Scientific computing with Python

by Conor Lawless email: conor.lawless@ncl.ac.uk

Workshop Overview

These notes are for a half-day workshop introducing Python to Newcastle University Medical School postgraduate research students. This workshop is next scheduled to run on Tuesday 23rd April 2013, but you are free to read these notes at any time and get in touch with me if you have any questions.

Introduction & Motivation

Modern biological research involves a lot of data, usually stored on a computer. Scientists want to extract the maximum amount of information possible from their data. Writing and executing computer code is an extremely flexible and powerful way to do this, making good use of the impressive computing power and fast network connections currently at our disposal.

Unfortunately, programming has not been part of many biological researchers' training. All research scientists should have some computational tools at their disposal to help with data handling, processing and analysis. Solely relying on Microsoft Excel (as an example) for analysis is highly restrictive, expensive and heavily constrains the way we think about research. IT software like Excel is not capable of carrying out advanced analysis and simply doesn't help with time-consuming tasks such as file formatting, image manipulation or text manipulation, which are often important parts of the research workflow.

Many biological research scientists spend hours or days on repetitive, computer-based tasks which could easily be automated if they had a few basic programming skills. Worse still, some easily automatable tasks are not even attempted because of the perceived amount of manual computer work involved. Getting to grips with a little bit of programming will help you become more efficient at a vast range of tasks and improve your ability to do science.

The Python programming language

Python is a friendly, powerful, flexible open-source programming language with many freely available add-ons which allow it to easily handle an incredibly diverse range of data types in a consistent manner.

Python's functionality overlaps with that of many other tools, including Mathematica, Matlab, R, C++ and Java, but Python has some advantages over all of these. Compared to Mathematica and Matlab, Python is a true programming language which is capable of doing more than just mathematical analysis (although these tools are designed specifically for mathematical analysis, and are sometimes preferable). R is similarly open-source and freely available, and although powerful and useful, it is designed specifically for statistical analysis, and is not a true programming language. R syntax (or language structure) is not as clean, simple and easy to read and learn as that of Python. C++ and Java are more powerful programming languages capable of faster code execution, but it is much more difficult to write and debug C++ and Java code. Python code is clean and simple, quite powerful, and it is relatively difficult to make mistakes when writing it.

Both Python and R are distributed under open-source licenses, which is important for sharing of scientific results. Open-source means that anyone, anywhere with an internet connection can access and install the tools necessary to either use or test published code. As computer code is an increasingly important part of biological research, universal, free access greatly increases the reproducibility of experimental work. Reproducibility is a fundamental component of the scientific method. Universal access, enabled by open-source software, also means that it's convenient for code developers (you & I), allowing us to reuse code on our personal machines, or on colleagues machines without the need for expensive licenses or specific permissions.

R can be a good alternative to Python

R is a programming environment designed for statistical analysis. It shares several of Python's best features, in particular it is an open source programming tool. It is specifically designed for handling spreadsheet-like numerical data and for statistical analysis, and in many ways is preferable to Python for pure data analysis. However, Python syntax is cleaner, simpler and better structured, making it easier to learn. Python is also more flexible, adaptable and powerful (and therefore much more fun). For these two important reasons, learning Python is a much better way to begin programming than attempting to learn R.

Having said that, if you do have a little previous programming experience, or, after this course if you have come to grips with some programming concepts and are interested in learning about another amazing tool, I thoroughly recommend the introductory R courses run by the school of Maths & Stats here at Newcastle: http://www.ncl.ac.uk/maths/rcourse/

Objectives

After completing this course on Scientific Computing with Python, you should be able to:

All of these steps will be motivated by practical (and hopefully useful) example code which you can download from this site. By the end of the workshop you will see that it is easy to write simple code and that writing code is a powerful and flexible way to make efficient use of computers. Feel free to contact me to ask about any specific aspects of installation, script-writing or script execution which are presented here but do not seem clear.

Some further tools and resources are highlighted in the Other Resources section. In particular, this page includes links to more advanced Python tutorials for continued learning.


OverviewInstallationFirst ScriptExecutionLibrariesStructureOther Resources


Last updated: April 2013