Learning IPython for Interactive Computing and Data Visualization

by Cyrille Rossant

Python 2.x and 3.x

The 3.x branch of Python is not backward compatible with the 2.x branch, which explains why the 2.7 version is still maintained. Even if most external Python packages used in this book are compatible with Python 3.x, some packages are still not compatible with this branch. At this time, the choice between Python 2.x and Python 3.x for a new project is typically dictated by the Python 3 support of the required external Python packages. The setups of the targeted users is also an important point to consider. In this book, we will use Python 2.7 and try to minimize the incompatibilities with Python 3.x. This issue is beyond the scope of this book, and we encourage you to search for information about how to write code for Python 2.x that is as compatible with Python 3.x as possible. This official web page is a good starting point: http://wiki.python.org/moin/Python2orPython3

Usefull commands

The standard Unix commands pwd, ls, cd, cp, ps, ... are available in IPython.

Magic commands actually come with a % prefix, but the automagic system, enabled by default, allows you to conveniently omit this prefix. Using the prefix is always possible, particularly when the unprefixed command is shadowed by a Python variable with the same name. The %automagic command toggles the automagic system.

Appending ? or ?? to any command or variable gives you all the information you need about it: var? è analogo a %pinfo var.

In any interactive session, your input and output history is kept in the In and Out variables and is indexed by a prompt number. The _, __, _ and _i, _ii, _iii variables contain the last three output and input objects, respectively. The _n and _in variables return the nth output and input history.

Whenever you start typing any command, variable name, or function, press the Tab key to let IPython either automatically complete what you are typing if there is no ambiguity, or show you the list of possible commands or names that match what you have typed so far. It also works for directories and file paths, just like in the system shell.

To force IPython to show all private attributes and methods, type myobject._ before pressing the Tab key.

A Python script can be executed from within the IPython console with the %run magic command followed by the script filename. The script is executed in a fresh, new Python namespace unless the -i option has been used, in which case the current interactive Python namespace is used for the execution: %run script.py.

You can do quick benchmarks in an interactive session with the %timeit magic command. It lets you estimate how much time the execution of a single command takes.

In [1]:
%timeit [x*x for x in range(100000)]
100 loops, best of 3: 8.97 ms per loop

Whenever an exception is raised in the console, use the %debug magic command to launch the debugger at the exception point. You can use the %pdb magic command to activate the automatic execution of the IPython debugger as soon as an exception is raised.

The %pylab magic command enables the scientific computing capabilities of the NumPy and matplotlib packages, namely efficient operations on vectors and matrices and plotting and interactive visualization features.

Importing methods

Python provides two ways to import modules

  • import library -- If you import a library, you get a module object named library. But if you try to access methods directly, you get an error. You need to specify the library name. Example:
In [2]:
import math
print(math.pi)
3.14159265359

As an alternative, you can import an object from a module and avoid to specify the library name. Example:

In [3]:
from math import pi
print(pi)
3.14159265359

Or you can use the star operator to import everything from the module. The advantage of importing everything from the math module is that your code can be more concise. The disadvantage is that there might be conflicts between names defined in different modules, or between a name from a module and one of your variables. Example:

In [4]:
from math import *
cos(pi)
Out[4]:
-1.0

Interactive Work with IPython

IPython is not only an extended Python console, but it also provides several ways to interact with the operating system during a Python interactive session without quitting the console. The list of aliases of UNIX commads can be obtained with the magic %alias command.

The native Python module urllib2 can be used to download a file, and the zipfile module to extract it.

In [5]:
import urllib2, zipfile
url = 'http://camcat.df.unicam.it/camcat_files/'
filename = 'TemplateLaTex.zip'
downloaded = urllib2.urlopen(url + filename)

A new folder in the current directory can be made by using mkdir, which is a particular IPython alias redirecting a magic command to a shell command, and the dollar ($) sign to use a Python variable within a system or magic command.

In [6]:
folder = 'data'
In [7]:
mkdir $folder
mkdir: data: File exists
In [8]:
%cd $folder
/Users/gianx/data
In [9]:
with open(filename, 'wb') as f:
            f.write(downloaded.read())

and extract it in the current folder with the extractall method of zip and a ZipFile object:

In [10]:
with zipfile.ZipFile(filename) as zip:
             zip.extractall('.')

The current directory can be saved as a bookmark using the command %bookmark my_data. Now, in any future session with the same IPython profile, we can type cd my_data to enter into this directory, whichever directory we call this command from. The -l and -d options allow to respectively list all defined bookmarks, and delete a specified bookmark. Typing %bookmark? displays the list of all options.

Accessing the system shell from IPython

We can also launch commands using the system shell directly from IPython, and retrieve the result as a list of strings in a Python variable. To do this, we need to prefix shell commands with !. For example:

In [11]:
files = !ls -1 -S | grep Info

The Unix command ls -1 -S lists all files in the current directory, sorted by decreasing size, and with one file per line. The pipe | grep pippo filters only those files that contain pippo. Then, the Python variable files contains the list of all filenames.

We can also use Python variables in the system command, using either the $ syntax for single variables, or {} for any Python expression, as follows:

In [12]:
!head -n5 {files[0]}
head: {files[0]}: No such file or directory

The head -n5 {files[0]} command displays the first five lines of the first file in the files list.

Exploring the history

The %history magic command (and %hist, which is an alias) accepts multiple convenient options to display the part of the input history we are interested in. By default, %history displays all our input history in the current session. We can specify a specific line range with a simple syntax, for example, hist 4-6 8 for lines four to six and line eight. We can also choose to display our history from the previous sessions with the syntax hist 243/4-8 for lines four to eight in session 243. Finally, we can number the sessions relative to the current session using the syntax %hist ~1/7, which shows line seven of the previous session. Other useful options for %history include -o, which displays the output in addition to the input; -n, which displays the line numbers; -f, which saves the history to a file; and -ps, which displays the classic >>> prompt. For example:

In [13]:
 %hist -nop 1-2
   1: >>> %timeit [x*x for x in range(100000)]
   2:
>>> import math
... print(math.pi)
...

displays the history of the first two lines with the line number, the output, and the default Python prompt.

Finally, a related command is %store, which is used to save the content of any Python variable for later use in any future interactive session. The %store name command saves the variable name, and %store -d name deletes it. To recover the stored variables, we need to use %store -r.

Importing code in IPython

When using the IPython console, the %paste magic command can be used to import and execute the code contained in the clipboard. IPython automatically dedents the code and removes the > and + characters at the beginning of the lines, allowing to paste the diff and doctest files directly from e-mails. In addition, the %run magic command executes a Python script in the console, by default, in an empty namespace. It means that any variable defined in the interactive namespace is not available within the executed script. However, at the end of the execution, the control returns to IPython's prompt, and the variables defined in the script are then imported in the interactive namespace. If we want the script to use a variable defined in the interactive namespace, we need to use the -i option.

Exporting code to a file

By default, %edit opens the system's text editor and executes the code when we close the editor. If we supply an argument to %edit, this command will try to open the text editor with the code we supplied.

Source code introspection

IPython can also display information about the internals of a variable, in particular about the source code when it is defined in a file. First, typing ? before or after a variable name prints useful information about it. Typing ?? gives more detailed information, in particular, the source code of the object, if it is a function defined in a file. In addition, several magic commands display specific information about a variable, such as the source code of the function (%psource) or of the file (%pfile) where it is defined, the docstring (%pdoc), or the definition header (%pdef). The %pfile magic command also accepts a Python filename, in which case, it prints the file's contents with syntax highlighting. With this function, IPython can then act as a code viewer with syntax highlighting.

Using the interactive debugger

First, when we encounter an exception, we can use the %debug magic command to launch the IPython debugger at the exact point where the exception was raised. If we activate the %pdb magic command, the debugger will be automatically launched upon the very next exception. We can also start IPython with ipython --pdb for the same behavior. Finally, we can run a whole script under the control of the debugger with the %run -d command. We can also specify explicitly where to put the first breakpoint; typing %run -d -b29 script.py pauses the program execution on line 29 of script.py. We first need to type c to start the script execution.

When the debugger launches, the prompt becomes ipdb>. The program execution is then paused at a given point in the code. We can use the w command to display the line and the location in the stack traceback where the debugger has paused. At this point, we have access to all local variables and we can control precisely how we want to resume the execution. Within the debugger, several commands are available to navigate into the traceback:

  • u/d for going up/down into the call stack
  • s to step into the next statement
  • n to continue execution until the next line in the current function
  • r to continue execution until the current function returns
  • c to continue execution until the next breakpoint or exception Other useful commands include:
  • p to evaluate and print any expression
  • a to obtain the arguments of the current functions
  • The ! prefix to execute any Python command within the debugger

The entire list of commands can be found in the documentation of the pdb module in Python.

Controlling the execution time of a command

First, the %timeit magic function uses the Python's timeit module to estimate the execution time of any Python statement. If we have defined a function fun(x), %timeit fun(x) executes this command multiple times and returns an average of the execution time. The number of calls is determined automatically; there are r loops of n executions each. These numbers can be specified with the -r and -n options to %timeit. Also, we can easily estimate the execution time of a script with the %run -t command.

To obtain much more detailed information about the execution time of a program, we can execute it under the control of a profiler, like the one provided natively by the profile Python module. Profiling is a complex topic, and more details about the profile module can be found in the official Python documentation. To run a script under the control of the profiler, we can execute it from IPython with %run -p or with the equivalent %prun magic command.

Using the IPython notebook

The IPython notebook is increasingly used in the Python community, in particular for scientific research and education. It brings both a powerful HTML user interface to IPython and a way of saving a whole interactive session in a notebook file in a JSON format. The latter functionality brings reproducibility to interactive computing, a crucial feature notably in scientific research. Notebooks run in the browser and can contain, not only Python code, but also text in a markup language, such as Markdown, as well as images, videos, or rich content media. Notebooks can be converted into other formats, such as Python scripts, HTML, or PDF.

The notebook dashboard

Typing ipython notebook in a shell it will launch a local web server on the 8888 port (by default), and the link http://127.0.0.1:8888/ appears in a browser. This page is the notebook dashboard; it lists all notebooks in the directory where we launched ipython notebook from. An IPython notebook file has a .ipynb extension; it is a text file containing structured data in JSON.

At the top of the user interface of the notebook, the menu and the toolbar offer access to all commands. The main area below them shows, by default, an empty input cell. Python code can be typed into this input cell. An important feature of an input cell is that pressing the Enter key does not execute the cell, but rather inserts a new line. Writing code into a cell is then closer to what a standard text editor offers, compared to the classic IPython console.

An input cell can be executed in two ways. By pressing Shift + Enter, all the code within the cell is executed in the current IPython interactive namespace. The output then appears in an output area right below the input cell, and a new input cell is created below. By pressing Ctrl + Enter, no new input cell is created and only the output is shown.

A very useful feature of the notebook is the possibility to insert rich text in cells using a popular marker text format called Markdown (described at http://daringfireball.net/projects/markdown/syntax). Edition features such as bold, italic, headers, and bullet points can be inserted with a simple syntax. To do this, we need to convert a cell into a Markdown cell with the Cell > Markdown command.

The plotting capabilities of the notebook are inizialized, first, by launching the notebook with the command ipython notebook --pylab inline. It allows to insert figures within the notebook, thanks to the Matplotlib library. NetworkX offers several Matplotlib-based commands to plot graphs.

In [14]:
%pylab
Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['cosh', 'ldexp', 'hypot', 'tan', 'isnan', 'log', 'fabs', 'floor', 'modf', 'sqrt', 'frexp', 'degrees', 'pi', 'log10', 'sin', 'exp', 'copysign', 'cos', 'ceil', 'isinf', 'sinh', 'trunc', 'expm1', 'e', 'tanh', 'f', 'radians', 'fmod', 'log1p', 'gamma']
`%matplotlib` prevents importing * from pylab and numpy
In [15]:
x = linspace(-10., 10., 1000)
In [16]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [17]:
sg = plot(x, sin(x))

Numerical Computing with IPython

Although IPython's powerful shell and extended console can be advantageously used by any Python programmer, this package was originally a tool designed for scientists by scientists. It was indeed meant to provide a convenient way of doing interactive scientific computing with Python.

IPython does not really provide scientific computing features per se, but rather offers an interactive interface to powerful external libraries, such as NumPy, SciPy, Pandas, Matplotlib, and the like. Together, these tools provide a framework for scientific computing, which can compete with widespread commercial tools in the scientific community such as Matlab or Mathematica.

  • NumPy provides a multidimensional array object that supports optimized vector operations.
  • SciPy offers a wide variety of scientific algorithms (signal processing, optimization, and so on) that are based on NumPy.
  • Pandas proposes convenient data structures for tabular data coming from real-world data sets.
  • Matplotlib allows to plot graphical figures easily so as to visualize interactively any form of data, and to generate publication-quality figures.

IPython provides the adequate interactive framework for using all these tools in a streamlined way.

Arrays

NumPy provides multidimensional arrays. Loops should be avoided every time it is possible to use a NumPy operation instead. An array is a block of data organized into several dimensions. A one-dimensional array is a vector, which is an ordered sequence of elements (typically numbers) that are indexed with a single integer. A two-dimensional array is a matrix, containing elements indexed by a pair of integers, that is, the row index and the column index. More generally, an n-dimensional array is a set of elements with the same data type that are indexed by a tuple of n integers.

All elements in an array must have the same type: it is called the data type (dtype). There are multiple possible types in NumPy: Booleans, signed/unsigned integers, single-precision/double-precision floating point numbers, complex numbers, strings, and so on. Custom data types can also be defined.

The advantage of arrays compared with native Python types is that it is possible to perform very efficient computations on arrays instead of relying on Python loops. The difference is that the loop is implemented internally in C by NumPy instead of Python, so that there is no longer the cost of interpretation within the loop. Indeed, Python being an interpreted, dynamically-typed language, each iteration involves various low-level operations performed by Python (type checking and so on). If NumPy is compiled with the adequate options, array computations can benefit from vectorized CPU instructions [SSE, AVX, XOP, and so on, that use large registers (128 bits or 256 bits)] and can be more than two or four times faster.

To import NumPy in IPython, we can use the %pylab magic command (or start IPython with ipython --pylab), which loads NumPy and Matplotlib within the interactive namespace (available as np for numpy and plt for matplotlib.pyplot). The other possibility is to import NumPy with import numpy (or import numpy as np for the lazy ones) or from numpy import *. The former syntax is to be preferred in a script, while the latter can be used in an interactive session.

Important attributes of an array (es. array_name) include:

  • shape: the array shape as a tuple of integers.
  • ndim: the number of dimensions, which is also len(array_name.shape)
  • size: the total number of elements (the product of all values in array_name.shape)
  • itemsize: the size in bytes of each element (four for an int32 data type, eight for float64, and so on).

In Python/NumPy the brackets [] allow to access elements from a Python-container object. Inside the brackets, the notation :,0 refers to all pairs of indices with any first element (the colon :) and a second element equal to zero. Since, in NumPy, the first dimension always refers to the row and the second dimension to the column, we are precisely referring to the first column here.

Example. First we need to load the libraries %pylab

We generate some random data:

In [18]:
positions = rand(10000000,2)

We now look at some properties of the array:

In [19]:
type(positions)
Out[19]:
numpy.ndarray
In [20]:
positions.ndim, positions.shape
Out[20]:
(2, (10000000, 2))

We assign to x and y the first and second column, respectively:

In [21]:
x, y = positions[:,0], positions[:,1]

We compute the vector of distances from the position of interest (0.5, 0.5) to all positions:

In [22]:
distances = (x - .5) ** 2 + (y - .5) ** 2

and determine the closest element

In [23]:
%timeit -n10 -r3 exec(In[19])
10 loops, best of 3: 7.41 µs per loop
In [24]:
%timeit -n10 -r3 ibest = distances.argmin()
10 loops, best of 3: 11.9 ms per loop

Creating arrays

There are several ways of creating an array.

  • From scratch, element by element -- First, we can create an array by manually specifying its coefficients. This is the most direct way of creating an array, but it is not used very often in practice. The NumPy function array takes a list of elements and returns a corresponding NumPy array. For example we can generate a two-dimensional array and force it ot be 64-bit float:
In [25]:
x = array([[1, 2, 3],[4, 5, 6]], dtype=float64)
In [26]:
x.shape
Out[26]:
(2, 3)
In [27]:
x.dtype
Out[27]:
dtype('float64')
  • From scratch, using predefined templates -- Creating arrays by specifying the individual coefficients manually is rarely practical. One can use any of the several convenient functions defined in NumPy to create typical arrays with the desired shape. For example we can generate:
In [28]:
zeros((2,4))
Out[28]:
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])
In [29]:
arange(2, 10, 2)
Out[29]:
array([2, 4, 6, 8])
In [30]:
%precision 2
linspace(-1.,1,10)
Out[30]:
array([-1.  , -0.78, -0.56, -0.33, -0.11,  0.11,  0.33,  0.56,  0.78,  1.  ])
  • From random values -- NumPy provides various random sampling routines for generating arrays with independent random values following different probability distributions.
    • rand() floating numbers uniformly sampled between 0 and 1,
    • randn() random values sampled from a Gaussian distribution,
    • randint() random integers,
    • exponential() exponential distribution.

Related functions include shuffle and permutation, which randomly permute existing arrays.

In [31]:
%precision 4
rand(2,5)
Out[31]:
array([[ 0.8746,  0.8872,  0.3742,  0.3923,  0.8589],
       [ 0.3818,  0.5875,  0.5919,  0.8038,  0.7573]])

Loading arrays

NumPy provides efficient and convenient ways of loading multidimensional arrays from text (Python strings or text/CSV files) or from binary buffers or files. In addition, the Pandas package is particularly useful when loading tabular data, that is, tables that contain heterogeneous data types instead of just numbers.

  • From a native Python object -- It is quite common to have data in some native Python object and to want to convert it into a NumPy array. The standard method is to use the array function. When we created arrays by directly specifying their values, we actually converted Python lists of numbers into arrays.
  • From a buffer or an external file -- Another common way of creating an array is to load data from a memory buffer or from a file, with either binary or string elements. If we know the exact data type of, we can obtain a NumPy array with
    • frombuffer() from a Python buffer object,
    • fromstring() accepts either ASCII text with values separated by any delimiter or binary data in any data type,
In [32]:
np.fromstring('1 2 5 10', dtype=int, sep=' ')
Out[32]:
array([ 1,  2,  5, 10])

Other fucntions allow to load data from text files or binary files and convert them into NumPy arrays:

  • fromfile()
  • loadtxt()
  • genfromtxt()

The function fromfile() is highly efficient with binary data. loadtxt() is a simplified version of genfromtxt(), useful when the file format is straightforward

In [33]:
cd /Users/gianx
/Users/gianx
In [34]:
np.loadtxt('Info_iPython_GDG.txt',delimiter=',')
Out[34]:
array([[ 1.,  2.],
       [ 1.,  3.],
       [ 1.,  4.],
       [ 1.,  5.],
       [ 1.,  6.]])

Saving arrays

Saving arrays in files is as easy as loading NumPy arrays. There are basically two functions

  • save() save an array into a binary file,
  • savetxt() save an array into a text file.
  • loadz() and savez() conveniently used to save dictionaries of variables of any type (including NumPy arrays).

Working with arrays

Once NumPy arrays are created or loaded, there are basically three things that we can do with them.

  • Selection -- consists of accessing one or several elements within an array. It can be done with NumPy or Pandas.

    Example in NumPy. We will select all elements in an array that have a value different from NaN, that is filter an array with a Boolean condition. We can use the NumPy function isnan(), and find the negative values of isnan(), which are contained in ~_, as follows:

In [35]:
x = array([ 1,  NaN,  5, 10])
In [36]:
isnan(x)
Out[36]:
array([False,  True, False, False], dtype=bool)
In [37]:
y = x[~_]
In [38]:
len(y), len(y) / float(len(x))
Out[38]:
(3, 0.7500)

We can also specify directly the list of indices we want to keep. For instance, if x is a one-dimensional NumPy
array, x[i:j:k] represents a view on x with only those elements having indices between i (included) and j (excluded) with a step of k. These conventions, with i included and j excluded, are convenient when working with consecutive portions of an array. For example, the first and second halves of x, assuming a size 2n, are simply x[:n] and x[n:]. In addition, the length of x[i:j] is simply j - i. In the end, there should not be +1 or -1 values hanging around in indices in general. If i is omitted, it is assumed to be zero. If j is omitted, it is assumed to be the length of the array in that dimension. Negative values mean we count from the end. Finally, the default value for k is one. This notation is also valid in multiple dimensions; for example, M[i:j,k:l] creates a submatrix view on a 2D array M. Also, we can use x[::-1] to get x in the reverse order.

IMPORTANT - An important point to consider with array views is that they point to the same location in memory. So a view on a large array does not imply memory allocation, and changing the values of elements in the view also changes the corresponding values in the original array. If this behavior is unwanted, it is possible to force the creation of a new array with y = x.copy() or y = array(x). In the latter case, it is also possible to change the data type of x, with the dtype keyword argument.

Another way of selecting a portion of an array consists in passing an array with explicit integer values for indices. This is called fancy indexing. If x is a one- dimensional vector, and indices is another one- dimensional vector (or a list) with positive integers, then x[indices] returns a vector containing x[indices[0]], x[indices[1]], and so on. Therefore, the length of x[indices] is equal to the length of indices and not the length of x.

  • Manipulation -- Arrays can be manipulated and reshaped, which can sometimes be useful when performing vectorized computations. It is also possible to construct a new array from identical copies of an original array.
  • Reshaping
    • reshape() changes the shape of an array if the total number of elements is kept constant. It is possible to use -1 in at most one dimension in the argument of reshape to specify that its value must be automatically inferred.
    • ravel() removes all multidimensional structures in an array and return a flattened vector;
    • squeeze() removes all single-dimensional entries from the shape of an array;
    • expand_dims() inserts a new axis in an array.
In [39]:
x = rand(6)
In [40]:
x.reshape((2, 3))
Out[40]:
array([[ 0.1497,  0.3954,  0.1432],
       [ 0.1131,  0.6682,  0.5531]])
In [41]:
x.reshape((2, -1))
Out[41]:
array([[ 0.1497,  0.3954,  0.1432],
       [ 0.1131,  0.6682,  0.5531]])
  • Repeating and concatenating
    • tile() and repeat() create copies of an array, either by concatenating identical copies of it along a specified axis, or by copying every coefficient any number of times. The second argument of repeat can also be a list reps, in which case the coefficient x[i] is repeated reps[i] times.
    • hstack(), vstack(), dstack(), or concatenate() to join several arrays into a single array along the first, second, third, or any dimension, respectively.
    • hsplit(), vsplit(), dsplit(), or split() functions allow to split an array into several consecutive subarrays along any dimension. The second argument of split is either an integer, n, in which case the array is split into n equal arrays, or a list with the indices where the array should be split (that is, the indices of the first element in each subarray except the first).
In [42]:
n = 10
M = arange(1, n + 1).reshape((-1, 1))
M
Out[42]:
array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])
In [43]:
M = tile(M, (1, n))
M
Out[43]:
array([[ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5],
       [ 6,  6,  6,  6,  6,  6,  6,  6,  6,  6],
       [ 7,  7,  7,  7,  7,  7,  7,  7,  7,  7],
       [ 8,  8,  8,  8,  8,  8,  8,  8,  8,  8],
       [ 9,  9,  9,  9,  9,  9,  9,  9,  9,  9],
       [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]])
In [44]:
N = arange(1, n + 1).reshape((1, -1))
N
Out[44]:
array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]])
In [45]:
N = tile(N, (n, 1))
N
Out[45]:
array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]])
In [46]:
M*N
Out[46]:
array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
       [  2,   4,   6,   8,  10,  12,  14,  16,  18,  20],
       [  3,   6,   9,  12,  15,  18,  21,  24,  27,  30],
       [  4,   8,  12,  16,  20,  24,  28,  32,  36,  40],
       [  5,  10,  15,  20,  25,  30,  35,  40,  45,  50],
       [  6,  12,  18,  24,  30,  36,  42,  48,  54,  60],
       [  7,  14,  21,  28,  35,  42,  49,  56,  63,  70],
       [  8,  16,  24,  32,  40,  48,  56,  64,  72,  80],
       [  9,  18,  27,  36,  45,  54,  63,  72,  81,  90],
       [ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100]])
  • Broadcasting

In the previous multiplication table example, we had to repeat identical copies of a row and a column so that we could multiply the two arrays with identical shapes (n, n). Actually, the repeat step is unnecessary, as arrays with different shapes can still be compatible under specific conditions; this is called broadcasting. The general rule is that two dimensions are compatible when they are equal, or when one of them is 1.

In [47]:
M = arange(1, n + 1).reshape((-1, 1))
M
Out[47]:
array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])
In [48]:
N = arange(1, n + 1).reshape((1, -1))
N
Out[48]:
array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]])
In [49]:
M*N
Out[49]:
array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
       [  2,   4,   6,   8,  10,  12,  14,  16,  18,  20],
       [  3,   6,   9,  12,  15,  18,  21,  24,  27,  30],
       [  4,   8,  12,  16,  20,  24,  28,  32,  36,  40],
       [  5,  10,  15,  20,  25,  30,  35,  40,  45,  50],
       [  6,  12,  18,  24,  30,  36,  42,  48,  54,  60],
       [  7,  14,  21,  28,  35,  42,  49,  56,  63,  70],
       [  8,  16,  24,  32,  40,  48,  56,  64,  72,  80],
       [  9,  18,  27,  36,  45,  54,  63,  72,  81,  90],
       [ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100]])
  • Permuting
    • transpose() permutes the dimensions of an array. The indices describing the permutation can be provided in the axes keyword argument.
    • fliplr() and flipud() flips an array in the left/right or up/down direction,
    • roll() performs a circular permutation of the elements along a given axis,
    • rot90() rotates an array by 90 degrees in the counter-clockwise direction.

Computation

If A and B are two NumPy arrays with compatible shapes, A + B, A - B, A x B, and A / B are element-wise operations. In particular, when A and B are two-dimensional matrices, A x B is not the matrix product. The matrix product is rather provided by the dot function, which more generally computes the dot product of two arrays.

Common unary operations include

  • -A,
  • A∗∗x coefficients to power x,
  • abs(A) absolute value,
  • sign(A) an array with -1, 0, or 1 depending on the sign of each element,
  • floor(A) floor of each element,
  • sqrt(A) square root,
  • log(A) natural logarithm,
  • exp(A) exponential,
  • a lot of other mathematical functions (trigonometric, hyperbolic, arithmetic functions, and so on).
  • sum sum of all elements in an array or in a given dimension. The axis keyword argument specifies the dimensions on which the sum is to be performed. This function returns an array with one dimension less than the original array.
  • product prod of all elements in an array or in a given dimension. The axis keyword argument specifies the dimensions on which the sum is to be performed. This function returns an array with one dimension less than the original array.
  • max and min functions return the largest and lowest values in an array or in a given dimension.
  • argmin and argmax functions return the index of the smallest or largest element of the array.
  • Statistical functions such as mean, median, std, and var compute the mean, median, standard deviation, and variance of the elements along a given dimension or across the whole array.
  • Related functions that can be useful when simulating mathematical models include diff (discrete difference), cumsum (cumulative sum), and cumprod (cumulative product). The diff function allows to compute a discrete derivative of a signal (up to a scalar coefficient), whereas cumsum computes a discrete indefinite integral of a signal.

Advanced computing

NumPy provides all necessary types and routines for doing efficient numerical computations with Python. SciPy is built on top of NumPy and implements a large variety of higher-level mathematical processing algorithms. Linear algebra routines are provided by the scipy.linalg subpackage: solvers of linear equations, matrix routines, eigenvalue problems, matrix decomposition, and so on.

  • Optimization routines are provided by the scipy.optimize subpackage: unconstrained or constrained minimization of real-valued functions, global optimization, curve fitting, and so on.
  • Numerical integrators are provided by the scipy.integrate subpackage. It can be used to solve differential equations, for example, in physics simulation engines.
  • Signal processing algorithms are implemented in the scipy.signal subpackage: convolutions, linear filters, wavelets, and so on. The scipy.fftpack subpackage, which implements Fourier transforms routines, and the scipy.ndimage subpackage, which implements several image processing algorithms. Finally, other image processing packages of interest include scikit-image, PIL, and OpenCV (computer vision).
  • Statistical routines are provided by the scipy.stats subpackage: probability distributions, descriptive statistics and statistical tests, and so on. SciPy.cluster implements clustering algorithms that can be useful to find categories in unstructured data. Other statistical packages of interest include Pandas and scikit-learn (machine learning).

Interactive Plotting and Graphical Interfaces

There are a lot of Python packages for curve plotting, but the most widely used one, by far, is Matplotlib much similar to Matlab. It is one of the most complete and powerful graphical libraries. It can be used both for interactive visualization and for generating high-quality figures that can be readily used in scientific publications. In addition, its high-level interface makes it particularly easy to use. Matplotlib is a particularly rich Python package for generating high-quality figures from NumPy data.

Figures can be displayed interactively in IPython using event loop integration. Then, they can be updated dynamically from the command-line interface. The %pylab magic command (or the --pylab option to the ipython shell command) activates this integration automatically.

When using Matplotlib from a script instead from IPython, we can put the from pylab import ∗ command at the top of the script. In a Python module, it might be a better idea to use import matplotlib.pyplot as plt so that the Matplotlib objects stay within their namespace.

Also, the way of generating plots is slightly different in a script compared to IPython. In a script, the figure is displayed only when the function show() is called, typically, at the very end of the script, whereas, in the IPython command-line interface, the figure is shown and updated at each plot function.

Matplotlib can also be used in the notebook. When launching the notebook with ipython notebook --pylab inline, the plots appear in the output cells as images and are saved as base64 strings within the IPYNB files. Without this inline option, figures are displayed in separate windows as usual. It is also possible to activate this option within the notebook by using the command %pylab inline.

A wide variety of figure examples can be found in the Matplotlib Gallery on the official website (http://matplotlib.org/ gallery.html) and in Nicolas Rougier's tutorial (http://www.loria.fr/~rougier/ teaching/matplotlib/).

  • Plot -- When the plot function receives a single vector as an argument, it assumes that this vector contains values on the y axis, whereas values on the x axis are automatically generated as integers from 0 to len(y) - 1. To explicitly specify the values on the x axis, we can use the following command: plot(x,y).
In [50]:
y = rand(1000)
In [51]:
plot(y)
Out[51]:
[<matplotlib.lines.Line2D at 0x12634c710>]
  • Scatter plot -- Scatter plots represent sets of points in two dimensions, using pixels or any other marker.
In [52]:
x = randn(10000)
y = randn(10000)
In [53]:
plot(x,y,',')
Out[53]:
[<matplotlib.lines.Line2D at 0x12667cb90>]
  • Bar graphs -- A bar graph is typically used for histograms, representing the distribution of values at different intervals. The hist function in Matplotlib accepts a vector of values and plots a histogram. The bins keyword allows to specify either the number of bins or the list of bins.
In [54]:
hist(x,bins=20);
  • Plot customization -- Matplotlib offers a lot of customization options. Here, we will see how to change styles and colors in figures.
    • xticks and yticks set the exact positions of the ticks,
    • grid adds a grid,
    • xlim and ylim specifies the extent of the x and y coordinates,
    • xlabel and ylabel set the axes labels,
    • legend key specifies the legend with the legend keyword; the label of each line corresponds to the label keyword argument of the plot function.
    • title displays the name of the figure.
In [55]:
x = linspace(-10.,10.,501)
In [56]:
plot(x,sin(x), '-r', label='sine')
plot(x,cos(x),'--g', label='cosine')
xticks([-10.,0,10.])
yticks([-2.,0,2.])
xlabel('x axis')
ylabel('y axis')
legend()
grid()

Multiple independent plots can be displayed on the same figure. We can define a grid with an arbitrary number of rows and columns and plot figures inside each box. Boxes can even span several rows or columns (using subplot2grid).

In [57]:
x = linspace(0, 2 * pi, 1000)
y = 1 + 2 * cos(5 * x)
subplot(1,2,1)
plot(x, y)
subplot(1,2,2, polar=True)
polar(x, y)
Out[57]:
[<matplotlib.lines.Line2D at 0x1267b4b90>]
  • 3D plot -- Matplotlib includes a 3D toolkit called mplot3d that can be used for basic 3D plots, such as 3D curves, surface plots, and likewise.
In [58]:
from mpl_toolkits.mplot3d import Axes3D
In [59]:
# we create a (X, Y) grid
X = linspace(-5, 5, 50)
Y= X
X, Y = meshgrid(X, Y)
# we compute the Z values 
R = sqrt(X**2 + Y**2)
Z = sin(R)
In [60]:
ax = gca(projection='3d')
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=mpl.cm.winter, linewidth=0)

Matplotlib has the option to use LaTeX to manage all text layout. The LaTeX option is activated by setting text.usetex : True in your rc settings. Text handling with matplotlib’s LaTeX support is slower than matplotlib’s very capable mathtext, but is more flexible, since different LaTeX packages (font packages, math packages, etc.) can be used. The results can be striking, especially when you take care to use the same fonts in your figures as in the main document.

There are a couple of options to mention, which can be changed using rc settings. Here is an example matplotlibrc file:

* font.family        : serif
* font.serif         : Times, Palatino, New Century Schoolbook, Bookman, Computer Modern Roman
* font.sans-serif    : Helvetica, Avant Garde, Computer Modern Sans serif
* font.cursive       : Zapf Chancery
* font.monospace     : Courier, Computer Modern Typewriter
* text.usetex        : true

The first valid font in each family is the one that will be loaded. If the fonts are not specified, the Computer Modern fonts are used by default. All of the other fonts are Adobe fonts. Times and Palatino each have their own accompanying math fonts, while the other Adobe serif fonts make use of the Computer Modern math fonts.

To use LaTeX and select Helvetica as the default font, without editing matplotlibrc use:

In [61]:
from matplotlib import rc
rc('font',**{'family':'sans-serif','sans-serif':['Helvetica']})
rc('text', usetex=True)

Here is the standard example, tex_demo.py:

In [62]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['rc']
`%matplotlib` prevents importing * from pylab and numpy
In [63]:
"""
Demo of TeX rendering.

You can use TeX to render all of your matplotlib text if the rc
parameter text.usetex is set.  This works currently on the agg and ps
backends, and requires that you have tex and the other dependencies
described at http://matplotlib.sf.net/matplotlib.texmanager.html
properly installed on your system.  The first time you run a script
you will see a lot of output from tex and associated tools.  The next
time, the run may be silent, as a lot of the information is cached in
~/.tex.cache

"""
import numpy as np
import matplotlib.pyplot as plt


# Example data
t = np.arange(0.0, 1.0 + 0.01, 0.01)
s = np.cos(4 * np.pi * t) + 2

plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.plot(t, s)

plt.xlabel(r'\textbf{time} (s)')
plt.ylabel(r'\textit{voltage} (mV)',fontsize=16)
plt.title(r"\TeX\ is Number "
          r"$\displaystyle\sum_{n=1}^\infty\frac{-e^{i\pi}}{2^n}$!",
          fontsize=16, color='gray')
# Make room for the ridiculously large title.
plt.subplots_adjust(top=0.8)

plt.savefig('tex_demo')
plt.show()

Data fitting

A common use of least-squares minimization is curve fitting, where one has a parametrized model function meant to explain some phenomena and wants to adjust the numerical values for the model to most closely match some data. With scipy, such problems are commonly solved with scipy.optimize.curve_fit(), which is a wrapper around scipy.optimize.leastsq(). Since Lmfit’s minimize() is also a high-level wrapper around scipy.optimize.leastsq() it can be used for curve-fitting problems, but requires more effort than using scipy.optimize.curve_fit().

Here we discuss lmfit’s Model class. This takes a model function – a function that calculates a model for some data – and provides methods to create parameters for that model and to fit data using that model function. This is closer in spirit to scipy.optimize.curve_fit(), but with the advantages of using Parameters and lmfit.

In addition to allowing you turn any model function into a curve-fitting method, Lmfit also provides canonical definitions for many known line shapes such as Gaussian or Lorentzian peaks and Exponential decays that are widely used in many scientific domains.

Model

The Model allows us to easily wrap a model function. This automatically generate the appropriate residual function, and determines the corresponding parameter names from the function signature itself.

The Model knows the names of the parameters and the independent variables. By default, the first argument of the function is taken as the independent variable, held in independent_vars, and the rest of the functions positional arguments (and, in certain cases, keyword arguments – see below) are used for Parameter names.

On creation of the model, parameters are not created. The model knows what the parameters should be named, but not anything about the scale and range of your data. You will normally have to make these parameters and assign initial values and other attributes. To help you do this, each model has a make_params() method that will generate parameters with the expected names. This creates the Parameters but doesn’t necessarily give them initial values – again, the model has no idea what the scale should be. You can set initial values for parameters with keyword arguments to make_params(), or assign them (and other parameter properties) after the Parameters has been created.

A Model has several methods associated with it. For example, one can use the eval() method to evaluate the model or the fit() method to fit data to this model with a Parameter object. Both of these methods can take explicit keyword arguments for the parameter values. The fit() function returns a ModelFit object of results, which has many components, including a fit_report() method, init_fit for the fit with the initial parameter values and a best_fit for the fit with the best fit parameter values.

One of the more interesting features of the Model class is that Models can be added together or combined with basic algebraic operations (add, subtract, multiply, and divide) to give a composite model. The composite model will have parameters from each of the component models, with all parameters being available to influence the whole model. Each components can be generated after the fit using the Models ModelFit.eval_components() method of the result [ex. comps = result.eval_components()], which returns a dictionary of the components, using keys of the model name (or prefix if that is set). This will use the parameter values in result.params and the independent variables (x) used during the fit. Note that while the ModelFit held in result does store the best parameters and the best estimate of the model in result.best_fit, the original model and parameters in pars are left unaltered.

The combiantion works if the argument names for the model functions do not overlap. If they had, the prefix argument to Model would have allowed us to identify which parameter went with which component model. Using composite models with the built-in models provides a simple way to build up complex models. If you want to use a binary operator other than add, subtract, multiply, or divide that are supported through normal Python syntax, you’ll need to explicitly create a CompositeModel with the appropriate binary operator.

Example:

In [64]:
import numpy as np
from lmfit import Model, CompositeModel
from lmfit.lineshapes import step, gaussian

import matplotlib.pyplot as plt

# create data from broadened step
npts = 201
x = np.linspace(0, 10, npts)
y = step(x, amplitude=12.5, center=4.5, sigma=0.88, form='erf')
y = y + np.random.normal(size=npts, scale=0.35)

def jump(x, mid):
    "heaviside step function"
    o = np.zeros(len(x))
    imid = max(np.where(x<=mid)[0])
    o[imid:] = 1.0
    return o

def convolve(arr, kernel):
    # simple convolution of two arrays
    npts = min(len(arr), len(kernel))
    pad  = np.ones(npts)
    tmp  = np.concatenate((pad*arr[0], arr, pad*arr[-1]))
    out  = np.convolve(tmp, kernel, mode='valid')
    noff = int((len(out) - npts)/2)
    return out[noff:noff+npts]

# create Composite Model using the custom convolution operator
mod  = CompositeModel(Model(jump), Model(gaussian), convolve)

pars = mod.make_params(amplitude=1, center=3.5, sigma=1.5, mid=5.0)

# 'mid' and 'center' should be completely correlated, and 'mid' is
# used as an integer index, so a very poor fit variable:
pars['mid'].vary = False

# fit this model to data array y
result =  mod.fit(y, params=pars, x=x)

print(result.fit_report())

# Use 'False' to plot as usual, while 'True' for plotting each components
plot_components = False 

# plot results
plt.plot(x, y,         'bo')
if plot_components:
    # generate components
    comps = result.eval_components(x=x)
    plt.plot(x, 10*comps['jump'], 'k--')
    plt.plot(x, 10*comps['gaussian'], 'r-')
else:
    plt.plot(x, result.init_fit, 'k--')
    plt.plot(x, result.best_fit, 'r-')
plt.show()
[[Model]]
    (Model(jump) <function convolve at 0x127a46b90> Model(gaussian))
[[Fit Statistics]]
    # function evals   = 21
    # data points      = 201
    # variables        = 3
    chi-square         = 21.282
    reduced chi-square = 0.107
[[Variables]]
    sigma:       0.61213790 +/- 0.012764 (2.09%) (init= 1.5)
    mid:         5 (fixed)
    amplitude:   0.62152923 +/- 0.001768 (0.28%) (init= 1)
    center:      4.52045222 +/- 0.009217 (0.20%) (init= 3.5)
[[Correlations]] (unreported correlations are <  0.100)
    C(amplitude, center)         =  0.335 
    C(sigma, amplitude)          =  0.273 
:0: FutureWarning: IPython widgets are experimental and may change in the future.

High-Performance and Parallel Computing

A recurring argument against using Python for high-performance numerical computing is that this language is slow, because it is dynamic and interpreted. A compiled lower-level language such as C can often be orders of magnitude faster.

Operations on NumPy arrays can be almost as fast as C because slow Python loops are transparently replaced with fast C loops. Sometimes though, it may happen that vectorization is impossible or difficult to implement on some complex algorithms. In these cases, there are fortunately solutions other than throwing away all Python code and coding everything again in C.

First, one can take advantage of the multiple cores that are now present in any computer. A standard Python process normally runs on a single core, but it is possible to distribute tasks across multiple cores and even multiple computers in parallel. This is particularly easy to do with IPython. MPI can also be easily used with a few lines of code.

Another popular solution is to first detect the time-critical section of a Python algorithm and then replace it with C code. Typically, only a very small section of the Python code is responsible for most of the algorithm's duration, so that it is possible to keep the rest of the code in Python. Cython is an external package which makes this task easier than it sounds: it offers a superset of Python that is compiled and that can be seamlessly integrated within Python code. It is particularly convenient to use it with IPython.