6. numpy arrays

This section is about the main package for handling data in python: numpy

In the previous section, we already used numpy to hold a 2D array of numbers. This is basically, what numpy is best at: storing and working with arbitrary dimensional arrays of entries, that all have the same type. We only use 1D and 2D numpy arrays in this course, so I will not talk about higher dimensions here.

First, let’s import numpy:

import numpy as np

6.1. Creating numpy arrays

In addition to creating a dataset from a file on disk, there are several functions to create them directly on python. For example, we can input them as list and then convert them to numpy arrays using the np.array function:

a = np.array([0, 1, 2, 3,4,5,6,7,8,9,10])
a
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

2D arrays can be created using lists of lists:

b = np.array([[0, 1,2,3], [4,5,6, 7],[8,9,10,11],[12,13,14,15]])
b
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

When creating arrays, numpy tries to guess the data type. Sometimes it get’s it wrong. For example, since a and b were created from lists that only contain integers, numpy created arrays of int64. We can check the type on array using the .dtype attribute

b.dtype
dtype('int64')

If you want to make sure an array has a specific type, you can pass an optional dtype argument to the array creation functions

b_floats = np.array([[0, 1,2,3], [4,5,6, 7],[8,9,10,11],[12,13,14,15]], dtype=float)
b_floats
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.],
       [12., 13., 14., 15.]])

There are also functions for creating arrays full of zeros or ones, np.zeros and np.ones. These require a tuple argument to set the dimension of the output array:

np.zeros((2,3))
array([[0., 0., 0.],
       [0., 0., 0.]])
np.ones((4,5))
array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

Other useful ways to generate arrays are np.linspace that creates equally spaced arrays of values between two numbers

np.linspace(-1,1,25)
array([-1.        , -0.91666667, -0.83333333, -0.75      , -0.66666667,
       -0.58333333, -0.5       , -0.41666667, -0.33333333, -0.25      ,
       -0.16666667, -0.08333333,  0.        ,  0.08333333,  0.16666667,
        0.25      ,  0.33333333,  0.41666667,  0.5       ,  0.58333333,
        0.66666667,  0.75      ,  0.83333333,  0.91666667,  1.        ])

and np.arange that creates arrays with a given step size between two numbers (similar to the range function):

np.arange(-2,3,.2)
array([-2.0000000e+00, -1.8000000e+00, -1.6000000e+00, -1.4000000e+00,
       -1.2000000e+00, -1.0000000e+00, -8.0000000e-01, -6.0000000e-01,
       -4.0000000e-01, -2.0000000e-01, -4.4408921e-16,  2.0000000e-01,
        4.0000000e-01,  6.0000000e-01,  8.0000000e-01,  1.0000000e+00,
        1.2000000e+00,  1.4000000e+00,  1.6000000e+00,  1.8000000e+00,
        2.0000000e+00,  2.2000000e+00,  2.4000000e+00,  2.6000000e+00,
        2.8000000e+00])

6.2. Indexing numpy arrays

Indexing into numpy arrays works using the [] indexing syntax. For the 1D array a, we created before, this behaves like it would for a list. When we type in a[0] we get the zeroth element back:

a[0]
0

When we type in a[-1] we get the last element back

a[-1]
10

and we can also slice the array. Selecting a range of values using the colon notation

a[1:3]
array([1, 2])

For the 2D array b we have another dimension to index. If we use only the first index, then we get numpy arrays back. b[0] returns a numpy array that corresponds to the first row:

b[0]
array([0, 1, 2, 3])

Again, slicing works here too, returning a 2D array:

b[:3]
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

If we want to select a column, we need to replace the first index with a colon (meaning “select all”) and then use the column number a second index. b[:,1] returns a 1D numpy array containing the contents of the first column:

b[:,1]
array([ 1,  5,  9, 13])

Here, too, slicing can be used to select multiple columns:

b[:,:2]
array([[ 0,  1],
       [ 4,  5],
       [ 8,  9],
       [12, 13]])

You can also combine row and column indexes, to select a rectangular part of the array. For example, indexing b[:3,2] will select row 0 to 2 from column 2

b[:3,2]
array([ 2,  6, 10])

b[1:3,:3] selects rows 1 and 2 from columns 0, 1 and 2:

b[1:3,:3]
array([[ 4,  5,  6],
       [ 8,  9, 10]])

We can also use boolean indexing with numpy arrays. This allows us to select specific elements according to some boolean operation were we use an array containing bools instead of numbers. We can either type in those arrays by hand (lists also work, for this purpose). In this examples, we select elements 0, 1 and 3 from our array a:

a[[True, True, False, True, False, False, False, False, False, False, False]]
array([0, 1, 3])

We can also create boolean arrays from numpy arrays using comparison operators. For example, to select all positive values in an array. First we create a new array c using np.linspace:

c = np.linspace(-10,10, 5)
c
array([-10.,  -5.,   0.,   5.,  10.])

Then, we use c>0 to create an array that contains True wherever the entry in c was larger then 0

c>0
array([False, False, False,  True,  True])

If we use this expression c>0 in the indexing brackets, we select all elements that are large than 0

c[c>0]
array([ 5., 10.])

We can also combine boolean arrays using logic operators. These calculate a new bool array from by applying logic operators element-wise. We have previously encountered logic operators when we were talking about if blocks. These same operators are available for arrays as well, however, with a slightly different notation. The following table shows the comparison operator we’ve already seen before in the left column and the corresponding operator for arrays in the right column

operator

array

and

&

or

|

not

~

So, by combining (c<5) and (c>-5) we can select all elements with an absolute value less than 5

c[(c<5)&(c>-5)]
array([0.])

6.3. Array assignments

To assign a value to a specific element of an array, we just have to put the indexing expression on the left side of an assignment. Before the assignment, our array looks like this:

c
array([-10.,  -5.,   0.,   5.,  10.])

Now we select the zeroth element and replace it with -11

c[0] = -11 
c
array([-11.,  -5.,   0.,   5.,  10.])

When we are slicing or using boolean indexing on the left side and write a single element on the right side, then all selected values are replaced by the same element:

c[c<0] = -1
c
array([-1., -1.,  0.,  5., 10.])

On the other hand, we can also use a list or array on the right side to replace every selected element with another one

c[c>0] = [2,3]
c
array([-1., -1.,  0.,  2.,  3.])

This is also where the dtype of the array can lead to - at the first look - strange behavior

int_array = np.array([1,2,3,4])
int_array
array([1, 2, 3, 4])

This array was automatically created as ‘int64’ dtype

int_array.dtype
dtype('int64')

When we assign a float value to one of the elements

int_array[2] = 1.5

it is automatically converted to an ‘int64’:

int_array
array([1, 2, 1, 4])

6.4. Arithmetic with numpy arrays

We can use operators +, -, *, \ and ** with numpy arrays.

If two array of the same size are on both sides of the operator, then operation is performed elementwise on all elements. If they are of different size, then “array broadcasting” happens: for all dimensions, the arrays either need to have the same size or one of them has to have size 1. The smaller array is then replicated so its size matches that of the bigger one. Then the operation is again performed element-wise. An extreme case is the combination with a number: it has size 1 in all dimensions. Let’s look at some examples.

First, we will create an array:

just_ones = np.ones((3,2))
just_ones
array([[1., 1.],
       [1., 1.],
       [1., 1.]])

We multiply it by 2 (array broadcasting to the same size as the array before the operation is applied)

just_ones * 2
array([[2., 2.],
       [2., 2.],
       [2., 2.]])

Let’s create another array with different numbers

different_numbers = np.array([[1,2],[3,4],[5,6]])
different_numbers
array([[1, 2],
       [3, 4],
       [5, 6]])

Adding two arrays of the same size results in a third array of that size and elementwise operation:

different_numbers + different_numbers
array([[ 2,  4],
       [ 6,  8],
       [10, 12]])

When we combine arrays of different sizes, the smaller one is replicated to to fit:

different_numbers, different_numbers[1]
(array([[1, 2],
        [3, 4],
        [5, 6]]),
 array([3, 4]))
different_numbers + different_numbers[1]
array([[ 4,  6],
       [ 6,  8],
       [ 8, 10]])

For 2D arrays, the @ operator is used for matrix multiplication (here, the usual restrictions for the sizes of multiples matrices apply). .T transposes an array:

just_ones.T @ just_ones
array([[3., 3.],
       [3., 3.]])

6.5. numpy functions

The numpy package includes many mathematical functions that are optimized for numpy arrays. When working with numpy and pandas arrays, use these functions instead of those from the math package.

Some useful functions:

  • General function: np.abs, np.sqrt, np.exp, np.log

  • Trigonometric functions: np.sin, np.cos, np.tan

  • Statistical np.mean, np.std, np.max, np.min

  • Sums and products np.sum, np.product

Functions that reduce an array down to a single value like np.mean or np.sum also have an optional argument axis with default value None. When you don’t want to, e.g. have the sum of elements for the whole array, but along a specific axis, you can specify it there. For example, without the axis argument

np.sum(just_ones)
6.0

and when using the axis argument to apply rows wise

np.sum(just_ones, axis=1)
array([2., 2., 2.])

6.6. Arrays are “mutable”

numpy arrays are a mutable data type. And we know what that means:

When we assign the array to another variable

just_ones_2 = just_ones
just_ones
array([[1., 1.],
       [1., 1.],
       [1., 1.]])

And then change that variable

just_ones_2[0,0]  = 2

then the first array is also affected

just_ones
array([[2., 1.],
       [1., 1.],
       [1., 1.]])

If you want to be on the save side, use .copy() when assigning to another variable

just_ones_2 = just_ones.copy()

That way, when you change the array in the second variable

just_ones_2[:,1] = 0

It only affects that variable

just_ones_2
array([[2., 0.],
       [1., 0.],
       [1., 0.]])

And not the original one

just_ones
array([[2., 1.],
       [1., 1.],
       [1., 1.]])

6.7. Summary

1. Create arrays using np.array, np.zeros, np.ones or np.genfromtxt

Don’t forget that dtype might matter

2. Operations are applied element-wise

With broadcasting used to increase the size of the smaller array

3. Functions

  1. Use functions from numpy for arrays, not from math

  2. axis keyword argument if you want rowwise/columnwise results

4. np.arrays are mutable