9. Fits and stuff

When writing this course, one of the problems I ran into was, that this is not actually a programming course, but a crash course to get you started. Hence, in this section (and, to a lesser degree, in previous one) the functions used and problems shown are not there because they are fundamental or especially important, but because they are one example of many different uses of the tools in python ecosystem.

When we are done with this section, I would judge the crash course as a success, if you have learned how to start a jupyter notebook and where to look up documentation for the packages we have used so far. Learning by doing will get you really, really far. Typically, code doesn’t run on the first try - that’s what error messages are for. Your code doesn’t need to perfect or super efficient. And its not me saying that:

[…] premature optimization is the root of all evil (or at least most of it) in programming

Donald effing Knuth

And with that, we will once again load our default tools:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

9.1. Calibration curve: a polynomial fit

I kinda ran out of nice datasets in the previous section, so for this section, we are mostly looking at data from the “Instru” lab. This first example is an F-AAS determination of copper. The students prepare standards of a known copper concentration and establish a calibration curve. For the calibration each standard is measured as five repeat measurements. The data is stored in an excel document. The first sheet contains the calibration, the second the sample measurements. Luckily, the calibration data is loaded correctly right away with pandas

9.1.1. Loading data and clean up

Cu_calibration = pd.read_excel("data/Cu.xls", 
                               sheet_name="calibration")
Cu_calibration
concentration [mg/L] Repeat 1 Repeat 2 Repeat 3 Repeat 4 Repeat 5
0 0.1 0.001 0.002 0.002 0.001 0.001
1 0.3 0.002 0.002 0.002 0.003 0.003
2 1.0 0.009 0.009 0.009 0.009 0.009
3 3.0 0.028 0.027 0.027 0.028 0.027
4 10.0 0.089 0.088 0.089 0.089 0.089
5 30.0 0.255 0.257 0.253 0.258 0.256

as is are the sample measurements

Cu_samples = pd.read_excel("data/Cu.xls", sheet_name="sample")
Cu_samples
Unnamed: 0 Repeat 1 Repeat 2 Repeat 3 Repeat 4 Repeat 5
0 Ansatz 1 0.003 0.004 0.004 0.003 0.003
1 Ansatz 2 0.003 0.003 0.003 0.003 0.003
2 Ansatz 3 0.004 0.004 0.004 0.004 0.004
3 Ansatz 4 0.004 0.004 0.004 0.004 0.004

I don’t like that the one of the columns is named “Unnamed: 0”. I also don’t like that it contains mostly redundant strings. That can be changed using .apply and a lambda function that splits each string on the space, selects the last substring and converts it to an integer. The result is then stored in the column “Ansatz”:

Cu_samples["Ansatz"] = Cu_samples["Unnamed: 0"].apply(lambda s: int(s.split()[-1]))
Cu_samples
Unnamed: 0 Repeat 1 Repeat 2 Repeat 3 Repeat 4 Repeat 5 Ansatz
0 Ansatz 1 0.003 0.004 0.004 0.003 0.003 1
1 Ansatz 2 0.003 0.003 0.003 0.003 0.003 2
2 Ansatz 3 0.004 0.004 0.004 0.004 0.004 3
3 Ansatz 4 0.004 0.004 0.004 0.004 0.004 4

After checking that step worked, we can now drop the “Unnamed: 0” column.

Cu_samples = Cu_samples.set_index("Ansatz").drop(columns="Unnamed: 0")
Cu_samples
Repeat 1 Repeat 2 Repeat 3 Repeat 4 Repeat 5
Ansatz
1 0.003 0.004 0.004 0.003 0.003
2 0.003 0.003 0.003 0.003 0.003
3 0.004 0.004 0.004 0.004 0.004
4 0.004 0.004 0.004 0.004 0.004

To fit our calibration curve we need one vector of x values and one vector of y values. Unfortunately, our calibration dataset is not in the right shape yet:

Cu_calibration
concentration [mg/L] Repeat 1 Repeat 2 Repeat 3 Repeat 4 Repeat 5
0 0.1 0.001 0.002 0.002 0.001 0.001
1 0.3 0.002 0.002 0.002 0.003 0.003
2 1.0 0.009 0.009 0.009 0.009 0.009
3 3.0 0.028 0.027 0.027 0.028 0.027
4 10.0 0.089 0.088 0.089 0.089 0.089
5 30.0 0.255 0.257 0.253 0.258 0.256

Fortunately, there is the DataFrame.melt method (which I did not just learn about yesterday, when I was googling “pandas make many columns one column”. I swear.). What .melt does is the following: you tell it take these columns (the value_vars) and stack them on top of each other in a new column called value_name. Then take the names of the value_vars and put them in a new column called var_name so that the name of the value_var shows which column a value was taken from:

This code also includes list comprehension, a short way of writing a for loop that fills a list:

Cu_calibration = Cu_calibration.melt(id_vars="concentration [mg/L]", 
                    value_vars=["Repeat {}".format(i) for i in range(1,6)], 
                    var_name="Repeat",
                   value_name="Absorption")
Cu_calibration
concentration [mg/L] Repeat Absorption
0 0.1 Repeat 1 0.001
1 0.3 Repeat 1 0.002
2 1.0 Repeat 1 0.009
3 3.0 Repeat 1 0.028
4 10.0 Repeat 1 0.089
5 30.0 Repeat 1 0.255
6 0.1 Repeat 2 0.002
7 0.3 Repeat 2 0.002
8 1.0 Repeat 2 0.009
9 3.0 Repeat 2 0.027
10 10.0 Repeat 2 0.088
11 30.0 Repeat 2 0.257
12 0.1 Repeat 3 0.002
13 0.3 Repeat 3 0.002
14 1.0 Repeat 3 0.009
15 3.0 Repeat 3 0.027
16 10.0 Repeat 3 0.089
17 30.0 Repeat 3 0.253
18 0.1 Repeat 4 0.001
19 0.3 Repeat 4 0.003
20 1.0 Repeat 4 0.009
21 3.0 Repeat 4 0.028
22 10.0 Repeat 4 0.089
23 30.0 Repeat 4 0.258
24 0.1 Repeat 5 0.001
25 0.3 Repeat 5 0.003
26 1.0 Repeat 5 0.009
27 3.0 Repeat 5 0.027
28 10.0 Repeat 5 0.089
29 30.0 Repeat 5 0.256

We can now plot all measurements (using crosses instead of lines)

fig = plt.figure()
ax = fig.add_subplot()
ax.plot(Cu_calibration["concentration [mg/L]"],
        Cu_calibration["Absorption"], "x")
ax.set_xlabel("concentration [mg/L]")
ax.set_ylabel("Absorption");
_images/fitting_19_0.svg

The numpy library contains functions for polynomial fitting. Let’s import it (there is also an older version of polynomial fits in numpy, called np.polyfit, that you will see in many examples, still)

from numpy.polynomial import polynomial

polynomial.polyfit returns an array of coefficients. The leftmost coefficient corresponds to the constant offset, the next one is the coefficient of the linear term, and so on.

fit = polynomial.polyfit(Cu_calibration["Absorption"],
                 Cu_calibration["concentration [mg/L]"],deg=1)
fit
array([ -0.12553488, 117.34201998])

We evaluate the polynomial using polynomial.polyval. The array minmax contains the smallest and the largest absorption value in the fit, we use it to plot the calibration curve on top of the measurement:

fig = plt.figure()
ax = fig.add_subplot()
ax.plot(Cu_calibration["concentration [mg/L]"],
        Cu_calibration["Absorption"], "x",label="measurements")

minmax = np.array([Cu_calibration["Absorption"].min(),Cu_calibration["Absorption"].max()])


ax.plot(polynomial.polyval(minmax, fit), minmax, label="fit")

ax.set_xlabel("concentration [mg/L]")
ax.set_ylabel("Absorption");
ax.legend();
_images/fitting_25_0.svg

To use our calibration to quantify the copper amount in each sample, we use .applymap. This method of the DataFrame is similar to .apply but it apply a function to each element individually:

conc = Cu_samples.applymap(lambda val:polynomial.polyval(val,fit))
conc
Repeat 1 Repeat 2 Repeat 3 Repeat 4 Repeat 5
Ansatz
1 0.226491 0.343833 0.343833 0.226491 0.226491
2 0.226491 0.226491 0.226491 0.226491 0.226491
3 0.343833 0.343833 0.343833 0.343833 0.343833
4 0.343833 0.343833 0.343833 0.343833 0.343833

9.2. Multivariate models

Keeping with the theme of “Instru” datasets, we next look at a multivariate calibration to determine the concentration of sucrose, glucose and fructose in water. The dataset consists of a .csv file containing IR spectra (the first column is the wavenumber axis, the remaining columns are absorption values) and another csv file that contains concentrations of the three sugars.

We load the spectra using numpy (large dataset, all numbers!) and split it into a wavenumber axis and the absorption block:

spec_data = np.genfromtxt("data/sugar_spectra.csv", unpack=True)
wn = spec_data[0]
absorption = spec_data[1:]

For the sugar concentrations we use pandas. index_col makes pandas use the first column as index

sugars = pd.read_csv("data/sugar concentration.csv",index_col=0)

This is what the dataset looks like. Since this measurement was taken on a diamond ATR using water as background, there are several regions in the spectrum showing high noise.

fig = plt.figure()
ax = fig.add_subplot()
ax.plot(wn, absorption.T);
ax.invert_xaxis()
ax.set_xlabel("wavenumber")
ax.set_ylabel("absorption / 1");
_images/fitting_34_0.svg

These probably won’t be useful to us anyway. Actually - because this is a student lab example - I know that most of the information is between 850 and 1500 wavenumbers. We select that range using boolean indexing. Specifically, we create an array called mask the size of our wn wavenumber axis, that is true for all elements where wn is larger than 850 and less than 1500.

mask = (wn>850)&(wn<1500)

Hint

The & here is an array-wise boolean “and”. It applies a boolean and operation to all elements of two arrays (like a + or * applied to an array). We have also seen “or” | and “not” ~ in the section on numpy arrays.

We use this mask to select a part of our wavenumber and absorptions arrays and store those in new variables called wn_cut and absorption_cut.

absorption_cut = absorption[:,mask]
wn_cut = wn[mask]

This is the set of spectra we will use for the multivariate calibration:

fig = plt.figure()
ax = fig.add_subplot()
ax.plot(wn_cut, absorption_cut.T);
ax.invert_xaxis()
ax.set_xlabel("wavenumber")
ax.set_ylabel("absorption / 1");
_images/fitting_41_0.svg

We start by using the old workhorse of infrared spectroscopists, the partials least squares regression (PLS). This type of multivariate regression handles the multicollinearity found in infrared datasets well. In contrast to many other types of regression that would require some pretreatment or variable selection before we chuck in our data, PLS don’t care.

We load the regression from sklearn:

from sklearn.cross_decomposition import PLSRegression

And, the typical pattern for sklearn the we’ve already seen for the PCA: First, instantiate and set the parameters. (By default, scale would be set to True, which would mean that PLS scales the absorptions for each wavenumber by the standard deviation. Usually a good idea, when each column in a dataset has a different unit or scale. However, all our columns are absorption values and don’t need to be rescaled.)

pls = PLSRegression(scale=False)

Then use the fit method to fit the dataset. We pass in our cut absorption values as X and the sugar concentrations as y:

pls.fit(absorption_cut, sugars)
PLSRegression(scale=False)

And once the pls object is fit, it can be used to predict. The prediction is returned as a numpy array

prediction_arr = pls.predict(absorption_cut)
prediction_arr
array([[ 5.84846538e+00,  2.96571538e+00,  3.23604482e+00],
       [ 7.09753198e+00,  9.32157352e+00,  8.41382842e+00],
       [ 1.83520949e+01,  2.49563678e+00,  7.29927395e+00],
       [ 3.47357995e+00,  1.50973216e+01,  1.14391758e+01],
       [ 2.10969760e+01,  6.34315557e+00,  1.11356432e+01],
       [ 1.66167271e+01,  1.24629760e+01,  1.41151621e+01],
       [ 5.50820299e+00,  2.26918564e+01,  1.78174065e+01],
       [ 2.05124497e+01,  2.02876543e+01,  2.13218503e+01],
       [ 5.27577992e+00,  3.51926832e+00,  3.44644769e+00],
       [ 9.75463973e+00,  1.13941190e+01,  1.08963658e+01],
       [ 2.34532498e+01, -3.35485683e-01,  6.98992018e+00],
       [ 1.71309645e+00,  1.54462596e+01,  1.10778110e+01],
       [ 5.84787152e+00,  2.43604838e+01,  1.91808958e+01],
       [ 2.55996999e+01,  8.27659333e+00,  1.41659778e+01],
       [ 1.61952237e+01,  1.17262548e+01,  1.34173161e+01],
       [ 1.93057331e+01,  2.08971943e+01,  2.13501770e+01],
       [ 1.41059197e-01,  2.19290643e+00,  6.45469615e-01],
       [ 5.27032709e+00,  1.02199989e+01,  8.43843324e+00],
       [ 1.86094954e+01, -1.51324738e+00,  4.40239682e+00],
       [-2.68873605e+00,  1.34280145e+01,  8.01988438e+00],
       [ 2.08183156e+01,  7.04021603e+00,  1.15567851e+01],
       [ 7.39482240e-01,  2.08904293e+01,  1.47915643e+01],
       [ 1.52687699e+01,  9.95209166e+00,  1.17680480e+01],
       [ 1.76754254e+01,  1.79068083e+01,  1.85460357e+01],
       [ 8.06807905e-01,  2.39796012e+00,  1.03328969e+00],
       [ 4.08557982e+00,  1.01068547e+01,  7.93591368e+00],
       [ 1.56284955e+01,  2.24204779e-02,  4.49465476e+00],
       [-2.01947661e+00,  1.41317958e+01,  8.78063460e+00],
       [ 1.94102245e+01,  6.69195159e+00,  1.08001988e+01],
       [ 1.42394590e+00,  2.21376228e+01,  1.59626748e+01],
       [ 1.44628073e+01,  1.03754253e+01,  1.17990585e+01],
       [ 1.73849548e+01,  1.80481744e+01,  1.85488616e+01]])

That we convert into a DataFrame with the same columns and indices as our training data.

prediction = pd.DataFrame(prediction_arr, 
                          columns=sugars.columns, 
                          index=sugars.index)
prediction
Fructose Sucrose Glucose
0 5.848465 2.965715 3.236045
1 7.097532 9.321574 8.413828
2 18.352095 2.495637 7.299274
3 3.473580 15.097322 11.439176
4 21.096976 6.343156 11.135643
5 16.616727 12.462976 14.115162
6 5.508203 22.691856 17.817406
7 20.512450 20.287654 21.321850
8 5.275780 3.519268 3.446448
9 9.754640 11.394119 10.896366
10 23.453250 -0.335486 6.989920
11 1.713096 15.446260 11.077811
12 5.847872 24.360484 19.180896
13 25.599700 8.276593 14.165978
14 16.195224 11.726255 13.417316
15 19.305733 20.897194 21.350177
16 0.141059 2.192906 0.645470
17 5.270327 10.219999 8.438433
18 18.609495 -1.513247 4.402397
19 -2.688736 13.428015 8.019884
20 20.818316 7.040216 11.556785
21 0.739482 20.890429 14.791564
22 15.268770 9.952092 11.768048
23 17.675425 17.906808 18.546036
24 0.806808 2.397960 1.033290
25 4.085580 10.106855 7.935914
26 15.628495 0.022420 4.494655
27 -2.019477 14.131796 8.780635
28 19.410225 6.691952 10.800199
29 1.423946 22.137623 15.962675
30 14.462807 10.375425 11.799058
31 17.384955 18.048174 18.548862

However, when we compare the predicted values to the actual concentrations, the result doesn’t look great. There are large difference between those values, sometimes the deviation is three or four times the size of actual value:

prediction - sugars
Fructose Sucrose Glucose
0 3.845265 0.965715 1.236045
1 5.094332 7.321574 -11.586172
2 -1.679905 0.495637 5.299274
3 1.470380 -4.902678 9.439176
4 1.064976 4.343156 -8.864357
5 -3.415273 -7.537024 12.115162
6 3.505003 2.691856 -2.182594
7 0.480450 0.287654 1.321850
8 3.275780 1.535268 1.432448
9 7.754640 9.410119 -9.243634
10 3.453250 -2.319486 4.975920
11 -0.286904 -4.393740 9.063811
12 3.847872 4.520484 -0.959104
13 5.599700 6.292593 -5.974022
14 -3.804776 -8.113745 11.403316
15 -0.694267 1.057194 1.210177
16 -1.858941 0.192906 -1.354530
17 3.270327 8.219999 -11.561567
18 -1.390505 -3.513247 2.402397
19 -4.688736 -6.571985 6.019884
20 0.818316 5.040216 -8.443215
21 -1.260518 0.890429 -5.208436
22 -4.731230 -10.047908 9.768048
23 -2.324575 -2.093192 -1.453964
24 -1.205192 0.404960 -0.971510
25 2.073580 8.113855 -12.112086
26 -4.491505 -1.970580 2.489855
27 -4.031477 -5.796204 6.775835
28 -0.709775 4.698952 -9.247801
29 -0.588054 2.209623 -4.085325
30 -5.657193 -9.552575 9.794258
31 -2.735045 -1.879826 -1.499138

Fortunately, the PLS has one parameter that we can optimize to get a better fit: the number of components (components here means orthogonal directions in the original data set, akin to the components in the PCA). More components allow closer modeling of the data.

To see how the number of components improves how well the PLS mode predicts sugar concentrations, we need a better way to check the quality of the model then to look at a table of all errors. sklearn provides an assortment of metrics in that do just that: they provide a single value that states how close to the actual value the predictions are. We are using the mean_squared_error here. A mean_squared_error of 0 describes a perfect fit

from sklearn.metrics import mean_squared_error

When we want to evaluate and improve the quality of a model, there is one more thing we need to understand: when we use the metric to evaluate how well our mode predicts the data we trained it on (training set), then we will tend to get a model that is really good a predicting values from the training set but will miserable for all other data. We get a more realistic idea of the performance of our model if we split the training set into two parts and use only one of them to train and the other to test the performance. This is called cross validation. sklearn provides several methods for splitting the dataset in sklearn.model_selection. We will use KFold. This strategy splits the dataset into \(k\) batches. In each iteration, one of the batches is used to test the mode, and the others are used to train.

from sklearn.model_selection import KFold

Let’s see how the train and test set metric changes as we increase the number of components. We create to lists to hold the data:

metric_train = []
metric_test = []

To test at one number of components, we need to iterate over all \(k\) splits, fit the model each time and determine the score. We will package that in a function, that takes a number of components, an X and y matrix as input and applies 5 cross validations.

def cv_components(n_components, X, y):
    kfold = KFold(n_splits=5, shuffle=True, random_state=42)
    pls = PLSRegression(n_components=n_components, scale=False)
    train = []
    test = []
    for train_index, test_index in kfold.split(X):
        X_train, y_train = X[train_index], y.loc[train_index]
        X_test, y_test = X[test_index], y.loc[test_index]
        pls = pls.fit(X_train,y_train)
        train.append(mean_squared_error(y_train, 
                              pls.predict(X_train)))
        test.append(mean_squared_error(y_test, 
                              pls.predict(X_test)))
    return train, test

We loop the function over n_components between 1 and 10 and capture the results in the lists we have crated before:

all_n_components = list(range(1,11))
for n_components in all_n_components:
    train, test = cv_components(n_components, 
                                absorption_cut, 
                                sugars)
    metric_train.append(train)
    metric_test.append(test)
    

To make them easier to work with, the lists are now converted to numpy arrays:

metric_train = np.array(metric_train)
metric_test = np.array(metric_test)    

And finally, we can plot mean values

    
fig = plt.figure()
ax = fig.add_subplot()
ax.plot(all_n_components,  np.mean(metric_train,axis=1), label="train")
ax.plot(all_n_components,  np.mean(metric_test,axis=1), label="test")
plt.legend()
ax.set_xlabel("components")
ax.set_ylabel(r"mean squared error");
_images/fitting_67_0.svg

As we can see the metric for training and test set improved rapidly at the beginning. However, at around 5-7 components the metric for the test set is actually starting to go down a bit (this is the point where our model starts to get worse at generalizing to data it hasn’t seen yet). In any case, the PLS with 7 will perform much better than what we had before.

This procedure of optimizing a parameter of the model, performing cross validation and selecting the best model is also available as in form of GridSearchCV in sklearn:

from sklearn.model_selection import GridSearchCV

GridSearchCV takes a model as an input, then a dictionary param_grid that specifies all values for the parameters that should be tested in our case only “n_components”, a scorer and cross validation. A scorer is just a metric with a bit of a wrapper around. We can convert the mean_squared_error into a scorer using make_scorer:

from sklearn.metrics import make_scorer
scorer = make_scorer(mean_squared_error, greater_is_better=False)

Then we select the cross validation strategy - KFold

cv = KFold(n_splits=5, shuffle=True, random_state=42)

And create an instance of GridSearchCV

pls = PLSRegression(scale=False)

gcv = GridSearchCV(pls, 
             param_grid={"n_components":list(range(1,21))}, 
             scoring=scorer,
            cv=cv)
gcv
GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
             estimator=PLSRegression(scale=False),
             param_grid={'n_components': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                          13, 14, 15, 16, 17, 18, 19, 20]},
             scoring=make_scorer(mean_squared_error, greater_is_better=False))

Finally, we use gcv to .fit our dataset

gcv.fit(absorption_cut, y=sugars)
GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
             estimator=PLSRegression(scale=False),
             param_grid={'n_components': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                          13, 14, 15, 16, 17, 18, 19, 20]},
             scoring=make_scorer(mean_squared_error, greater_is_better=False))

The best model is available as gcv.best_estimator_. It seems, the best

gcv.best_estimator_
PLSRegression(n_components=5, scale=False)

The best estimator for our dataset has 5 components.

It often makes sense to check all the cross validation results to see if the “best” estimator is the result of some fluke or a general trend. We convert the gcv.cv_results_ into a DataFrame to make them easier to handle:

gcv_results = pd.DataFrame(gcv.cv_results_)

gcv_results
mean_fit_time std_fit_time mean_score_time std_score_time param_n_components params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.002977 0.000655 0.001625 0.000037 1 {'n_components': 1} -48.545094 -58.066011 -63.409419 -64.216270 -63.156637 -59.478686 5.881527 20
1 0.003228 0.000183 0.001594 0.000008 2 {'n_components': 2} -41.332714 -38.451381 -32.150809 -31.454395 -29.110431 -34.499946 4.607413 19
2 0.003567 0.000159 0.001632 0.000022 3 {'n_components': 3} -13.463327 -7.979911 -6.900035 -4.014426 -4.065309 -7.284601 3.460228 18
3 0.003835 0.000181 0.012798 0.022345 4 {'n_components': 4} -1.130712 -3.307423 -4.412787 -1.399914 -2.296724 -2.509512 1.219565 9
4 0.004169 0.000149 0.001688 0.000034 5 {'n_components': 5} -0.844244 -1.778812 -2.841950 -0.786899 -1.376876 -1.525756 0.752188 1
5 0.005023 0.000307 0.001706 0.000081 6 {'n_components': 6} -0.969771 -2.036483 -2.807773 -0.936962 -2.149403 -1.780078 0.724671 2
6 0.005409 0.000305 0.001699 0.000046 7 {'n_components': 7} -1.063460 -2.363377 -3.053623 -1.327435 -2.696548 -2.100889 0.775358 3
7 0.006141 0.000492 0.001693 0.000035 8 {'n_components': 8} -1.231006 -2.367161 -3.142022 -1.542247 -2.917713 -2.240030 0.747534 4
8 0.006612 0.000386 0.001682 0.000020 9 {'n_components': 9} -1.356007 -2.408202 -3.131088 -1.825171 -2.944411 -2.332976 0.667657 5
9 0.007283 0.000601 0.001731 0.000027 10 {'n_components': 10} -1.446502 -2.416616 -3.179046 -1.672533 -3.370647 -2.417069 0.772820 6
10 0.008347 0.000630 0.001802 0.000045 11 {'n_components': 11} -1.425645 -2.390100 -3.270158 -1.766760 -3.571948 -2.484922 0.830065 7
11 0.008866 0.000470 0.001790 0.000038 12 {'n_components': 12} -1.415156 -2.454737 -3.275856 -1.784956 -3.575774 -2.501296 0.830597 8
12 0.009851 0.000834 0.001744 0.000024 13 {'n_components': 13} -1.449223 -2.492004 -3.318894 -1.796965 -3.627127 -2.536843 0.840556 10
13 0.010429 0.000736 0.001800 0.000052 14 {'n_components': 14} -1.438591 -2.501979 -3.376938 -1.795340 -3.581789 -2.538927 0.843190 11
14 0.011031 0.000882 0.001842 0.000080 15 {'n_components': 15} -1.422812 -2.496264 -3.394167 -1.838856 -3.630443 -2.556508 0.855445 12
15 0.013136 0.002752 0.002207 0.000499 16 {'n_components': 16} -1.415170 -2.512702 -3.378424 -1.883139 -3.618698 -2.561627 0.843998 13
16 0.012654 0.001259 0.001853 0.000103 17 {'n_components': 17} -1.413881 -2.520264 -3.391660 -1.871239 -3.620335 -2.563476 0.849172 14
17 0.012361 0.000993 0.001764 0.000026 18 {'n_components': 18} -1.413481 -2.525459 -3.406376 -1.858064 -3.635066 -2.567689 0.857932 17
18 0.013039 0.001267 0.001728 0.000038 19 {'n_components': 19} -1.419565 -2.516107 -3.398196 -1.855727 -3.647709 -2.567461 0.858362 16
19 0.013483 0.001162 0.001793 0.000031 20 {'n_components': 20} -1.415083 -2.513809 -3.400481 -1.851188 -3.654173 -2.566947 0.862411 15

The plot of the test score below looks identical to the one we had produced before. I used an .errorbar plot to also show the standard deviation of scores. Overall, it looks like the number of components actually is reasonable. The scores actually a d minimum in at that point:

fig = plt.figure()
ax = fig.add_subplot()
ax.errorbar(gcv_results["param_n_components"], -gcv_results["mean_test_score"], gcv_results["std_test_score"]);
_images/fitting_84_0.svg

We can also check for more than one parameter with GridSearchCV. We just add that parameter to the param_grid dictionary. Maybe we want to check it scaling actually improves the results in our cased:

gcv = GridSearchCV(pls, 
             param_grid={"n_components":list(range(1,21)), "scale":[True, False]}, 
             scoring=scorer,
            cv=cv)
gcv.fit(absorption_cut, sugars)
gcv.best_estimator_
PLSRegression(n_components=5, scale=False)

And the answer is “no”, the best estimator is still the one with 5 components and no scaling.

We will have another look at the cv_results_ from this new run.

gcv_results = pd.DataFrame(gcv.cv_results_)
gcv_results
mean_fit_time std_fit_time mean_score_time std_score_time param_n_components param_scale params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.003719 0.000669 0.002001 0.000359 1 True {'n_components': 1, 'scale': True} -54.698458 -62.046749 -68.166944 -70.095853 -69.120858 -64.825772 5.790862 40
1 0.002519 0.000029 0.001579 0.000033 1 False {'n_components': 1, 'scale': False} -48.545094 -58.066011 -63.409419 -64.216270 -63.156637 -59.478686 5.881527 39
2 0.003284 0.000208 0.001589 0.000022 2 True {'n_components': 2, 'scale': True} -52.097918 -54.574401 -55.639906 -39.294977 -35.817363 -47.484913 8.261372 38
3 0.003182 0.000216 0.001634 0.000017 2 False {'n_components': 2, 'scale': False} -41.332714 -38.451381 -32.150809 -31.454395 -29.110431 -34.499946 4.607413 36
4 0.004255 0.000353 0.001655 0.000023 3 True {'n_components': 3, 'scale': True} -53.143825 -34.523533 -37.614435 -26.322474 -30.598067 -36.440467 9.171286 37
5 0.003607 0.000137 0.001702 0.000073 3 False {'n_components': 3, 'scale': False} -13.463327 -7.979911 -6.900035 -4.014426 -4.065309 -7.284601 3.460228 35
6 0.004655 0.000423 0.001712 0.000087 4 True {'n_components': 4, 'scale': True} -1.724367 -4.231388 -5.844991 -3.397973 -3.141838 -3.668111 1.355298 34
7 0.003961 0.000223 0.001698 0.000035 4 False {'n_components': 4, 'scale': False} -1.130712 -3.307423 -4.412787 -1.399914 -2.296724 -2.509512 1.219565 9
8 0.005026 0.000368 0.001770 0.000036 5 True {'n_components': 5, 'scale': True} -1.568588 -2.471745 -4.012460 -3.293607 -2.497108 -2.768701 0.827609 21
9 0.004258 0.000230 0.001718 0.000039 5 False {'n_components': 5, 'scale': False} -0.844244 -1.778812 -2.841950 -0.786899 -1.376876 -1.525756 0.752188 1
10 0.005596 0.000355 0.001758 0.000048 6 True {'n_components': 6, 'scale': True} -1.377297 -2.398350 -3.862859 -3.046570 -3.051359 -2.747287 0.827667 20
11 0.005504 0.000949 0.001860 0.000118 6 False {'n_components': 6, 'scale': False} -0.969771 -2.036483 -2.807773 -0.936962 -2.149403 -1.780078 0.724671 2
12 0.005986 0.000280 0.001778 0.000025 7 True {'n_components': 7, 'scale': True} -1.239088 -2.421301 -3.558917 -2.821641 -3.215091 -2.651208 0.802233 18
13 0.005585 0.000299 0.001828 0.000059 7 False {'n_components': 7, 'scale': False} -1.063460 -2.363377 -3.053623 -1.327435 -2.696548 -2.100889 0.775358 3
14 0.006611 0.000250 0.001914 0.000054 8 True {'n_components': 8, 'scale': True} -1.317916 -2.397072 -3.718937 -2.707687 -3.594688 -2.747260 0.875159 19
15 0.006402 0.000414 0.001912 0.000081 8 False {'n_components': 8, 'scale': False} -1.231006 -2.367161 -3.142022 -1.542247 -2.917713 -2.240030 0.747534 4
16 0.007335 0.000502 0.001946 0.000064 9 True {'n_components': 9, 'scale': True} -1.382980 -2.490851 -3.625133 -2.668780 -3.707565 -2.775062 0.851134 22
17 0.006888 0.000517 0.001920 0.000056 9 False {'n_components': 9, 'scale': False} -1.356007 -2.408202 -3.131088 -1.825171 -2.944411 -2.332976 0.667657 5
18 0.008268 0.000373 0.002009 0.000022 10 True {'n_components': 10, 'scale': True} -1.440215 -2.685522 -3.853001 -2.470417 -3.491591 -2.788149 0.843464 23
19 0.007590 0.000570 0.001991 0.000044 10 False {'n_components': 10, 'scale': False} -1.446502 -2.416616 -3.179046 -1.672533 -3.370647 -2.417069 0.772820 6
20 0.008992 0.000209 0.002035 0.000050 11 True {'n_components': 11, 'scale': True} -1.257625 -2.834232 -3.788406 -2.391889 -3.684424 -2.791315 0.927946 24
21 0.008591 0.000637 0.002030 0.000058 11 False {'n_components': 11, 'scale': False} -1.425645 -2.390100 -3.270158 -1.766760 -3.571948 -2.484922 0.830065 7
22 0.009493 0.000453 0.002081 0.000080 12 True {'n_components': 12, 'scale': True} -1.517545 -2.842888 -3.724798 -2.398101 -3.774634 -2.851593 0.848517 25
23 0.009092 0.000518 0.002018 0.000040 12 False {'n_components': 12, 'scale': False} -1.415156 -2.454737 -3.275856 -1.784956 -3.575774 -2.501296 0.830597 8
24 0.010039 0.000542 0.002088 0.000030 13 True {'n_components': 13, 'scale': True} -1.572594 -2.964220 -3.831789 -2.473844 -3.831157 -2.934721 0.857555 27
25 0.010022 0.000788 0.001940 0.000114 13 False {'n_components': 13, 'scale': False} -1.449223 -2.492004 -3.318894 -1.796965 -3.627127 -2.536843 0.840556 10
26 0.010459 0.000482 0.002074 0.000032 14 True {'n_components': 14, 'scale': True} -1.595299 -2.988451 -3.941710 -2.267406 -3.885289 -2.935631 0.912124 28
27 0.010518 0.000848 0.001929 0.000117 14 False {'n_components': 14, 'scale': False} -1.438591 -2.501979 -3.376938 -1.795340 -3.581789 -2.538927 0.843190 11
28 0.010949 0.000548 0.001839 0.000035 15 True {'n_components': 15, 'scale': True} -1.564222 -3.043631 -3.818043 -2.285135 -3.944452 -2.931097 0.906845 26
29 0.010983 0.000893 0.001870 0.000068 15 False {'n_components': 15, 'scale': False} -1.422812 -2.496264 -3.394167 -1.838856 -3.630443 -2.556508 0.855445 12
30 0.012034 0.000833 0.002086 0.000207 16 True {'n_components': 16, 'scale': True} -1.533037 -3.078306 -3.825852 -2.343480 -4.016826 -2.959500 0.927027 29
31 0.012009 0.000846 0.002107 0.000080 16 False {'n_components': 16, 'scale': False} -1.415170 -2.512702 -3.378424 -1.883139 -3.618698 -2.561627 0.843998 13
32 0.012871 0.000914 0.002168 0.000008 17 True {'n_components': 17, 'scale': True} -1.524799 -3.075210 -3.900531 -2.309325 -4.049744 -2.971922 0.955812 30
33 0.012874 0.001595 0.002062 0.000148 17 False {'n_components': 17, 'scale': False} -1.413881 -2.520264 -3.391660 -1.871239 -3.620335 -2.563476 0.849172 14
34 0.012565 0.000658 0.001923 0.000071 18 True {'n_components': 18, 'scale': True} -1.557934 -3.080677 -3.920539 -2.312594 -4.031202 -2.980589 0.945249 31
35 0.012578 0.001085 0.001923 0.000040 18 False {'n_components': 18, 'scale': False} -1.413481 -2.525459 -3.406376 -1.858064 -3.635066 -2.567689 0.857932 17
36 0.013247 0.000744 0.001900 0.000033 19 True {'n_components': 19, 'scale': True} -1.548109 -3.085797 -3.952162 -2.307434 -4.029778 -2.984656 0.955079 32
37 0.013405 0.001026 0.002138 0.000442 19 False {'n_components': 19, 'scale': False} -1.419565 -2.516107 -3.398196 -1.855727 -3.647709 -2.567461 0.858362 16
38 0.013578 0.000616 0.001901 0.000020 20 True {'n_components': 20, 'scale': True} -1.531816 -3.086763 -3.959442 -2.319961 -4.051912 -2.989979 0.964582 33
39 0.013665 0.001188 0.001830 0.000047 20 False {'n_components': 20, 'scale': False} -1.415083 -2.513809 -3.400481 -1.851188 -3.654173 -2.566947 0.862411 15

A new column has been added that contains the scale parameter. We can use boolean indexing on the DataFrame to grab all rows where scaling was on or off:

fig = plt.figure()
ax = fig.add_subplot()
scale_true = gcv_results[gcv_results["param_scale"]==True]
ax.errorbar(scale_true["param_n_components"], 
            -scale_true["mean_test_score"], 
            scale_true["std_test_score"], label="Scaling on");
scale_false = gcv_results[gcv_results["param_scale"]==False]
ax.errorbar(scale_false["param_n_components"], 
            -scale_false["mean_test_score"], 
            scale_false["std_test_score"], label="Scaling off");
ax.legend()
<matplotlib.legend.Legend at 0x7f1ded954b20>
_images/fitting_90_1.svg

The scaled models consistently perform worse than the unscaled ones - but we knew that already.

sklearn has an excellent documentation including a very approachable user guide that explains and compares algorithms and how to use them. You can find it at https://scikit-learn.org/stable/user_guide.html and I can only recommend scrolling through it.

There is one more tool in sklearn that I want to specifically point out here, though, the pipeline. The pipeline chains together preprocessing and fitting in a single object. From the outside, the pipeline looks like any other sklearn model: it has the .fit and .predict methods and can be optimized using GridSearchCV, but whenever you call either of these methods, it will not only fit/predict using the final step, the actual fitting algorithm but also fit/or apply preprocessing steps.

That is important, because preprocessing, when done incorrectly can actually defeat the purpose of cross validation through an effect called “leakage”: basically, whenever preprocessing uses information from multiple rows of your X matrix (e.g. to scale by the standard deviation of a column), it must not be fit applied to the test data in cross validation, only to the training data, otherwise, the fitting algorithm is getting (indirect) information on your test set as well and your cross validation score will look better than it actually is.

We can build a pipeline using the make_pipeline function:

from sklearn.pipeline import make_pipeline

Since PLS didn’t need preprocessing, we will demonstrate the the use of a pipeline with another regressor, LASSO. LASSO performs a least squares fit but tries to use as few variables as possible for it. We will preprocess the dataset using a PCA to instead of looking at absorptions at different wavenumbers, we will be fitting components. Before applying the PCA, we will also mean center the data:

We import PCA and LASSO

from sklearn.decomposition import PCA
from sklearn.linear_model import Lasso

and then make the pipeline

pca = PCA()
lasso = Lasso()

lasso_pipe = make_pipeline(pca, lasso)
lasso_pipe
Pipeline(steps=[('pca', PCA()), ('lasso', Lasso())])

The parameter we want to optimize here is the regularization paramter \(\alpha\). Higher values of \(\alpha\) mean that LASSO will use fewer variables/columns, lower values mean it will use more. With \(\alpha=0\) it uses all of them.

In the pipeline, the parameters of each step are accessible from the top level. To make each parameter unique (in case two steps have parameters with the same name) they are prefixed with the name of the step and two underscores __. The .get_params method shows all parameters of the pipe. \(\alpha\) is now called lasso__alpha

lasso_pipe.get_params()
{'memory': None,
 'steps': [('pca', PCA()), ('lasso', Lasso())],
 'verbose': False,
 'pca': PCA(),
 'lasso': Lasso(),
 'pca__copy': True,
 'pca__iterated_power': 'auto',
 'pca__n_components': None,
 'pca__random_state': None,
 'pca__svd_solver': 'auto',
 'pca__tol': 0.0,
 'pca__whiten': False,
 'lasso__alpha': 1.0,
 'lasso__copy_X': True,
 'lasso__fit_intercept': True,
 'lasso__max_iter': 1000,
 'lasso__normalize': False,
 'lasso__positive': False,
 'lasso__precompute': False,
 'lasso__random_state': None,
 'lasso__selection': 'cyclic',
 'lasso__tol': 0.0001,
 'lasso__warm_start': False}

We create a GridSearchCV object for lasso_pipe that searches the optimum lasso__alpha. The spacing of alphas will be exponential between \(10^{-3}\) and \(0\):

cv = KFold(n_splits=5, shuffle=True, random_state=42)
scorer = make_scorer(mean_squared_error, greater_is_better=False)

lasso_gcv = GridSearchCV(lasso_pipe,
                        param_grid={"lasso__alpha":10**np.linspace(-5,-2,100)},
                        cv=cv,
                        scoring=scorer)

And then fit it:

lasso_gcv.fit(absorption_cut, sugars)
GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('pca', PCA()), ('lasso', Lasso())]),
             param_grid={'lasso__alpha': array([1.00000000e-05, 1.07226722e-05, 1.14975700e-05, 1.23284674e-05,
       1.32194115e-05, 1.41747416e-05, 1.51991108e-05, 1.62975083e-05,
       1.74752840e-05, 1.87381742e-05, 2.00923300e-05, 2.15443469e-05,
       2.31012970e-0...
       2.65608778e-03, 2.84803587e-03, 3.05385551e-03, 3.27454916e-03,
       3.51119173e-03, 3.76493581e-03, 4.03701726e-03, 4.32876128e-03,
       4.64158883e-03, 4.97702356e-03, 5.33669923e-03, 5.72236766e-03,
       6.13590727e-03, 6.57933225e-03, 7.05480231e-03, 7.56463328e-03,
       8.11130831e-03, 8.69749003e-03, 9.32603347e-03, 1.00000000e-02])},
             scoring=make_scorer(mean_squared_error, greater_is_better=False))

and generate the results DataFrame

gcv_results = pd.DataFrame(lasso_gcv.cv_results_)

The best estimator looks as follows:

lasso_gcv.best_estimator_
Pipeline(steps=[('pca', PCA()), ('lasso', Lasso(alpha=0.00026560877829466864))])

And it is right at the minimum of the dip in cross validation scores. Looks reasonable.

fig = plt.figure()
ax = fig.add_subplot()
ax.plot(gcv_results["param_lasso__alpha"], 
            -gcv_results["mean_test_score"]);
ax.axvline(lasso_gcv.best_estimator_.get_params()["lasso__alpha"], color="black")
ax.set_xscale("log")
ax.set_xlabel(r"$\alpha$")
ax.set_ylabel("mean square error");
_images/fitting_110_0.svg

Briefly, I want to show how leakage would affect the results in this example. We perform another grid search, but this time without PCA in the pipeline. Instead, we apply PCA to the whole dataset at once, before cross validation

cv = KFold(n_splits=5, shuffle=True, random_state=42)
scorer = make_scorer(mean_squared_error, greater_is_better=False)

pca=PCA()
lasso = Lasso()

lasso_leakage_gcv = GridSearchCV(lasso,
                        param_grid={"alpha":10**np.linspace(-5,-2,100)},
                        cv=cv,
                        scoring=scorer)
lasso_leakage_gcv.fit(pca.fit_transform(absorption_cut), sugars)

gcv_results_leakage = pd.DataFrame(lasso_leakage_gcv.cv_results_)

As the following plot shows, the incorrectly performed cross validation gives a higher \( \alpha\) and a better optimum score.

fig = plt.figure()
ax = fig.add_subplot()
ax.plot(gcv_results["param_lasso__alpha"], 
            -gcv_results["mean_test_score"], label="pipeline");
ax.plot(gcv_results_leakage["param_alpha"], 
            -gcv_results_leakage["mean_test_score"], label="leakage");
ax.set_xscale("log")
ax.set_xlabel(r"$\alpha$")
ax.set_ylabel("mean square error");
plt.legend();
_images/fitting_114_0.svg

9.3. Unsupervised learning

Unsupervised learning encompasses methods that can find structure in the data by themselves. We give them a dataset and they find - e.g. spectra that look similar or samples that had similar compositions.

The dataset I am using here is an image of different polymers taken with an IR microscope. It has 64x64 pixels, each of which has an FTIR spectrum assigned to it. (This example might not run on the jupyterlab server due to the large amount of memory it needs).

bead = np.genfromtxt("data/bead1.dpt")
wn = bead[0]
A = bead[1:]

64x64 pixels is small for an image, nevertheless, we are already dealing with more than 3600 spectra. Plotting all of them is already a challenge. And we don’t really see much in the dataset either:

fig = plt.figure()
ax_specs = fig.add_subplot()
ax_specs.plot(wn, A[::10].T, color="blue", alpha=.1, linewidth=.1);
ax_specs.invert_xaxis()
ax_specs.set_xlabel("wavenumber")
ax_specs.set_ylabel("ATR absorption");
_images/fitting_119_0.svg

However, at least that plot shows us that we can remove the part of the spectrum between 1900 and 2700 wavenumbers as well as above 3200. We use element wise boolean operators to create a mask of all wavenumber we want to keep

mask = ~((wn>1900)&(wn<2700)|(wn>3200)|(wn<950))

and then cut down wn and A (we are overwriting the variables here to save some memory)

wn = wn[mask]
A = A[:,mask]
fig = plt.figure()
ax_specs = fig.add_subplot()
ax_specs.plot(wn, A[::10].T, color="blue", alpha=.1, linewidth=.1);
ax_specs.invert_xaxis()
ax_specs.set_xlabel("wavenumber")
ax_specs.set_ylabel("ATR absorption");
_images/fitting_124_0.svg

The algorithm I will use here is called hierarchical cluster analysis (HCA) or agglomerative clustering. It takes the dataset and starts looking for two spectra that are closest together. The distance between spectra here means (for example) the Euclidian norm, i.e. how we learned to calculate the distance between two points in high school.

The two closes points are then put into a “cluster” and replaced in the dataset by a single average spectrum. Then the algorithm looks for the next two closest spectra in this new dataset. That might include the new cluster or not. And so on, until finally the last two remaining clusters have been linked.

The output of this algorithm is a “dendrogram” (think origin of species) that shows how spectra were linked and what their distances were.

First, let’s import the algorithm

from sklearn.cluster import AgglomerativeClustering

We create and fit he model - as have been doing for PLS and LASSO:

model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model = model.fit(A)

To plot the dendrogram, I am using one of the code examples from sklearn (with slight modifications). The dendrogram below doesn’t show all clusters but only the ones furthest apart. The number along the x-axis show how many spectra are contained in each cluster point:

from scipy.cluster.hierarchy import dendrogram


def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                      counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)


fig_dendro = plt.figure()
ax = fig_dendro.add_subplot()


plot_dendrogram(model, truncate_mode='level', p=3, ax=ax, leaf_rotation=30)
ax.set_xlabel("Number of points in node (or index of point if no parenthesis).");
ax.set_ylabel("distance");
_images/fitting_130_0.svg

The dendrogram shows that there are three quite distant clusters (their horizontal bars are quite high up). We can now change the mode to use only 3 clusters and call .fit_predict instead of just .fit to get the cluster number for each spectrum:

model = AgglomerativeClustering(n_clusters=3)
cluster = model.fit_predict(A)

cluster is a 1D vector that is 64x64 elements long and contains 0, 1 or 3 according to the cluster number for each spectrum. We can use this vector and boolean indexing to get the average spectrum for each cluster.

np.unique returns the unique entries in an array. We could have typed those values in by hand, but this way the script still work in case our we change the number of clusters above.

fig = plt.figure()
ax = fig.add_subplot()

for c in np.unique(cluster):
    ax.plot(wn, np.mean(A[cluster==c], axis=0), label="{}".format(c))

ax.legend()
ax.set_xlabel("wavenumber")
ax.invert_xaxis()
_images/fitting_134_0.svg

Since this is an image, we can also show the distribution of clusters as an image. Here, we can use imshow again.

First, we need to reshape the cluster vector to have a rectangular shape:

img = np.reshape(cluster,(64,64))
img
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 2, 0, 0],
       ...,
       [1, 1, 1, ..., 2, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 2, 0, 2]])

Now we are ready to plot:

fig = plt.figure()
ax = fig.add_subplot()
mappable=ax.imshow(img);
fig.colorbar(mappable)
<matplotlib.colorbar.Colorbar at 0x7f1da776df70>
_images/fitting_138_1.svg

We can already see a pretty nice structure in this image. It would be nice to make the relationship between clusters in the plot and in the image a bit more obvious though. We will change the color map to just have three levels that correspond to the line colors in the plot. And, to make the relationship very obvious, we will also combine both plots in the same figure.

For each plotted line, we get the color that was used and put it in a list. Then we use those colors to create a color map.

from matplotlib.colors import ListedColormap


fig = plt.figure()
ax_spec = fig.add_subplot()
ax_img = ax_spec.inset_axes([0,.5, .5,.5])

# create list to hold lines
lines = []

# we reuse this code
for c in np.unique(cluster):
    lines.extend(ax_spec.plot(wn, np.mean(A[cluster==c], axis=0), 
                              label="cluster {}".format(c)))

# list comprehension to get a list of colors
colors = [l.get_color() for l in lines]


cm = ListedColormap(colors)

mappable=ax_img.imshow(img, cmap=cm, vmin=-.5, vmax=np.amax(img)+.5);


# anchor="NW" moves the image to the top left corner
ax_img.set_aspect(1, anchor="NW")
ax_img.set_axis_off()

ax_spec.legend(loc='upper center')
ax_spec.set_xlabel("wavenumber")
ax_spec.invert_xaxis()
_images/fitting_140_0.svg

Looking at this image more closely, it seems a bit weird, that all the green areas are enclosed by orange rings. This might be a hint, that we actually need four clusters instead of three. It could also mean, that we need to spend some time preprocessing the data - maybe by normalizing spectra.

There are a million things we haven’t done. But given the right tools, they are just a few lines of code away.

9.4. Summary

9.4.1. 1. polynomial fits

polynomial.polyfit to fit, polynomial.polyval to evaluate.

9.4.2. 2. sklearn

  1. Same procedure every time:

    1. instantiate pls = PLSregression()

    2. fit pls.fit

    3. predict plt.predict

  2. Cross-validation to optimize parameters using GridSearchCV

  3. pipelines to include preprocessing

Never forget: Getting things to work the first time might be hard. But when you have gotten things to work once, you can repeat them at the push of a button for any similar task:

def load_image(path):
    """Loads rowwise FITR image from CSV file"""
    bead = np.genfromtxt(path)
    wn = bead[0]
    A = bead[1:]
    return wn, A
def wavelength_select(wn, A, mask=True):
    """Select useful wavelength ranges"""
    if mask:
        mask = ~((wn>1900)&(wn<2700)|(wn>3200)|(wn<950))
        wn = wn[mask]
        A = A[:,mask]
    return wn, A
def cluster_image(wn, A, n_clusters=3,plot=True,title=None):
    """Cluster hyperspectral image"""
    model = AgglomerativeClustering(n_clusters=n_clusters)
    cluster = model.fit_predict(A)
    if plot:
        fig = plt.figure()
        ax_spec = fig.add_subplot()
        ax_img = ax_spec.inset_axes([0,.5, .5,.5])
        lines = []
        for c in np.unique(cluster):
            lines.extend(ax_spec.plot(wn, np.mean(A[cluster==c], axis=0), 
                                      label="cluster {}".format(c)))
        colors = [l.get_color() for l in lines]
        cm = ListedColormap(colors)
        img = np.reshape(cluster, (64,64))
        mappable=ax_img.imshow(img, cmap=cm, vmin=-.5, vmax=np.amax(img)+.5, interpolation="none");
        ax_img.set_aspect(1, anchor="NW")
        ax_img.set_axis_off()
        ax_spec.legend(loc='upper center')
        ax_spec.set_xlabel("wavenumber")
        ax_spec.invert_xaxis()
        if title is not None:
            ax_spec.set_title(title)
        return cluster, model, fig
    return cluster, model
for image_filename in ["data/bead1.dpt", "data/bead2.dpt", "data/bead3.dpt"]:
    wnA = load_image(image_filename)
    wnA = wavelength_select(*wnA)
    cluster_image(*wnA,n_clusters=2,title=image_filename)
_images/fitting_150_0.svg_images/fitting_150_1.svg_images/fitting_150_2.svg