Zip is its own inverse

You can undo the zipping operation using zip itself. Let’s explore that in Python. By using the unpacking operator * you don’t have to manually specify the number of arguments (although I do assume the return of two components in the example below). In Python 3, the zip operator returns a generator instead of a list, so you need to explicitly cast to a list if you want one.

>>> a = [1, 3]
>>> b = [2, 4]
>>> c = list(zip(a,b))
>>> c
[(1, 2), (3, 4)]
>>> a, b = list(zip(*c))
>>> a
(1, 3)
>>> b
(2, 4)

The inverse operation will always return tuples in Python, so if your original input was a list, you also need to convert back the results to a list.

This operation is super handy. For example, I wrote a class to recommend relevant texts for a query document based on their distance in a vector space. The recommend() function of this class returns a list of recommended texts and a list of tuples containing relevant metadata about those texts. So the metadata will be a list of (distance, document_id, type) tuples. We may be interested in easily retrieving all distances, document_ids etc. as a list of their own. We can do that by using the unpack+zip trick:

distances, document_ids, types = zip(*meta)

Flatten nested lists with a list comprehension

Here’s my programming tip of the day. You can flatten a nested list of lists into a single flat list with a nested list comprehension. Wow, phrasing.

It’s easy to get confused. If you forget how to do it, you can first write out the whole loop:

# Flatten list
>>> flat = []
>>> nested = [ [1, 2, 3], [4, 5, 6] ]
>>> for sub in nested:
>>>    for element in sub:
>>>        flat.append(element)
>>> flat
[1, 2, 3, 4, 5, 6]

To collapse this into a one-liner, work from the outer scope inwards:

flat = [ el for sub in nested for el in sub ]

Voila. Lean and mean.

Wrong feature preprocessing is a source of train-test leakage

Feature selection should be done after train-test splitting to avoid leaking information from the test set into the training pipeline. This also means that feature selection should be done within each fold of cross-validation, not before. This sounds obvious, but this is something that goes wrong easily and often. Especially when the feature extraction and selection pipeline is relatively expensive, having to repeat it in each fold may be a perverse incentive to want to only do it once before cross-validation. It may also be that feature selection is done on the data set prior to even starting other machine learning work, so it’s easy to overlook. In this post we discuss the do’s and don’ts when it comes to leaking information from a test set during preprocessing.

Code example ¶

This is an example of how it should not be done ( source):

import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# random data:
X = np.random.randn(500, 10000)
y = np.random.choice(2, size=500)

selector = SelectKBest(k=25)
# first select features
X_selected = selector.fit_transform(X,y)
# then split
X_selected_train, X_selected_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.25, random_state=42)

# fit a simple logistic regression
lr = LogisticRegression()
lr.fit(X_selected_train,y_train)

# predict on the test set and get the test accuracy:
y_pred = lr.predict(X_selected_test)
accuracy_score(y_test, y_pred)
# 0.76000000000000001

In this example, we expect a performance around 0.5 because our data and target labels are randomly sampled. Nevertheless, we find that we have a significantly better performance even though there is no interesting signal in the data, because our feature selection is biased towards information of (what will be) the test set. You can see that the feature selector is fitted using target signal y, which includes samples that will later be in the test set.

Instead, you should only fit the data preprocessing steps on the training data after splitting, and then at inference time apply (but not refit!) the preprocessing steps:

# split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# then select features using the training set only
selector = SelectKBest(k=25)
X_train_selected = selector.fit_transform(X_train,y_train)

# fit again a simple logistic regression
lr.fit(X_train_selected,y_train)

# select the same features on the test set, predict, and get the test accuracy:
X_test_selected = selector.transform(X_test)
y_pred = lr.predict(X_test_selected)
accuracy_score(y_test, y_pred)
# 0.52800000000000002

This now gives the expected performance! Because there is no useful signal in the training labels, this machine learning classifier is effectively making random guesses for this binary classification problem.

Unsupervised feature selection as exception ¶

There is a single exception to above procedure. Unsupervised feature selection procedures do not use the target signal and thus also do not have the same biasing effect towards the test set. So you may for example remove features that always have the same value, i.e. selection based on (zero) variance.

Okay, well, let’s test that!

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=1)  # Normally you'd do some form of scaling
# first select features
X_selected = selector.fit_transform(X)  # y is not used here
# then split
X_selected_train, X_selected_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.25, random_state=42)
print(X.shape, X_selected.shape)

# fit a simple logistic regression
lr = LogisticRegression()
lr.fit(X_selected_train,y_train)

# predict on the test set and get the test accuracy:
y_pred = lr.predict(X_selected_test)
accuracy_score(y_test, y_pred)
# 0.512

Which is again close to the baseline, as expected!

Let the exception be just that: an exception ¶

In this dummy example we know that the values of each feature follow the same distribution, since we generated them by sampling from it. In practice, it may be that some features have a very different scale, which makes selection using a single variance threshold insensible because the variance is dependent on the chosen scale. If I change a measurement in meters into centimeters, the same data will suddenly have a larger variance! This is why you would scale your data e.g. using a MinMaxScaler using a variance threshold (“standard scaling” to zero mean and unit variance is in this case useless because, well… the variance will always be 1).

If you apply this form of feature scaling before splitting the data, you’ll use global data statistics, in this case the global minimum and maximum per feature. However subtle, this is also a form of leakage from the test set into the training pipeline which may lead you to either over- or underestimate your model performance. A preprocessing step would not leak information if it only requires information from a single sample, i.e. a “row” in the data array. Scaling instead uses the whole “column” corresponding to feature values. By scaling features using statistics from the test set as well, you basically do not account for the fact that the data distribution of your test data may be different. You therefore effectively do not as adequately evaluate the ability of the model to generalize to unseen data.

In short, even though unsupervised feature selection does not leak data strictly by itself, this insight is not super useful in practical applications because 1) you’ll likely also need other prior steps that do leak information and 2) you’ll have to constantly be careful and overthink each each step, which costs effort while unnecessarily being at risk.

It’s better to just follow the rule of thumb: avoid leakage by always fitting your data preprocessing and feature selection only on the training data. During testing, only apply the data preprocessing steps used during the training phase.

Useful sources:

Masking with Boolean arrays in Numpy

Use case ¶

I regularly encounter situations where I have an array that specifies which elements to keep in another array. If you for example want to provide batched inputs to a BERT language model as a tensor, you have to pad the input text sequences because the tensor needs to be a square matrix. BERT uses attention masks (Boolean arrays) to indicate which elements correspond to actual input tokens and which ones are special tokens such as meaningless padding tokens [PAD]. If I use BERT to classify input tokens, I don’t care about classifications on the [PAD] token, so I want to filter them out using the attention mask.

Numpy is very convenient for this use case, because it supports using boolean arrays directly as masks. But before we dive into masking with boolean arrays, let’s briefly discuss Numpy masked arrays.

Masked arrays ¶

Something that throws me off sometimes is that Numpy has a masked array class, but this has a slightly different and specific use case, namely to work with “arrays that may have missing or invalid entries”. The purpose of this is to be able to use the input array as is, but exclude the invalid elements from common computations. A simple example from the documentation:

>>> import numpy as np
>>> import numpy.ma as ma
>>> x = np.array([1, 2, 3, -1, 5])
>>> mx = ma.masked_array(x, mask=[0, 0, 0, 1, 0])
>>> mx.mean()
2.75

In this case the 1 in the mask indicates that the fourth data point is invalid (which to me is slightly counter-intuitive because in the example use case above I want to keep the entries with a 1). The benefit of the masked array module is that you don’t have to modify the shape of the input array, but in my use case I actually just want to throw away the data I don’t care about. This is not desirable in cases where you do tensor computations and where the shape of the input must be preserved.

Four methods ¶

So in our use case we have two arrays, where the second serves as a mask that indicates which elements to keep in the first array. We can use:

use numpy.nonzero() for creating a boolean array
explicitly cast the masking array as a boolean array
create a boolean array based on a logical condition
numpy.where()

arr = np.array([1,2,3])
mask = np.array([0,1,0])

Method 1. We essentially want to keep all elements from arr in the corresponding places where mask is non-zero:

>>> arr[np.nonzero(mask)]
array([2])

Python interprets False as 0 and True as 1. E.g. you can do arithmetic on booleans like:

>>> arr > 1
array([False,  True,  True])
>>> np.sum(arr > 1)
2

This means nonzero() can be used to mask an array using arbitrary conditions:

>>> arr[ np.nonzero(arr>1)]
array([2, 3])

This could be handy if you instead want to select elements of arr based on some threshold. Although of course, you don’t really need to bother with this because you can apply the mask directly:

>> arr[arr>1]
array([2, 3])

Method 2. The same can be achieved with by explicitly casting the mask to a Boolean array:

>>> arr[mask.astype(bool)]
array([2])

Method 3. But the most straightforward usage to me is creating the Boolean array based on a logical operator. Numpy also nicely handles these operations by applying them to each array element:

>>> arr[mask != 0]
array([2])

Method 4. If you use numpy.where with a boolean condition, it is equivalent to using numpy.nonzero():

>>> arr[np.where(mask > 0)]
array([2])

The behavior of numpy.where() is more general because it allows you to pick an element for array-like x or y depending on a Boolean array, where an element from x is picked when the condition is True, and y otherwise. This could simulate a bit of the behavior of the numpy.maskedarray class. E.g. you use the masking array to keep only certain values and set others to NaN and then use numpy functions that ignore NaN values:

>>> np.where(mask > 0, arr, np.nan)
array([nan,  2., nan])
>>> np.mean(arr_nan)  # This will not give correct results
nan
>>> np.nanmean(arr_nan)  # But this a succesfull operation on the masked array
2.0

Note that the numpy.where function expects array-like arguments, but will automatically broadcast the value np.nan to an array of the correct shape.

Multi-dimensional case ¶

>>> arr = np.array([ [1,2,3], [4,5,6] ])
>>> arr
array([[1, 2, 3],
       [4, 5, 6]])
>>> mask = np.array( [[0, 1, 0], [1, 1, 0]])
>>> mask
array([[0, 1, 0],
       [1, 1, 0]])

If you apply the masking strategies 1-3 from above, it is good to know that the shape of the input array is not preserved, like with numpy.ma. Instead you end up with a list of the preserved elements.

>>> arr[np.nonzero(mask)]
array([2, 4, 5])

If you want to preserve the shape of the input array, you can use numpy.where:

>>> np.where(mask > 0, arr, np.nan)
array([[nan,  2., nan],
       [ 4.,  5., nan]])

You can also provide a unidimensional mask to a multidimensional input array. For example, in the token classification task mentioned above, BERT will output activations or probabilities over classes for each input token. Each input sequence thus gives an output with shape (n_tokens, n_classes). If you do batch processing, many of these tokens will be [PAD], i.e. we want to mask on the first “token” dimension. The attention mask in this case will have shape (n_tokens,). Let’s say we have a short sentence with only two tokens, for which we predict fictitious activations over three output classes. The first token is actual text, the second token is padding. We can remove the padding tokens as follows:

>>> a = np.array( [ [2,3,4], [5,6,7] ] )
>>> a.shape
(2, 3)
>>> m = np.array( [1,0] )
>>> m.shape
(2,)
>>> a[np.nonzero(m)]
array([[2, 3, 4]])

Understanding the output of nonzero ¶

The output of np.nonzero can be a bit hard to read because it’s not organized by row indexes, but by dimension:

>>> np.nonzero(mask)
(array([0, 1, 1], dtype=int64), array([1, 0, 1], dtype=int64))

These two arrays have length three because we have three non-zero elements. The row index of these three points are [0,1,1] and the column indexes are [1,0,1], which selects the following coordinates:

>>> np.transpose(np.nonzero(mask))
array([[0, 1],
       [1, 0],
       [1, 1]], dtype=int64)

Which is (by definition) the output of np.argwhere:

>>> np.argwhere(mask)
array([[0, 1],
       [1, 0],
       [1, 1]], dtype=int64)

However, you can’t use this as index (which goes dimension-wise) so shouldn’t be used for masking purposes.

Digest October 2021

Update: Edo made a Spotify playlist of this digest.

I’ve gotten to know Men I Trust a while back and what a luck, they just dropped a new album in August (yes, I skipped two months worth of digest, so buckle up). For this next song called Organon, it’s as if you start playing a record, but then the room gets hot and the records start melting. But not in a bad way. You kind of dig it and pour yourself a red wine.

The new album of Men I Trust is great, but I’ve become somewhat addicted to their old album Headroom. Take for example this song with ephemeral vocals, but a slowly emerging thick arpeggio that gives it a nice drive:

Okay, staying a bit in the same vibe, give this beautifully sad song a listen. Put this on your “music for break ups” playlist. Actually, you can just go ahead and put the whole album on that list.

In the previous digest, I included a latest BADBADNOTGOOD single. This month the full album dropped and it’s worth the listen. Here is another gem (the video is again a piece of art in its own right):

When this song was released all the way back in April I deliberately chose not to include it in my digest. Well, I was an absolute idiot! This song gets better with every listen. There’s actually no sound in this song that’s not quite unique. It’s as if the instruments go off the rails sometimes, yet it’s never off key and quite perfect. The drums are minimalistic, yet if you listen you’ll notice all these little shifts and breaks. The vocals go from low, warm and fuzzy to this absolutely crazy moment (e.g. from 2:30 on) where it sounds as if two people are singing at the same time.

The next one is not new, but scratches all the right itches for me. I have a huge soft spot for ghostly female vocals over darker sounding music, like techno or metal. The apotheosis reminds me of Gesaffelstein or Jon Hopkins, and the visuals of the clip are brilliant:

Okay, we’re approaching the moment where the faint of heart may call it quits. The next two are loud. Spiritbox recently dropped a new album. Holy Roller and Constance are both great songs released earlier as singles. I haven’t decided yet, but I think the opening song may actually be my favorite (unfortunately no clip):

Amenra dropped a new album with Dutch lyrics a few months back. Listening to Amenra has a spiritual quality to it, it’s condensed catharsis. Colin screams like he’s leaving his body. Don’t afraid to put up the volume :-)