Recap: Data preprocessing#

Basic data transformation techniques

Joaquin Vanschoren

Data transformations covered here#

  • Scaling and power transformations

  • Unsupervised feature selection ()

    • Feature engineering (e.g. binning, polynomial features,…)

    • Handling missing data

    • Handling imbalanced data

    • Dimensionality reduction (e.g. PCA)

    • Learned embeddings (e.g. for text)

  • Seek the best combinations of transformations and learning methods

    • Often done empirically, using cross-validation

    • Make sure that there is no data leakage during this process!


  • Use when different numeric features have different scales (different range of values)

    • Features with much higher values may overpower the others

  • Goal: bring them all within the same range

  • Different methods exist

# Iris dataset with some added noise
def noisy_iris():
    iris = fetch_openml("iris", return_X_y=True, as_frame=False)
    X, y = iris
    noise = np.random.normal(0, 0.1, 150)
    for i in range(4):
        X[:, i] = X[:, i] + noise
    X[:, 0] = X[:, 0] + 3 # add more skew 
    label_encoder = LabelEncoder().fit(y)
    y = label_encoder.transform(y)
    return X, y

scalers = [StandardScaler(), RobustScaler(), MinMaxScaler(), Normalizer(norm='l1'), MaxAbsScaler()]

def plot_scaling(scaler=scalers):
    X, y = noisy_iris()
    X = X[:,:2] # Use only first 2 features
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8*fig_scale, 3*fig_scale))
    axes[0].scatter(X[:, 0], X[:, 1], c=y, s=1*fig_scale, cmap="brg")
    axes[0].set_xlim(-15, 15)
    axes[0].set_ylim(-5, 5)
    axes[0].set_title("Original Data")
    X_ = scaler.fit_transform(X)
    axes[1].scatter(X_[:, 0], X_[:, 1], c=y, s=1*fig_scale, cmap="brg")
    axes[1].set_xlim(-2, 2)
    axes[1].set_ylim(-2, 2)

    for ax in axes:
Why do we need scaling?#

  • KNN: Distances depend mainly on feature with larger values

  • SVMs: (kernelized) dot products are also based on distances

  • Linear model: Feature scale affects regularization

    • Weights have similar scales, more interpretable

# Example by Andreas Mueller, with some tweaks
def plot_2d_classification(classifier, X, fill=False, ax=None, eps=None, alpha=1):
    # multiclass                                                                  
    if eps is None:                                                               
        eps = X.std(axis=0) / 2.
        eps = np.array([eps, eps])

    if ax is None:                                                                
        ax = plt.gca()                                                            

    x_min, x_max = X[:, 0].min() - eps[0], X[:, 0].max() + eps[0]
    y_min, y_max = X[:, 1].min() - eps[1], X[:, 1].max() + eps[1]
    # these should be 1000 but knn predict is unnecessarily slow
    xx = np.linspace(x_min, x_max, 100)                                          
    yy = np.linspace(y_min, y_max, 100)                                          

    X1, X2 = np.meshgrid(xx, yy)                                                  
    X_grid = np.c_[X1.ravel(), X2.ravel()]                                        
    decision_values = classifier.predict(X_grid)                                  
    ax.imshow(decision_values.reshape(X1.shape), extent=(x_min, x_max,            
                                                       y_min, y_max),             
            aspect='auto', origin='lower', alpha=alpha,     

clfs = [KNeighborsClassifier(), SVC(), LinearSVC(), LogisticRegression(C=10)]

def plot_scaling_effect(classifier=clfs, show_test=[False,True]):
    X, y = make_blobs(centers=2, random_state=4, n_samples=50)
    X = X * np.array([1000, 1])
    y[7], y[27] = 0, 0 
    X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state=1)
    clf2 = clone(classifier)
    clf_unscaled =, y_train)

    fig, axes = plt.subplots(1, 2, figsize=(7*fig_scale, 3*fig_scale))
    axes[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='bwr', label="train")
    axes[0].set_title("Without scaling. Accuracy:{:.2f}".format(clf_unscaled.score(X_test,y_test)))
    if show_test: # Hide test data for simplicity
        axes[0].scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='^', cmap='bwr', label="test") 
    scaler = StandardScaler().fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    clf_scaled =, y_train)

    axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, cmap='bwr', label="train")
    axes[1].set_title("With scaling. Accuracy:{:.2f}".format(clf_scaled.score(X_test_scaled,y_test)))   
    if show_test: # Hide test data for simplicity
        axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], c=y_test, marker='^', cmap='bwr', label="test")

    plot_2d_classification(clf_unscaled, X, ax=axes[0], alpha=.2)
    plot_2d_classification(clf_scaled, scaler.transform(X), ax=axes[1], alpha=.3)
Standard scaling (standardization)#

  • Generally most useful, assumes data is more or less normally distributed

  • Per feature, subtract the mean value \(\mu\), scale by standard deviation \(\sigma\)

  • New feature has \(\mu=0\) and \(\sigma=1\), values can still be arbitrarily large $\(\mathbf{x}_{new} = \frac{\mathbf{x} - \mu}{\sigma}\)$

Min-max scaling#

  • Scales all features between a given \(min\) and \(max\) value (e.g. 0 and 1)

  • Makes sense if min/max values have meaning in your data

  • Sensitive to outliers

\[\mathbf{x}_{new} = \frac{\mathbf{x} - x_{min}}{x_{max} - x_{min}} \cdot (max - min) + min \]
Robust scaling#

  • Subtracts the median, scales between quantiles \(q_{25}\) and \(q_{75}\)

  • New feature has median 0, \(q_{25}=-1\) and \(q_{75}=1\)

  • Similar to standard scaler, but ignores outliers

  • Makes sure that feature values of each point (each row) sum up to 1 (L1 norm)

    • Useful for count data (e.g. word counts in documents)

  • Can also be used with L2 norm (sum of squares is 1)

    • Useful when computing distances in high dimensions

    • Normalized Euclidean distance is equivalent to cosine similarity

Maximum Absolute scaler#

  • For sparse data (many features, but few are non-zero)

    • Maintain sparseness (efficient storage)

  • Scales all values so that maximum absolute value is 1

  • Similar to Min-Max scaling without changing 0 values

Automatic Feature Selection#

It can be a good idea to reduce the number of features to only the most useful ones

  • Simpler models that generalize better (less overfitting)

    • Curse of dimensionality (e.g. kNN)

    • Even models such as RandomForest can benefit from this

    • Sometimes it is one of the main methods to improve models (e.g. gene expression data)

  • Faster prediction and training

    • Training time can be quadratic (or cubic) in number of features

  • Easier data collection, smaller models (less storage)

  • More interpretable models: fewer features to look at

Example: bike sharing#

  • The Bike Sharing Demand dataset shows the amount of bikes rented in Washington DC

  • Some features are clearly more informative than others (e.g. temp, hour)

  • Some are correlated (e.g. temp and feel_temp)

  • We add two random features at the end

# Get bike sharing data from OpenML
bikes = fetch_openml(data_id=42713, as_frame=True)
X_bike_cat, y_bike =,

# Optional: take half of the data to speed up processing
X_bike_cat = X_bike_cat.sample(frac=0.5, random_state=1)
y_bike = y_bike.sample(frac=0.5, random_state=1)

# One-hot encode the categorical features
encoder = OneHotEncoder(dtype=int)
preprocessor = ColumnTransformer(transformers=[('cat', encoder, [0,7])], remainder='passthrough')
X_bike = preprocessor.fit_transform(X_bike_cat,y_bike)

# Add 2 random features at the end
random_features = np.random.rand(len(X_bike),2)
X_bike = np.append(X_bike,random_features, axis=1)

# Create feature names
bike_names = ['summer','winter', 'spring', 'fall', 'clear', 'misty', 'rain', 'heavy_rain']
Unsupervised feature selection#

  • Variance-based

    • Remove (near) constant feature: choose a small variance threshold

    • Scale features before computing variance!

    • Infrequent values may still be important

  • Covariance-based

    • Remove correlated features

    • The small differences may actually be important

      • You don’t know because you don’t consider the target

Covariance based feature selection#

  • Remove features \(X_i\) (= \(\mathbf{X_{:,i}}\)) that are highly correlated (have high correlation coefficient \(\rho\)) $\(\rho (X_1,X_2)={\frac {{\mathrm {cov}}(X_1,X_2)}{\sigma (X_1)\sigma (X_2)}} = {\frac { \frac{1}{N-1} \sum_i (X_{i,1} - \overline{X_1})(X_{i,2} - \overline{X_2}) }{\sigma (X_1)\sigma (X_2)}}\)$

  • Should we remove feel_temp? Or temp? Maybe one correlates more with the target?

Univariate statistics (F-test)#

  • Consider each feature individually (univariate), independent of the model that you aim to apply

  • Use a statistical test: is there a linear statistically significant relationship with the target?

  • Use F-statistic (or corresponding p value) to rank all features, then select features using a threshold

    • Best \(k\), best \(k\) %, probability of removing useful features (FPR),…

  • Cannot detect correlations (e.g. temp and feel_temp) or interactions (e.g. binary features)

  • For regression: does feature \(X_i\) correlate (positively or negatively) with the target \(y\)? $\(\text{F-statistic} = \frac{\rho(X_i,y)^2}{1-\rho(X_i,y)^2} \cdot (N-1)\)$

  • For classification: uses ANOVA: does \(X_i\) explain the between-class variance?

    • Alternatively, use the \(\chi^2\) test (only for categorical features) $\(\text{F-statistic} = \frac{\text{within-class variance}}{\text{between-class variance}} =\frac{var(\overline{X_i})}{\overline{var(X_i)}}\)$ ml

Mutual information#

  • Measures how much information \(X_i\) gives about the target \(Y\). In terms of entropy \(H\): $\(MI(X,Y) = H(X) + H(Y) - H(X,Y)\)$

  • Idea: estimate H(X) as the average distance between a data point and its \(k\) Nearest Neighbors

    • You need to choose \(k\) and say which features are categorical

  • Captures complex dependencies (e.g. hour, month), but requires more samples to be accurate

Further techniques#

  • Many more powerful techniques exist

    • Model-based: Random Forests, Linear models, kNN

    • Wrapping techniques (black-box search)

    • Permutation importance

  • See the Data Preprocessing lecture.

Feature Engineering#

  • Create new features based on existing ones

    • Polynomial features

    • Interaction features

    • Binning

  • Mainly useful for simple models (e.g. linear models)

    • Other models can learn interations themselves

    • But may be slower, less robust than linear models


  • Add all polynomials up to degree \(d\) and all products

    • Equivalent to polynomial basis expansions $\([1, x_1, ..., x_p] \xrightarrow{} [1, x_1, ..., x_p, x_1^2, ..., x_p^2, ..., x_p^d, x_1 x_2, ..., x_{p-1} x_p]\)$

  • Partition numeric feature values into \(n\) intervals (bins)

  • Create \(n\) new one-hot features, 1 if original value falls in corresponding bin

  • Models different intervals differently (e.g. different age groups)

Binning + interaction features#

  • Add interaction features (or product features )

    • Product of the bin encoding and the original feature value

    • Learn different weights per bin

Categorical feature interactions#

  • One-hot-encode categorical feature

  • Multiply every one-hot-encoded column with every numeric feature

  • Allows to built different submodels for different categories

  • Data preprocessing is a crucial part of machine learning

    • Scaling is important for many distance-based methods (e.g. kNN, SVM, Neural Nets)

    • Selecting features can speed up models and reduce overfitting

    • Feature engineering is often useful for linear models

    • Many more techniques (e.g. missing value imputation, handling data imbalance,…) will be discussed in the data preprocessing lecture

  • Pipelines allow us to encapsulate multiple steps in a convenient way

    • Avoids data leakage, crucial for proper evaluation

  • Choose the right preprocessing steps and models in your pipeline

    • Cross-validation helps, but the search space is huge

    • Smarter techniques exist to automate this process (i.e. AutoML)