Vaex-ml - Machine Learning

The vaex.ml package brings some machine learning algorithms to vaex. Install it by running pip install vaex-ml.

Vaex.ml stays close to the authoritative ML package: scikit-learn. We will first show two examples, KMeans and PCA, to see how they compare and differ, and what the gain is in performance.

In [1]:
import vaex.ml.cluster
import numpy as np
%matplotlib inline

We use the well known iris flower dataset, a classical for machine learning.

In [2]:
ds = vaex.ml.iris()
ds.scatter(ds.petal_width, ds.petal_length, c_expr=ds.class_)
Out[2]:
<matplotlib.collections.PathCollection at 0x1154b6b70>
_images/ml_4_1.png
In [3]:
ds
Out[3]:
#sepal_widthpetal_lengthsepal_lengthpetal_widthclass_random_index
03.04.20000000000000025.90000000000000041.51114
13.04.59999999999999966.09999999999999961.3999999999999999174
22.89999999999999994.59999999999999966.59999999999999961.3137
33.29999999999999985.70000000000000026.70000000000000022.10000000000000012116
44.20000000000000021.39999999999999995.50.20000000000000001061
.....................
1453.39999999999999991.39999999999999995.20000000000000020.200000000000000010119
1463.79999999999999981.60000000000000015.09999999999999960.20000000000000001015
1472.60000000000000014.05.79999999999999981.2122
1483.79999999999999981.75.70000000000000020.299999999999999990144
1492.89999999999999994.29999999999999986.20000000000000021.31102

KMeans

We use two features to do a KMeans, and roughly put the two features on the same scale by a simple division. We then construct a KMeans object, quite similar to what you would do in sklearn, and fit it.

In [4]:
features = ['petal_width/2', 'petal_length/5']
init = [[0, 1/5], [1.2/2, 4/5], [2.5/2, 6/5]] #
kmeans = vaex.ml.cluster.KMeans(features=features, init=init, verbose=True)
kmeans.fit(ds)
Iteration    0, inertia  6.2609999999999975
Iteration    1, inertia  2.5062184444444435
Iteration    2, inertia  2.443455900151798
Iteration    3, inertia  2.418136327962199
Iteration    4, inertia  2.4161501474358995
Iteration    5, inertia  2.4161501474358995

We now transform the original dataset, similar to sklearn. However, we now end up with a new dataset, which contains an extra column (prediction_kmeans).

In [5]:
ds_predict = kmeans.transform(ds)
ds_predict
Out[5]:
#sepal_widthpetal_lengthsepal_lengthpetal_widthclass_random_indexprediction_kmeans
03.04.20000000000000025.90000000000000041.511141
13.04.59999999999999966.09999999999999961.39999999999999991741
22.89999999999999994.59999999999999966.59999999999999961.31371
33.29999999999999985.70000000000000026.70000000000000022.100000000000000121162
44.20000000000000021.39999999999999995.50.200000000000000010610
........................
1453.39999999999999991.39999999999999995.20000000000000020.2000000000000000101190
1463.79999999999999981.60000000000000015.09999999999999960.200000000000000010150
1472.60000000000000014.05.79999999999999981.21221
1483.79999999999999981.75.70000000000000020.2999999999999999901440
1492.89999999999999994.29999999999999986.20000000000000021.311021

Although this column is special, it is actually a virtual column, it does not use up any memory and will be computed on the fly when needed, saving us precious ram. Note that the other columns reference the original data as well, so this new dataset (ds_predict) almost takes up no memory at all, which is ideal for very large datasets, and quite different from what sklearn will do.

In [6]:
ds_predict.virtual_columns['prediction_kmeans']
Out[6]:
<vaex.expression.Expression(expressions='kmean_predict_function(petal_width/2, petal_length/5)')> instance at 0x1154c8da0 [1, 1, 1, 2, 0 ... (total 150 values) ... 0, 0, 1, 0, 1]

By making a simple scatter plot we can see the KMeans does a pretty good job.

In [7]:
import matplotlib.pylab as plt
fig, ax = plt.subplots(1, 2, figsize=(12,5))

plt.sca(ax[0])
plt.title('original classes')
ds.scatter(ds.petal_width, ds.petal_length, c_expr=ds.class_)

plt.sca(ax[1])
plt.title('predicted classes')
ds_predict.scatter(ds_predict.petal_width, ds_predict.petal_length, c_expr=ds_predict.prediction_kmeans)
Out[7]:
<matplotlib.collections.PathCollection at 0x1169e9e10>
_images/ml_13_1.png

KMeans benchmark

To demonstrate the performance and scaling of vaex, we continue with a special version of the iris dataset that has \(\sim10^7\) rows, by repeating the rows many times.

In [8]:
ds = vaex.ml.iris_1e7()

We now use random initial conditions, and execute 10 runs in parallel (n_init), for a maximum of 5 iterations and benchmark it.

In [9]:
features = ['petal_width/2', 'petal_length/5']
kmeans = vaex.ml.cluster.KMeans(features=features, n_clusters=3, init='random', random_state=1,
                                max_iter=5, verbose=True, n_init=10)
In [10]:
%%timeit -n1 -r1 -o
kmeans.fit(ds)
Iteration    0, inertia  1784973.799998645 |  1548329.7999990159 |  354711.39999875583 |  434173.3999988521 |  1005871.0000026901 |  1312114.6000003854 |  1989377.3999927903 |  577104.4999989534 |  2747388.6000027955 |  628486.799997179
Iteration    1, inertia  481645.0225601919 |  233311.807648651 |  214794.26525253724 |  175205.9965848818 |  490218.54137152765 |  816598.0811733825 |  285786.25668654573 |  456305.06015295343 |  1205488.9851008556 |  262443.28449456714
Iteration    2, inertia  458443.87392026593 |  162015.13397359703 |  173081.69460305249 |  162580.0667193532 |  488402.9744732218 |  436698.8939923954 |  162626.54988994548 |  394680.5108569789 |  850103.6561417002 |  198213.0961053151
Iteration    3, inertia  394680.5108569789 |  161882.05987810466 |  162580.0667193532 |  161882.05987810466 |  487435.98983613244 |  214098.28159484005 |  161882.05987810466 |  275282.3731570135 |  594451.8937940609 |  169525.1971933692
Iteration    4, inertia  275282.37315701344 |  161882.05987810466 |  161882.05987810466 |  161882.05987810466 |  486000.8312405078 |  169097.27135654766 |  161882.05987810466 |  201144.2611065195 |  512055.18086238694 |  162023.37977993552
8.63 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Out[10]:
<TimeitResult : 8.63 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
In [11]:
time_vaex = _

We now do the same using sklearn.

In [12]:
from sklearn.cluster import KMeans
kmeans_sk = kmeans = KMeans(n_clusters=3, init='random', max_iter=5, verbose=True, algorithm='full', n_jobs=-1,
                           precompute_distances=False, n_init=10)
# Doing an unfortunate memory copy
X = np.array(ds[features])
In [13]:
%%timeit -n1 -r1 -o
kmeans_sk.fit(X)
Initialization complete
Initialization complete
Initialization complete
Iteration  0, inertia 606591.200
Initialization complete
Iteration  0, inertia 2859921.800
Iteration  1, inertia 204568.645
Iteration  0, inertia 552321.200
Iteration  1, inertia 743308.405
Iteration  2, inertia 169097.271
Iteration  1, inertia 486281.350
Iteration  0, inertia 2764835.400
Iteration  2, inertia 252036.950
Iteration  3, inertia 163711.545
Iteration  2, inertia 481104.476
Iteration  1, inertia 497577.339
Iteration  3, inertia 167916.636
Iteration  4, inertia 162015.134
Iteration  3, inertia 458443.874
Iteration  2, inertia 207535.785
Iteration  4, inertia 163711.545
Iteration  4, inertia 394680.511
Iteration  3, inertia 171177.750
Initialization complete
Iteration  4, inertia 162580.067
Iteration  0, inertia 4727901.900
Initialization complete
Iteration  1, inertia 1183859.906
Initialization complete
Iteration  0, inertia 2338440.700
Iteration  2, inertia 596440.052
Iteration  0, inertia 223873.800
Iteration  1, inertia 751931.288
Iteration  3, inertia 265965.361
Initialization complete
Iteration  1, inertia 161882.060
Converged at iteration 1: center shift 0.000000e+00 within tolerance 1.341649e-05
Iteration  2, inertia 293381.958
Iteration  4, inertia 176624.304
Iteration  0, inertia 951708.200
Iteration  3, inertia 188528.277
Iteration  1, inertia 491323.101
Iteration  4, inertia 167916.636
Iteration  2, inertia 490998.649
Initialization complete
Iteration  3, inertia 489551.146
Iteration  4, inertia 488402.974
Initialization complete
Iteration  0, inertia 1220612.700
Iteration  0, inertia 1469993.400
Iteration  1, inertia 514343.770
Iteration  1, inertia 498914.693
Iteration  2, inertia 214309.979
Iteration  2, inertia 211156.559
Iteration  3, inertia 167916.636
Iteration  3, inertia 171177.750
Iteration  4, inertia 163711.545
Iteration  4, inertia 162580.067
47.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Out[13]:
<TimeitResult : 47.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
In [14]:
time_sklearn = _

We see that vaex is quite fast:

In [15]:
print('vaex is approx', time_sklearn.best / time_vaex.best, 'times faster for KMeans')
vaex is approx 5.523496296321969 times faster for KMeans

But also, sklean will need to copy the data, while vaex will be very careful not to do unnecessary copies, and minimal amounts of passes of the data (Out-of-core). Therefore vaex will happily scale to massive datasets, while with sklearn you will be limited to the size of the RAM.

PCA Benchmark

We now continue with benchmarking a PCA on 4 features:

In [16]:
features = [k.expression for k in [ds.col.petal_width, ds.col.petal_length, ds.col.sepal_width, ds.col.sepal_length]]
pca = ds.ml.pca(features=features)
In [17]:
%%timeit -n1 -r3 -o
pca = ds.ml.pca(features=features)
478 ms ± 13.9 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Out[17]:
<TimeitResult : 478 ms ± 13.9 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)>
In [18]:
time_vaex = _

Since sklearn takes too much memory with this dataset, we only use 10% for sklearn, and correct later.

In [19]:
# on my laptop this takes too much memory with sklearn, use only a subset
factor = 0.1
ds.set_active_fraction(factor)
len(ds)
Out[19]:
1005000
In [20]:
from sklearn.decomposition import PCA
pca_sk = PCA(n_components=2, random_state=33, svd_solver='full', whiten=False)
X = np.array(ds.trim()[features])
In [21]:
%%timeit -n1 -r3 -o
pca_sk.fit(X)
232 ms ± 37.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Out[21]:
<TimeitResult : 232 ms ± 37.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)>
In [22]:
time_sklearn = _
In [23]:
print('vaex is approx', time_sklearn.best / time_vaex.best / factor, 'times faster for a PCA')
vaex is approx 4.449269957142027 times faster for a PCA

Again we see that vaex not only will outperform sklearn, but more importantly it will scale to much larger datasets.

In [24]:
ds_big = vaex.ml.iris_1e8()
In [25]:
%%timeit -n1 -r2 -o
pca = ds_big.ml.pca(features=features)
10.4 s ± 4.94 s per loop (mean ± std. dev. of 2 runs, 1 loop each)
Out[25]:
<TimeitResult : 10.4 s ± 4.94 s per loop (mean ± std. dev. of 2 runs, 1 loop each)>

Note the although this dataset is \(10\times\) larger, it takes more than \(10\times\) to execute. This is because this dataset did not fit into memory this time, and is limited to the harddrive speeds. But note that it possible to actually run it, instead of giving a MemoryError!

XGBoost

This example shows integration with xgboost, this is work in progress.

In [26]:
import vaex.ml.xgboost
/Users/maartenbreddels/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [27]:
ds = vaex.ml.iris()
In [28]:
features = [k.expression for k in [ds.col.petal_width, ds.col.petal_length, ds.col.sepal_width, ds.col.sepal_length]]
In [29]:
ds_train, ds_test = ds.ml.train_test_split()
In [30]:
param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softmax',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset
xgmodel = vaex.ml.xgboost.XGBModel(features=features, num_round=10, param=param)
In [31]:
xgmodel.fit(ds_train, ds_train.class_, copy=True)
In [32]:
ds_predict = xgmodel.transform(ds_test)
ds_predict
Out[32]:
#sepal_widthpetal_lengthsepal_lengthpetal_widthclass_random_indexxgboost_prediction
03.04.20000000000000025.90000000000000041.511141.0
13.04.59999999999999966.09999999999999961.39999999999999991741.0
22.89999999999999994.59999999999999966.59999999999999961.31371.0
33.29999999999999985.70000000000000026.70000000000000022.100000000000000121162.0
44.20000000000000021.39999999999999995.50.200000000000000010610.0
........................
252.54.05.51.31831.0
262.70000000000000023.89999999999999995.79999999999999981.21941.0
272.89999999999999991.39999999999999994.40000000000000040.200000000000000010540.0
282.29999999999999981.34.50.2999999999999999901450.0
293.20000000000000025.70000000000000026.90000000000000042.29999999999999982842.0
In [33]:
import matplotlib.pylab as plt
fig, ax = plt.subplots(1, 2, figsize=(12,5))

plt.sca(ax[0])
plt.title('original classes')
ds_predict.scatter(ds_predict.petal_width, ds_predict.petal_length, c_expr=ds_predict.class_)

plt.sca(ax[1])
plt.title('predicted classes')
ds_predict.scatter(ds_predict.petal_width, ds_predict.petal_length, c_expr=ds_predict.xgboost_prediction)
Out[33]:
<matplotlib.collections.PathCollection at 0x14bd7ccc0>
_images/ml_49_1.png

One hot encoding

Shortly showing one hot encoding

In [34]:
ds.ml_one_hot_encoding(ds.col.class_.expression)
In [35]:
ds
Out[35]:
#sepal_widthpetal_lengthsepal_lengthpetal_widthclass_random_indexclass__0class__1class__2
03.04.20000000000000025.90000000000000041.51114010
13.04.59999999999999966.09999999999999961.3999999999999999174010
22.89999999999999994.59999999999999966.59999999999999961.3137010
33.29999999999999985.70000000000000026.70000000000000022.10000000000000012116001
44.20000000000000021.39999999999999995.50.20000000000000001061100
..............................
1453.39999999999999991.39999999999999995.20000000000000020.200000000000000010119100
1463.79999999999999981.60000000000000015.09999999999999960.20000000000000001015100
1472.60000000000000014.05.79999999999999981.2122010
1483.79999999999999981.75.70000000000000020.299999999999999990144100
1492.89999999999999994.29999999999999986.20000000000000021.31102010