Performance notes

In most cases, minimizing memory usage is Vaex’ first priority, and performance comes seconds. This allows Vaex to work with very large datasets, without shooting yourself in the foot.

However, this sometimes comes at the cost of performance.

Virtual columns

When we add a new column to a dataframe based on existing, Vaex will create a virtual column, e.g.:

[18]:
import vaex
import numpy as np
x = np.arange(100_000_000, dtype='float64')
df = vaex.from_arrays(x=x)
df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log()

In this dataframe, x uses memory, while y does not, it will be evaluate in chunks when needed. To demonstate the performance implications, let us compute with the column, to force the evaluation.

[21]:
%%time
df.x.mean()
CPU times: user 2.74 s, sys: 12.3 ms, total: 2.75 s
Wall time: 71.2 ms
[21]:
array(49999999.5)
[22]:
%%time
df.y.mean()
CPU times: user 3.88 s, sys: 635 ms, total: 4.52 s
Wall time: 304 ms
[22]:
array(-17.42068049)

From this, we can see that a similar computation (the mean), with a virtual column can be much slower, a penalty we pay for saving memory.

Materializing the columns

We can ask Vaex to materialize a column, or all virtual column using df.materialize

[23]:
df_mat = df.materialize()
[24]:
%%time
df_mat.x.mean()
CPU times: user 2.54 s, sys: 14 ms, total: 2.56 s
Wall time: 68.1 ms
[24]:
array(49999999.5)
[25]:
%%time
df_mat.y.mean()
CPU times: user 2.64 s, sys: 18.7 ms, total: 2.66 s
Wall time: 68.1 ms
[25]:
array(-17.42068049)

We now get equal performance for both columns

Consideration in backends with multiple workers

As often is the case with web frameworks in Python, we use multiple workers, e.g. using gunicorn. If all workers would materialize, it would waste a lot of memory, there are two solutions to this issue:

Save to disk

Export the dataframe to disk in hdf5 or arrow format as a pre-process step, and let all workers access the same file. Due to memory mapping, each worker will share the same memory.

e.g.

df.export('materialized-data.hdf5', progress=True)

Materialize a single time

Gunicorn has the following command line flag:

--preload             Load application code before the worker processes are forked. [False]

This will let gunicorn first run you app (a single time), allowing you to do the materialize step. After your script run, it will fork, and all workers will share the same memory.

Tip:

A good ida could be to mix the two, and use use Vaex’ df.fingerprint method to cache the file to disk.

E.g.

import vaex
import numpy as np
import os

x = np.arange(100_000_000, dtype='float64')
df = vaex.from_arrays(x=x)
df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log()

filename = "vaex-cache-" + df.fingerprint() + ".hdf5"
if not os.path.exists(filename):
    df.export(filename, progress=True)
else:
    df = vaex.open(filename)

In case the virtual columns change, rerunning will create a new cache file, and changing back will use the previously generated cache file. This is especially useful during development.

In this case, it is still important to let gunicorn run a single process first (using the --preload flag), to avoid multiple workers doing the same work.

[28]:

[ ]: