In most cases, minimizing memory usage is Vaex’ first priority, and performance comes seconds. This allows Vaex to work with very large datasets, without shooting yourself in the foot.
However, this sometimes comes at the cost of performance.
When we add a new column to a dataframe based on existing, Vaex will create a virtual column, e.g.:
import vaex import numpy as np x = np.arange(100_000_000, dtype='float64') df = vaex.from_arrays(x=x) df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log()
In this dataframe,
x uses memory, while
y does not, it will be evaluate in chunks when needed. To demonstate the performance implications, let us compute with the column, to force the evaluation.
CPU times: user 2.74 s, sys: 12.3 ms, total: 2.75 s Wall time: 71.2 ms
CPU times: user 3.88 s, sys: 635 ms, total: 4.52 s Wall time: 304 ms
From this, we can see that a similar computation (the mean), with a virtual column can be much slower, a penalty we pay for saving memory.
Materializing the columns¶
We can ask Vaex to materialize a column, or all virtual column using df.materialize
df_mat = df.materialize()
CPU times: user 2.54 s, sys: 14 ms, total: 2.56 s Wall time: 68.1 ms
CPU times: user 2.64 s, sys: 18.7 ms, total: 2.66 s Wall time: 68.1 ms
We now get equal performance for both columns
Consideration in backends with multiple workers¶
As often is the case with web frameworks in Python, we use multiple workers, e.g. using gunicorn. If all workers would materialize, it would waste a lot of memory, there are two solutions to this issue:
Save to disk¶
Export the dataframe to disk in hdf5 or arrow format as a pre-process step, and let all workers access the same file. Due to memory mapping, each worker will share the same memory.
Materialize a single time¶
Gunicorn has the following command line flag:
--preload Load application code before the worker processes are forked. [False]
This will let gunicorn first run you app (a single time), allowing you to do the materialize step. After your script run, it will fork, and all workers will share the same memory.
A good ida could be to mix the two, and use use Vaex’ df.fingerprint method to cache the file to disk.
import vaex import numpy as np import os x = np.arange(100_000_000, dtype='float64') df = vaex.from_arrays(x=x) df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log() filename = "vaex-cache-" + df.fingerprint() + ".hdf5" if not os.path.exists(filename): df.export(filename, progress=True) else: df = vaex.open(filename)
In case the virtual columns change, rerunning will create a new cache file, and changing back will use the previously generated cache file. This is especially useful during development.
In this case, it is still important to let gunicorn run a single process first (using the
--preload flag), to avoid multiple workers doing the same work.