Caching#

Vaex can cache task results, such as aggregations, or the internal hashmaps used for groupby operations to make recurring calculations much faster, at the cost of calculating cache keys and storing/retrieving the cached values.

Internally, Vaex calculates fingerprints (e.g. hashes of data, or file paths and mtimes) to create cache keys that are similar across processes, so that a restart of a process will most likely result in similar hash keys.

See configuration of the cache.

Caches can be turned on globally like this:

[1]:

import vaex
df = vaex.datasets.titanic()
vaex.cache.memory();  # cache on globally

One can verify that the cache is turned on via:

[2]:

vaex.cache.is_on()

[2]:

True

The cache can be globally turned off again:

[3]:

vaex.cache.off()
vaex.cache.is_on()

[3]:

False

The cache can also be turned on with a context manager, after which it will be turned off again. Here we use a disk cache. Disk cache is shared among processes, and is ideal for processes that restart, or when using Vaex in a web service with multiple workers. Consider the following example:

[4]:

with vaex.cache.disk(clear=True):
    print(df.age.mean())  # The very first time the mean is computed

29.8811345124283

[5]:

# outside of the context manager, the cache is still off
vaex.cache.is_on()

[5]:

False

[6]:

with vaex.cache.disk():
    print(df.age.mean())  # The second time the result is read from the cache

29.8811345124283

[7]:

vaex.cache.is_on()

[7]:

False