Progress Bars

Basic progress bars

Progress bars are an excellent way to get an idea of how long a certain computation might take. Most of the methods responsible for computations or aggregations in Vaex support the display of progressbars. Displaying progress bars is as easy as:

[1]:
import vaex

df = vaex.datasets.taxi()
df.total_amount.mean(progress=True)
mean [########################################] 100.00% elapsed time  :     0.09s =  0.0m =  0.0h

[1]:
array(11.6269824)

If you are in the Jupyter notebook, you can pass progress='widget' to get a nicer looking progress bar, provided by ipywidgets:

[2]:
df.payment_type.unique(progress='widget')
[2]:
['CRD', 'CSH']

Rich based progress bars

Using Rich based progress bars we can take this idea to the next level. With Rich one gets to see a tree structure of progress bars that give the user an idea of what Vaex does internally, and how long each step takes. Each leaf in this tree is a Task, while the nodes are used to group tasks logically. For instance, in the following example the last node named ‘mean’ uses the mean aggregation, which creates two tasks: sum and count agregations.

[3]:
with vaex.progress.tree('rich', title="My Vaex computations"):
    result_1 = df.groupby('passenger_count', agg='count')
    result_2 = df.groupby('vendor_id', agg=vaex.agg.sum('tip_amount'))
    result_3 = df.tip_amount.mean()

In the last column (between brackets) we also see how many passes over the data Vaex had to do to compute all results. The last two tasks are done together in the 5th pass.

If we want to do all computations in a single pass over the data for performance reason, we can use Vaex’ async way, by adding the delayed argument (see Async programming with Vaex for more details).

[4]:
with vaex.progress.tree('rich', title="My Vaex computations"):
    result_1 = df.groupby('passenger_count', agg='count', delay=True)
    result_2 = df.groupby('vendor_id', agg=vaex.agg.sum('tip_amount'), delay=True)
    result_3 = df.tip_amount.mean(delay=True)
    df.execute()
result_1 = result_1.get()
result_2 = result_2.get()
result_3 = result_3.get()

We see that all computations are done in a single pass over the data, which is slightly faster in this case because we are not IO bound. On slower disks, or slower formats (e.g. parquet) this difference will be larger.

Combining this with the caching feature, we can clearly see the effect on later calculations, and the efficiency of Vaex:

[5]:
vaex.cache.disk(clear=True)  # turn on cache, and delete all cache entries

with vaex.progress.tree('rich', title="Warm up cache"):
    result_1 = df.groupby('passenger_count', agg='count', delay=True)
    result_2 = df.groupby('vendor_id', agg=vaex.agg.sum('tip_amount'), delay=True)
    df.execute()


with vaex.progress.tree('rich', title="My Vaex computations"):
    result_1 = df.groupby('passenger_count', agg='count', delay=True)
    result_2 = df.groupby('vendor_id', agg=vaex.agg.sum('tip_amount'), delay=True)
    result_3 = df.tip_amount.mean(delay=True)
    df.execute()
vaex.cache.off();