What is Vaex?

Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (\(10^9\)) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

Why vaex

  • Performance: Works with huge tabular data, process \(\gt 10^9\) rows/second
  • Virtual columns: compute on the fly, without wasting ram
  • Memory efficient no memory copies when doing filtering/selections/subsets.
  • Visualization: directly supported, a one-liner is often enough.
  • User friendly API: You will only need to deal with a Dataset object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.
  • Lean: separated into multiple packages
    • vaex-core: Dataset and core algorithms, takes numpy arrays as input columns.
    • vaex-hdf5: Provides memory mapped numpy arrays to a Dataset.
    • vaex-viz: Visualization based on matplotlib.
    • vaex-jupyter: Interactive visualization based on Jupyter widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet.
    • vaex-astro: Astronomy related transformations and FITS file support.
    • vaex-server: Provides a server to access a dataset remotely.
    • vaex-distributed: (Proof of concept) combined multiple servers / cluster into a single dataset for distributed computations.
    • vaex: meta package that installs all of the above.
    • vaex-qt: Program written using Qt GUI.
  • Jupyter integration: vaex-jupyter will give you interactive visualization and selection in the Jupyter notebook and Jupyter lab.

Installation

Using conda:

  • conda install -c conda-forge vaex

Using pip:

  • pip install --upgrade vaex

Or read the detailed instructions

Getting started

We assuming you have installed vaex, and are running a Jupyter notebook server. We start by importing vaex and ask it to give us sample example dataset.

In [36]:
import vaex
ds = vaex.example()  # open the example dataset provided with vaex

Instead, you can download some larger datasets, or read in your csv file.

In [49]:
ds  # will pretty print a table
Out[49]:
#xyzvxvyvzELzFeH
0-0.777470766999999952.10626291999999981.9374346753.276721999999999288.38604700000002-95.264907800000003-121238.171875-336.426513671875-2.3092276091645179
13.77427315999999992.23387193999999983.76209331252.81079099999999-69.949844400000003-56.312103299999997-100819.9140625-828.75677490234375-1.788735491591229
21.3757626999999999-6.32838442.632500170000000196.276473999999993226.440201-34.752716100000001-100559.9609375920.802490234375-0.76181090224787984
3-7.06737804000000041.31737781-6.1054353700000004204.968842-205.67901599999999-58.977703099999999-70174.85156251183.5899658203125-1.5208778422936413
40.243441463-0.82278168200000001-0.20659387100000001-311.74237099999999-238.41217186.824127-144138.75-314.53530883789062-2.6553413584273611
..............................
329,9953.76883793000000014.6625165900000001-4.4290413900000001107.432999-2.1377129617.5130272-119687.3203125-508.96484375-1.6499842518381402
329,9969.1740932500000003-8.8709135099999994-8.6170768732.0108.089264179.06063800000001-68933.80468751275.490234375-1.4336036247720836
329,997-1.1404100699999999-8.49576949999999982.25749826000000028.4671134899999991-38.276523599999997-127.541473-112580.359375115.58557891845703-1.9306227597361942
329,998-14.298593500000001-5.5175042200000002-8.6547231700000005110.221558-31.392559186.272682200000006-74862.906251057.017333984375-1.2250198188385679
329,99910.5450506-8.86106777-4.6583542800000002-2.1054141500000001-27.61088563.80799961-95361.765625-309.81439208984375-2.5689636894079477

Using `square brackets[] <api.rst#vaex.dataset.Dataset.__getitem__>`__, we can easily filter or get different views on the dataset.

In [20]:
ds_negative = ds[ds.x < 0]  # easily filter your dataset, without making a copy
ds_negative[:5][['x', 'y']]  # take the first five rows, and only the 'x' and 'y' column (no memory copy!)
Out[20]:
#xy
0-0.777470766999999952.1062629199999998
1-7.06737804000000041.31737781
2-5.171744357.8291530600000003
3-15.9538851000000015.7712588299999998
4-12.399496113.9181805

When dealing with huge datasets, say a billion rows (\(10^9\)), computations with the data can waste memory, up to 8 GB for a new column. Instead, vaex uses lazy computation, only a representation of the computation is stored, and computations done on the fly when needed. Even though, you can just many of the numpy functions, as if it was a normal array.

In [21]:
import numpy as np
# creates an expression (nothing is computed)
r = np.sqrt(ds.x**2 + ds.y**2 + ds.z**2)
r  # for convinience, we print out some values
Out[21]:
<vaex.expression.Expression(expressions='sqrt((((x) ** (2)) + ((y) ** (2))) + ((z) ** (2)))')> instance at 0x110e6ec88 [2.96554503966, 5.77829281049, 6.9907960395, 9.43184275271, 0.882561312135 ... (total 330000 values) ... 7.45383176151, 15.3984124911, 8.86425027393, 17.601047186, 14.540181525]

These expressions can be added to the dataset, creating what we call a virtual column. These virtual columns are simular to normal columns, except they do not waste memory.

In [22]:
ds['r'] = r  # add a (virtual) column that will be computed on the fly
ds.mean(ds.x), ds.mean(ds.r)  # calculate statistics on normal and virtual columns
Out[22]:
(-0.067131491264005971, 9.407082338299773)

One of the core features of vaex is its ability to calculate statistics on a regular (N-dimensional) grid. The dimensions of the grid are specified by the binby argument (analogous to SQL’s grouby), and the shape and limits.

In [15]:
ds.mean(ds.r, binby=ds.x, shape=32, limits=[-10, 10]) # create statistics on a regular grid (1d)
Out[15]:
array([ 15.01058183,  14.43693006,  13.72923338,  12.90294499,
        11.86615103,  11.03563695,  10.12162553,   9.2969267 ,
         8.58250973,   7.86602644,   7.19568442,   6.55738773,
         6.01942499,   5.51462457,   5.15798991,   4.8274218 ,
         4.7346551 ,   5.1343761 ,   5.46017944,   6.02199777,
         6.54132124,   7.27025256,   7.99780777,   8.55188217,
         9.30286584,   9.97067561,  10.81633293,  11.60615795,
        12.33813552,  13.10488982,  13.86868565,  14.60577266])
In [23]:
ds.mean(ds.r, binby=[ds.x, ds.y], shape=32, limits=[-10, 10]) # or 2d
ds.count(ds.r, binby=[ds.x, ds.y], shape=32, limits=[-10, 10]) # or 2d counts/histogram
Out[23]:
array([[ 22.,  33.,  37., ...,  58.,  38.,  45.],
       [ 37.,  36.,  47., ...,  52.,  36.,  53.],
       [ 34.,  42.,  47., ...,  59.,  44.,  56.],
       ...,
       [ 73.,  73.,  84., ...,  41.,  40.,  37.],
       [ 53.,  58.,  63., ...,  34.,  35.,  28.],
       [ 51.,  32.,  46., ...,  47.,  33.,  36.]])

These one and two dimensional grids can be visualized using any plotting library, such as matplotlib, but the setup can be tedious. For convenience we can use plot1d, plot, or see the list of plotting commands

In [17]:
ds.plot(ds.x, ds.y, show=True);  # make a plot quickly
_images/index_19_0.png
Out[17]:
<matplotlib.image.AxesImage at 0x1193c8dd8>