{ "cells": [ { "cell_type": "markdown", "metadata": { "nbsphinx-toctree": { "hidden": true } }, "source": [ "[Installation](installing.rst)\n", "[Tutorials](tutorials.rst)\n", "[Guides](guides.rst)\n", "[Configuration](conf.md)\n", "[API](api.rst)\n", "[Datasets](datasets.rst)\n", "[FAQ](faq.rst)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/html" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is Vaex?\n", "\n", "Vaex is a python library for lazy **Out-of-Core DataFrames** (similar to Pandas), to visualize and explore big tabular datasets. It can calculate *statistics* such as mean, sum, count, standard deviation etc, on an *N-dimensional grid* up to **a billion** ($10^9$) objects/rows **per second**. Visualization is done using **histograms**, **density plots** and **3d volume rendering**, allowing interactive exploration of big data. Vaex uses memory mapping, a zero memory copy policy, and lazy computations for best performance (no memory wasted)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Why vaex\n", " \n", " * **Performance:** works with huge tabular data, processes $\\gt 10^9$ rows/second\n", " * **Lazy / Virtual columns:** compute on the fly, without wasting ram\n", " * **Memory efficient** no memory copies when doing filtering/selections/subsets.\n", " * **Visualization:** directly supported, a one-liner is often enough.\n", " * **User friendly API:** you will only need to deal with the DataFrame object, and tab completion + docstring will help you out: `ds.mean`, feels very similar to Pandas.\n", " * **Lean:** separated into multiple packages\n", " * `vaex-core`: DataFrame and core algorithms, takes numpy arrays as input columns.\n", " * `vaex-hdf5`: Provides memory mapped numpy arrays to a DataFrame.\n", " * `vaex-arrow`: [Arrow](https://arrow.apache.org/) support for cross language data sharing.\n", " * `vaex-viz`: Visualization based on matplotlib.\n", " * `vaex-jupyter`: Interactive visualization based on Jupyter widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet.\n", " * `vaex-astro`: Astronomy related transformations and FITS file support.\n", " * `vaex-server`: Provides a server to access a DataFrame remotely.\n", " * `vaex-distributed`: (Deprecated) Now part of vaex-enterprise.\n", " * `vaex-qt`: Program written using Qt GUI.\n", " * `vaex`: Meta package that installs all of the above.\n", " * `vaex-ml`: [Machine learning](tutorial_ml.ipynb)\n", "\n", " * **Jupyter integration**: vaex-jupyter will give you interactive visualization and selection in the Jupyter notebook and Jupyter lab." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation\n", "\n", "Using conda: \n", "\n", " * `conda install -c conda-forge vaex`\n", "\n", "Using pip:\n", "\n", " * `pip install --upgrade vaex`\n", " \n", "Or read the [detailed instructions](installing.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started\n", "\n", "We assume that you have installed vaex, and are running a [Jupyter notebook server](https://jupyter.readthedocs.io/en/latest/running.html). We start by importing vaex and asking it to give us an example dataset." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import vaex\n", "df = vaex.example() # open the example dataset provided with vaex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead, you can [download some larger datasets](datasets.rst), or [read in your csv file](api.rst#vaex.from_csv)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x y z vx vy vz E L Lz FeH
0 -0.7774707672.10626292 1.93743467 53.276722 288.386047 -95.2649078-121238.171875 831.0799560546875 -336.426513671875 -2.309227609164518
1 3.77427316 2.23387194 3.76209331 252.810791 -69.9498444-56.3121033-100819.91406251435.1839599609375-828.7567749023438 -1.788735491591229
2 1.3757627 -6.3283844 2.63250017 96.276474 226.440201 -34.7527161-100559.96093751039.2989501953125920.802490234375 -0.7618109022478798
3 -7.06737804 1.31737781 -6.10543537 204.968842 -205.679016-58.9777031-70174.8515625 2441.724853515625 1183.5899658203125 -1.5208778422936413
4 0.243441463 -0.822781682-0.206593871-311.742371-238.41217 186.824127 -144138.75 374.8164367675781 -314.5353088378906 -2.655341358427361
... ... ... ... ... ... ... ... ... ... ...
329,9953.76883793 4.66251659 -4.42904139 107.432999 -2.1377129617.5130272 -119687.3203125746.8833618164062 -508.96484375 -1.6499842518381402
329,9969.17409325 -8.87091351 -8.61707687 32.0 108.089264 179.060638 -68933.8046875 2395.633056640625 1275.490234375 -1.4336036247720836
329,997-1.14041007 -8.4957695 2.25749826 8.46711349 -38.2765236-127.541473-112580.359375 1182.436279296875 115.58557891845703 -1.9306227597361942
329,998-14.2985935 -5.51750422 -8.65472317 110.221558 -31.392559186.2726822 -74862.90625 1324.59265136718751057.017333984375 -1.225019818838568
329,99910.5450506 -8.86106777 -4.65835428 -2.10541415-27.61088563.80799961 -95361.765625 351.0955505371094 -309.81439208984375-2.5689636894079477
" ], "text/plain": [ "# x y z vx vy vz E L Lz FeH\n", "0 -0.777470767 2.10626292 1.93743467 53.276722 288.386047 -95.2649078 -121238.171875 831.0799560546875 -336.426513671875 -2.309227609164518\n", "1 3.77427316 2.23387194 3.76209331 252.810791 -69.9498444 -56.3121033 -100819.9140625 1435.1839599609375 -828.7567749023438 -1.788735491591229\n", "2 1.3757627 -6.3283844 2.63250017 96.276474 226.440201 -34.7527161 -100559.9609375 1039.2989501953125 920.802490234375 -0.7618109022478798\n", "3 -7.06737804 1.31737781 -6.10543537 204.968842 -205.679016 -58.9777031 -70174.8515625 2441.724853515625 1183.5899658203125 -1.5208778422936413\n", "4 0.243441463 -0.822781682 -0.206593871 -311.742371 -238.41217 186.824127 -144138.75 374.8164367675781 -314.5353088378906 -2.655341358427361\n", "... ... ... ... ... ... ... ... ... ... ...\n", "329,995 3.76883793 4.66251659 -4.42904139 107.432999 -2.13771296 17.5130272 -119687.3203125 746.8833618164062 -508.96484375 -1.6499842518381402\n", "329,996 9.17409325 -8.87091351 -8.61707687 32.0 108.089264 179.060638 -68933.8046875 2395.633056640625 1275.490234375 -1.4336036247720836\n", "329,997 -1.14041007 -8.4957695 2.25749826 8.46711349 -38.2765236 -127.541473 -112580.359375 1182.436279296875 115.58557891845703 -1.9306227597361942\n", "329,998 -14.2985935 -5.51750422 -8.65472317 110.221558 -31.3925591 86.2726822 -74862.90625 1324.5926513671875 1057.017333984375 -1.225019818838568\n", "329,999 10.5450506 -8.86106777 -4.65835428 -2.10541415 -27.6108856 3.80799961 -95361.765625 351.0955505371094 -309.81439208984375 -2.5689636894079477" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df # will pretty print the DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using square brackets[], we can easily filter or get different views on the DataFrame." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x y
0 -0.777471 2.10626
1 -7.06738 1.31738
2 -5.17174 7.82915
3-15.9539 5.77126
4-12.3995 13.9182
" ], "text/plain": [ " # x y\n", " 0 -0.777471 2.10626\n", " 1 -7.06738 1.31738\n", " 2 -5.17174 7.82915\n", " 3 -15.9539 5.77126\n", " 4 -12.3995 13.9182" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_negative = df[df.x < 0] # easily filter your DataFrame, without making a copy\n", "df_negative[:5][['x', 'y']] # take the first five rows, and only the 'x' and 'y' column (no memory copy!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When dealing with huge datasets, say a billion rows ($10^9$), computations with the data can waste memory, up to 8 GB for a new column. Instead, vaex uses lazy computation, storing only a representation of the computation, and computations are done on the fly when needed. You can just use many of the numpy functions, as if it was a normal array." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " instance at 0x118f71550 values=[1.159963903, 7.53636647, 4.00826287, -13.17281341, 0.036847591999999985 ... (total 330000 values) ... -0.66020346, 0.5570163800000003, 1.1170881900000003, -22.95331667, 5.8866963199999995] " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "# creates an expression (nothing is computed)\n", "some_expression = df.x + df.z\n", "some_expression # for convenience, we print out some values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These expressions can be added to a DataFrame, creating what we call a *virtual column*. These virtual columns are similar to normal columns, except they do not waste memory." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(-0.06713149126400597, -0.0501732470530304)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['r'] = some_expression # add a (virtual) column that will be computed on the fly\n", "df.mean(df.x), df.mean(df.r) # calculate statistics on normal and virtual columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the core features of vaex is its ability to calculate statistics on a regular (N-dimensional) grid. The dimensions of the grid are specified by the binby argument (analogous to SQL's grouby), and the shape and limits." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-9.67777315, -8.99466731, -8.17042477, -7.57122871, -6.98273954,\n", " -6.28362848, -5.70005784, -5.14022306, -4.52820368, -3.96953423,\n", " -3.3362477 , -2.7801045 , -2.20162243, -1.57910621, -0.92856689,\n", " -0.35964342, 0.30367721, 0.85684123, 1.53564551, 2.1274488 ,\n", " 2.69235585, 3.37746363, 4.04648274, 4.59580105, 5.20540601,\n", " 5.73475069, 6.28384101, 6.67880226, 7.46059303, 8.13480148,\n", " 8.90738265, 9.6117928 ])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean(df.r, binby=df.x, shape=32, limits=[-10, 10]) # create statistics on a regular grid (1d)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[22., 33., 37., ..., 58., 38., 45.],\n", " [37., 36., 47., ..., 52., 36., 53.],\n", " [34., 42., 47., ..., 59., 44., 56.],\n", " ...,\n", " [73., 73., 84., ..., 41., 40., 37.],\n", " [53., 58., 63., ..., 34., 35., 28.],\n", " [51., 32., 46., ..., 47., 33., 36.]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean(df.r, binby=[df.x, df.y], shape=32, limits=[-10, 10]) # or 2d\n", "df.count(df.r, binby=[df.x, df.y], shape=32, limits=[-10, 10]) # or 2d counts/histogram" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These one and two dimensional grids can be visualized using any plotting library, such as matplotlib, but the setup can be tedious. For convenience we can use [heatmap](api.rst#vaex.viz.DataFrameAccessorViz.heatmap), or see the [other visualization commands](api.rst#vaex-viz)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.viz.heatmap(df.x, df.y, show=True); # make a plot quickly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Continue\n", "[Continue the tutorial here](tutorial.ipynb) or check the [guides](guides.rst)." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }