{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-24T14:13:48.731159Z",
     "start_time": "2020-04-24T14:13:48.725142Z"
    }
   },
   "source": [
    "<style>\n",
    "pre {\n",
    " white-space: pre-wrap !important;\n",
    "}\n",
    ".table-striped > tbody > tr:nth-of-type(odd) {\n",
    "    background-color: #f9f9f9;\n",
    "}\n",
    ".table-striped > tbody > tr:nth-of-type(even) {\n",
    "    background-color: white;\n",
    "}\n",
    ".table-striped td, .table-striped th, .table-striped tr {\n",
    "    border: 1px solid black;\n",
    "    border-collapse: collapse;\n",
    "    margin: 1em 2em;\n",
    "}\n",
    ".rendered_html td, .rendered_html th {\n",
    "    text-align: left;\n",
    "    vertical-align: middle;\n",
    "    padding: 4px;\n",
    "}\n",
    "</style>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# I/O Kung-Fu: get your data in and out of [Vaex](https://github.com/vaexio/vaex)\n",
    "\n",
    "If you want to try out this notebook with a live Python kernel, use mybinder:\n",
    "\n",
    "<a class=\"reference external image-reference\" href=\"https://mybinder.org/v2/gh/vaexio/vaex/latest?filepath=docs%2Fsource%2Fexample_io.ipynb\"><img alt=\"https://mybinder.org/badge_logo.svg\" src=\"https://mybinder.org/badge_logo.svg\" width=\"150px\"></a>\n",
    "\n",
    "\n",
    "## Data input\n",
    "\n",
    "Every project starts with reading in some data. Vaex supports several data sources:\n",
    "\n",
    "- Binary file formats:\n",
    " \n",
    "     - [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format#HDF5)\n",
    "     - [Apache Arrow](https://arrow.apache.org/)\n",
    "     - [Apache Parquet](https://parquet.apache.org/)\n",
    "     - [FITS](https://en.wikipedia.org/wiki/FITS)\n",
    "     \n",
    " - Text based file formats:\n",
    " \n",
    "     - [CSV](https://en.wikipedia.org/wiki/Comma-separated_values)\n",
    "     - [ASCII](https://en.wikipedia.org/wiki/Text_file)\n",
    "     - [JSON](https://www.json.org/json-en.html)\n",
    "     \n",
    " - In-memory data representations:\n",
    " \n",
    "     - [pandas](https://pandas.pydata.org/) DataFrames and everything that pandas can read\n",
    "     - [Apache Arrow](https://arrow.apache.org/) Tables\n",
    "     - [numpy](https://numpy.org/) arrays\n",
    "     - Python dictionaries\n",
    "     - Single row DataFrames\n",
    "     \n",
    "The following examples show the best practices of getting your data in Vaex.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Binary file formats\n",
    "\n",
    "If your data is already in one of the supported binary file formats (HDF5, Apache Arrow, Apache Parquet, FITS), opening it with Vaex rather simple:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:07.478955Z",
     "start_time": "2020-05-01T17:02:06.193973Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>name  </th><th style=\"text-align: right;\">  age</th><th>city     </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>John  </td><td style=\"text-align: right;\">   17</td><td>Edinburgh</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>Sally </td><td style=\"text-align: right;\">   33</td><td>Groningen</td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  name      age  city\n",
       "  0  John       17  Edinburgh\n",
       "  1  Sally      33  Groningen"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import vaex\n",
    "\n",
    "# Reading a HDF5 file\n",
    "df_names = vaex.open('./data/io/sample_names_1.hdf5')\n",
    "df_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:07.538413Z",
     "start_time": "2020-05-01T17:02:07.481811Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>fruit  </th><th style=\"text-align: right;\">  amount</th><th>origin  </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>mango  </td><td style=\"text-align: right;\">       5</td><td>Malaya  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>banana </td><td style=\"text-align: right;\">      10</td><td>Ecuador </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>orange </td><td style=\"text-align: right;\">       7</td><td>Spain   </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  fruit      amount  origin\n",
       "  0  mango           5  Malaya\n",
       "  1  banana         10  Ecuador\n",
       "  2  orange          7  Spain"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reading an arrow file\n",
    "df_fruits = vaex.open('./data/io/sample_fruits.arrow')\n",
    "df_fruits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Opening such data is instantenous regardless of the file size on disk: Vaex will just memory-map the data instead of reading it in memory. This is the optimal way of working with large datasets that are larger than available RAM.\n",
    "\n",
    "If your data is contained within multiple files, one can open them all simultaneously like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:07.602563Z",
     "start_time": "2020-05-01T17:02:07.542303Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>name  </th><th style=\"text-align: right;\">  age</th><th>city     </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>John  </td><td style=\"text-align: right;\">   17</td><td>Edinburgh</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>Sally </td><td style=\"text-align: right;\">   33</td><td>Groningen</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>Maria </td><td style=\"text-align: right;\">   23</td><td>Caracas  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>3</i></td><td>Monica</td><td style=\"text-align: right;\">   55</td><td>New York </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  name      age  city\n",
       "  0  John       17  Edinburgh\n",
       "  1  Sally      33  Groningen\n",
       "  2  Maria      23  Caracas\n",
       "  3  Monica     55  New York"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_names_all = vaex.open('./data/io/sample_names_*.hdf5')\n",
    "df_names_all"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, one can use the `open_many` method to pass a list of files to open:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:07.654127Z",
     "start_time": "2020-05-01T17:02:07.606145Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>name  </th><th style=\"text-align: right;\">  age</th><th>city     </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>John  </td><td style=\"text-align: right;\">   17</td><td>Edinburgh</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>Sally </td><td style=\"text-align: right;\">   33</td><td>Groningen</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>Maria </td><td style=\"text-align: right;\">   23</td><td>Caracas  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>3</i></td><td>Monica</td><td style=\"text-align: right;\">   55</td><td>New York </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  name      age  city\n",
       "  0  John       17  Edinburgh\n",
       "  1  Sally      33  Groningen\n",
       "  2  Maria      23  Caracas\n",
       "  3  Monica     55  New York"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_names_all = vaex.open_many(['./data/io/sample_names_1.hdf5', \n",
    "                               './data/io/sample_names_2.hdf5'])\n",
    "df_names_all"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result will be a single DataFrame object containing all of the data coming from all files.\n",
    "\n",
    "The data does not necessarily have to be local. With Vaex you can open a HDF5 file straight from Amazon's S3:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:07.800021Z",
     "start_time": "2020-05-01T17:02:07.660798Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th style=\"text-align: right;\">  y</th><th style=\"text-align: right;\">  s</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  1</td><td style=\"text-align: right;\">  3</td><td style=\"text-align: right;\">  5</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  2</td><td style=\"text-align: right;\">  4</td><td style=\"text-align: right;\">  6</td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #    x    y    s\n",
       "  0    1    3    5\n",
       "  1    2    4    6"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_from_s3 = vaex.open('s3://vaex/testing/xys.hdf5?anon=true')\n",
    "df_from_s3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case the data will be lazily downloaded and cached to the local machine. \"Lazily downloaded\" means that Vaex will only download the portions of the data you really need. For example: imagine that we have a file hosted on S3 that has 100 columns and 1 billion rows. Getting a preview of the DataFrame via `print(df)` for instance will download only the first and last 5 rows. If we than proceed to make calculations or plots with only 5 columns, only the data from those columns will be downloaded and cached to the local machine. \n",
    "\n",
    "By default, data that is streamed from S3 is cached at ` $HOME/.vaex/file-cache/s3`, and thus successive access is as fast as native disk access. One can also use the `profile_name` argument to use a specific S3 profile, which will than be passed to `s3fs.core.S3FileSystem`.\n",
    "\n",
    "With Vaex one can also read-in parquet files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:07.820196Z",
     "start_time": "2020-05-01T17:02:07.802259Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>car    </th><th>color  </th><th style=\"text-align: right;\">  year</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>renault</td><td>red    </td><td style=\"text-align: right;\">  1996</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>audi   </td><td>black  </td><td style=\"text-align: right;\">  2005</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>toyota </td><td>blue   </td><td style=\"text-align: right;\">  2000</td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  car      color      year\n",
       "  0  renault  red        1996\n",
       "  1  audi     black      2005\n",
       "  2  toyota   blue       2000"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reading a parquet file\n",
    "df_cars = vaex.open('./data/io/sample_cars.parquet')\n",
    "df_cars"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text based file formats\n",
    "\n",
    "Datasets are still commonly stored in text-based file formats such as CSV. Since text-based file formats are not memory-mappable, they have to be read in memory. If the contents of a CSV file fits into the available RAM, one can simply do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.100638Z",
     "start_time": "2020-05-01T17:02:07.824271Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>city        </th><th>team   </th><th>player        </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>Indianopolis</td><td>Pacers </td><td>Reggie Miller </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>Chicago     </td><td>Bulls  </td><td>Michael Jordan</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>Boston      </td><td>Celtics</td><td>Larry Bird    </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  city          team     player\n",
       "  0  Indianopolis  Pacers   Reggie Miller\n",
       "  1  Chicago       Bulls    Michael Jordan\n",
       "  2  Boston        Celtics  Larry Bird"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_nba = vaex.from_csv('./data/io/sample_nba_1.csv', copy_index=False)\n",
    "df_nba"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or alternatively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.121097Z",
     "start_time": "2020-05-01T17:02:08.103438Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>city        </th><th>team   </th><th>player        </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>Indianopolis</td><td>Pacers </td><td>Reggie Miller </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>Chicago     </td><td>Bulls  </td><td>Michael Jordan</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>Boston      </td><td>Celtics</td><td>Larry Bird    </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  city          team     player\n",
       "  0  Indianopolis  Pacers   Reggie Miller\n",
       "  1  Chicago       Bulls    Michael Jordan\n",
       "  2  Boston        Celtics  Larry Bird"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_nba = vaex.read_csv('./data/io/sample_nba_1.csv', copy_index=False)\n",
    "df_nba"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Vaex is using pandas for reading CSV files in the background, so one can pass any arguments to the `vaex.from_csv` or `vaex.read_csv` as one would pass to `pandas.read_csv` and specify for example separators, column names and column types. The `copy_index` parameter specifies if the index column of the pandas DataFrame should be read as a regular column, or left out to save memory. In addition to this, if you specify the `convert=True` argument, the data will be automatically converted to an HDF5 file behind the scenes, thus freeing RAM and allowing you to work with your data in a memory-efficient, out-of-core manner.\n",
    "\n",
    "If the CSV file is so large that it can not fit into RAM all at one time, one can convert the data to HDF5 simply by:\n",
    "\n",
    "```\n",
    "df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)\n",
    "```\n",
    "\n",
    "When the above line is executed, Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file on disk. All temporary files are then concatenated into a single HDF5 file, and the temporary files deleted. The size of the individual chunks to be read can be specified via the `chunk_size` argument. Note that this automatic conversion requires free disk space of twice the final HDF5 file size.\n",
    "\n",
    "It often happens that the data we need to analyse is spread over multiple CSV files. One can convert them to the HDF5 file format like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.190644Z",
     "start_time": "2020-05-01T17:02:08.127331Z"
    }
   },
   "outputs": [],
   "source": [
    "list_of_files = ['./data/io/sample_nba_1.csv',\n",
    "                 './data/io/sample_nba_2.csv',\n",
    "                 './data/io/sample_nba_3.csv',]\n",
    "\n",
    "# Convert each CSV file to HDF5\n",
    "for file in list_of_files:\n",
    "    df_tmp = vaex.from_csv(file, convert=True, copy_index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above code block converts in turn each CSV file to the HDF5 format. Note that the conversion will work regardless of the file size of each individual CSV file, provided there is sufficient storage space. \n",
    "\n",
    "Working with all of the data is now easy: just open all of the relevant HDF5 files as described above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.257771Z",
     "start_time": "2020-05-01T17:02:08.195072Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>city        </th><th>team   </th><th>player        </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>Indianopolis</td><td>Pacers </td><td>Reggie Miller </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>Chicago     </td><td>Bulls  </td><td>Michael Jordan</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>Boston      </td><td>Celtics</td><td>Larry Bird    </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>3</i></td><td>Los Angeles </td><td>Lakers </td><td>Kobe Bryant   </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>4</i></td><td>Toronto     </td><td>Raptors</td><td>Vince Carter  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>5</i></td><td>Philadelphia</td><td>76ers  </td><td>Allen Iverson </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>6</i></td><td>San Antonio </td><td>Spurs  </td><td>Tim Duncan    </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  city          team     player\n",
       "  0  Indianopolis  Pacers   Reggie Miller\n",
       "  1  Chicago       Bulls    Michael Jordan\n",
       "  2  Boston        Celtics  Larry Bird\n",
       "  3  Los Angeles   Lakers   Kobe Bryant\n",
       "  4  Toronto       Raptors  Vince Carter\n",
       "  5  Philadelphia  76ers    Allen Iverson\n",
       "  6  San Antonio   Spurs    Tim Duncan"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = vaex.open('./data/io/sample_nba_*.csv.hdf5')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One can than additionally export this combined DataFrame to a single HDF5 file. This should lead to minor performance improvements. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.325184Z",
     "start_time": "2020-05-01T17:02:08.259960Z"
    }
   },
   "outputs": [],
   "source": [
    "df.export('./data/io/sample_nba_combined.hdf5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also common the data to be stored in JSON files. To read such data in Vaex one can do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.344497Z",
     "start_time": "2020-05-01T17:02:08.327339Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>isle         </th><th style=\"text-align: right;\">  size_sqkm</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>Easter Island</td><td style=\"text-align: right;\">    163.6  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>Fiji         </td><td style=\"text-align: right;\">     18.333</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>Tortuga      </td><td style=\"text-align: right;\">    178.7  </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  isle             size_sqkm\n",
       "  0  Easter Island      163.6\n",
       "  1  Fiji                18.333\n",
       "  2  Tortuga            178.7"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_isles = vaex.from_json('./data/io/sample_isles.json', orient='table', copy_index=False)\n",
    "df_isles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a convenience method which simply wraps `pandas.read_json`, so the same arguments and file reading strategy applies. If the data is distributed amongs multiple JSON files, one can apply a similar strategy as in the case of multiple CSV files: read each JSON file with the `vaex.from_json` method, convert it to a HDF5 or Arrow file format. Than use `vaex.open` or `vaex.open_many` methods to open all the converted files as a single DataFrame. \n",
    "\n",
    "To learn more about different options of exporting data with Vaex, please read the next section below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### In-memory data representations\n",
    "\n",
    "One can construct a Vaex DataFrame from a variety of in-memory data representations. Such a common operation is converting a pandas into a Vaex DataFrame. Let us read in a CSV file with pandas and than convert it to a Vaex DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.376404Z",
     "start_time": "2020-05-01T17:02:08.346801Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>city</th>\n",
       "      <th>team</th>\n",
       "      <th>player</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Indianopolis</td>\n",
       "      <td>Pacers</td>\n",
       "      <td>Reggie Miller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Chicago</td>\n",
       "      <td>Bulls</td>\n",
       "      <td>Michael Jordan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Boston</td>\n",
       "      <td>Celtics</td>\n",
       "      <td>Larry Bird</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           city     team          player\n",
       "0  Indianopolis   Pacers   Reggie Miller\n",
       "1       Chicago    Bulls  Michael Jordan\n",
       "2        Boston  Celtics      Larry Bird"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pandas_df = pd.read_csv('./data/io/sample_nba_1.csv')\n",
    "pandas_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.394986Z",
     "start_time": "2020-05-01T17:02:08.379763Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>city        </th><th>team   </th><th>player        </th><th style=\"text-align: right;\">  index</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>Indianopolis</td><td>Pacers </td><td>Reggie Miller </td><td style=\"text-align: right;\">      0</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>Chicago     </td><td>Bulls  </td><td>Michael Jordan</td><td style=\"text-align: right;\">      1</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>Boston      </td><td>Celtics</td><td>Larry Bird    </td><td style=\"text-align: right;\">      2</td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  city          team     player            index\n",
       "  0  Indianopolis  Pacers   Reggie Miller         0\n",
       "  1  Chicago       Bulls    Michael Jordan        1\n",
       "  2  Boston        Celtics  Larry Bird            2"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = vaex.from_pandas(df=pandas_df, copy_index=True)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `copy_index` argument specifies whether the index column of a pandas DataFrame should be imported into the Vaex DataFrame. Converting a pandas into a Vaex DataFrame is particularly useful since pandas can read data from a large variety of file formats. For instance, we can use pandas to read data from a database, and then pass it to Vaex like so:\n",
    "\n",
    "```\n",
    "import vaex\n",
    "import pandas as pd\n",
    "import sqlalchemy\n",
    "\n",
    "connection_string = 'postgresql://readonly:' + 'my_password' + '@server.company.com:1234/database_name'\n",
    "engine = sqlalchemy.create_engine(connection_string)\n",
    "\n",
    "pandas_df = pd.read_sql_query('SELECT * FROM MYTABLE', con=engine)\n",
    "df = vaex.from_pandas(pandas_df, copy_index=False)\n",
    "```\n",
    "\n",
    "Another example is using pandas to read in [SAS](https://www.sas.com/en_us/home.html) files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.439517Z",
     "start_time": "2020-05-01T17:02:08.400874Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                             </th><th>YEAR  </th><th>Y                 </th><th>W                  </th><th>R                  </th><th>L                 </th><th>K                 </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i> </td><td>1948.0</td><td>1.2139999866485596</td><td>0.24300000071525574</td><td>0.1454000025987625 </td><td>1.4149999618530273</td><td>0.6119999885559082</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i> </td><td>1949.0</td><td>1.3539999723434448</td><td>0.25999999046325684</td><td>0.21809999644756317</td><td>1.3839999437332153</td><td>0.5590000152587891</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i> </td><td>1950.0</td><td>1.569000005722046 </td><td>0.27799999713897705</td><td>0.3156999945640564 </td><td>1.3880000114440918</td><td>0.5730000138282776</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>3</i> </td><td>1951.0</td><td>1.9479999542236328</td><td>0.296999990940094  </td><td>0.39399999380111694</td><td>1.5499999523162842</td><td>0.5640000104904175</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>4</i> </td><td>1952.0</td><td>2.265000104904175 </td><td>0.3100000023841858 </td><td>0.35589998960494995</td><td>1.8020000457763672</td><td>0.5740000009536743</td></tr>\n",
       "<tr><td>...                           </td><td>...   </td><td>...               </td><td>...                </td><td>...                </td><td>...               </td><td>...               </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>27</i></td><td>1975.0</td><td>18.72100067138672 </td><td>1.246999979019165  </td><td>0.23010000586509705</td><td>5.7220001220703125</td><td>9.062000274658203 </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>28</i></td><td>1976.0</td><td>19.25             </td><td>1.375              </td><td>0.3452000021934509 </td><td>5.76200008392334  </td><td>8.26200008392334  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>29</i></td><td>1977.0</td><td>20.64699935913086 </td><td>1.5440000295639038 </td><td>0.45080000162124634</td><td>5.876999855041504 </td><td>7.473999977111816 </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>30</i></td><td>1978.0</td><td>22.72599983215332 </td><td>1.7029999494552612 </td><td>0.5877000093460083 </td><td>6.107999801635742 </td><td>7.104000091552734 </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>31</i></td><td>1979.0</td><td>23.618999481201172</td><td>1.7790000438690186 </td><td>0.534600019454956  </td><td>6.8520002365112305</td><td>6.874000072479248 </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "#    YEAR    Y                   W                    R                    L                   K\n",
       "0    1948.0  1.2139999866485596  0.24300000071525574  0.1454000025987625   1.4149999618530273  0.6119999885559082\n",
       "1    1949.0  1.3539999723434448  0.25999999046325684  0.21809999644756317  1.3839999437332153  0.5590000152587891\n",
       "2    1950.0  1.569000005722046   0.27799999713897705  0.3156999945640564   1.3880000114440918  0.5730000138282776\n",
       "3    1951.0  1.9479999542236328  0.296999990940094    0.39399999380111694  1.5499999523162842  0.5640000104904175\n",
       "4    1952.0  2.265000104904175   0.3100000023841858   0.35589998960494995  1.8020000457763672  0.5740000009536743\n",
       "...  ...     ...                 ...                  ...                  ...                 ...\n",
       "27   1975.0  18.72100067138672   1.246999979019165    0.23010000586509705  5.7220001220703125  9.062000274658203\n",
       "28   1976.0  19.25               1.375                0.3452000021934509   5.76200008392334    8.26200008392334\n",
       "29   1977.0  20.64699935913086   1.5440000295639038   0.45080000162124634  5.876999855041504   7.473999977111816\n",
       "30   1978.0  22.72599983215332   1.7029999494552612   0.5877000093460083   6.107999801635742   7.104000091552734\n",
       "31   1979.0  23.618999481201172  1.7790000438690186   0.534600019454956    6.8520002365112305  6.874000072479248"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pandas_df = pd.read_sas('./data/io/sample_airline.sas7bdat')\n",
    "df = vaex.from_pandas(pandas_df, copy_index=False)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One can read in an arrow table as a Vaex DataFrame in a similar manner. Let us first use pyarrow to read in a CSV file as an arrow table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.458982Z",
     "start_time": "2020-05-01T17:02:08.443294Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pyarrow.Table\n",
       "city: string\n",
       "team: string\n",
       "player: string"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pyarrow.csv\n",
    "\n",
    "arrow_table = pyarrow.csv.read_csv('./data/io/sample_nba_1.csv')\n",
    "arrow_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have the arrow table, converting it to a DataFrame is simple:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.479598Z",
     "start_time": "2020-05-01T17:02:08.461126Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th>city        </th><th>team   </th><th>player        </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td>Indianopolis</td><td>Pacers </td><td>Reggie Miller </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td>Chicago     </td><td>Bulls  </td><td>Michael Jordan</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td>Boston      </td><td>Celtics</td><td>Larry Bird    </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #  city          team     player\n",
       "  0  Indianopolis  Pacers   Reggie Miller\n",
       "  1  Chicago       Bulls    Michael Jordan\n",
       "  2  Boston        Celtics  Larry Bird"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = vaex.from_arrow_table(arrow_table)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It also common to construct a Vaex DataFrame from numpy arrays. That can be done like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.511635Z",
     "start_time": "2020-05-01T17:02:08.482173Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th style=\"text-align: right;\">  y</th><th>z  </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  0</td><td style=\"text-align: right;\"> 10</td><td>dog</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  1</td><td style=\"text-align: right;\"> 20</td><td>cat</td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #    x    y  z\n",
       "  0    0   10  dog\n",
       "  1    1   20  cat"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "x = np.arange(2)\n",
    "y = np.array([10, 20])\n",
    "z = np.array(['dog', 'cat'])\n",
    "\n",
    "\n",
    "df_numpy = vaex.from_arrays(x=x, y=y, z=z)\n",
    "df_numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Constructing a DataFrame from a Python dict is also straight-forward:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.532062Z",
     "start_time": "2020-05-01T17:02:08.514469Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th style=\"text-align: right;\">  y</th><th>z    </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  2</td><td style=\"text-align: right;\"> 30</td><td>cow  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  3</td><td style=\"text-align: right;\"> 40</td><td>horse</td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #    x    y  z\n",
       "  0    2   30  cow\n",
       "  1    3   40  horse"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Construct a DataFrame from Python dictionary\n",
    "data_dict = dict(x=[2, 3], y=[30, 40], z=['cow', 'horse'])\n",
    "\n",
    "df_dict = vaex.from_dict(data_dict)\n",
    "df_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At times, one may need to create a single row DataFrame. Vaex has a convenience method which takes individual elements (scalars) and creates the DataFrame:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.554837Z",
     "start_time": "2020-05-01T17:02:08.536162Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th style=\"text-align: right;\">  y</th><th>z    </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  4</td><td style=\"text-align: right;\"> 50</td><td>mouse</td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #    x    y  z\n",
       "  0    4   50  mouse"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_single_row = vaex.from_scalars(x=4, y=50, z='mouse')\n",
    "df_single_row"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can choose to concatenate different DataFrames, without any memory penalties like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.590367Z",
     "start_time": "2020-05-01T17:02:08.557882Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead>\n",
       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th style=\"text-align: right;\">  y</th><th>z    </th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  0</td><td style=\"text-align: right;\"> 10</td><td>dog  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  1</td><td style=\"text-align: right;\"> 20</td><td>cat  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>2</i></td><td style=\"text-align: right;\">  2</td><td style=\"text-align: right;\"> 30</td><td>cow  </td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>3</i></td><td style=\"text-align: right;\">  3</td><td style=\"text-align: right;\"> 40</td><td>horse</td></tr>\n",
       "<tr><td><i style='opacity: 0.6'>4</i></td><td style=\"text-align: right;\">  4</td><td style=\"text-align: right;\"> 50</td><td>mouse</td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "  #    x    y  z\n",
       "  0    0   10  dog\n",
       "  1    1   20  cat\n",
       "  2    2   30  cow\n",
       "  3    3   40  horse\n",
       "  4    4   50  mouse"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = vaex.concat([df_numpy, df_dict, df_single_row])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data export\n",
    "\n",
    "One can export Vaex DataFrames to multiple file or in-memory data representations:\n",
    "\n",
    " - Binary file formats:\n",
    " \n",
    "     - [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format#HDF5)\n",
    "     - [Apache Arrow](https://arrow.apache.org/)\n",
    "     - [Apache Parquet](https://parquet.apache.org/)\n",
    "     - [FITS](https://en.wikipedia.org/wiki/FITS)\n",
    "     \n",
    " - Text based file formats:\n",
    " \n",
    "     - [CSV](https://en.wikipedia.org/wiki/Comma-separated_values)\n",
    "     - [ASCII](https://en.wikipedia.org/wiki/Text_file)\n",
    "     \n",
    " - In-memory data representations:\n",
    "\n",
    "    - DataFrames:\n",
    "    \n",
    "         - [panads](https://pandas.pydata.org/) DataFrame\n",
    "         - [Apache Arrow](https://arrow.apache.org/) Table\n",
    "         - [numpy](https://numpy.org/) arrays\n",
    "         - [Dask](https://dask.org/) arrays\n",
    "         - Python dictionaries\n",
    "         - Python items list ( a list of ('column_name', data) tuples)\n",
    "\n",
    "    - Expressions:\n",
    "    \n",
    "         - [panads](https://pandas.pydata.org/) Series\n",
    "         - [numpy](https://numpy.org/) array\n",
    "         - [Dask](https://dask.org/) array\n",
    "         - Python list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Binary file formats\n",
    "\n",
    "The most efficient way to store data on disk when you work with Vaex is to use binary file formats. Vaex can export a DataFrame to HDF5, Apache Arrow, Apache Parquet and FITS:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.676302Z",
     "start_time": "2020-05-01T17:02:08.592347Z"
    }
   },
   "outputs": [],
   "source": [
    "df.export_hdf5('./data/io/output_data.hdf5')\n",
    "df.export_arrow('./data/io/output_data.arrow')\n",
    "df.export_parquet('./data/io/output_data.parquet')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, one can simply use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.726306Z",
     "start_time": "2020-05-01T17:02:08.682753Z"
    }
   },
   "outputs": [],
   "source": [
    "df.export('./data/io/output_data.hdf5')\n",
    "df.export('./data/io/output_data.arrow')\n",
    "df.export('./data/io/output_data.parquet')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where Vaex will determine the file format of the based on the specified extension of the file name. If the extension is not recognized, an exception will be raised. \n",
    "\n",
    "If your data is large, i.e. larger than the available RAM, we recomment exporting to HDF5. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text based file format\n",
    "\n",
    "At times, it may be useful to export the data to disk in a text based file format such as CSV. In that case one can simply do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.738484Z",
     "start_time": "2020-05-01T17:02:08.728709Z"
    }
   },
   "outputs": [],
   "source": [
    "df.export_csv('./data/io/output_data.csv')  # `chunk_size` has a default value of 1_000_000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `df.export_csv` method is using `pandas_df.to_csv` behind the scenes, and thus one can pass any argument to `df.export_csv` as would to `pandas_df.to_csv`. The data is exported in chunks and the size of those chunks can be specified by the `chunk_size` argument in `df.export_csv`. In this way, data that is too large to fit in RAM can be saved to disk."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### In memory data representation\n",
    "\n",
    "Python has a rich ecosystem comprised of various libraries for data manipulation, that offer different functionality. Thus, it is often useful to be able to pass data from one library to another. Vaex is able to pass on its data to other libraries via a number of in-memory representations.\n",
    "\n",
    "#### DataFrame representations\n",
    "\n",
    "A Vaex DataFrame can be converted to a pandas DataFrame like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.752230Z",
     "start_time": "2020-05-01T17:02:08.740444Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>y</th>\n",
       "      <th>z</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>dog</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "      <td>cat</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>30</td>\n",
       "      <td>cow</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>40</td>\n",
       "      <td>horse</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>50</td>\n",
       "      <td>mouse</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x   y      z\n",
       "0  0  10    dog\n",
       "1  1  20    cat\n",
       "2  2  30    cow\n",
       "3  3  40  horse\n",
       "4  4  50  mouse"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pandas_df = df.to_pandas_df()\n",
    "pandas_df  # looks the same doesn't it?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For DataFrames that are too large to fit in memory, one can specify the `chunk_size` argument, in which case the `to_pandas_df`method returns a generator yileding a pandas DataFrame with as many rows as indicated by the `chunk_size` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.800970Z",
     "start_time": "2020-05-01T17:02:08.754447Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 3\n",
      "   x   y    z\n",
      "0  0  10  dog\n",
      "1  1  20  cat\n",
      "2  2  30  cow\n",
      "\n",
      "3 5\n",
      "   x   y      z\n",
      "0  3  40  horse\n",
      "1  4  50  mouse\n",
      "\n"
     ]
    }
   ],
   "source": [
    "gen = df.to_pandas_df(chunk_size=3)\n",
    "\n",
    "for i1, i2, chunk in gen:\n",
    "    print(i1, i2)\n",
    "    print(chunk)\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-29T10:27:22.706846Z",
     "start_time": "2020-04-29T10:27:22.702412Z"
    }
   },
   "source": [
    "The generator also yields the row number of the first and the last element of that chunk, so we know exactly where in the parent DataFrame we are. The following DataFrame methods also support the `chunk_size` argument with the same behaviour.\n",
    "\n",
    "Converting a Vaex DataFrame into an arrow table is similar:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.811967Z",
     "start_time": "2020-05-01T17:02:08.804106Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pyarrow.Table\n",
       "x: int64\n",
       "y: int64\n",
       "z: string"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arrow_table = df.to_arrow_table()\n",
    "arrow_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One can simply convert the DataFrame to a list of arrays. By default, the data is exposed as a list of numpy arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:08.824342Z",
     "start_time": "2020-05-01T17:02:08.814367Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[array([0, 1, 2, 3, 4]),\n",
       " array([10, 20, 30, 40, 50]),\n",
       " array(['dog', 'cat', 'cow', 'horse', 'mouse'], dtype=object)]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arrays = df.to_arrays()\n",
    "arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By specifying the `array_type` argument, one can choose whether the data will be represented by numpy arrays, xarrays, or Python lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.455293Z",
     "start_time": "2020-05-01T17:02:08.826695Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<xarray.DataArray (dim_0: 5)>\n",
       " array([0, 1, 2, 3, 4])\n",
       " Dimensions without coordinates: dim_0, <xarray.DataArray (dim_0: 5)>\n",
       " array([10, 20, 30, 40, 50])\n",
       " Dimensions without coordinates: dim_0, <xarray.DataArray (dim_0: 5)>\n",
       " array(['dog', 'cat', 'cow', 'horse', 'mouse'], dtype=object)\n",
       " Dimensions without coordinates: dim_0]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arrays = df.to_arrays(array_type='xarray')\n",
    "arrays  # list of xarrays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.466523Z",
     "start_time": "2020-05-01T17:02:09.457513Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[0, 1, 2, 3, 4],\n",
       " [10, 20, 30, 40, 50],\n",
       " ['dog', 'cat', 'cow', 'horse', 'mouse']]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arrays = df.to_arrays(array_type='list')\n",
    "arrays  # list of lists"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keeping it close to pure Python, one can export a Vaex DataFrame as a dictionary. The same `array_type` keyword argument applies here as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.476517Z",
     "start_time": "2020-05-01T17:02:09.468670Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'x': array([0, 1, 2, 3, 4]),\n",
       " 'y': array([10, 20, 30, 40, 50]),\n",
       " 'z': array(['dog', 'cat', 'cow', 'horse', 'mouse'], dtype=object)}"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d_dict = df.to_dict(array_type='numpy')\n",
    "d_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, one can also convert a DataFrame to a list of tuples, were the first element of the tuple is the column name, while the second element is the array representation of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.495699Z",
     "start_time": "2020-05-01T17:02:09.482143Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('x', [0, 1, 2, 3, 4]),\n",
       " ('y', [10, 20, 30, 40, 50]),\n",
       " ('z', ['dog', 'cat', 'cow', 'horse', 'mouse'])]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get a single item list\n",
    "items = df.to_items(array_type='list')\n",
    "items"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As mentioned earlier, with all of the above example, one can use the `chunk_size` argument which creates a generator, yielding a portion of the DataFrame in the specified format. In the case of `.to_dict` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.539436Z",
     "start_time": "2020-05-01T17:02:09.503487Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 2 {'x': [0, 1], 'y': [10, 20], 'z': ['dog', 'cat']}\n",
      "2 4 {'x': [2, 3], 'y': [30, 40], 'z': ['cow', 'horse']}\n",
      "4 5 {'x': [4], 'y': [50], 'z': ['mouse']}\n"
     ]
    }
   ],
   "source": [
    "gen = df.to_dict(array_type='list', chunk_size=2)\n",
    "\n",
    "for i1, i2, chunk in gen:\n",
    "    print(i1, i2, chunk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Last but not least, a Vaex DataFrame can be lazily exposed as a Dask array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.552868Z",
     "start_time": "2020-05-01T17:02:09.542825Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<tr>\n",
       "<td>\n",
       "<table>\n",
       "  <thead>\n",
       "    <tr><td> </td><th> Array </th><th> Chunk </th></tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr><th> Bytes </th><td> 80 B </td> <td> 80 B </td></tr>\n",
       "    <tr><th> Shape </th><td> (5, 2) </td> <td> (5, 2) </td></tr>\n",
       "    <tr><th> Count </th><td> 2 Tasks </td><td> 1 Chunks </td></tr>\n",
       "    <tr><th> Type </th><td> int64 </td><td> numpy.ndarray </td></tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</td>\n",
       "<td>\n",
       "<svg width=\"98\" height=\"170\" style=\"stroke:rgb(0,0,0);stroke-width:1\" >\n",
       "\n",
       "  <!-- Horizontal lines -->\n",
       "  <line x1=\"0\" y1=\"0\" x2=\"48\" y2=\"0\" style=\"stroke-width:2\" />\n",
       "  <line x1=\"0\" y1=\"120\" x2=\"48\" y2=\"120\" style=\"stroke-width:2\" />\n",
       "\n",
       "  <!-- Vertical lines -->\n",
       "  <line x1=\"0\" y1=\"0\" x2=\"0\" y2=\"120\" style=\"stroke-width:2\" />\n",
       "  <line x1=\"48\" y1=\"0\" x2=\"48\" y2=\"120\" style=\"stroke-width:2\" />\n",
       "\n",
       "  <!-- Colored Rectangle -->\n",
       "  <polygon points=\"0.000000,0.000000 48.000000,0.000000 48.000000,120.000000 0.000000,120.000000\" style=\"fill:#ECB172A0;stroke-width:0\"/>\n",
       "\n",
       "  <!-- Text -->\n",
       "  <text x=\"24.000000\" y=\"140.000000\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" >2</text>\n",
       "  <text x=\"68.000000\" y=\"60.000000\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" transform=\"rotate(0,68.000000,60.000000)\">5</text>\n",
       "</svg>\n",
       "</td>\n",
       "</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "dask.array<vaex-df-7e85845a-8bcd-11ea-a451, shape=(5, 2), dtype=int64, chunksize=(5, 2), chunktype=numpy.ndarray>"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dask_arrays = df[['x', 'y']].to_dask_array()   # String support coming soon\n",
    "dask_arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Expression representations\n",
    "\n",
    "A single Vaex Expression can be also converted to a variety of in-memory representations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.569938Z",
     "start_time": "2020-05-01T17:02:09.557138Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    1\n",
       "2    2\n",
       "3    3\n",
       "4    4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# pandas Series\n",
    "x_series = df.x.to_pandas_series()\n",
    "x_series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.586785Z",
     "start_time": "2020-05-01T17:02:09.575392Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 2, 3, 4])"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# numpy array\n",
    "x_numpy = df.x.to_numpy()\n",
    "x_numpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.600977Z",
     "start_time": "2020-05-01T17:02:09.591062Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 1, 2, 3, 4]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Python list\n",
    "x_list = df.x.tolist()\n",
    "x_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-01T17:02:09.616893Z",
     "start_time": "2020-05-01T17:02:09.603377Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<tr>\n",
       "<td>\n",
       "<table>\n",
       "  <thead>\n",
       "    <tr><td> </td><th> Array </th><th> Chunk </th></tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr><th> Bytes </th><td> 40 B </td> <td> 40 B </td></tr>\n",
       "    <tr><th> Shape </th><td> (5,) </td> <td> (5,) </td></tr>\n",
       "    <tr><th> Count </th><td> 2 Tasks </td><td> 1 Chunks </td></tr>\n",
       "    <tr><th> Type </th><td> int64 </td><td> numpy.ndarray </td></tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</td>\n",
       "<td>\n",
       "<svg width=\"170\" height=\"92\" style=\"stroke:rgb(0,0,0);stroke-width:1\" >\n",
       "\n",
       "  <!-- Horizontal lines -->\n",
       "  <line x1=\"0\" y1=\"0\" x2=\"120\" y2=\"0\" style=\"stroke-width:2\" />\n",
       "  <line x1=\"0\" y1=\"42\" x2=\"120\" y2=\"42\" style=\"stroke-width:2\" />\n",
       "\n",
       "  <!-- Vertical lines -->\n",
       "  <line x1=\"0\" y1=\"0\" x2=\"0\" y2=\"42\" style=\"stroke-width:2\" />\n",
       "  <line x1=\"120\" y1=\"0\" x2=\"120\" y2=\"42\" style=\"stroke-width:2\" />\n",
       "\n",
       "  <!-- Colored Rectangle -->\n",
       "  <polygon points=\"0.000000,0.000000 120.000000,0.000000 120.000000,42.009890 0.000000,42.009890\" style=\"fill:#ECB172A0;stroke-width:0\"/>\n",
       "\n",
       "  <!-- Text -->\n",
       "  <text x=\"60.000000\" y=\"62.009890\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" >5</text>\n",
       "  <text x=\"140.000000\" y=\"21.004945\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" transform=\"rotate(0,140.000000,21.004945)\">1</text>\n",
       "</svg>\n",
       "</td>\n",
       "</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "dask.array<vaex-expression-7e8e9f18-8bcd-11ea-a451, shape=(5,), dtype=int64, chunksize=(5,), chunktype=numpy.ndarray>"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Dask array\n",
    "x_dask_array = df.x.to_dask_array()\n",
    "x_dask_array"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}