{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Types\n", "\n", "Vaex is a hybrid DataFrame - it supports both [numpy](https://numpy.org/) and [arrow](https://arrow.apache.org/) data types. This page outlines exactly which data types are supported in Vaex, and which we hope to support in the future. We also provide some tips on how to approach common data structures.\n", "\n", "For some additional insight, you are welcome to \n", "[look at this post](https://vaex.io/blog/a-hybrid-apache-arrow-numpy-dataframe-with-vaex-version-4) \n", "as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supported Data Types in Vaex\n", "\n", "In the table below we define:\n", "\n", " - Supported: a column or expression of that type can exist and can be stored in at least one file format;\n", " - Unsupported: a column or expression of that type can currently not live within a Vaex DataFrame, but can supported be added in the future;\n", " - Will not support: This datatype will not be supported in Vaex going forward. \n", " \n", "\n", "\n", "| Framework | Dtype | Supported | Remarks |\n", "|----------- |-------------- |----------- |--------------------------------------------- |\n", "| Python | `int` | yes | Will be converted to a `numpy` array |\n", "| Python | `float` | yes | Will be converted to a `numpy` array |\n", "| Python | `datetime` | not yet | |\n", "| Python | `timedelta` | not yet | |\n", "| Python | `str` | yes | Will be converted to Arrow array |\n", "| `numpy` | `int8` | yes | |\n", "| `numpy` | `int16` | yes | |\n", "| `numpy` | `int32` | yes | |\n", "| `numpy` | `int64` | yes | |\n", "| `numpy` | `float16` | yes | Operations should be upcast to `float64` |\n", "| `numpy` | `float32` | yes | |\n", "| `numpy` | `float64` | yes | |\n", "| `numpy` | `datetime64` | yes | |\n", "| `numpy` | `timedelta64` | yes | |\n", "| `numpy` | `object ('O')` | no | |\n", "| `arrow` | `int8` | yes | |\n", "| `arrow` | `int16` | yes | |\n", "| `arrow` | `int32` | yes | |\n", "| `arrow` | `int64` | yes | |\n", "| `arrow` | `float16` | yes | Operations should be upcast to `float64` |\n", "| `arrow` | `float32` | yes | |\n", "| `arrow` | `float64` | yes | |\n", "| `arrow` | `date32` | yes | |\n", "| `arrow` | `time64` | yes | |\n", "| `arrow` | `time32` | yes | |\n", "| `arrow` | `duration` | yes | |\n", "| `arrow` | `struct` | yes | Can't be exported to HDF5 yet, but possible |\n", "| `arrow` | `dictionary` | yes | |\n", "| `arrow` | `union` | not yet | |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### General advice on data types in Vaex\n", "\n", "Vaex requires that each column or expression be of a single data type, as in the case of databases. \n", "Having a column of different data type can result in a data type `object`, which is not supported, and can also give raise to various problems.\n", "\n", "The general advice is to prepare your data to have a uniform data type per column prior to using Vaex with it. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import vaex\n", "import numpy as np\n", "import pyarrow as pa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Higher dimensional arrays\n", "\n", "Vaex support high dimensional numpy arrays. The one requirement the arrays must have the same shape. Currently DataFrames that contain higher dimensional `numpy` arrays can only be exported to HDF5. We hope that `arrow` will add support for this soon, so we can export such columns to the arrow and parquet formats also.\n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x
0 'array([[ 1.83097431e+00, -9.90736404e-01, -8.85...
1 'array([[ 1.99466370e+00, 8.92569841e-01, 2.11...
2 'array([[-0.69977757, 0.61319317, 0.01313954, ...
3 'array([[ 0.25304255, -0.84425097, -1.18806199, ...
4 'array([[ 0.31611316, -1.36148251, 1.67342284, ...
... ...
95'array([[-0.60892972, -0.52389881, -0.92493729, ...
96'array([[ 1.10435809, 1.06875633, 1.45812865, ...
97'array([[-0.59311765, 0.10650056, -0.14413671, ...
98'array([[-0.24467361, -0.40743024, 0.6914302 , ...
99'array([[-1.0646038 , 0.53975242, -1.70715565, ...
" ], "text/plain": [ "# x\n", "0 'array([[ 1.83097431e+00, -9.90736404e-01, -8.85...\n", "1 'array([[ 1.99466370e+00, 8.92569841e-01, 2.11...\n", "2 'array([[-0.69977757, 0.61319317, 0.01313954, ...\n", "3 'array([[ 0.25304255, -0.84425097, -1.18806199, ...\n", "4 'array([[ 0.31611316, -1.36148251, 1.67342284, ...\n", "... ...\n", "95 'array([[-0.60892972, -0.52389881, -0.92493729, ...\n", "96 'array([[ 1.10435809, 1.06875633, 1.45812865, ...\n", "97 'array([[-0.59311765, 0.10650056, -0.14413671, ...\n", "98 'array([[-0.24467361, -0.40743024, 0.6914302 , ...\n", "99 'array([[-1.0646038 , 0.53975242, -1.70715565, ..." ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = np.random.randn(100, 10, 10)\n", "df = vaex.from_arrays(x=x)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also pass a nested list of lists structure to Vaex. This will be converted on the fly to a `numpy` ndarray. As before, the condition is that the resulting ndarray must be regular. \n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x
0array([1, 2])
1array([3, 4])
" ], "text/plain": [ " # x\n", " 0 array([1, 2])\n", " 1 array([3, 4])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = [[1, 2], [3, 4]]\n", "df = vaex.from_arrays(x=x)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we happen to have a non-regular list of lists, i.e. a list of lists where the inner lists are of different lengths, we first need to convert it to an `arrow` array before passing it to vaex:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x
0[1, 2, 3, 4, 5]
1[6, 7]
2[8, 9, 10]
" ], "text/plain": [ " # x\n", " 0 [1, 2, 3, 4, 5]\n", " 1 [6, 7]\n", " 2 [8, 9, 10]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = [[1, 2, 3, 4, 5], [6, 7], [8, 9, 10]]\n", "x = pa.array(x)\n", "df = vaex.from_arrays(x=x)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the `arrow` structs and lists can not be exported to HDF5 for the time being." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### String support in Vaex\n", "\n", "Vaex uses `arrow` under the hood to work with strings. Any strings passed to a Vaex DataFrame will be converted to an `arrow` type. \n", "\n", "For example: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " # x y\n", " 0 This This\n", " 1 is is\n", " 2 a one\n", " 3 string also\n", " 4 column --\n" ] }, { "data": { "text/plain": [ "\n", "[\n", " \"This\",\n", " \"is\",\n", " \"a\",\n", " \"string\",\n", " \"column\"\n", "]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\n", "[\n", " \"This\",\n", " \"is\",\n", " \"one\",\n", " \"also\",\n", " null\n", "]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "x = ['This', 'is', 'a', 'string', 'column']\n", "y = np.array(['This', 'is', 'one', 'also', None])\n", "\n", "df = vaex.from_arrays(x=x, y=y)\n", "print(df)\n", "\n", "display(df.x.values)\n", "display(df.y.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is useful to know that string operations in Vaex also work on lists of lists of strings (and also on lists of lists of lists of strings, and so on)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Expression = str_lower(x)\n", "Length: 2 dtype: list (expression)\n", "------------------------------------------------\n", "0 ['reggie', 'miller']\n", "1 ['michael', 'jordan']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = pa.array([['Reggie', 'Miller'], ['Michael', 'Jordan']])\n", "df = vaex.from_arrays(x=x)\n", "df.x.str.lower()" ] } ], "metadata": { "interpreter": { "hash": "2b337e1aa502f5cea9a92c761ad75d3ab5045107ee3446fdbe7f873d4f1936e7" }, "kernelspec": { "display_name": "Python 3.8.5 64-bit ('base': conda)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }