I/O Kung-Fu: get your data in and out of Vaex#
If you want to try out this notebook with a live Python kernel, use mybinder:
Data input#
Every project starts with reading in some data. Vaex supports several data sources:
Binary file formats:
Text based file formats:
In-memory data representations:
pandas DataFrames and everything that pandas can read
Apache Arrow Tables
numpy arrays
Python dictionaries
Single row DataFrames
Cloud support:
Amazon Web Services S3
Google Cloud Storage
Other cloud storage options
Extras:
Aliases
The following examples show the best practices of getting your data in Vaex.
Opening binary file formats#
If your data is already in one of the supported binary file formats (HDF5, Apache Arrow, Apache Parquet, FITS), opening it with Vaex rather simple:
[1]:
import vaex
# Reading a HDF5 file
df_names = vaex.open('../data/io/sample_names_1.hdf5')
df_names
[1]:
# | name | age | city |
---|---|---|---|
0 | John | 17 | Edinburgh |
1 | Sally | 33 | Groningen |
When opening a HDF5 file, one can specify which group to read:
df_group = vaex.open('my_file_with_groups.hdf5', group='/path/to/my/table')
For a worked example please see the Exporting binary file formats section.
Opening an arrow or a parquet file is just as simple:
[2]:
# Reading an arrow file
df_fruits = vaex.open('../data/io/sample_fruits.arrow')
df_fruits
[2]:
# | fruit | amount | origin |
---|---|---|---|
0 | mango | 5 | Malaya |
1 | banana | 10 | Ecuador |
2 | orange | 7 | Spain |
Opening such data is instantenous regardless of the file size on disk: Vaex will just memory-map the data instead of reading it in memory. This is the optimal way of working with large datasets that are larger than available RAM.
If your data is contained within multiple files, one can open them all simultaneously like this:
[3]:
df_names_all = vaex.open('../data/io/sample_names_*.hdf5')
df_names_all
[3]:
# | name | age | city |
---|---|---|---|
0 | John | 17 | Edinburgh |
1 | Sally | 33 | Groningen |
2 | Maria | 23 | Caracas |
3 | Monica | 55 | New York |
Alternatively, one can use the open_many
method to pass a list of files to open:
[4]:
df_names_all = vaex.open_many(['../data/io/sample_names_1.hdf5',
'../data/io/sample_names_2.hdf5'])
df_names_all
[4]:
# | name | age | city |
---|---|---|---|
0 | John | 17 | Edinburgh |
1 | Sally | 33 | Groningen |
2 | Maria | 23 | Caracas |
3 | Monica | 55 | New York |
The result will be a single DataFrame object containing all of the data coming from all files.
[5]:
# Reading a parquet file
df_cars = vaex.open('../data/io/sample_cars.parquet')
df_cars
[5]:
# | car | color | year |
---|---|---|---|
0 | renault | red | 1996 |
1 | audi | black | 2005 |
2 | toyota | blue | 2000 |
Text based file formats#
Datasets are still commonly stored in text-based file formats such as CSV and JSON. Vaex supports various methods for reading such datasets.
New in 4.14:
The vaex.open
method can also be used to read a CSV file. With this method Vaex will lazily read the CSV file, i.e. the data from the CSV file will be streamed when computations need to be executed. This is powered by Apache Arrow under the hood. In this way you can work with arbitraruly large CSV files without caring about RAM!
Note: When opening a CSV file in this way, Vaex will first quickly scan the file to determine some basic metadata such as the number of rows, column names and their data types. The duration of this can vary depending on the number of rows, columns, your disk read speed, and infer_schema_fraction
telling Vaex what fraction of the file to read to determine the metadata.
One can use the convert
argument, for example vaex.open('my_file.csv', convert='my_file.hdf5')
, easily covert a CSV file to HDF5 for faster access, or to a Parquet file for storage considerations.
[6]:
df_nba_lazy = vaex.open('../data/io/sample_nba_1.csv') # Read lazily, not kept in RAM
df_nba_lazy
[6]:
# | city | team | player |
---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller |
1 | Chicago | Bulls | Michael Jordan |
2 | Boston | Celtics | Larry Bird |
It can be more practicle to simply read smaller datasets in memory. This is easily done with:
[7]:
df_nba = vaex.from_csv('../data/io/sample_nba_1.csv', copy_index=False)
df_nba
[7]:
# | city | team | player |
---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller |
1 | Chicago | Bulls | Michael Jordan |
2 | Boston | Celtics | Larry Bird |
or alternatively:
[8]:
df_nba = vaex.read_csv('../data/io/sample_nba_1.csv', copy_index=False)
df_nba
[8]:
# | city | team | player |
---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller |
1 | Chicago | Bulls | Michael Jordan |
2 | Boston | Celtics | Larry Bird |
Vaex is using Pandas for reading CSV files in the background, so one can pass any arguments to the vaex.from_csv
or vaex.read_csv
as one would pass to pandas.read_csv
and specify for example separators, column names and column types. The copy_index
parameter specifies if the index column of the Pandas DataFrame should be read as a regular column, or left out to save memory. In addition to this, if you specify the convert=True
argument, the data will be automatically converted
to an HDF5 file behind the scenes, thus freeing RAM and allowing you to work with your data in a memory-efficient, out-of-core manner.
If the CSV file is so large that it can not fit into RAM all at one time, one can convert the data to HDF5 simply by:
df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)
When the above line is executed, Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file on disk. All temporary files are then concatenated into a single HDF5 file, and the temporary files deleted. The size of the individual chunks to be read can be specified via the chunk_size
argument. Note that this automatic conversion requires free disk space of twice the final HDF5 file size.
It often happens that the data we need to analyse is spread over multiple CSV files. One can convert them to the HDF5 file format like this:
[9]:
list_of_files = ['../data/io/sample_nba_1.csv',
'../data/io/sample_nba_2.csv',
'../data/io/sample_nba_3.csv',]
# Convert each CSV file to HDF5
for file in list_of_files:
df_tmp = vaex.from_csv(file, convert=True, copy_index=False)
The above code block converts in turn each CSV file to the HDF5 format. Note that the conversion will work regardless of the file size of each individual CSV file, provided there is sufficient storage space.
Working with all of the data is now easy: just open all of the relevant HDF5 files as described above:
[10]:
df = vaex.open('../data/io/sample_nba_*.csv.hdf5')
df
[10]:
# | city | team | player |
---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller |
1 | Chicago | Bulls | Michael Jordan |
2 | Boston | Celtics | Larry Bird |
3 | Los Angeles | Lakers | Kobe Bryant |
4 | Toronto | Raptors | Vince Carter |
5 | Philadelphia | 76ers | Allen Iverson |
6 | San Antonio | Spurs | Tim Duncan |
One can than additionally export this combined DataFrame to a single HDF5 file. This should lead to minor performance improvements.
[11]:
df.export('../data/io/sample_nba_combined.hdf5')
Reading larger CSV files via Pandas can be slow. Apache Arrow provides a considerably faster of reading such files. Vaex conveniently exposes this functionality:
[12]:
df_nba_arrow = vaex.from_csv_arrow('../data/io/sample_nba_1.csv')
df_nba_arrow
[12]:
# | city | team | player |
---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller |
1 | Chicago | Bulls | Michael Jordan |
2 | Boston | Celtics | Larry Bird |
In fact, Apache Arrow parses CSV files so fast, it can be used to stream the data and effectively enable lazy reading. By passing lazy=True
to the method above, one can work with CSV files that are much larger than available RAM. This is what is used under the hood of vaex.open
to provide lazy reading of CSV files.
[13]:
df_nba_arrow_lazy = vaex.from_csv_arrow('../data/io/sample_nba_1.csv', lazy=True)
df_nba_arrow_lazy
[13]:
# | city | team | player |
---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller |
1 | Chicago | Bulls | Michael Jordan |
2 | Boston | Celtics | Larry Bird |
It is also common the data to be stored in JSON files. To read such data in Vaex one can do:
[14]:
df_isles = vaex.from_json('../data/io/sample_isles.json', orient='table', copy_index=False)
df_isles
[14]:
# | isle | size_sqkm |
---|---|---|
0 | Easter Island | 163.6 |
1 | Fiji | 18.333 |
2 | Tortuga | 178.7 |
This is a convenience method which simply wraps pandas.read_json
, so the same arguments and file reading strategy applies. If the data is distributed amongs multiple JSON files, one can apply a similar strategy as in the case of multiple CSV files: read each JSON file with the vaex.from_json
method, convert it to a HDF5 or Arrow file format. Than use vaex.open
or vaex.open_many
methods to open all the converted files as a single DataFrame.
To learn more about different options of exporting data with Vaex, please read the next section below.
Cloud Support#
Vaex supports streaming of HDF5, Apache Arrow, Apache Parquet, and CSV files from Amazon’s S3 and Google Cloud Storage. Here is an example of streaming an HDF5 file directly from S3:
[15]:
df = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true')
df.head_and_tail_print(3)
# | vendor_id | pickup_datetime | dropoff_datetime | passenger_count | payment_type | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VTS | 2015-02-27 22:11:38.000000000 | 2015-02-27 22:22:51.000000000 | 5 | 1 | 2.26 | -74.006645 | 40.707497 | 1.0 | 0.0 | -74.0096 | 40.73462 | 10.0 | 0.5 | 0.5 | 2.0 | 0.0 | 13.3 |
1 | VTS | 2015-08-04 00:36:01.000000000 | 2015-08-04 00:47:11.000000000 | 1 | 1 | 5.13 | -74.00747 | 40.705235 | 1.0 | 0.0 | -73.96727 | 40.755196 | 16.0 | 0.5 | 0.5 | 3.46 | 0.0 | 20.76 |
2 | VTS | 2015-01-28 19:56:52.000000000 | 2015-01-28 20:03:27.000000000 | 1 | 2 | 1.89 | -73.97189 | 40.76286 | 1.0 | 0.0 | -73.95513 | 40.78596 | 7.5 | 1.0 | 0.5 | 0.0 | 0.0 | 9.3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
299,997 | CMT | 2015-06-18 09:05:52.000000000 | 2015-06-18 09:28:19.000000000 | 1 | 1 | 2.7 | -73.95231 | 40.78091 | 1.0 | 0.0 | -73.97917 | 40.75542 | 15.0 | 0.0 | 0.5 | 1.25 | 0.0 | 17.05 |
299,998 | VTS | 2015-04-17 11:13:46.000000000 | 2015-04-17 11:33:19.000000000 | 1 | 2 | 1.75 | -73.951935 | 40.77804 | 1.0 | 0.0 | -73.9692 | 40.763924 | 13.0 | 0.0 | 0.5 | 0.0 | 0.0 | 13.8 |
299,999 | VTS | 2015-05-29 07:00:45.000000000 | 2015-05-29 07:17:47.000000000 | 5 | 2 | 8.94 | -73.95345 | 40.77932 | 1.0 | 0.0 | -73.86702 | 40.77094 | 26.0 | 0.0 | 0.5 | 0.0 | 5.54 | 32.34 |
One can also use the fs_options
to specify any arguments that need to be passed to an external file system if needed:
When using Amazon’s S3:
pyarrow.fs.S3FileSystem - If supported by Arrow.
s3fs.core.S3FileSystem - Used for globbing and fallbacks.
When using Google Cloud Storage:
For example:
[16]:
df = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5', fs_options={'anon': True})
df.head(3)
[16]:
# | vendor_id | pickup_datetime | dropoff_datetime | passenger_count | payment_type | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VTS | 2015-02-27 22:11:38.000000000 | 2015-02-27 22:22:51.000000000 | 5 | 1 | 2.26 | -74.0066 | 40.7075 | 1 | 0 | -74.0096 | 40.7346 | 10 | 0.5 | 0.5 | 2 | 0 | 13.3 |
1 | VTS | 2015-08-04 00:36:01.000000000 | 2015-08-04 00:47:11.000000000 | 1 | 1 | 5.13 | -74.0075 | 40.7052 | 1 | 0 | -73.9673 | 40.7552 | 16 | 0.5 | 0.5 | 3.46 | 0 | 20.76 |
2 | VTS | 2015-01-28 19:56:52.000000000 | 2015-01-28 20:03:27.000000000 | 1 | 2 | 1.89 | -73.9719 | 40.7629 | 1 | 0 | -73.9551 | 40.786 | 7.5 | 1 | 0.5 | 0 | 0 | 9.3 |
When streaming HDF5 files, fs_options
also accepts the “cache” options. When True
, as is the default, Vaex will lazily download and cache the data to the local machine. “Lazily download” means that Vaex will only download the portions of the data you really need.
For example: imagine that we have a file hosted on S3 that has 100 columns and 1 billion rows. Getting a preview of the DataFrame via print(df)
for instance will download only the first and last 5 rows. If we then proceed to make calculations or plots with only 5 columns, only the data from those columns will be downloaded and cached to the local machine.
By default, the data streamed from S3 and GCS is cached at $HOME/.vaex/file-cache/s3
and $HOME/.vaex/file-cache/gs
respectively, and thus successive access is as fast as native disk access.
Streaming Apache Arrow and Apache Parquet is just as simple. Caching is available for these file formats, but using the Apache Arrow format will currently read all the data when opening the file, so less useful. For maximum performance, we always advise to use a compute instance at the same region as the bucket.
Here is an example of reading an Apache Arrow file straight from Google Cloud Storage:
df = vaex.open('gs://vaex-data/airlines/us_airline_2019_mini.arrow', fs_options={'anon': True})
df
Apache Parquet files typically compressed, and therefore are often a better choice for cloud environments, since the tend to keep the storage and transfer costs lower. Here is an example of opening a Parquet file from Google Cloud Storage.
df = vaex.open('gs://vaex-data/airlines/us_airline_2019_mini.parquet', fs_options={'anon': True})
df
The following table summarizes the current capabilities of Vaex to read, cache and write different file formats to Amazon S3 and Google Cloud Storage.
Format |
Read |
Cache |
Write |
---|---|---|---|
HDF5 |
Yes |
Yes |
No* |
Arrow |
Yes |
No* |
Yes |
Parquet |
Yes |
No* |
Yes |
FITS |
Yes |
No* |
Yes |
CSV |
Yes |
??? |
Yes |
No* - this is not available now, but should be possible in the future. Please contact vaex.io for more information.
Other cloud storage options - Minio example#
Minio is an S3 compatible object storage server, which can be used instead of AWS’ S3 service. Assuming a Minio setup like this:
$ export DATA_ROOT=/data/tmp
$ mkdir $DATA_ROOT/taxi
$ wget https://github.com/vaexio/vaex-datasets/releases/download/1.1/yellow_taxi_2012_zones.parquet --directory-prefix $DATA_ROOT/taxi/
$ docker run -it --rm -p 9000:9000 --name minio1 -v $DATA_ROOT:/data -e "MINIO_ROOT_USER=AKIAIOSFODNN7EXAMPLE" -e "MINIO_ROOT_PASSWORD=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" minio/minio server /data
This creates a running Minio server available at localhost:9000, hosting a bucket called ‘taxi’, with 1 parquet file. We can now connect to it using Vaex. From the web interface we can get a URL, in this case in the form of:
http://localhost:9000/taxi/yellow_taxi_2012_zones.parquet?Content-Disposition=attachment%3B%20filename%3D%22yellow_taxi_2012_zones.parquet%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20210707%2F%2Fs3%2Faws4_request&X-Amz-Date=20210707T085053Z&X-Amz-Expires=432000&X-Amz-SignedHeaders=host&X-Amz-Signature=03e0b6718a95be0fd0d679c4fc52bc26f9ce9f7845877866d5caa709e9b0e12c
This is not the S3 URL you should provide to Vaex (or Apache Arrow for that matter, which is used by Vaex). Instead the correct URL is of the form s3://bucket/path/to/file.ext
. We also need to tell Vaex to connect to the server by passing the appropriate fs_options:
df = vaex.open('s3://taxi/yellow_taxi_2012_zones.parquet', fs_options=dict(
endpoint_override='localhost:9000',
access_key='AKIAIOSFODNN7EXAMPLE',
secret_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
scheme='http')
)
In-memory data representations#
One can construct a Vaex DataFrame from a variety of in-memory data representations. Such a common operation is converting a Pandas into a Vaex DataFrame. Let us read in a CSV file with Pandas and than convert it to a Vaex DataFrame:
[17]:
import pandas as pd
pandas_df = pd.read_csv('../data/io/sample_nba_1.csv')
pandas_df
[17]:
city | team | player | |
---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller |
1 | Chicago | Bulls | Michael Jordan |
2 | Boston | Celtics | Larry Bird |
[18]:
df = vaex.from_pandas(df=pandas_df, copy_index=True)
df
[18]:
# | city | team | player | index |
---|---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller | 0 |
1 | Chicago | Bulls | Michael Jordan | 1 |
2 | Boston | Celtics | Larry Bird | 2 |
The copy_index
argument specifies whether the index column of a Pandas DataFrame should be imported into the Vaex DataFrame. Converting a Pandas into a Vaex DataFrame is particularly useful since Pandas can read data from a large variety of file formats. For instance, we can use Pandas to read data from a database, and then pass it to Vaex like so:
import vaex
import pandas as pd
import sqlalchemy
connection_string = 'postgresql://readonly:' + 'my_password' + '@server.company.com:1234/database_name'
engine = sqlalchemy.create_engine(connection_string)
pandas_df = pd.read_sql_query('SELECT * FROM MYTABLE', con=engine)
df = vaex.from_pandas(pandas_df, copy_index=False)
Another example is using pandas to read in SAS files:
[19]:
pandas_df = pd.read_sas('../data/io/sample_airline.sas7bdat')
df = vaex.from_pandas(pandas_df, copy_index=False)
df
[19]:
# | YEAR | Y | W | R | L | K |
---|---|---|---|---|---|---|
0 | 1948.0 | 1.2139999866485596 | 0.24300000071525574 | 0.1454000025987625 | 1.4149999618530273 | 0.6119999885559082 |
1 | 1949.0 | 1.3539999723434448 | 0.25999999046325684 | 0.21809999644756317 | 1.3839999437332153 | 0.5590000152587891 |
2 | 1950.0 | 1.569000005722046 | 0.27799999713897705 | 0.3156999945640564 | 1.3880000114440918 | 0.5730000138282776 |
3 | 1951.0 | 1.9479999542236328 | 0.296999990940094 | 0.39399999380111694 | 1.5499999523162842 | 0.5640000104904175 |
4 | 1952.0 | 2.265000104904175 | 0.3100000023841858 | 0.35589998960494995 | 1.8020000457763672 | 0.5740000009536743 |
... | ... | ... | ... | ... | ... | ... |
27 | 1975.0 | 18.72100067138672 | 1.246999979019165 | 0.23010000586509705 | 5.7220001220703125 | 9.062000274658203 |
28 | 1976.0 | 19.25 | 1.375 | 0.3452000021934509 | 5.76200008392334 | 8.26200008392334 |
29 | 1977.0 | 20.64699935913086 | 1.5440000295639038 | 0.45080000162124634 | 5.876999855041504 | 7.473999977111816 |
30 | 1978.0 | 22.72599983215332 | 1.7029999494552612 | 0.5877000093460083 | 6.107999801635742 | 7.104000091552734 |
31 | 1979.0 | 23.618999481201172 | 1.7790000438690186 | 0.534600019454956 | 6.8520002365112305 | 6.874000072479248 |
One can read in an arrow table as a Vaex DataFrame in a similar manner. Let us first use pyarrow to read in a CSV file as an arrow table.
[20]:
import pyarrow.csv
arrow_table = pyarrow.csv.read_csv('../data/io/sample_nba_1.csv')
arrow_table
[20]:
pyarrow.Table
city: string
team: string
player: string
----
city: [["Indianopolis","Chicago","Boston"]]
team: [["Pacers","Bulls","Celtics"]]
player: [["Reggie Miller","Michael Jordan","Larry Bird"]]
Once we have the arrow table, converting it to a DataFrame is simple:
[21]:
df = vaex.from_arrow_table(arrow_table)
df
[21]:
# | city | team | player |
---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller |
1 | Chicago | Bulls | Michael Jordan |
2 | Boston | Celtics | Larry Bird |
It also common to construct a Vaex DataFrame from numpy arrays. That can be done like this:
[22]:
import numpy as np
x = np.arange(2)
y = np.array([10, 20])
z = np.array(['dog', 'cat'])
df_numpy = vaex.from_arrays(x=x, y=y, z=z)
df_numpy
[22]:
# | x | y | z |
---|---|---|---|
0 | 0 | 10 | dog |
1 | 1 | 20 | cat |
Constructing a DataFrame from a Python dict is also straight-forward:
[23]:
# Construct a DataFrame from Python dictionary
data_dict = dict(x=[2, 3], y=[30, 40], z=['cow', 'horse'])
df_dict = vaex.from_dict(data_dict)
df_dict
[23]:
# | x | y | z |
---|---|---|---|
0 | 2 | 30 | cow |
1 | 3 | 40 | horse |
At times, one may need to create a single row DataFrame. Vaex has a convenience method which takes individual elements (scalars) and creates the DataFrame:
[24]:
df_single_row = vaex.from_scalars(x=4, y=50, z='mouse')
df_single_row
[24]:
# | x | y | z |
---|---|---|---|
0 | 4 | 50 | mouse |
Finally, we can choose to concatenate different DataFrames, without any memory penalties like so:
[25]:
df = vaex.concat([df_numpy, df_dict, df_single_row])
df
[25]:
# | x | y | z |
---|---|---|---|
0 | 0 | 10 | dog |
1 | 1 | 20 | cat |
2 | 2 | 30 | cow |
3 | 3 | 40 | horse |
4 | 4 | 50 | mouse |
Extras#
Vaex allows you to make alias to the locations of your most used datasets. They can be local or in the cloud:
[26]:
vaex.aliases['nba'] = '../data/io/sample_nba_1.csv'
vaex.aliases['nyc_taxi_aws'] = 's3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true'
[27]:
df = vaex.open('nba')
df
[27]:
# | city | team | player |
---|---|---|---|
0 | Indianopolis | Pacers | Reggie Miller |
1 | Chicago | Bulls | Michael Jordan |
2 | Boston | Celtics | Larry Bird |
[28]:
df = vaex.open('nyc_taxi_aws')
df.head_and_tail_print(3)
# | vendor_id | pickup_datetime | dropoff_datetime | passenger_count | payment_type | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VTS | 2015-02-27 22:11:38.000000000 | 2015-02-27 22:22:51.000000000 | 5 | 1 | 2.26 | -74.006645 | 40.707497 | 1.0 | 0.0 | -74.0096 | 40.73462 | 10.0 | 0.5 | 0.5 | 2.0 | 0.0 | 13.3 |
1 | VTS | 2015-08-04 00:36:01.000000000 | 2015-08-04 00:47:11.000000000 | 1 | 1 | 5.13 | -74.00747 | 40.705235 | 1.0 | 0.0 | -73.96727 | 40.755196 | 16.0 | 0.5 | 0.5 | 3.46 | 0.0 | 20.76 |
2 | VTS | 2015-01-28 19:56:52.000000000 | 2015-01-28 20:03:27.000000000 | 1 | 2 | 1.89 | -73.97189 | 40.76286 | 1.0 | 0.0 | -73.95513 | 40.78596 | 7.5 | 1.0 | 0.5 | 0.0 | 0.0 | 9.3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
299,997 | CMT | 2015-06-18 09:05:52.000000000 | 2015-06-18 09:28:19.000000000 | 1 | 1 | 2.7 | -73.95231 | 40.78091 | 1.0 | 0.0 | -73.97917 | 40.75542 | 15.0 | 0.0 | 0.5 | 1.25 | 0.0 | 17.05 |
299,998 | VTS | 2015-04-17 11:13:46.000000000 | 2015-04-17 11:33:19.000000000 | 1 | 2 | 1.75 | -73.951935 | 40.77804 | 1.0 | 0.0 | -73.9692 | 40.763924 | 13.0 | 0.0 | 0.5 | 0.0 | 0.0 | 13.8 |
299,999 | VTS | 2015-05-29 07:00:45.000000000 | 2015-05-29 07:17:47.000000000 | 5 | 2 | 8.94 | -73.95345 | 40.77932 | 1.0 | 0.0 | -73.86702 | 40.77094 | 26.0 | 0.0 | 0.5 | 0.0 | 5.54 | 32.34 |
Data export#
One can export Vaex DataFrames to multiple file or in-memory data representations:
Binary file formats:
Text based file formats:
In-memory data representations:
Exporting binary file formats#
The most efficient way to store data on disk when you work with Vaex is to use binary file formats. Vaex can export a DataFrame to HDF5, Apache Arrow, Apache Parquet and FITS:
[29]:
df.export_hdf5('../data/io/output_data.hdf5')
df.export_arrow('../data/io/output_data.arrow')
df.export_parquet('../data/io/output_data.parquet')
Alternatively, one can simply use:
[30]:
df.export('../data/io/output_data.hdf5')
df.export('../data/io/output_data.arrow')
df.export('../data/io/output_data.parquet')
where Vaex will determine the file format of the based on the specified extension of the file name. If the extension is not recognized, an exception will be raised.
When exporing to HDF5, you can specify a particular group. An existing HDF5 file can also be appended, by another dataset, but it needs to be in another group. For example:
[31]:
df1 = vaex.from_arrays(x=[1, 2, 3], y=[-0.5, 0, 0.5])
df2 = vaex.from_arrays(s1=['Apple', 'Orange', 'Peach'], s2=['potato', 'carrot', 'cucumber'])
df1.export_hdf5('../data/io/output_hdf5_file_with_multiple_groups.hdf5', mode='w', group='/numbers')
df2.export_hdf5('../data/io/output_hdf5_file_with_multiple_groups.hdf5', mode='a', group='/food')
As explained in the Opening binary formats section, this newly created file can be opened by passing the group
argument to the vaex.open
method:
[32]:
df_food = vaex.open('../data/io/output_hdf5_file_with_multiple_groups.hdf5', group='food')
df_food
[32]:
# | s1 | s2 |
---|---|---|
0 | Apple | potato |
1 | Orange | carrot |
2 | Peach | cucumber |
[33]:
df_nums = vaex.open('../data/io/output_hdf5_file_with_multiple_groups.hdf5', group='numbers')
df_nums
[33]:
# | x | y |
---|---|---|
0 | 1 | -0.5 |
1 | 2 | 0 |
2 | 3 | 0.5 |
When exporting to Apache Arrow and Apache Parquet file format, the data is written in chunks thus enabling to export of data that does not fit in RAM all at once. A custom chunk size can be specified via the chunk_size
argument, the default value of which is 1048576
. For example:
[34]:
df.export('../data/io/output_data.parquet', chunk_size=10_000)
Vaex supports direct writing to Amazon’s S3 and Google Cloud Storage buckets when exporting the data to Apache Arrow and Apache Parquet file formats. Much like when opening a file, the fs_options
dictionary can be specified to pass arguments to the underlying file system, for example authentication credentials. Here are two examples:
# Export to Google Cloud Storage
df.export_arrow(to='gs://my-gs-bucket/my_data.arrow', fs_options={'token': my_token})
# Export to Amazon's S3
df.export_parquet(to='s3://my-s3-bucket/my_data.parquet', fs_options={'access_key': my_key, 'secret_key': my_secret_key})
Text based file format#
At times, it may be useful to export the data to disk in a text based file format such as CSV. In that case one can simply do:
[35]:
df.export_csv('../data/io/output_data.csv') # `chunk_size` has a default value of 1_000_000
The df.export_csv
method is using pandas_df.to_csv
behind the scenes, and thus one can pass any argument to df.export_csv
as would to pandas_df.to_csv
. The data is exported in chunks and the size of those chunks can be specified by the chunk_size
argument in df.export_csv
. In this way, data that is too large to fit in RAM can be saved to disk.
If one needs to export a larger DataFrame to CSV, the Apache Arrow backend provides better performance:
[36]:
df.export_csv_arrow('../data/io/output_data.csv')
Export to multiple files in parallel#
With the export_many
method one can export a DataFrame to muliple files of the same type in parallel. This is likely to be more performant when exporting very large DataFrames to the cloud compared to writing a single large Arrow of Parquet file, where each chunk is written in succession. The method also accepts the fs_options
dictonary, and can be particularly convenient when exporting to cloud storage.
[37]:
df.export_many('../data/io/output_chunk-{i:02}.parquet', chunk_size=100_000)
[38]:
!ls ./data/io/output_chunk*.parquet
./data/io/output_chunk-00.parquet ./data/io/output_chunk-02.parquet
./data/io/output_chunk-01.parquet
In memory data representation#
Python has a rich ecosystem comprised of various libraries for data manipulation, that offer different functionality. Thus, it is often useful to be able to pass data from one library to another. Vaex is able to pass on its data to other libraries via a number of in-memory representations.
DataFrame representations#
A Vaex DataFrame can be converted to a pandas DataFrame like so:
[39]:
df = vaex.open('../data/io/sample_simple.hdf5')
pandas_df = df.to_pandas_df()
pandas_df # looks the same doesn't it?
[39]:
x | y | z | |
---|---|---|---|
0 | 0 | 10 | dog |
1 | 1 | 20 | cat |
2 | 2 | 30 | cow |
3 | 3 | 40 | horse |
4 | 4 | 50 | mouse |
For DataFrames that are too large to fit in memory, one can specify the chunk_size
argument, in which case the to_pandas_df
method returns a generator yileding a pandas DataFrame with as many rows as indicated by the chunk_size
argument:
[40]:
gen = df.to_pandas_df(chunk_size=3)
for i1, i2, chunk in gen:
print(i1, i2)
print(chunk)
print()
0 3
x y z
0 0 10 dog
1 1 20 cat
2 2 30 cow
3 5
x y z
0 3 40 horse
1 4 50 mouse
The generator also yields the row number of the first and the last element of that chunk, so we know exactly where in the parent DataFrame we are. The following DataFrame methods also support the chunk_size
argument with the same behaviour.
Converting a Vaex DataFrame into an arrow table is similar:
[41]:
arrow_table = df.to_arrow_table()
arrow_table
[41]:
pyarrow.Table
x: int64
y: int64
z: string
----
x: [[0,1,2,3,4]]
y: [[10,20,30,40,50]]
z: [["dog","cat","cow","horse","mouse"]]
One can simply convert the DataFrame to a list of arrays. By default, the data is exposed as a list of numpy or arrow arrays:
[42]:
arrays = df.to_arrays()
arrays
[42]:
[array([0, 1, 2, 3, 4]),
array([10, 20, 30, 40, 50]),
<pyarrow.lib.StringArray object at 0x7ff265f91c40>
[
"dog",
"cat",
"cow",
"horse",
"mouse"
]]
By specifying the array_type
argument, one can choose whether the data will be represented by numpy arrays, xarrays, or Python lists.
[43]:
arrays = df.to_arrays(array_type='xarray')
arrays # list of xarrays
[43]:
[<xarray.DataArray (dim_0: 5)>
array([0, 1, 2, 3, 4])
Dimensions without coordinates: dim_0,
<xarray.DataArray (dim_0: 5)>
array([10, 20, 30, 40, 50])
Dimensions without coordinates: dim_0,
<xarray.DataArray (dim_0: 5)>
array(['dog', 'cat', 'cow', 'horse', 'mouse'], dtype=object)
Dimensions without coordinates: dim_0]
[44]:
arrays = df.to_arrays(array_type='list')
arrays # list of lists
[44]:
[[0, 1, 2, 3, 4],
[10, 20, 30, 40, 50],
['dog', 'cat', 'cow', 'horse', 'mouse']]
Keeping it close to pure Python, one can export a Vaex DataFrame as a dictionary. The same array_type
keyword argument applies here as well:
[45]:
d_dict = df.to_dict(array_type='numpy')
d_dict
[45]:
{'x': array([0, 1, 2, 3, 4]),
'y': array([10, 20, 30, 40, 50]),
'z': array(['dog', 'cat', 'cow', 'horse', 'mouse'], dtype=object)}
Alternatively, one can also convert a DataFrame to a list of tuples, were the first element of the tuple is the column name, while the second element is the array representation of the data.
[46]:
# Get a single item list
items = df.to_items(array_type='list')
items
[46]:
[('x', [0, 1, 2, 3, 4]),
('y', [10, 20, 30, 40, 50]),
('z', ['dog', 'cat', 'cow', 'horse', 'mouse'])]
When interacting with various types of APIs, it is common to pass a list of “records”, where a record is a dictionary describing a single row of the DataFrame:
[47]:
records = df.to_records()
records
[47]:
[{'x': 0, 'y': 10, 'z': 'dog'},
{'x': 1, 'y': 20, 'z': 'cat'},
{'x': 2, 'y': 30, 'z': 'cow'},
{'x': 3, 'y': 40, 'z': 'horse'},
{'x': 4, 'y': 50, 'z': 'mouse'}]
As mentioned earlier, with all of the above example, one can use the chunk_size
argument which creates a generator, yielding a portion of the DataFrame in the specified format. In the case of .to_dict
method:
[48]:
gen = df.to_dict(array_type='list', chunk_size=2)
for i1, i2, chunk in gen:
print(i1, i2, chunk)
0 2 {'x': [0, 1], 'y': [10, 20], 'z': ['dog', 'cat']}
2 4 {'x': [2, 3], 'y': [30, 40], 'z': ['cow', 'horse']}
4 5 {'x': [4], 'y': [50], 'z': ['mouse']}
Last but not least, a Vaex DataFrame can be lazily exposed as a Dask array:
[49]:
dask_arrays = df[['x', 'y']].to_dask_array() # String support coming soon
dask_arrays
[49]:
|
Expression representations#
A single Vaex Expression can be also converted to a variety of in-memory representations:
[50]:
# pandas Series
x_series = df.x.to_pandas_series()
x_series
[50]:
0 0
1 1
2 2
3 3
4 4
dtype: int64
[51]:
# numpy array
x_numpy = df.x.to_numpy()
x_numpy
[51]:
array([0, 1, 2, 3, 4])
[52]:
# Python list
x_list = df.x.tolist()
x_list
[52]:
[0, 1, 2, 3, 4]
[53]:
# Dask array
x_dask_array = df.x.to_dask_array()
x_dask_array
[53]:
|