Frequently Asked Questions¶
I have a massive CSV file which I can not fit all into memory at one time. How do I convert it to HDF5?¶
Such an operation is a one-liner in Vaex:
df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)
When the above line is executed, Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file on disk. All temporary will files are then concatenated into a single HDF5, and the temporary files deleted. The size of the individual chunks to be read can be specified via the chunk_size
argument.
For more information on importing and exporting data with Vaex, please refer to please refer to the I/O example page.
Why can’t I open a HDF5 file that was exported from a pandas
DataFrame using .to_hdf
?¶
When one uses the pandas
.to_hdf
method, the output HDF5 file has a row based format. Vaex
on the other hand expects column based HDF5 files. This allows for more efficient reading of data columns, which is much more commonly required for data science applications.
One can easily export a pandas
DataFrame to a vaex
friendly HDF5 file:
vaex_df = vaex.from_pandas(pandas_df, copy_index=False)
vaex_df.export_hdf5('my_data.hdf5')
Why can’t I add a new column after filtering a vaex
DataFrame?¶
Unlike other libraries, vaex
does not copy or modify the data. After a filtering operations for example:
df2 = df[df.x > 5]
df2
still contains all of the data present in df
however. The difference is that the columns of df2
are lazily indexed, and only the rows for which the filtering condition is satisfied are displayed or used. This means that in principle one can turn filters on/off as needed.
To be able to manually add a new column to the filtered df2
DataFrame, one needs to use the df2.extract()
method first. This will drop the lazy indexing, making the length of df2
equal to its filtered length.
Here is a short example:
[1]:
import vaex
import numpy as np
df = vaex.from_dict({'id': np.array([1, 2, 3, 4]),
'name': np.array(['Sally', 'Tom', 'Maria', 'John'])
})
df2 = df[df.id > 2]
df2 = df2.extract()
df2['age'] = np.array([27, 29])
df2
[1]:
# | id | name | age |
---|---|---|---|
0 | 3 | Maria | 27 |
1 | 4 | John | 29 |