API documentation for vaex library¶
Quick lists¶
Opening/reading in your data.¶
vaex.open (path[, convert, shuffle, copy_index]) |
Open a DataFrame from file given by path. |
vaex.from_arrow_table (table) |
Creates a vaex DataFrame from an arrow Table. |
vaex.from_arrays (**arrays) |
Create an in memory DataFrame from numpy arrays. |
vaex.from_dict (data) |
Create an in memory dataset from a dict with column names as keys and list/numpy-arrays as values |
vaex.from_csv (filename_or_buffer[, copy_index]) |
Shortcut to read a csv file using pandas and convert to a DataFrame directly. |
vaex.from_ascii (path[, seperator, names, …]) |
Create an in memory DataFrame from an ascii file (whitespace seperated by default). |
vaex.from_pandas (df[, name, copy_index, …]) |
Create an in memory DataFrame from a pandas DataFrame. |
vaex.from_astropy_table (table) |
Create a vaex DataFrame from an Astropy Table. |
Visualization.¶
vaex.dataframe.DataFrame.plot ([x, y, z, …]) |
Viz data in a 2d histogram/heatmap. |
vaex.dataframe.DataFrame.plot1d ([x, what, …]) |
Viz data in 1d (histograms, running means etc) |
vaex.dataframe.DataFrame.scatter (x, y[, …]) |
Viz (small amounts) of data in 2d using a scatter plot |
vaex.dataframe.DataFrame.plot_widget (x, y[, …]) |
Viz 1d, 2d or 3d in a Jupyter notebook |
vaex.dataframe.DataFrame.healpix_plot ([…]) |
Viz data in 2d using a healpix column. |
Statistics.¶
vaex.dataframe.DataFrame.count ([expression, …]) |
Count the number of non-NaN values (or all, if expression is None or “*”). |
vaex.dataframe.DataFrame.mean (expression[, …]) |
Calculate the mean for expression, possibly on a grid defined by binby. |
vaex.dataframe.DataFrame.std (expression[, …]) |
Calculate the standard deviation for the given expression, possible on a grid defined by binby |
vaex.dataframe.DataFrame.var (expression[, …]) |
Calculate the sample variance for the given expression, possible on a grid defined by binby |
vaex.dataframe.DataFrame.cov (x[, y, binby, …]) |
Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby. |
vaex.dataframe.DataFrame.correlation (x[, y, …]) |
Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between x and y, possibly on a grid defined by binby. |
vaex.dataframe.DataFrame.median_approx (…) |
Calculate the median, possibly on a grid defined by binby. |
vaex.dataframe.DataFrame.mode (expression[, …]) |
Calculate/estimate the mode. |
vaex.dataframe.DataFrame.min (expression[, …]) |
Calculate the minimum for given expressions, possibly on a grid defined by binby. |
vaex.dataframe.DataFrame.max (expression[, …]) |
Calculate the maximum for given expressions, possibly on a grid defined by binby. |
vaex.dataframe.DataFrame.minmax (expression) |
Calculate the minimum and maximum for expressions, possibly on a grid defined by binby. |
vaex.dataframe.DataFrame.mutual_information (x) |
Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby. |
vaex-core¶
Vaex is a library for dealing with larger than memory DataFrames (out of core).
The most important class (datastructure) in vaex is the DataFrame
. A DataFrame is obtained by either opening
the example dataset:
>>> import vaex
>>> df = vaex.example()
Or using open()
to open a file.
>>> df1 = vaex.open("somedata.hdf5")
>>> df2 = vaex.open("somedata.fits")
>>> df2 = vaex.open("somedata.arrow")
>>> df4 = vaex.open("somedata.csv")
Or connecting to a remove server:
>>> df_remote = vaex.open("http://try.vaex.io/nyc_taxi_2015")
A few strong features of vaex are:
- Performance: works with huge tabular data, process over a billion (> 109) rows/second.
- Expression system / Virtual columns: compute on the fly, without wasting ram.
- Memory efficient: no memory copies when doing filtering/selections/subsets.
- Visualization: directly supported, a one-liner is often enough.
- User friendly API: you will only need to deal with a DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.
- Very fast statistics on N dimensional grids such as histograms, running mean, heatmaps.
Follow the tutorial at https://docs.vaex.io/en/latest/tutorial.html to learn how to use vaex.
-
vaex.
open
(path, convert=False, shuffle=False, copy_index=True, *args, **kwargs)[source]¶ Open a DataFrame from file given by path.
Example:
>>> df = vaex.open('sometable.hdf5') >>> df = vaex.open('somedata*.csv', convert='bigdata.hdf5')
Parameters: - or list path (str) – local or absolute path to file, or glob string, or list of paths
- convert – convert files to an hdf5 file for optimization, can also be a path
- shuffle (bool) – shuffle converted DataFrame or not
- args – extra arguments for file readers that need it
- kwargs – extra keyword arguments
- copy_index (bool) – copy index when source is read via pandas
Returns: return a DataFrame on succes, otherwise None
Return type: S3 support:
Vaex supports streaming in hdf5 files from Amazon AWS object storage S3. Files are by default cached in $HOME/.vaex/file-cache/s3 such that successive access is as fast as native disk access. The following url parameters control S3 options:
- anon: Use anonymous access or not (false by default). (Allowed values are: true,True,1,false,False,0)
- use_cache: Use the disk cache or not, only set to false if the data should be accessed once. (Allowed values are: true,True,1,false,False,0)
- profile_name and other arguments are passed to
s3fs.core.S3FileSystem
All arguments can also be passed as kwargs, but then arguments such as anon can only be a boolean, not a string.
Examples:
>>> df = vaex.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5?anon=true') >>> df = vaex.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5', anon=True) # Note that anon is a boolean, not the string 'true' >>> df = vaex.open('s3://mybucket/path/to/file.hdf5?profile_name=myprofile')
-
vaex.
from_arrays
(**arrays)[source]¶ Create an in memory DataFrame from numpy arrays.
Example
>>> import vaex, numpy as np >>> x = np.arange(5) >>> y = x ** 2 >>> vaex.from_arrays(x=x, y=y) # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16 >>> some_dict = {'x': x, 'y': y} >>> vaex.from_arrays(**some_dict) # in case you have your columns in a dict # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16
Parameters: arrays – keyword arguments with arrays Return type: DataFrame
-
vaex.
from_dict
(data)[source]¶ Create an in memory dataset from a dict with column names as keys and list/numpy-arrays as values
Example
>>> data = {'A':[1,2,3],'B':['a','b','c']} >>> vaex.from_dict(data) # A B 0 1 'a' 1 2 'b' 2 3 'c'
Parameters: data – A dict of {column:[value, value,…]} Return type: DataFrame
-
vaex.
from_items
(*items)[source]¶ Create an in memory DataFrame from numpy arrays, in contrast to from_arrays this keeps the order of columns intact (for Python < 3.6).
Example
>>> import vaex, numpy as np >>> x = np.arange(5) >>> y = x ** 2 >>> vaex.from_items(('x', x), ('y', y)) # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16
Parameters: items – list of [(name, numpy array), …] Return type: DataFrame
-
vaex.
from_arrow_table
(table)[source]¶ Creates a vaex DataFrame from an arrow Table.
Return type: DataFrame
-
vaex.
from_csv
(filename_or_buffer, copy_index=True, **kwargs)[source]¶ Shortcut to read a csv file using pandas and convert to a DataFrame directly.
Return type: DataFrame
-
vaex.
from_ascii
(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]¶ Create an in memory DataFrame from an ascii file (whitespace seperated by default).
>>> ds = vx.from_ascii("table.asc") >>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])
Parameters: - path – file path
- seperator – value seperator, by default whitespace, use “,” for comma seperated values.
- names – If True, the first line is used for the column names, otherwise provide a list of strings with names
- skip_lines – skip lines at the start of the file
- skip_after – skip lines at the end of the file
- kwargs –
Return type:
-
vaex.
from_pandas
(df, name='pandas', copy_index=True, index_name='index')[source]¶ Create an in memory DataFrame from a pandas DataFrame.
Param: pandas.DataFrame df: Pandas DataFrame Param: name: unique for the DataFrame >>> import vaex, pandas as pd >>> df_pandas = pd.from_csv('test.csv') >>> df = vaex.from_pandas(df_pandas)
Return type: DataFrame
-
vaex.
from_samp
(username=None, password=None)[source]¶ Connect to a SAMP Hub and wait for a single table load event, disconnect, download the table and return the DataFrame.
Useful if you want to send a single table from say TOPCAT to vaex in a python console or notebook.
-
vaex.
open_many
(filenames)[source]¶ Open a list of filenames, and return a DataFrame with all DataFrames concatenated.
Parameters: filenames (list[str]) – list of filenames/paths Return type: DataFrame
-
vaex.
register_function
(scope=None, as_property=False, name=None, on_expression=True)[source]¶ Decorator to register a new function with vaex.
If on_expression is True, the function will be available as a method on an Expression, where the first argument will be the expression itself.
Example:
>>> import vaex >>> df = vaex.example() >>> @vaex.register_function() >>> def invert(x): >>> return 1/x >>> df.x.invert()
>>> import numpy as np >>> df = vaex.from_arrays(departure=np.arange('2015-01-01', '2015-12-05', dtype='datetime64')) >>> @vaex.register_function(as_property=True, scope='dt') >>> def dt_relative_day(x): >>> return vaex.functions.dt_dayofyear(x)/365. >>> df.departure.dt.relative_day
-
vaex.
server
(url, **kwargs)[source]¶ Connect to hostname supporting the vaex web api.
Parameters: hostname (str) – hostname or ip address of server Return vaex.dataframe.ServerRest: returns a server object, note that it does not connect to the server yet, so this will always succeed Return type: ServerRest
-
vaex.
example
(download=True)[source]¶ Returns an example DataFrame which comes with vaex for testing/learning purposes.
Return type: DataFrame
-
vaex.
app
(*args, **kwargs)[source]¶ Create a vaex app, the QApplication mainloop must be started.
In ipython notebook/jupyter do the following:
>>> import vaex.ui.main # this causes the qt api level to be set properly >>> import vaex
Next cell:
>>> %gui qt
Next cell:
>>> app = vaex.app()
From now on, you can run the app along with jupyter
-
vaex.
delayed
(f)[source]¶ Decorator to transparantly accept delayed computation.
Example:
>>> delayed_sum = ds.sum(ds.E, binby=ds.x, limits=limits, >>> shape=4, delay=True) >>> @vaex.delayed >>> def total_sum(sums): >>> return sums.sum() >>> sum_of_sums = total_sum(delayed_sum) >>> ds.execute() >>> sum_of_sums.get() See the tutorial for a more complete example https://docs.vaex.io/en/latest/tutorial.html#Parallel-computations
DataFrame class¶
-
class
vaex.dataframe.
DataFrame
(name, column_names, executor=None)[source]¶ Bases:
object
All local or remote datasets are encapsulated in this class, which provides a pandas like API to your dataset.
Each DataFrame (df) has a number of columns, and a number of rows, the length of the DataFrame.
All DataFrames have multiple ‘selection’, and all calculations are done on the whole DataFrame (default) or for the selection. The following example shows how to use the selection.
>>> df.select("x < 0") >>> df.sum(df.y, selection=True) >>> df.sum(df.y, selection=[df.x < 0, df.x > 0])
-
__delitem__
(item)[source]¶ Removes a (virtual) column from the DataFrame.
Note: this does not check if the column is used in a virtual expression or in the filter and may lead to issues. It is safer to use
drop()
.
-
__getitem__
(item)[source]¶ Convenient way to get expressions, (shallow) copies of a few columns, or to apply filtering.
Example:
>>> df['Lz'] # the expression 'Lz >>> df['Lz/2'] # the expression 'Lz/2' >>> df[["Lz", "E"]] # a shallow copy with just two columns >>> df[df.Lz < 0] # a shallow copy with the filter Lz < 0 applied
-
__init__
(name, column_names, executor=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
__setitem__
(name, value)[source]¶ Convenient way to add a virtual column / expression to this DataFrame.
Example:
>>> import vaex, numpy as np >>> df = vaex.example() >>> df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2) >>> df.r <vaex.expression.Expression(expressions='r')> instance at 0x121687e80 values=[2.9655450396553587, 5.77829281049018, 6.99079603950256, 9.431842752707537, 0.8825613121347967 ... (total 330000 values) ... 7.453831761514681, 15.398412491068198, 8.864250273925633, 17.601047186042507, 14.540181524970293]
-
__weakref__
¶ list of weak references to the object (if defined)
-
add_variable
(name, expression, overwrite=True, unique=True)[source]¶ Add a variable to a DataFrame.
A variable may refer to other variables, and virtual columns and expression may refer to variables.
Example
>>> df.add_variable('center', 0) >>> df.add_virtual_column('x_prime', 'x-center') >>> df.select('x_prime < 0')
Param: str name: name of virtual varible Param: expression: expression for the variable
-
add_virtual_column
(name, expression, unique=False)[source]¶ Add a virtual column to the DataFrame.
Example:
>>> df.add_virtual_column("r", "sqrt(x**2 + y**2 + z**2)") >>> df.select("r < 10")
Param: str name: name of virtual column Param: expression: expression for the column Parameters: unique (str) – if name is already used, make it unique by adding a postfix, e.g. _1, or _2
-
apply
(f, arguments=None, dtype=None, delay=False, vectorize=False)[source]¶ Apply a function on a per row basis across the entire DataFrame.
Example:
>>> import vaex >>> df = vaex.example() >>> def func(x, y): ... return (x+y)/(x-y) ... >>> df.apply(func, arguments=[df.x, df.y]) Expression = lambda_function(x, y) Length: 330,000 dtype: float64 (expression) ------------------------------------------- 0 -0.460789 1 3.90038 2 -0.642851 3 0.685768 4 -0.543357
Parameters: - f – The function to be applied
- arguments – List of arguments to be passed on to the function f.
Returns: A function that is lazily evaluated.
-
byte_size
(selection=False, virtual=False)[source]¶ Return the size in bytes the whole DataFrame requires (or the selection), respecting the active_fraction.
-
cat
(i1, i2, format='html')[source]¶ Display the DataFrame from row i1 till i2
For format, see https://pypi.org/project/tabulate/
Parameters:
-
close_files
()[source]¶ Close any possible open file handles, the DataFrame will not be in a usable state afterwards.
-
col
¶ Gives direct access to the columns only (useful for tab completion).
Convenient when working with ipython in combination with small DataFrames, since this gives tab-completion.
Columns can be accessed by their names, which are attributes. The attributes are currently expressions, so you can do computations with them.
Example
>>> ds = vaex.example() >>> df.plot(df.col.x, df.col.y)
-
combinations
(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]¶ Generate a list of combinations for the possible expressions for the given dimension.
Parameters: - expressions_list – list of list of expressions, where the inner list defines the subspace
- dimensions – if given, generates a subspace with all possible combinations for that dimension
- exclude – list of
-
correlation
(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, delay=False, progress=None)[source]¶ Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between x and y, possibly on a grid defined by binby.
Example:
>>> df.correlation("x**2+y**2+z**2", "-log(-E+1)") array(0.6366637382215669) >>> df.correlation("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4) array([ 0.40594394, 0.69868851, 0.61394099, 0.65266318])
Parameters: - x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
-
count
(expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]¶ Count the number of non-NaN values (or all, if expression is None or “*”).
Example:
>>> df.count() 330000 >>> df.count("*") 330000.0 >>> df.count("*", binby=["x"], shape=4) array([ 10925., 155427., 152007., 10748.])
Parameters: - expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
-
cov
(x, y=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby.
Either x and y are expressions, e.g.:
>>> df.cov("x", "y")
Or only the x argument is given with a list of expressions, e.g.:
>>> df.cov(["x, "y, "z"])
Example:
>>> df.cov("x", "y") array([[ 53.54521742, -3.8123135 ], [ -3.8123135 , 60.62257881]]) >>> df.cov(["x", "y", "z"]) array([[ 53.54521742, -3.8123135 , -0.98260511], [ -3.8123135 , 60.62257881, 1.21381057], [ -0.98260511, 1.21381057, 25.55517638]])
>>> df.cov("x", "y", binby="E", shape=2) array([[[ 9.74852878e+00, -3.02004780e-02], [ -3.02004780e-02, 9.99288215e+00]], [[ 8.43996546e+01, -6.51984181e+00], [ -6.51984181e+00, 9.68938284e+01]]])
Parameters: - x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- y – if previous argument is not a list, this argument should be given
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)
-
covar
(x, y, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the covariance cov[x,y] between x and y, possibly on a grid defined by binby.
Example:
>>> df.covar("x**2+y**2+z**2", "-log(-E+1)") array(52.69461456005138) >>> df.covar("x**2+y**2+z**2", "-log(-E+1)")/(df.std("x**2+y**2+z**2") * df.std("-log(-E+1)")) 0.63666373822156686 >>> df.covar("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4) array([ 10.17387143, 51.94954078, 51.24902796, 20.2163929 ])
Parameters: - x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
-
describe
(strings=True, virtual=True, selection=None)[source]¶ Give a description of the DataFrame.
>>> import vaex >>> df = vaex.example()[['x', 'y', 'z']] >>> df.describe() x y z dtype float64 float64 float64 count 330000 330000 330000 missing 0 0 0 mean -0.0671315 -0.0535899 0.0169582 std 7.31746 7.78605 5.05521 min -128.294 -71.5524 -44.3342 max 271.366 146.466 50.7185 >>> df.describe(selection=df.x > 0) x y z dtype float64 float64 float64 count 164060 164060 164060 missing 165940 165940 165940 mean 5.13572 -0.486786 -0.0868073 std 5.18701 7.61621 5.02831 min 1.51635e-05 -71.5524 -44.3342 max 271.366 78.0724 40.2191
Parameters: Returns: Pandas dataframe
-
drop
(columns, inplace=False, check=True)[source]¶ Drop columns (or a single column).
Parameters: - columns – List of columns or a single column name
- inplace – Make modifications to self or return a new DataFrame
- check – When true, it will check if the column is used in virtual columns or the filter, and hide it instead.
-
dropmissing
(column_names=None)[source]¶ Create a shallow copy of a DataFrame, with filtering set using ismissing.
Parameters: column_names – The columns to consider, default: all (real, non-virtual) columns Return type: DataFrame
-
dropna
(column_names=None)[source]¶ Create a shallow copy of a DataFrame, with filtering set using isna.
Parameters: column_names – The columns to consider, default: all (real, non-virtual) columns Return type: DataFrame
-
dropnan
(column_names=None)[source]¶ Create a shallow copy of a DataFrame, with filtering set using isnan.
Parameters: column_names – The columns to consider, default: all (real, non-virtual) columns Return type: DataFrame
-
dtype
(expression, internal=False)[source]¶ Return the numpy dtype for the given expression, if not a column, the first row will be evaluated to get the dtype.
-
dtypes
¶ Gives a Pandas series object containing all numpy dtypes of all columns (except hidden).
-
evaluate
(expression, i1=None, i2=None, out=None, selection=None, parallel=True)[source]¶ Evaluate an expression, and return a numpy array with the results for the full column or a part of it.
Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.
To get partial results, use i1 and i2
Parameters: - expression (str) – Name/expression to evaluate
- i1 (int) – Start row index, default is the start (0)
- i2 (int) – End row index, default is the length of the DataFrame
- out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to a memory mapped array)
- selection – selection to apply
Returns:
-
extract
()[source]¶ Return a DataFrame containing only the filtered rows.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
The resulting DataFrame may be more efficient to work with when the original DataFrame is heavily filtered (contains just a small number of rows).
If no filtering is applied, it returns a trimmed view. For the returned df, len(df) == df.length_original() == df.length_unfiltered()
Return type: DataFrame
-
fillna
(value, column_names=None, prefix='__original_', inplace=False)[source]¶ Return a DataFrame, where missing values/NaN are filled with ‘value’.
The original columns will be renamed, and by default they will be hidden columns. No data is lost.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Note
Note that filtering will be ignored (since they may change), you may want to consider running
extract()
first.Example:
>>> import vaex >>> import numpy as np >>> x = np.array([3, 1, np.nan, 10, np.nan]) >>> df = vaex.from_arrays(x=x) >>> df_filled = df.fillna(value=-1, column_names=['x']) >>> df_filled # x 0 3 1 1 2 -1 3 10 4 -1
Parameters: - value (float) – The value to use for filling nan or masked values.
- fill_na (bool) – If True, fill np.nan values with value.
- fill_masked (bool) – If True, fill masked values with values.
- column_names (list) – List of column names in which to fill missing values.
- prefix (str) – The prefix to give the original columns.
- inplace – Make modifications to self or return a new DataFrame
-
filter
(expression, mode='and')[source]¶ General version of df[<boolean expression>] to modify the filter applied to the DataFrame.
See
DataFrame.select()
for usage of selection.Note that using df = df[<boolean expression>], one can only narrow the filter (i.e. only less rows can be selected). Using the filter method, and a different boolean mode (e.g. “or”) one can actually cause more rows to be selected. This differs greatly from numpy and pandas for instance, which can only narrow the filter.
Example:
>>> import vaex >>> import numpy as np >>> x = np.arange(10) >>> df = vaex.from_arrays(x=x, y=x**2) >>> df # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25 6 6 36 7 7 49 8 8 64 9 9 81 >>> dff = df[df.x<=2] >>> dff # x y 0 0 0 1 1 1 2 2 4 >>> dff = dff.filter(dff.x >=7, mode="or") >>> dff # x y 0 0 0 1 1 1 2 2 4 3 7 49 4 8 64 5 9 81
-
first
(expression, order_expression, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]¶ Return the first element of a binned expression, where the values each bin are sorted by order_expression.
Example:
>>> import vaex >>> df = vaex.example() >>> df.first(df.x, df.y, shape=8) >>> df.first(df.x, df.y, shape=8, binby=[df.y]) >>> df.first(df.x, df.y, shape=8, binby=[df.y]) array([-4.81883764, 11.65378 , 9.70084476, -7.3025589 , 4.84954977, 8.47446537, -5.73602629, 10.18783 ])
Parameters: - expression – The value to be placed in the bin.
- order_expression – Order the values in the bins by this expression.
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
Returns: Ndarray containing the first elements.
Return type: numpy.array
-
get_column_names
(virtual=True, strings=True, hidden=False, regex=None)[source]¶ Return a list of column names
Example:
>>> import vaex >>> df = vaex.from_scalars(x=1, x2=2, y=3, s='string') >>> df['r'] = (df.x**2 + df.y**2)**2 >>> df.get_column_names() ['x', 'x2', 'y', 's', 'r'] >>> df.get_column_names(virtual=False) ['x', 'x2', 'y', 's'] >>> df.get_column_names(regex='x.*') ['x', 'x2']
Parameters: - virtual – If False, skip virtual columns
- hidden – If False, skip hidden columns
- strings – If False, skip string columns
- regex – Only return column names matching the (optional) regular expression
Return type: list of str
Example: >>> import vaex >>> df = vaex.from_scalars(x=1, x2=2, y=3, s=’string’) >>> df[‘r’] = (df.x**2 + df.y**2)**2 >>> df.get_column_names() [‘x’, ‘x2’, ‘y’, ‘s’, ‘r’] >>> df.get_column_names(virtual=False) [‘x’, ‘x2’, ‘y’, ‘s’] >>> df.get_column_names(regex=’x.*’) [‘x’, ‘x2’]
-
get_current_row
()[source]¶ Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked.
-
get_private_dir
(create=False)[source]¶ Each DataFrame has a directory where files are stored for metadata etc.
Example
>>> import vaex >>> ds = vaex.example() >>> vaex.get_private_dir() '/Users/users/breddels/.vaex/dfs/_Users_users_breddels_vaex-testing_data_helmi-dezeeuw-2000-10p.hdf5'
Parameters: create (bool) – is True, it will create the directory if it does not exist
-
get_selection
(name='default')[source]¶ Get the current selection object (mostly for internal use atm).
-
get_variable
(name)[source]¶ Returns the variable given by name, it will not evaluate it.
For evaluation, see
DataFrame.evaluate_variable()
, see alsoDataFrame.set_variable()
-
healpix_count
(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, delay=False, progress=None, selection=None)[source]¶ Count non missing value for expression on an array which represents healpix data.
Parameters: - expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
- healpix_expression – {healpix_max_level}
- healpix_max_level – {healpix_max_level}
- healpix_level – {healpix_level}
- binby – {binby}, these dimension follow the first healpix dimension.
- limits – {limits}
- shape – {shape}
- selection – {selection}
- delay – {delay}
- progress – {progress}
Returns:
-
healpix_plot
(healpix_expression='source_id/34359738368', healpix_max_level=12, healpix_level=8, what='count(*)', selection=None, grid=None, healpix_input='equatorial', healpix_output='galactic', f=None, colormap='afmhot', grid_limits=None, image_size=800, nest=True, figsize=None, interactive=False, title='', smooth=None, show=False, colorbar=True, rotation=(0, 0, 0), **kwargs)[source]¶ Viz data in 2d using a healpix column.
Parameters: - healpix_expression – {healpix_max_level}
- healpix_max_level – {healpix_max_level}
- healpix_level – {healpix_level}
- what – {what}
- selection – {selection}
- grid – {grid}
- healpix_input – Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”.
- healpix_output – Plot in “equatorial”, “galactic” or “ecliptic”.
- f – function to apply to the data
- colormap – matplotlib colormap
- grid_limits – Optional sequence [minvalue, maxvalue] that determine the min and max value that map to the colormap (values below and above these are clipped to the the min/max). (default is [min(f(grid)), max(f(grid)))
- image_size – size for the image that healpy uses for rendering
- nest – If the healpix data is in nested (True) or ring (False)
- figsize – If given, modify the matplotlib figure size. Example (14,9)
- interactive – (Experimental, uses healpy.mollzoom is True)
- title – Title of figure
- smooth – apply gaussian smoothing, in degrees
- show – Call matplotlib’s show (True) or not (False, defaut)
- rotation – Rotatate the plot, in format (lon, lat, psi) such that (lon, lat) is the center, and rotate on the screen by angle psi. All angles are degrees.
Returns:
-
length_original
()[source]¶ the full length of the DataFrame, independent what active_fraction is, or filtering. This is the real length of the underlying ndarrays.
-
length_unfiltered
()[source]¶ The length of the arrays that should be considered (respecting active range), but without filtering.
-
limits
(expression, value=None, square=False, selection=None, delay=False, shape=None)[source]¶ Calculate the [min, max] range for expression, as described by value, which is ‘99.7%’ by default.
If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.
Example:
>>> df.limits("x") array([-28.86381927, 28.9261226 ]) >>> df.limits(["x", "y"]) (array([-28.86381927, 28.9261226 ]), array([-28.60476934, 28.96535249])) >>> df.limits(["x", "y"], "minmax") (array([-128.293991, 271.365997]), array([ -71.5523682, 146.465836 ])) >>> df.limits(["x", "y"], ["minmax", "90%"]) (array([-128.293991, 271.365997]), array([-13.37438402, 13.4224423 ])) >>> df.limits(["x", "y"], ["minmax", [0, 10]]) (array([-128.293991, 271.365997]), [0, 10])
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- value – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list
-
limits_percentage
(expression, percentage=99.73, square=False, delay=False)[source]¶ Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.
The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:
Example:
>>> df.limits_percentage("x", 90) array([-12.35081376, 12.14858052] >>> df.percentile_approx("x", 5), df.percentile_approx("x", 95) (array([-12.36813152]), array([ 12.13275818]))
NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- percentage (float) – Value between 0 and 100
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list
-
materialize
(virtual_column, inplace=False)[source]¶ Returns a new DataFrame where the virtual column is turned into an in memory numpy array.
Example:
>>> x = np.arange(1,4) >>> y = np.arange(2,5) >>> df = vaex.from_arrays(x=x, y=y) >>> df['r'] = (df.x**2 + df.y**2)**0.5 # 'r' is a virtual column (computed on the fly) >>> df = df.materialize('r') # now 'r' is a 'real' column (i.e. a numpy array)
Parameters: inplace – {inplace}
-
max
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False)[source]¶ Calculate the maximum for given expressions, possibly on a grid defined by binby.
Example:
>>> df.max("x") array(271.365997) >>> df.max(["x", "y"]) array([ 271.365997, 146.465836]) >>> df.max("x", binby="x", shape=5, limits=[-10, 10]) array([-6.00010443, -2.00002384, 1.99998057, 5.99983597, 9.99984646])
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
-
mean
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False)[source]¶ Calculate the mean for expression, possibly on a grid defined by binby.
Example:
>>> df.mean("x") -0.067131491264005971 >>> df.mean("(x**2+y**2)**0.5", binby="E", shape=4) array([ 2.43483742, 4.41840721, 8.26742458, 15.53846476])
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
-
median_approx
(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=256, percentile_limits='minmax', selection=False, delay=False)[source]¶ Calculate the median, possibly on a grid defined by binby.
NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
- percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
-
min
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False)[source]¶ Calculate the minimum for given expressions, possibly on a grid defined by binby.
Example:
>>> df.min("x") array(-128.293991) >>> df.min(["x", "y"]) array([-128.293991 , -71.5523682]) >>> df.min("x", binby="x", shape=5, limits=[-10, 10]) array([-9.99919128, -5.99972439, -1.99991322, 2.0000093 , 6.0004878 ])
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
-
minmax
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the minimum and maximum for expressions, possibly on a grid defined by binby.
Example:
>>> df.minmax("x") array([-128.293991, 271.365997]) >>> df.minmax(["x", "y"]) array([[-128.293991 , 271.365997 ], [ -71.5523682, 146.465836 ]]) >>> df.minmax("x", binby="x", shape=5, limits=[-10, 10]) array([[-9.99919128, -6.00010443], [-5.99972439, -2.00002384], [-1.99991322, 1.99998057], [ 2.0000093 , 5.99983597], [ 6.0004878 , 9.99984646]])
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
-
mode
(expression, binby=[], limits=None, shape=256, mode_shape=64, mode_limits=None, progressbar=False, selection=None)[source]¶ Calculate/estimate the mode.
-
mutual_information
(x, y=None, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, delay=False)[source]¶ Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby.
If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order.
Example:
>>> df.mutual_information("x", "y") array(0.1511814526380327) >>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]]) array([ 0.15118145, 0.18439181, 1.07067379]) >>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True) (array([ 1.07067379, 0.18439181, 0.15118145]), [['E', 'Lz'], ['x', 'z'], ['x', 'y']])
Parameters: - x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,
-
nbytes
¶ Alias for df.byte_size(), see
DataFrame.byte_size()
.
-
nop
(expression, progress=False, delay=False)[source]¶ Evaluates expression, and drop the result, usefull for benchmarking, since vaex is usually lazy
-
percentile_approx
(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, delay=False)[source]¶ Calculate the percentile given by percentage, possibly on a grid defined by binby.
NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits.
Example:
>>> df.percentile_approx("x", 10), df.percentile_approx("x", 90) (array([-8.3220355]), array([ 7.92080358])) >>> df.percentile_approx("x", 50, binby="x", shape=5, limits=[-10, 10]) array([[-7.56462982], [-3.61036641], [-0.01296306], [ 3.56697863], [ 7.45838367]])
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
- percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
-
plot
(x=None, y=None, z=None, what='count(*)', vwhat=None, reduce=['colormap'], f=None, normalize='normalize', normalize_axis='what', vmin=None, vmax=None, shape=256, vshape=32, limits=None, grid=None, colormap='afmhot', figsize=None, xlabel=None, ylabel=None, aspect='auto', tight_layout=True, interpolation='nearest', show=False, colorbar=True, colorbar_label=None, selection=None, selection_labels=None, title=None, background_color='white', pre_blend=False, background_alpha=1.0, visual={'column': 'what', 'fade': 'selection', 'layer': 'z', 'row': 'subspace', 'x': 'x', 'y': 'y'}, smooth_pre=None, smooth_post=None, wrap=True, wrap_columns=4, return_extra=False, hardcopy=None)¶ Viz data in a 2d histogram/heatmap.
Declarative plotting of statistical plots using matplotlib, supports subplots, selections, layers.
Instead of passing x and y, pass a list as x argument for multiple panels. Give what a list of options to have multiple panels. When both are present then will be origanized in a column/row order.
This methods creates a 6 dimensional ‘grid’, where each dimension can map the a visual dimension. The grid dimensions are:
- x: shape determined by shape, content by x argument or the first dimension of each space
- y: ,,
- z: related to the z argument
- selection: shape equals length of selection argument
- what: shape equals length of what argument
- space: shape equals length of x argument if multiple values are given
By default, this its shape is (1, 1, 1, 1, shape, shape) (where x is the last dimension)
The visual dimensions are
- x: x coordinate on a plot / image (default maps to grid’s x)
- y: y ,, (default maps to grid’s y)
- layer: each image in this dimension is blended togeher to one image (default maps to z)
- fade: each image is shown faded after the next image (default mapt to selection)
- row: rows of subplots (default maps to space)
- columns: columns of subplot (default maps to what)
All these mappings can be changes by the visual argument, some examples:
>>> df.plot('x', 'y', what=['mean(x)', 'correlation(vx, vy)'])
Will plot each ‘what’ as a column.
>>> df.plot('x', 'y', selection=['FeH < -3', '(FeH >= -3) & (FeH < -2)'], visual=dict(column='selection'))
Will plot each selection as a column, instead of a faded on top of each other.
Parameters: - x – Expression to bin in the x direction (by default maps to x), or list of pairs, like [[‘x’, ‘y’], [‘x’, ‘z’]], if multiple pairs are given, this dimension maps to rows by default
- y – y (by default maps to y)
- z – Expression to bin in the z direction, followed by a :start,end,shape signature, like ‘FeH:-3,1:5’ will produce 5 layers between -10 and 10 (by default maps to layer)
- what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)
- reduce –
- f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
- normalize – normalization function, currently only ‘normalize’ is supported
- normalize_axis – which axes to normalize on, None means normalize by the global maximum.
- vmin – instead of automatic normalization, (using normalize and normalization_axis) scale the data between vmin and vmax to [0, 1]
- vmax – see vmin
- shape – shape/size of the n-D histogram grid
- limits – list of [[xmin, xmax], [ymin, ymax]], or a description such as ‘minmax’, ‘99%’
- grid – if the binning is done before by yourself, you can pass it
- colormap – matplotlib colormap to use
- figsize – (x, y) tuple passed to pylab.figure for setting the figure size
- xlabel –
- ylabel –
- aspect –
- tight_layout – call pylab.tight_layout or not
- colorbar – plot a colorbar or not
- interpolation – interpolation for imshow, possible options are: ‘nearest’, ‘bilinear’, ‘bicubic’, see matplotlib for more
- return_extra –
Returns:
-
plot1d
(x=None, what='count(*)', grid=None, shape=64, facet=None, limits=None, figsize=None, f='identity', n=None, normalize_axis=None, xlabel=None, ylabel=None, label=None, selection=None, show=False, tight_layout=True, hardcopy=None, progress=None, **kwargs)¶ Viz data in 1d (histograms, running means etc)
Example
>>> df.plot1d(df.x) >>> df.plot1d(df.x, limits=[0, 100], shape=100) >>> df.plot1d(df.x, what='mean(y)', limits=[0, 100], shape=100)
If you want to do a computation yourself, pass the grid argument, but you are responsible for passing the same limits arguments:
>>> counts = df.mean(df.y, binby=df.x, limits=[0, 100], shape=100)/100. >>> df.plot1d(df.x, limits=[0, 100], shape=100, grid=means, label='mean(y)/100')
Parameters: - x – Expression to bin in the x direction
- what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum
- grid – If the binning is done before by yourself, you can pass it
- facet – Expression to produce facetted plots ( facet=’x:0,1,12’ will produce 12 plots with x in a range between 0 and 1)
- limits – list of [xmin, xmax], or a description such as ‘minmax’, ‘99%’
- figsize – (x, y) tuple passed to pylab.figure for setting the figure size
- f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
- n – normalization function, currently only ‘normalize’ is supported, or None for no normalization
- normalize_axis – which axes to normalize on, None means normalize by the global maximum.
- normalize_axis –
- xlabel – String for label on x axis (may contain latex)
- ylabel – Same for y axis
- kwargs – extra argument passed to pylab.plot
Param: tight_layout: call pylab.tight_layout or not
Returns:
-
plot2d_contour
(x=None, y=None, what='count(*)', limits=None, shape=256, selection=None, f='identity', figsize=None, xlabel=None, ylabel=None, aspect='auto', levels=None, fill=False, colorbar=False, colorbar_label=None, colormap=None, colors=None, linewidths=None, linestyles=None, vmin=None, vmax=None, grid=None, show=None, **kwargs)¶ Plot conting contours on 2D grid.
Parameters: - x – {expression}
- y – {expression}
- what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)
- limits – {limits}
- shape – {shape}
- selection – {selection}
- f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
- figsize – (x, y) tuple passed to pylab.figure for setting the figure size
- xlabel – label of the x-axis (defaults to param x)
- ylabel – label of the y-axis (defaults to param y)
- aspect – the aspect ratio of the figure
- levels – the contour levels to be passed on pylab.contour or pylab.contourf
- colorbar – plot a colorbar or not
- colorbar_label – the label of the colourbar (defaults to param what)
- colormap – matplotlib colormap to pass on to pylab.contour or pylab.contourf
- colors – the colours of the contours
- linewidths – the widths of the contours
- linestyles – the style of the contour lines
- vmin – instead of automatic normalization, scale the data between vmin and vmax
- vmax – see vmin
- grid – {grid}
- show –
-
plot3d
(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]¶ Use at own risk, requires ipyvolume
-
plot_bq
(x, y, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, **kwargs)[source]¶ Deprecated: use plot_widget
-
plot_widget
(x, y, z=None, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, backend='bqplot', **kwargs)[source]¶ Viz 1d, 2d or 3d in a Jupyter notebook
Note
This API is not fully settled and may change in the future
Example:
>>> df.plot_widget(df.x, df.y, backend='bqplot') >>> df.plot_widget(df.pickup_longitude, df.pickup_latitude, backend='ipyleaflet')
Parameters: backend – Widget backend to use: ‘bqplot’, ‘ipyleaflet’, ‘ipyvolume’, ‘matplotlib’
-
propagate_uncertainties
(columns, depending_variables=None, cov_matrix='auto', covariance_format='{}_{}_covariance', uncertainty_format='{}_uncertainty')[source]¶ Propagates uncertainties (full covariance matrix) for a set of virtual columns.
Covariance matrix of the depending variables is guessed by finding columns prefixed by “e” or “e_” or postfixed by “_error”, “_uncertainty”, “e” and “_e”. Off diagonals (covariance or correlation) by postfixes with “_correlation” or “_corr” for correlation or “_covariance” or “_cov” for covariances. (Note that x_y_cov = x_e * y_e * x_y_correlation.)
Example
>>> df = vaex.from_scalars(x=1, y=2, e_x=0.1, e_y=0.2) >>> df["u"] = df.x + df.y >>> df["v"] = np.log10(df.x) >>> df.propagate_uncertainties([df.u, df.v]) >>> df.u_uncertainty, df.v_uncertainty
Parameters: - columns – list of columns for which to calculate the covariance matrix.
- depending_variables – If not given, it is found out automatically, otherwise a list of columns which have uncertainties.
- cov_matrix – List of list with expressions giving the covariance matrix, in the same order as depending_variables. If ‘full’ or ‘auto’, the covariance matrix for the depending_variables will be guessed, where ‘full’ gives an error if an entry was not found.
-
remove_virtual_meta
()[source]¶ Removes the file with the virtual column etc, it does not change the current virtual columns etc.
-
rename_column
(name, new_name, unique=False, store_in_state=True)[source]¶ Renames a column, not this is only the in memory name, this will not be reflected on disk
-
sample
(n=None, frac=None, replace=False, weights=None, random_state=None)[source]¶ Returns a DataFrame with a random set of rows
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Provide either n or frac.
Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5)) >>> df # s x 0 a 1 1 b 2 2 c 3 3 d 4 >>> df.sample(n=2, random_state=42) # 2 random rows, fixed seed # s x 0 b 2 1 d 4 >>> df.sample(frac=1, random_state=42) # 'shuffling' # s x 0 c 3 1 a 1 2 d 4 3 b 2 >>> df.sample(frac=1, replace=True, random_state=42) # useful for bootstrap (may contain repeated samples) # s x 0 d 4 1 a 1 2 a 1 3 d 4
Parameters: - n (int) – number of samples to take (default 1 if frac is None)
- frac (float) – fractional number of takes to take
- replace (bool) – If true, a row may be drawn multiple times
- or expression weights (str) – (unnormalized) probability that a row can be drawn
- or RandomState (int) – seed or RandomState for reproducability, when None a random seed it chosen
Returns: Returns a new DataFrame with a shallow copy/view of the underlying data
Return type:
-
scatter
(x, y, xerr=None, yerr=None, cov=None, corr=None, s_expr=None, c_expr=None, labels=None, selection=None, length_limit=50000, length_check=True, label=None, xlabel=None, ylabel=None, errorbar_kwargs={}, ellipse_kwargs={}, **kwargs)¶ Viz (small amounts) of data in 2d using a scatter plot
Convenience wrapper around pylab.scatter when for working with small DataFrames or selections
Parameters: - x – Expression for x axis
- y – Idem for y
- s_expr – When given, use if for the s (size) argument of pylab.scatter
- c_expr – When given, use if for the c (color) argument of pylab.scatter
- labels – Annotate the points with these text values
- selection – Single selection expression, or None
- length_limit – maximum number of rows it will plot
- length_check – should we do the maximum row check or not?
- label – label for the legend
- xlabel – label for x axis, if None .label(x) is used
- ylabel – label for y axis, if None .label(y) is used
- errorbar_kwargs – extra dict with arguments passed to plt.errorbar
- kwargs – extra arguments passed to pylab.scatter
Returns:
-
select
(boolean_expression, mode='replace', name='default', executor=None)[source]¶ Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode.
Selections are recorded in a history tree, per name, undo/redo can be done for them separately.
Parameters: Returns:
-
select_box
(spaces, limits, mode='replace', name='default')[source]¶ Select a n-dimensional rectangular box bounded by limits.
The following examples are equivalent:
>>> df.select_box(['x', 'y'], [(0, 10), (0, 1)]) >>> df.select_rectangle('x', 'y', [(0, 10), (0, 1)])
Parameters: - spaces – list of expressions
- limits – sequence of shape [(x1, x2), (y1, y2)]
- mode –
- name –
Returns:
-
select_circle
(x, y, xc, yc, r, mode='replace', name='default', inclusive=True)[source]¶ Select a circular region centred on xc, yc, with a radius of r.
Example:
>>> df.select_circle('x','y',2,3,1)
Parameters: - x – expression for the x space
- y – expression for the y space
- xc – location of the centre of the circle in x
- yc – location of the centre of the circle in y
- r – the radius of the circle
- name – name of the selection
- mode –
Returns:
-
select_ellipse
(x, y, xc, yc, width, height, angle=0, mode='replace', name='default', radians=False, inclusive=True)[source]¶ Select an elliptical region centred on xc, yc, with a certain width, height and angle.
Example:
>>> df.select_ellipse('x','y', 2, -1, 5,1, 30, name='my_ellipse')
Parameters: - x – expression for the x space
- y – expression for the y space
- xc – location of the centre of the ellipse in x
- yc – location of the centre of the ellipse in y
- width – the width of the ellipse (diameter)
- height – the width of the ellipse (diameter)
- angle – (degrees) orientation of the ellipse, counter-clockwise measured from the y axis
- name – name of the selection
- mode –
Returns:
-
select_inverse
(name='default', executor=None)[source]¶ Invert the selection, i.e. what is selected will not be, and vice versa
Parameters: - name (str) –
- executor –
Returns:
-
select_lasso
(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]¶ For performance reasons, a lasso selection is handled differently.
Parameters: Returns:
-
select_non_missing
(drop_nan=True, drop_masked=True, column_names=None, mode='replace', name='default')[source]¶ Create a selection that selects rows having non missing values for all columns in column_names.
The name reflects Pandas, no rows are really dropped, but a mask is kept to keep track of the selection
Parameters: - drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)
- drop_masked – drop rows when there is a masked value in any of the columns
- column_names – The columns to consider, default: all (real, non-virtual) columns
- mode (str) – Possible boolean operator: replace/and/or/xor/subtract
- name (str) – history tree or selection ‘slot’ to use
Returns:
-
select_rectangle
(x, y, limits, mode='replace', name='default')[source]¶ Select a 2d rectangular box in the space given by x and y, bounded by limits.
Example:
>>> df.select_box('x', 'y', [(0, 10), (0, 1)])
Parameters: - x – expression for the x space
- y – expression fo the y space
- limits – sequence of shape [(x1, x2), (y1, y2)]
- mode –
-
set_active_fraction
(value)[source]¶ Sets the active_fraction, set picked row to None, and remove selection.
TODO: we may be able to keep the selection, if we keep the expression, and also the picked row
-
set_active_range
(i1, i2)[source]¶ Sets the active_fraction, set picked row to None, and remove selection.
TODO: we may be able to keep the selection, if we keep the expression, and also the picked row
-
set_selection
(selection, name='default', executor=None)[source]¶ Sets the selection object
Parameters: - selection – Selection object
- name – selection ‘slot’
- executor –
Returns:
-
set_variable
(name, expression_or_value, write=True)[source]¶ Set the variable to an expression or value defined by expression_or_value.
Example
>>> df.set_variable("a", 2.) >>> df.set_variable("b", "a**2") >>> df.get_variable("b") 'a**2' >>> df.evaluate_variable("b") 4.0
Parameters: - name – Name of the variable
- write – write variable to meta file
- expression – value or expression
-
sort
(by, ascending=True, kind='quicksort')[source]¶ Return a sorted DataFrame, sorted by the expression ‘by’
The kind keyword is ignored if doing multi-key sorting.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Note
Note that filtering will be ignored (since they may change), you may want to consider running
extract()
first.Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5)) >>> df['y'] = (df.x-1.8)**2 >>> df # s x y 0 a 1 0.64 1 b 2 0.04 2 c 3 1.44 3 d 4 4.84 >>> df.sort('y', ascending=False) # Note: passing '(x-1.8)**2' gives the same result # s x y 0 d 4 4.84 1 c 3 1.44 2 a 1 0.64 3 b 2 0.04
Parameters:
-
split
(frac)[source]¶ Returns a list containing ordered subsets of the DataFrame.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Example:
>>> import vaex >>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> for dfs in df.split(frac=0.3): ... print(dfs.x.values) ... [0 1 3] [3 4 5 6 7 8 9] >>> for split in df.split(frac=[0.2, 0.3, 0.5]): ... print(dfs.x.values) [0 1] [2 3 4] [5 6 7 8 9]
Parameters: frac (int/list) – If int will split the DataFrame in two portions, the first of which will have size as specified by this parameter. If list, the generator will generate as many portions as elements in the list, where each element defines the relative fraction of that portion. Returns: A list of DataFrames. Return type: list
-
split_random
(frac, random_state=None)[source]¶ Returns a list containing random portions of the DataFrame.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Example:
>>> import vaex, import numpy as np >>> np.random.seed(111) >>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> for dfs in df.split_random(frac=0.3, random_state=42): ... print(dfs.x.values) ... [8 1 5] [0 7 2 9 4 3 6] >>> for split in df.split_random(frac=[0.2, 0.3, 0.5], random_state=42): ... print(dfs.x.values) [8 1] [5 0 7] [2 9 4 3 6]
Parameters: - frac (int/list) – If int will split the DataFrame in two portions, the first of which will have size as specified by this parameter. If list, the generator will generate as many portions as elements in the list, where each element defines the relative fraction of that portion.
- random_state (int) – (default, None) Random number seed for reproducibility.
Returns: A list of DataFrames.
Return type:
-
state_get
()[source]¶ Return the internal state of the DataFrame in a dictionary
Example:
>>> import vaex >>> df = vaex.from_scalars(x=1, y=2) >>> df['r'] = (df.x**2 + df.y**2)**0.5 >>> df.state_get() {'active_range': [0, 1], 'column_names': ['x', 'y', 'r'], 'description': None, 'descriptions': {}, 'functions': {}, 'renamed_columns': [], 'selections': {'__filter__': None}, 'ucds': {}, 'units': {}, 'variables': {}, 'virtual_columns': {'r': '(((x ** 2) + (y ** 2)) ** 0.5)'}}
-
state_load
(f, use_active_range=False)[source]¶ Load a state previously stored by
DataFrame.state_write()
, see alsoDataFrame.state_set()
.
-
state_set
(state, use_active_range=False, trusted=True)[source]¶ Sets the internal state of the df
Example:
>>> import vaex >>> df = vaex.from_scalars(x=1, y=2) >>> df # x y r 0 1 2 2.23607 >>> df['r'] = (df.x**2 + df.y**2)**0.5 >>> state = df.state_get() >>> state {'active_range': [0, 1], 'column_names': ['x', 'y', 'r'], 'description': None, 'descriptions': {}, 'functions': {}, 'renamed_columns': [], 'selections': {'__filter__': None}, 'ucds': {}, 'units': {}, 'variables': {}, 'virtual_columns': {'r': '(((x ** 2) + (y ** 2)) ** 0.5)'}} >>> df2 = vaex.from_scalars(x=3, y=4) >>> df2.state_set(state) # now the virtual functions are 'copied' >>> df2 # x y r 0 3 4 5
Parameters: - state – dict as returned by
DataFrame.state_get()
. - use_active_range (bool) – Whether to use the active range or not.
- state – dict as returned by
-
state_write
(f)[source]¶ Write the internal state to a json or yaml file (see
DataFrame.state_get()
)Example
>>> import vaex >>> df = vaex.from_scalars(x=1, y=2) >>> df['r'] = (df.x**2 + df.y**2)**0.5 >>> df.state_write('state.json') >>> print(open('state.json').read()) { "virtual_columns": { "r": "(((x ** 2) + (y ** 2)) ** 0.5)" }, "column_names": [ "x", "y", "r" ], "renamed_columns": [], "variables": { "pi": 3.141592653589793, "e": 2.718281828459045, "km_in_au": 149597870.7, "seconds_per_year": 31557600 }, "functions": {}, "selections": { "__filter__": null }, "ucds": {}, "units": {}, "descriptions": {}, "description": null, "active_range": [ 0, 1 ] } >>> df.state_write('state.yaml') >>> print(open('state.yaml').read()) active_range: - 0 - 1 column_names: - x - y - r description: null descriptions: {} functions: {} renamed_columns: [] selections: __filter__: null ucds: {} units: {} variables: pi: 3.141592653589793 e: 2.718281828459045 km_in_au: 149597870.7 seconds_per_year: 31557600 virtual_columns: r: (((x ** 2) + (y ** 2)) ** 0.5)
Parameters: f (str) – filename (ending in .json or .yaml)
-
std
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the standard deviation for the given expression, possible on a grid defined by binby
>>> df.std("vz") 110.31773397535071 >>> df.std("vz", binby=["(x**2+y**2)**0.5"], shape=4) array([ 123.57954851, 85.35190177, 61.14345748, 38.0740619 ])
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
-
sum
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False)[source]¶ Calculate the sum for the given expression, possible on a grid defined by binby
Example:
>>> df.sum("L") 304054882.49378014 >>> df.sum("L", binby="E", shape=4) array([ 8.83517994e+06, 5.92217598e+07, 9.55218726e+07, 1.40008776e+08])
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
-
take
(indices, filtered=True, dropfilter=True)[source]¶ Returns a DataFrame containing only rows indexed by indices
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5)) >>> df.take([0,2]) # s x 0 a 1 1 c 3
Parameters: - indices – sequence (list or numpy array) with row numbers
- filtered – (for internal use) The indices refer to the filtered data.
- dropfilter – (for internal use) Drop the filter, set to False when indices refer to unfiltered, but may contain rows that still need to be filtered out.
Returns: DataFrame which is a shallow copy of the original data.
Return type:
-
to_arrays
(column_names=None, selection=None, strings=True, virtual=True, parallel=True)[source]¶ Return a list of ndarrays
Parameters: - column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- strings – argument passed to DataFrame.get_column_names when column_names is None
- virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns: list of (name, ndarray) pairs
-
to_arrow_table
(column_names=None, selection=None, strings=True, virtual=False)[source]¶ Returns an arrow Table object containing the arrays corresponding to the evaluated data
Parameters: - column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- strings – argument passed to DataFrame.get_column_names when column_names is None
- virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns: pyarrow.Table object
-
to_astropy_table
(column_names=None, selection=None, strings=True, virtual=False, index=None, parallel=True)[source]¶ Returns a astropy table object containing the ndarrays corresponding to the evaluated data
Parameters: - column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- strings – argument passed to DataFrame.get_column_names when column_names is None
- virtual – argument passed to DataFrame.get_column_names when column_names is None
- index – if this column is given it is used for the index of the DataFrame
Returns: astropy.table.Table object
-
to_copy
(column_names=None, selection=None, strings=True, virtual=False, selections=True)[source]¶ Return a copy of the DataFrame, if selection is None, it does not copy the data, it just has a reference
Parameters: - column_names – list of column names, to copy, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- strings – argument passed to DataFrame.get_column_names when column_names is None
- virtual – argument passed to DataFrame.get_column_names when column_names is None
- selections – copy selections to a new DataFrame
Returns: dict
-
to_dask_array
(chunks='auto')[source]¶ Lazily expose the DataFrame as a dask.array
Example
>>> df = vaex.example() >>> A = df[['x', 'y', 'z']].to_dask_array() >>> A dask.array<vaex-df-1f048b40-10ec-11ea-9553, shape=(330000, 3), dtype=float64, chunksize=(330000, 3), chunktype=numpy.ndarray> >>> A+1 dask.array<add, shape=(330000, 3), dtype=float64, chunksize=(330000, 3), chunktype=numpy.ndarray>
Parameters: chunks – How to chunk the array, similar to dask.array.from_array()
.Returns: dask.array.Array
object.
-
to_dict
(column_names=None, selection=None, strings=True, virtual=False, parallel=True)[source]¶ Return a dict containing the ndarray corresponding to the evaluated data
Parameters: - column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- strings – argument passed to DataFrame.get_column_names when column_names is None
- virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns: dict
-
to_items
(column_names=None, selection=None, strings=True, virtual=False, parallel=True)[source]¶ Return a list of [(column_name, ndarray), …)] pairs where the ndarray corresponds to the evaluated data
Parameters: - column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- strings – argument passed to DataFrame.get_column_names when column_names is None
- virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns: list of (name, ndarray) pairs
-
to_pandas_df
(column_names=None, selection=None, strings=True, virtual=False, index_name=None, parallel=True)[source]¶ Return a pandas DataFrame containing the ndarray corresponding to the evaluated data
If index is given, that column is used for the index of the dataframe.
Example
>>> df_pandas = df.to_pandas_df(["x", "y", "z"]) >>> df_copy = vaex.from_pandas(df_pandas)
Parameters: - column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- strings – argument passed to DataFrame.get_column_names when column_names is None
- virtual – argument passed to DataFrame.get_column_names when column_names is None
- index_column – if this column is given it is used for the index of the DataFrame
Returns: pandas.DataFrame object
-
trim
(inplace=False)[source]¶ Return a DataFrame, where all columns are ‘trimmed’ by the active range.
For the returned DataFrame, df.get_active_range() returns (0, df.length_original()).
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Parameters: inplace – Make modifications to self or return a new DataFrame Return type: DataFrame
-
ucd_find
(ucds, exclude=[])[source]¶ Find a set of columns (names) which have the ucd, or part of the ucd.
Prefixed with a ^, it will only match the first part of the ucd.
Example
>>> df.ucd_find('pos.eq.ra', 'pos.eq.dec') ['RA', 'DEC'] >>> df.ucd_find('pos.eq.ra', 'doesnotexist') >>> df.ucds[df.ucd_find('pos.eq.ra')] 'pos.eq.ra;meta.main' >>> df.ucd_find('meta.main')] 'dec' >>> df.ucd_find('^meta.main')]
-
unit
(expression, default=None)[source]¶ Returns the unit (an astropy.unit.Units object) for the expression.
Example
>>> import vaex >>> ds = vaex.example() >>> df.unit("x") Unit("kpc") >>> df.unit("x*L") Unit("km kpc2 / s")
Parameters: - expression – Expression, which can be a column name
- default – if no unit is known, it will return this
Returns: The resulting unit of the expression
Return type: astropy.units.Unit
-
var
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the sample variance for the given expression, possible on a grid defined by binby
Example:
>>> df.var("vz") 12170.002429456246 >>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4) array([ 15271.90481083, 7284.94713504, 3738.52239232, 1449.63418988]) >>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5 array([ 123.57954851, 85.35190177, 61.14345748, 38.0740619 ]) >>> df.std("vz", binby=["(x**2+y**2)**0.5"], shape=4) array([ 123.57954851, 85.35190177, 61.14345748, 38.0740619 ])
Parameters: - expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
- binby – List of expressions for constructing a binned grid
- limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
- shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
- selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
- progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
-
DataFrameLocal class¶
-
class
vaex.dataframe.
DataFrameLocal
(name, path, column_names)[source]¶ Bases:
vaex.dataframe.DataFrame
Base class for DataFrames that work with local file/data
-
__array__
(dtype=None, parallel=True)[source]¶ Gives a full memory copy of the DataFrame into a 2d numpy array of shape (n_rows, n_columns). Note that the memory order is fortran, so all values of 1 column are contiguous in memory for performance reasons.
Note this returns the same result as:
>>> np.array(ds)
If any of the columns contain masked arrays, the masks are ignored (i.e. the masked elements are returned as well).
-
__init__
(name, path, column_names)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
binby
(by=None, agg=None)[source]¶ Return a
BinBy
orDataArray
object when agg is not NoneThe binby operation does not return a ‘flat’ DataFrame, instead it returns an N-d grid in the form of an xarray.
Parameters: list or agg agg (dict,) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the binby object. Returns: DataArray
orBinBy
object.
-
categorize
(column, labels=None, check=True)[source]¶ Mark column as categorical, with given labels, assuming zero indexing
-
compare
(other, report_missing=True, report_difference=False, show=10, orderby=None, column_names=None)[source]¶ Compare two DataFrames and report their difference, use with care for large DataFrames
-
concat
(other)[source]¶ Concatenates two DataFrames, adding the rows of the other DataFrame to the current, returned in a new DataFrame.
No copy of the data is made.
Parameters: other – The other DataFrame that is concatenated with this DataFrame Returns: New DataFrame with the rows concatenated Return type: DataFrameConcatenated
-
data
¶ Gives direct access to the data as numpy arrays.
Convenient when working with IPython in combination with small DataFrames, since this gives tab-completion. Only real columns (i.e. no virtual) columns can be accessed, for getting the data from virtual columns, use DataFrame.evaluate(…).
Columns can be accessed by their names, which are attributes. The attributes are of type numpy.ndarray.
Example:
>>> df = vaex.example() >>> r = np.sqrt(df.data.x**2 + df.data.y**2)
-
evaluate
(expression, i1=None, i2=None, out=None, selection=None, filtered=True, internal=None, parallel=True, chunk_size=None)[source]¶ The local implementation of
DataFrame.evaluate()
, will forward call toDataFrame.evaluate_iterator()
if chunk_size is given
-
evaluate_iterator
(expression, s1=None, s2=None, out=None, selection=None, filtered=True, internal=None, parallel=True, chunk_size=None, prefetch=True)[source]¶ Generator to efficiently evaluate expressions in chunks (number of rows).
See
DataFrame.evaluate()
for other arguments.Example:
>>> import vaex >>> df = vaex.example() >>> for i1, i2, chunk in df.evaluate_iterator(df.x, chunk_size=100_000): ... print(f"Total of {i1} to {i2} = {chunk.sum()}") ... Total of 0 to 100000 = -7460.610158279056 Total of 100000 to 200000 = -4964.85827154921 Total of 200000 to 300000 = -7303.271340043915 Total of 300000 to 330000 = -2424.65234724951
Parameters: prefetch – Prefetch/compute the next chunk in parallel while the current value is yielded/returned.
-
export
(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a file written with arrow
Parameters: - df (DataFrameLocal) – DataFrame to export
- path (str) – path for file
- column_names (lis[str]) – list of column names to export or None for all columns
- byteorder (str) – = for native, < for little endian and > for big endian (not supported for fits)
- shuffle (bool) – export rows in random order
- selection (bool) – export selection or not
- progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
- sort (str) – expression used for sorting the output
- ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:
-
export_arrow
(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a file written with arrow
Parameters: - df (DataFrameLocal) – DataFrame to export
- path (str) – path for file
- column_names (lis[str]) – list of column names to export or None for all columns
- byteorder (str) – = for native, < for little endian and > for big endian
- shuffle (bool) – export rows in random order
- selection (bool) – export selection or not
- progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
- sort (str) – expression used for sorting the output
- ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:
-
export_fits
(path, column_names=None, shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a fits file that is compatible with TOPCAT colfits format
Parameters: - df (DataFrameLocal) – DataFrame to export
- path (str) – path for file
- column_names (lis[str]) – list of column names to export or None for all columns
- shuffle (bool) – export rows in random order
- selection (bool) – export selection or not
- progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
- sort (str) – expression used for sorting the output
- ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:
-
export_hdf5
(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a vaex hdf5 file
Parameters: - df (DataFrameLocal) – DataFrame to export
- path (str) – path for file
- column_names (lis[str]) – list of column names to export or None for all columns
- byteorder (str) – = for native, < for little endian and > for big endian
- shuffle (bool) – export rows in random order
- selection (bool) – export selection or not
- progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
- sort (str) – expression used for sorting the output
- ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:
-
export_parquet
(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a parquet file
Parameters: - df (DataFrameLocal) – DataFrame to export
- path (str) – path for file
- column_names (lis[str]) – list of column names to export or None for all columns
- byteorder (str) – = for native, < for little endian and > for big endian
- shuffle (bool) – export rows in random order
- selection (bool) – export selection or not
- progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
- sort (str) – expression used for sorting the output
- ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:
-
groupby
(by=None, agg=None)[source]¶ Return a
GroupBy
orDataFrame
object when agg is not NoneExamples:
>>> import vaex >>> import numpy as np >>> np.random.seed(42) >>> x = np.random.randint(1, 5, 10) >>> y = x**2 >>> df = vaex.from_arrays(x=x, y=y) >>> df.groupby(df.x, agg='count') # x y_count 0 3 4 1 4 2 2 1 3 3 2 1 >>> df.groupby(df.x, agg=[vaex.agg.count('y'), vaex.agg.mean('y')]) # x y_count y_mean 0 3 4 9 1 4 2 16 2 1 3 1 3 2 1 4 >>> df.groupby(df.x, agg={'z': [vaex.agg.count('y'), vaex.agg.mean('y')]}) # x z_count z_mean 0 3 4 9 1 4 2 16 2 1 3 1 3 2 1 4
Example using datetime:
>>> import vaex >>> import numpy as np >>> t = np.arange('2015-01-01', '2015-02-01', dtype=np.datetime64) >>> y = np.arange(len(t)) >>> df = vaex.from_arrays(t=t, y=y) >>> df.groupby(vaex.BinnerTime.per_week(df.t)).agg({'y' : 'sum'}) # t y 0 2015-01-01 00:00:00 21 1 2015-01-08 00:00:00 70 2 2015-01-15 00:00:00 119 3 2015-01-22 00:00:00 168 4 2015-01-29 00:00:00 87
Parameters: list or agg agg (dict,) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the groupby object. Returns: DataFrame
orGroupBy
object.
-
is_local
()[source]¶ The local implementation of
DataFrame.evaluate()
, always returns True.
-
join
(other, on=None, left_on=None, right_on=None, lprefix='', rprefix='', lsuffix='', rsuffix='', how='left', allow_duplication=False, inplace=False)[source]¶ Return a DataFrame joined with other DataFrames, matched by columns/expression on/left_on/right_on
If neither on/left_on/right_on is given, the join is done by simply adding the columns (i.e. on the implicit row index).
Note: The filters will be ignored when joining, the full DataFrame will be joined (since filters may change). If either DataFrame is heavily filtered (contains just a small number of rows) consider running
DataFrame.extract()
first.Example:
>>> a = np.array(['a', 'b', 'c']) >>> x = np.arange(1,4) >>> ds1 = vaex.from_arrays(a=a, x=x) >>> b = np.array(['a', 'b', 'd']) >>> y = x**2 >>> ds2 = vaex.from_arrays(b=b, y=y) >>> ds1.join(ds2, left_on='a', right_on='b')
Parameters: - other – Other DataFrame to join with (the right side)
- on – default key for the left table (self)
- left_on – key for the left table (self), overrides on
- right_on – default key for the right table (other), overrides on
- lprefix – prefix to add to the left column names in case of a name collision
- rprefix – similar for the right
- lsuffix – suffix to add to the left column names in case of a name collision
- rsuffix – similar for the right
- how – how to join, ‘left’ keeps all rows on the left, and adds columns (with possible missing values) ‘right’ is similar with self and other swapped. ‘inner’ will only return rows which overlap.
- allow_duplication (bool) – Allow duplication of rows when the joined column contains non-unique values.
- inplace – Make modifications to self or return a new DataFrame
Returns:
-
label_encode
(column, values=None, inplace=False)¶ Deprecated: use is_category
Encode column as ordinal values and mark it as categorical.
The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)-1].
-
length
(selection=False)[source]¶ Get the length of the DataFrames, for the selection of the whole DataFrame.
If selection is False, it returns len(df).
TODO: Implement this in DataFrameRemote, and move the method up in
DataFrame.length()
Parameters: selection – When True, will return the number of selected rows Returns:
-
ordinal_encode
(column, values=None, inplace=False)[source]¶ Deprecated: use is_category
Encode column as ordinal values and mark it as categorical.
The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)-1].
-
selected_length
(selection='default')[source]¶ The local implementation of
DataFrame.selected_length()
-
Expression class¶
-
class
vaex.expression.
Expression
(ds, expression, ast=None)[source]¶ Bases:
object
Expression class
-
__init__
(ds, expression, ast=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
__weakref__
¶ list of weak references to the object (if defined)
-
abs
(**kwargs)¶ Lazy wrapper around
numpy.abs
-
apply
(f)[source]¶ Apply a function along all values of an Expression.
Example:
>>> df = vaex.example() >>> df.x Expression = x Length: 330,000 dtype: float64 (column) --------------------------------------- 0 -0.777471 1 3.77427 2 1.37576 3 -7.06738 4 0.243441
>>> def func(x): ... return x**2
>>> df.x.apply(func) Expression = lambda_function(x) Length: 330,000 dtype: float64 (expression) ------------------------------------------- 0 0.604461 1 14.2451 2 1.89272 3 49.9478 4 0.0592637
Parameters: f – A function to be applied on the Expression values Returns: A function that is lazily evaluated when called.
-
arccos
(**kwargs)¶ Lazy wrapper around
numpy.arccos
-
arccosh
(**kwargs)¶ Lazy wrapper around
numpy.arccosh
-
arcsin
(**kwargs)¶ Lazy wrapper around
numpy.arcsin
-
arcsinh
(**kwargs)¶ Lazy wrapper around
numpy.arcsinh
-
arctan
(**kwargs)¶ Lazy wrapper around
numpy.arctan
-
arctan2
(**kwargs)¶ Lazy wrapper around
numpy.arctan2
-
arctanh
(**kwargs)¶ Lazy wrapper around
numpy.arctanh
-
ast
¶ Returns the abstract syntax tree (AST) of the expression
-
clip
(**kwargs)¶ Lazy wrapper around
numpy.clip
-
copy
(df=None)[source]¶ Efficiently copies an expression.
Expression objects have both a string and AST representation. Creating the AST representation involves parsing the expression, which is expensive.
Using copy will deepcopy the AST when the expression was already parsed.
Parameters: df – DataFrame for which the expression will be evaluated (self.df if None)
-
cosh
(**kwargs)¶ Lazy wrapper around
numpy.cosh
-
count
(binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]¶ Shortcut for ds.count(expression, …), see Dataset.count
-
countna
()[source]¶ Returns the number of Not Availiable (N/A) values in the expression. This includes missing values and np.nan values.
-
deg2rad
(**kwargs)¶ Lazy wrapper around
numpy.deg2rad
-
expand
(stop=[])[source]¶ Expand the expression such that no virtual columns occurs, only normal columns.
Example:
>>> df = vaex.example() >>> r = np.sqrt(df.data.x**2 + df.data.y**2) >>> r.expand().expression 'sqrt(((x ** 2) + (y ** 2)))'
-
expm1
(**kwargs)¶ Lazy wrapper around
numpy.expm1
-
fillmissing
(value)¶ Returns an array where missing values are replaced by value. See :ismissing for the definition of missing values.
-
fillna
(value)¶ Returns an array where NA values are replaced by value. See :isna for the definition of missing values.
-
fillnan
(value)¶ Returns an array where nan values are replaced by value. See :isnan for the definition of missing values.
-
format
(format)¶ Uses http://www.cplusplus.com/reference/string/to_string/ for formatting
-
isfinite
(**kwargs)¶ Lazy wrapper around
numpy.isfinite
-
isin
(values)[source]¶ Lazily tests if each value in the expression is present in values.
Parameters: values – List/array of values to check Returns: Expression
with the lazy expression.
-
ismissing
()¶ Returns True where there are missing values (masked arrays), missing strings or None
-
isna
()¶ Returns a boolean expression indicating if the values are Not Availiable (missing or NaN).
-
isnan
()¶ Returns an array where there are NaN values
-
log10
(**kwargs)¶ Lazy wrapper around
numpy.log10
-
log1p
(**kwargs)¶ Lazy wrapper around
numpy.log1p
-
map
(mapper, nan_value=None, missing_value=None, default_value=None, allow_missing=False)[source]¶ Map values of an expression or in memory column according to an input dictionary or a custom callable function.
Example:
>>> import vaex >>> df = vaex.from_arrays(color=['red', 'red', 'blue', 'red', 'green']) >>> mapper = {'red': 1, 'blue': 2, 'green': 3} >>> df['color_mapped'] = df.color.map(mapper) >>> df # color color_mapped 0 red 1 1 red 1 2 blue 2 3 red 1 4 green 3 >>> import numpy as np >>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, np.nan]) >>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user', np.nan: 'unknown'}) >>> df # type role 0 0 admin 1 1 maintainer 2 2 user 3 2 user 4 2 user 5 nan unknown >>> import vaex >>> import numpy as np >>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, 4]) >>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user'}, default_value='unknown') >>> df # type role 0 0 admin 1 1 maintainer 2 2 user 3 2 user 4 2 user 5 4 unknown :param mapper: dict like object used to map the values from keys to values :param nan_value: value to be used when a nan is present (and not in the mapper) :param missing_value: value to use used when there is a missing value :param default_value: value to be used when a value is not in the mapper (like dict.get(key, default)) :param allow_missing: used to signal that values in the mapper should map to a masked array with missing values, assumed True when default_value is not None. :return: A vaex expression :rtype: vaex.expression.Expression
-
masked
¶ Alias to df.is_masked(expression)
-
max
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.max(expression, …), see Dataset.max
-
maximum
(**kwargs)¶ Lazy wrapper around
numpy.maximum
-
mean
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.mean(expression, …), see Dataset.mean
-
min
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.min(expression, …), see Dataset.min
-
minimum
(**kwargs)¶ Lazy wrapper around
numpy.minimum
-
minmax
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.minmax(expression, …), see Dataset.minmax
-
nop
()[source]¶ Evaluates expression, and drop the result, usefull for benchmarking, since vaex is usually lazy
-
notna
()¶ Opposite of isna
-
nunique
(dropna=False, dropnan=False, dropmissing=False, selection=None, delay=False)[source]¶ Counts number of unique values, i.e. len(df.x.unique()) == df.x.nunique().
Parameters: - dropmissing – do not count missing values
- dropnan – do not count nan values
- dropna – short for any of the above, (see
Expression.isna()
)
-
rad2deg
(**kwargs)¶ Lazy wrapper around
numpy.rad2deg
-
searchsorted
(**kwargs)¶ Lazy wrapper around
numpy.searchsorted
-
sinc
(**kwargs)¶ Lazy wrapper around
numpy.sinc
-
sinh
(**kwargs)¶ Lazy wrapper around
numpy.sinh
-
sqrt
(**kwargs)¶ Lazy wrapper around
numpy.sqrt
-
std
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.std(expression, …), see Dataset.std
-
str
¶ Gives access to string operations via
StringOperations
-
str_pandas
¶ Gives access to string operations via
StringOperationsPandas
(using Pandas Series)
-
sum
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.sum(expression, …), see Dataset.sum
-
tanh
(**kwargs)¶ Lazy wrapper around
numpy.tanh
-
to_pandas_series
()[source]¶ Return a pandas.Series representation of the expression.
Note: Pandas is likely to make a memory copy of the data.
-
transient
¶ If this expression is not transient (e.g. on disk) optimizations can be made
-
unique
(dropna=False, dropnan=False, dropmissing=False, selection=None, delay=False)[source]¶ Returns all unique values.
Parameters: - dropmissing – do not count missing values
- dropnan – do not count nan values
- dropna – short for any of the above, (see
Expression.isna()
)
-
value_counts
(dropna=False, dropnan=False, dropmissing=False, ascending=False, progress=False)[source]¶ Computes counts of unique values.
- WARNING:
- If the expression/column is not categorical, it will be converted on the fly
- dropna is False by default, it is True by default in pandas
Parameters: - dropna – when True, it will not report the NA (see
Expression.isna()
) - dropnan – when True, it will not report the nans(see
Expression.isnan()
) - dropmissing – when True, it will not report the missing values (see
Expression.ismissing()
) - ascending – when False (default) it will report the most frequent occuring item first
Returns: Pandas series containing the counts
-
var
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.std(expression, …), see Dataset.var
-
variables
(ourself=False, expand_virtual=True, include_virtual=True)[source]¶ Return a set of variables this expression depends on.
Example:
>>> df = vaex.example() >>> r = np.sqrt(df.data.x**2 + df.data.y**2) >>> r.variables() {'x', 'y'}
-
where
(**kwargs)¶ Lazy wrapper around
numpy.where
-
Aggregation and statistics¶
-
class
vaex.agg.
AggregatorDescriptorMean
(name, expression, short_name='mean', selection=None)[source]¶
-
class
vaex.agg.
AggregatorDescriptorMulti
(name, expression, short_name, selection=None)[source]¶ Bases:
vaex.agg.AggregatorDescriptor
Uses multiple operations/aggregation to calculate the final aggretation
-
class
vaex.agg.
AggregatorDescriptorStd
(name, expression, short_name='var', ddof=0, selection=None)[source]¶
-
class
vaex.agg.
AggregatorDescriptorVar
(name, expression, short_name='var', ddof=0, selection=None)[source]¶
-
vaex.agg.
nunique
(expression, dropna=False, dropnan=False, dropmissing=False, selection=None)[source]¶ Aggregator that calculates the number of unique items per bin.
Parameters: - expression – Expression for which to calculate the unique items
- dropmissing – do not count missing values
- dropnan – do not count nan values
- dropna – short for any of the above, (see
Expression.isna()
)
Extensions¶
String operations¶
-
class
vaex.expression.
StringOperations
(expression)[source]¶ Bases:
object
String operations.
Usually accessed using e.g. df.name.str.lower()
-
__weakref__
¶ list of weak references to the object (if defined)
-
byte_length
()¶ Returns the number of bytes in a string sample.
Returns: an expression contains the number of bytes in each sample of a string column. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.byte_length() Expression = str_byte_length(text) Length: 5 dtype: int64 (expression) ----------------------------------- 0 9 1 11 2 9 3 3 4 4
-
capitalize
()¶ Capitalize the first letter of a string sample.
Returns: an expression containing the capitalized strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.capitalize() Expression = str_capitalize(text) Length: 5 dtype: str (expression) --------------------------------- 0 Something 1 Very pretty 2 Is coming 3 Our 4 Way.
-
cat
(other)¶ Concatenate two string columns on a row-by-row basis.
Parameters: other (expression) – The expression of the other column to be concatenated. Returns: an expression containing the concatenated columns. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.cat(df.text) Expression = str_cat(text, text) Length: 5 dtype: str (expression) --------------------------------- 0 SomethingSomething 1 very prettyvery pretty 2 is comingis coming 3 ourour 4 way.way.
-
center
(width, fillchar=' ')¶ Fills the left and right side of the strings with additional characters, such that the sample has a total of width characters.
Parameters: Returns: an expression containing the filled strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.center(width=11, fillchar='!') Expression = str_center(text, width=11, fillchar='!') Length: 5 dtype: str (expression) --------------------------------- 0 !Something! 1 very pretty 2 !is coming! 3 !!!!our!!!! 4 !!!!way.!!!
-
contains
(pattern, regex=True)¶ Check if a string pattern or regex is contained within a sample of a string column.
Parameters: Returns: an expression which is evaluated to True if the pattern is found in a given sample, and it is False otherwise.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.contains('very') Expression = str_contains(text, 'very') Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 True 2 False 3 False 4 False
-
count
(pat, regex=False)¶ Count the occurences of a pattern in sample of a string column.
Parameters: Returns: an expression containing the number of times a pattern is found in each sample.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.count(pat="et", regex=False) Expression = str_count(text, pat='et', regex=False) Length: 5 dtype: int64 (expression) ----------------------------------- 0 1 1 1 2 0 3 0 4 0
-
endswith
(pat)¶ Check if the end of each string sample matches the specified pattern.
Parameters: pat (str) – A string pattern or a regex Returns: an expression evaluated to True if the pattern is found at the end of a given sample, False otherwise. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.endswith(pat="ing") Expression = str_endswith(text, pat='ing') Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 False 2 True 3 False 4 False
-
equals
(y)¶ Tests if strings x and y are the same
Returns: a boolean expression Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.equals(df.text) Expression = str_equals(text, text) Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 True 2 True 3 True 4 True
>>> df.text.str.equals('our') Expression = str_equals(text, 'our') Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 False 3 True 4 False
-
find
(sub, start=0, end=None)¶ Returns the lowest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned.
Parameters: Returns: an expression containing the lowest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.find(sub="et") Expression = str_find(text, sub='et') Length: 5 dtype: int64 (expression) ----------------------------------- 0 3 1 7 2 -1 3 -1 4 -1
-
get
(i)¶ Extract a character from each sample at the specified position from a string column. Note that if the specified position is out of bound of the string sample, this method returns ‘’, while pandas retunrs nan.
Parameters: i (int) – The index location, at which to extract the character. Returns: an expression containing the extracted characters. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.get(5) Expression = str_get(text, 5) Length: 5 dtype: str (expression) --------------------------------- 0 h 1 p 2 m 3 4
-
index
(sub, start=0, end=None)¶ Returns the lowest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned. It is the same as str.find.
Parameters: Returns: an expression containing the lowest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.index(sub="et") Expression = str_find(text, sub='et') Length: 5 dtype: int64 (expression) ----------------------------------- 0 3 1 7 2 -1 3 -1 4 -1
-
isalnum
()¶ Check if all characters in a string sample are alphanumeric.
Returns: an expression evaluated to True if a sample contains only alphanumeric characters, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.isalnum() Expression = str_isalnum(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 False 2 False 3 True 4 False
-
isalpha
()¶ Check if all characters in a string sample are alphabetic.
Returns: an expression evaluated to True if a sample contains only alphabetic characters, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.isalpha() Expression = str_isalpha(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 False 2 False 3 True 4 False
-
isdigit
()¶ Check if all characters in a string sample are digits.
Returns: an expression evaluated to True if a sample contains only digits, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', '6'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 6
>>> df.text.str.isdigit() Expression = str_isdigit(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 False 3 False 4 True
-
islower
()¶ Check if all characters in a string sample are lowercase characters.
Returns: an expression evaluated to True if a sample contains only lowercase characters, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.islower() Expression = str_islower(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 True 2 True 3 True 4 True
-
isspace
()¶ Check if all characters in a string sample are whitespaces.
Returns: an expression evaluated to True if a sample contains only whitespaces, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', ' ', ' '] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 4
>>> df.text.str.isspace() Expression = str_isspace(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 False 3 True 4 True
-
isupper
()¶ Check if all characters in a string sample are lowercase characters.
Returns: an expression evaluated to True if a sample contains only lowercase characters, otherwise False. Example:
>>> import vaex >>> text = ['SOMETHING', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 SOMETHING 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.isupper() Expression = str_isupper(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 False 2 False 3 False 4 False
-
join
(sep)¶ Same as find (difference with pandas is that it does not raise a ValueError)
-
len
()¶ Returns the length of a string sample.
Returns: an expression contains the length of each sample of a string column. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.len() Expression = str_len(text) Length: 5 dtype: int64 (expression) ----------------------------------- 0 9 1 11 2 9 3 3 4 4
-
ljust
(width, fillchar=' ')¶ Fills the right side of string samples with a specified character such that the strings are right-hand justified.
Parameters: Returns: an expression containing the filled strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.ljust(width=10, fillchar='!') Expression = str_ljust(text, width=10, fillchar='!') Length: 5 dtype: str (expression) --------------------------------- 0 Something! 1 very pretty 2 is coming! 3 our!!!!!!! 4 way.!!!!!!
-
lower
()¶ Converts string samples to lower case.
Returns: an expression containing the converted strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.lower() Expression = str_lower(text) Length: 5 dtype: str (expression) --------------------------------- 0 something 1 very pretty 2 is coming 3 our 4 way.
-
lstrip
(to_strip=None)¶ Remove leading characters from a string sample.
Parameters: to_strip (str) – The string to be removed Returns: an expression containing the modified string column. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.lstrip(to_strip='very ') Expression = str_lstrip(text, to_strip='very ') Length: 5 dtype: str (expression) --------------------------------- 0 Something 1 pretty 2 is coming 3 our 4 way.
-
match
(pattern)¶ Check if a string sample matches a given regular expression.
Parameters: pattern (str) – a string or regex to match to a string sample. Returns: an expression which is evaluated to True if a match is found, False otherwise. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.match(pattern='our') Expression = str_match(text, pattern='our') Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 False 3 True 4 False
-
pad
(width, side='left', fillchar=' ')¶ Pad strings in a given column.
Parameters: Returns: an expression containing the padded strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.pad(width=10, side='left', fillchar='!') Expression = str_pad(text, width=10, side='left', fillchar='!') Length: 5 dtype: str (expression) --------------------------------- 0 !Something 1 very pretty 2 !is coming 3 !!!!!!!our 4 !!!!!!way.
-
repeat
(repeats)¶ Duplicate each string in a column.
Parameters: repeats (int) – number of times each string sample is to be duplicated. Returns: an expression containing the duplicated strings Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.repeat(3) Expression = str_repeat(text, 3) Length: 5 dtype: str (expression) --------------------------------- 0 SomethingSomethingSomething 1 very prettyvery prettyvery pretty 2 is comingis comingis coming 3 ourourour 4 way.way.way.
-
replace
(pat, repl, n=-1, flags=0, regex=False)¶ Replace occurences of a pattern/regex in a column with some other string.
Parameters: Returns: an expression containing the string replacements.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.replace(pat='et', repl='__') Expression = str_replace(text, pat='et', repl='__') Length: 5 dtype: str (expression) --------------------------------- 0 Som__hing 1 very pr__ty 2 is coming 3 our 4 way.
-
rfind
(sub, start=0, end=None)¶ Returns the highest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned.
Parameters: Returns: an expression containing the highest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rfind(sub="et") Expression = str_rfind(text, sub='et') Length: 5 dtype: int64 (expression) ----------------------------------- 0 3 1 7 2 -1 3 -1 4 -1
-
rindex
(sub, start=0, end=None)¶ Returns the highest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned. Same as str.rfind.
Parameters: Returns: an expression containing the highest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rindex(sub="et") Expression = str_rindex(text, sub='et') Length: 5 dtype: int64 (expression) ----------------------------------- 0 3 1 7 2 -1 3 -1 4 -1
-
rjust
(width, fillchar=' ')¶ Fills the left side of string samples with a specified character such that the strings are left-hand justified.
Parameters: Returns: an expression containing the filled strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rjust(width=10, fillchar='!') Expression = str_rjust(text, width=10, fillchar='!') Length: 5 dtype: str (expression) --------------------------------- 0 !Something 1 very pretty 2 !is coming 3 !!!!!!!our 4 !!!!!!way.
-
rstrip
(to_strip=None)¶ Remove trailing characters from a string sample.
Parameters: to_strip (str) – The string to be removed Returns: an expression containing the modified string column. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rstrip(to_strip='ing') Expression = str_rstrip(text, to_strip='ing') Length: 5 dtype: str (expression) --------------------------------- 0 Someth 1 very pretty 2 is com 3 our 4 way.
-
slice
(start=0, stop=None)¶ Slice substrings from each string element in a column.
Parameters: Returns: an expression containing the sliced substrings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.slice(start=2, stop=5) Expression = str_pandas_slice(text, start=2, stop=5) Length: 5 dtype: str (expression) --------------------------------- 0 met 1 ry 2 co 3 r 4 y.
-
startswith
(pat)¶ Check if a start of a string matches a pattern.
Parameters: pat (str) – A string pattern. Regular expressions are not supported. Returns: an expression which is evaluated to True if the pattern is found at the start of a string sample, False otherwise. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.startswith(pat='is') Expression = str_startswith(text, pat='is') Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 True 3 False 4 False
-
strip
(to_strip=None)¶ Removes leading and trailing characters.
Strips whitespaces (including new lines), or a set of specified characters from each string saple in a column, both from the left right sides.
Parameters: - to_strip (str) – The characters to be removed. All combinations of the characters will be removed. If None, it removes whitespaces.
- returns – an expression containing the modified string samples.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.strip(to_strip='very') Expression = str_strip(text, to_strip='very') Length: 5 dtype: str (expression) --------------------------------- 0 Something 1 prett 2 is coming 3 ou 4 way.
-
title
()¶ Converts all string samples to titlecase.
Returns: an expression containing the converted strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.title() Expression = str_title(text) Length: 5 dtype: str (expression) --------------------------------- 0 Something 1 Very Pretty 2 Is Coming 3 Our 4 Way.
-
upper
()¶ Converts all strings in a column to uppercase.
Returns: an expression containing the converted strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.upper() Expression = str_upper(text) Length: 5 dtype: str (expression) --------------------------------- 0 SOMETHING 1 VERY PRETTY 2 IS COMING 3 OUR 4 WAY.
-
zfill
(width)¶ Pad strings in a column by prepanding “0” characters.
Parameters: width (int) – The minimum length of the resulting string. Strings shorter less than width will be prepended with zeros. Returns: an expression containing the modified strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.zfill(width=12) Expression = str_zfill(text, width=12) Length: 5 dtype: str (expression) --------------------------------- 0 000Something 1 0very pretty 2 000is coming 3 000000000our 4 00000000way.
-
String (pandas) operations¶
-
class
vaex.expression.
StringOperationsPandas
(expression)[source]¶ Bases:
object
String operations using Pandas Series (much slower)
-
__weakref__
¶ list of weak references to the object (if defined)
-
byte_length
(**kwargs)¶ Wrapper around pandas.Series.byte_length
-
capitalize
(**kwargs)¶ Wrapper around pandas.Series.capitalize
-
cat
(**kwargs)¶ Wrapper around pandas.Series.cat
-
center
(**kwargs)¶ Wrapper around pandas.Series.center
-
contains
(**kwargs)¶ Wrapper around pandas.Series.contains
-
count
(**kwargs)¶ Wrapper around pandas.Series.count
-
endswith
(**kwargs)¶ Wrapper around pandas.Series.endswith
-
equals
(**kwargs)¶ Wrapper around pandas.Series.equals
-
find
(**kwargs)¶ Wrapper around pandas.Series.find
-
get
(**kwargs)¶ Wrapper around pandas.Series.get
-
index
(**kwargs)¶ Wrapper around pandas.Series.index
-
isalnum
(**kwargs)¶ Wrapper around pandas.Series.isalnum
-
isalpha
(**kwargs)¶ Wrapper around pandas.Series.isalpha
-
isdigit
(**kwargs)¶ Wrapper around pandas.Series.isdigit
-
islower
(**kwargs)¶ Wrapper around pandas.Series.islower
-
isspace
(**kwargs)¶ Wrapper around pandas.Series.isspace
-
isupper
(**kwargs)¶ Wrapper around pandas.Series.isupper
-
join
(**kwargs)¶ Wrapper around pandas.Series.join
-
len
(**kwargs)¶ Wrapper around pandas.Series.len
-
ljust
(**kwargs)¶ Wrapper around pandas.Series.ljust
-
lower
(**kwargs)¶ Wrapper around pandas.Series.lower
-
lstrip
(**kwargs)¶ Wrapper around pandas.Series.lstrip
-
match
(**kwargs)¶ Wrapper around pandas.Series.match
-
pad
(**kwargs)¶ Wrapper around pandas.Series.pad
-
repeat
(**kwargs)¶ Wrapper around pandas.Series.repeat
-
replace
(**kwargs)¶ Wrapper around pandas.Series.replace
-
rfind
(**kwargs)¶ Wrapper around pandas.Series.rfind
-
rindex
(**kwargs)¶ Wrapper around pandas.Series.rindex
-
rjust
(**kwargs)¶ Wrapper around pandas.Series.rjust
-
rstrip
(**kwargs)¶ Wrapper around pandas.Series.rstrip
-
slice
(**kwargs)¶ Wrapper around pandas.Series.slice
-
split
(**kwargs)¶ Wrapper around pandas.Series.split
-
startswith
(**kwargs)¶ Wrapper around pandas.Series.startswith
-
strip
(**kwargs)¶ Wrapper around pandas.Series.strip
-
title
(**kwargs)¶ Wrapper around pandas.Series.title
-
upper
(**kwargs)¶ Wrapper around pandas.Series.upper
-
zfill
(**kwargs)¶ Wrapper around pandas.Series.zfill
-
Date/time operations¶
-
class
vaex.expression.
DateTime
(expression)[source]¶ Bases:
object
DateTime operations
Usually accessed using e.g. df.birthday.dt.dayofweek
-
__weakref__
¶ list of weak references to the object (if defined)
-
day
¶ Extracts the day from a datetime sample.
Returns: an expression containing the day extracted from a datetime column. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.day Expression = dt_day(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 12 1 11 2 12
-
day_name
¶ Returns the day names of a datetime sample in English.
Returns: an expression containing the day names extracted from a datetime column. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.day_name Expression = dt_day_name(date) Length: 3 dtype: str (expression) --------------------------------- 0 Monday 1 Thursday 2 Thursday
-
dayofweek
¶ Obtain the day of the week with Monday=0 and Sunday=6
Returns: an expression containing the day of week. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.dayofweek Expression = dt_dayofweek(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 0 1 3 2 3
-
dayofyear
¶ The ordinal day of the year.
Returns: an expression containing the ordinal day of the year. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.dayofyear Expression = dt_dayofyear(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 285 1 42 2 316
-
hour
¶ Extracts the hour out of a datetime samples.
Returns: an expression containing the hour extracted from a datetime column. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.hour Expression = dt_hour(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 3 1 10 2 11
-
is_leap_year
¶ Check whether a year is a leap year.
Returns: an expression which evaluates to True if a year is a leap year, and to False otherwise. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.is_leap_year Expression = dt_is_leap_year(date) Length: 3 dtype: bool (expression) ---------------------------------- 0 False 1 True 2 False
-
minute
¶ Extracts the minute out of a datetime samples.
Returns: an expression containing the minute extracted from a datetime column. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.minute Expression = dt_minute(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 31 1 17 2 34
-
month
¶ Extracts the month out of a datetime sample.
Returns: an expression containing the month extracted from a datetime column. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.month Expression = dt_month(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 10 1 2 2 11
-
month_name
¶ Returns the month names of a datetime sample in English.
Returns: an expression containing the month names extracted from a datetime column. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.month_name Expression = dt_month_name(date) Length: 3 dtype: str (expression) --------------------------------- 0 October 1 February 2 November
-
second
¶ Extracts the second out of a datetime samples.
Returns: an expression containing the second extracted from a datetime column. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.second Expression = dt_second(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 0 1 34 2 22
-
weekofyear
¶ Returns the week ordinal of the year.
Returns: an expression containing the week ordinal of the year, extracted from a datetime column. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.weekofyear Expression = dt_weekofyear(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 42 1 6 2 46
-
year
¶ Extracts the year out of a datetime sample.
Returns: an expression containing the year extracted from a datetime column. Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.year Expression = dt_year(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 2009 1 2016 2 2015
-
Timedelta operations¶
-
class
vaex.expression.
TimeDelta
(expression)[source]¶ Bases:
object
TimeDelta operations
Usually accessed using e.g. df.delay.td.days
-
__weakref__
¶ list of weak references to the object (if defined)
-
days
¶ Number of days in each timedelta sample.
Returns: an expression containing the number of days in a timedelta sample. Example:
>>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]') >>> df = vaex.from_arrays(delta=delta) >>> df # delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15
>>> df.delta.td.days Expression = td_days(delta) Length: 4 dtype: int64 (expression) ----------------------------------- 0 204 1 1 2 471 3 -22
-
microseconds
¶ Number of microseconds (>= 0 and less than 1 second) in each timedelta sample.
Returns: an expression containing the number of microseconds in a timedelta sample. Example:
>>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]') >>> df = vaex.from_arrays(delta=delta) >>> df # delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15
>>> df.delta.td.microseconds Expression = td_microseconds(delta) Length: 4 dtype: int64 (expression) ----------------------------------- 0 290448 1 978582 2 19583 3 709551
-
nanoseconds
¶ Number of nanoseconds (>= 0 and less than 1 microsecond) in each timedelta sample.
Returns: an expression containing the number of nanoseconds in a timedelta sample. Example:
>>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]') >>> df = vaex.from_arrays(delta=delta) >>> df # delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15
>>> df.delta.td.nanoseconds Expression = td_nanoseconds(delta) Length: 4 dtype: int64 (expression) ----------------------------------- 0 384 1 16 2 488 3 616
-
seconds
¶ Number of seconds (>= 0 and less than 1 day) in each timedelta sample.
Returns: an expression containing the number of seconds in a timedelta sample. Example:
>>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]') >>> df = vaex.from_arrays(delta=delta) >>> df # delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15
>>> df.delta.td.seconds Expression = td_seconds(delta) Length: 4 dtype: int64 (expression) ----------------------------------- 0 30436 1 39086 2 28681 3 23519
-
total_seconds
()¶ Total duration of each timedelta sample expressed in seconds.
Returns: an expression containing the total number of seconds in a timedelta sample. Example: >>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype=’timedelta64[s]’) >>> df = vaex.from_arrays(delta=delta) >>> df
# delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15>>> df.delta.td.total_seconds() Expression = td_total_seconds(delta) Length: 4 dtype: float64 (expression) ------------------------------------- 0 -7.88024e+08 1 -2.55032e+09 2 6.72134e+08 3 2.85489e+08
-
Geo operations¶
-
class
vaex.geo.
DataFrameAccessorGeo
(df)[source]¶ Bases:
object
Geometry/geographic helper methods
Example:
>>> df_xyz = df.geo.spherical2cartesian(df.longitude, df.latitude, df.distance) >>> df_xyz.x.mean()
-
__weakref__
¶ list of weak references to the object (if defined)
-
bearing
(lon1, lat1, lon2, lat2, bearing='bearing', inplace=False)[source]¶ Calculates a bearing, based on http://www.movable-type.co.uk/scripts/latlong.html
-
cartesian2spherical
(x='x', y='y', z='z', alpha='l', delta='b', distance='distance', radians=False, center=None, center_name='solar_position', inplace=False)[source]¶ Convert cartesian to spherical coordinates.
Parameters: - x –
- y –
- z –
- alpha –
- delta – name for polar angle, ranges from -90 to 90 (or -pi to pi when radians is True).
- distance –
- radians –
- center –
- center_name –
Returns:
-
cartesian_to_polar
(x='x', y='y', radius_out='r_polar', azimuth_out='phi_polar', propagate_uncertainties=False, radians=False, inplace=False)[source]¶ Convert cartesian to polar coordinates
Parameters: - x – expression for x
- y – expression for y
- radius_out – name for the virtual column for the radius
- azimuth_out – name for the virtual column for the azimuth angle
- propagate_uncertainties – {propagate_uncertainties}
- radians – if True, azimuth is in radians, defaults to degrees
Returns:
-
project_aitoff
(alpha, delta, x, y, radians=True, inplace=False)[source]¶ Add aitoff (https://en.wikipedia.org/wiki/Aitoff_projection) projection
Parameters: - alpha – azimuth angle
- delta – polar angle
- x – output name for x coordinate
- y – output name for y coordinate
- radians – input and output in radians (True), or degrees (False)
Returns:
-
project_gnomic
(alpha, delta, alpha0=0, delta0=0, x='x', y='y', radians=False, postfix='', inplace=False)[source]¶ Adds a gnomic projection to the DataFrame
-
rotation_2d
(x, y, xnew, ynew, angle_degrees, propagate_uncertainties=False, inplace=False)[source]¶ Rotation in 2d.
Parameters: Returns:
-
spherical2cartesian
(alpha, delta, distance, xname='x', yname='y', zname='z', propagate_uncertainties=False, center=[0, 0, 0], radians=False, inplace=False)[source]¶ Convert spherical to cartesian coordinates.
Parameters: - alpha –
- delta – polar angle, ranging from the -90 (south pole) to 90 (north pole)
- distance – radial distance, determines the units of x, y and z
- xname –
- yname –
- zname –
- propagate_uncertainties – {propagate_uncertainties}
- center –
- radians –
Returns: New dataframe (in inplace is False) with new x,y,z columns
-
velocity_cartesian2polar
(x='x', y='y', vx='vx', radius_polar=None, vy='vy', vr_out='vr_polar', vazimuth_out='vphi_polar', propagate_uncertainties=False, inplace=False)[source]¶ Convert cartesian to polar velocities.
Parameters: - x –
- y –
- vx –
- radius_polar – Optional expression for the radius, may lead to a better performance when given.
- vy –
- vr_out –
- vazimuth_out –
- propagate_uncertainties – {propagate_uncertainties}
Returns:
-
velocity_cartesian2spherical
(x='x', y='y', z='z', vx='vx', vy='vy', vz='vz', vr='vr', vlong='vlong', vlat='vlat', distance=None, inplace=False)[source]¶ Convert velocities from a cartesian to a spherical coordinate system
TODO: uncertainty propagation
Parameters: - x – name of x column (input)
- y – y
- z – z
- vx – vx
- vy – vy
- vz – vz
- vr – name of the column for the radial velocity in the r direction (output)
- vlong – name of the column for the velocity component in the longitude direction (output)
- vlat – name of the column for the velocity component in the latitude direction, positive points to the north pole (output)
- distance – Expression for distance, if not given defaults to sqrt(x**2+y**2+z**2), but if this column already exists, passing this expression may lead to a better performance
Returns:
-
velocity_polar2cartesian
(x='x', y='y', azimuth=None, vr='vr_polar', vazimuth='vphi_polar', vx_out='vx', vy_out='vy', propagate_uncertainties=False, inplace=False)[source]¶ Convert cylindrical polar velocities to Cartesian.
Parameters: - x –
- y –
- azimuth – Optional expression for the azimuth in degrees , may lead to a better performance when given.
- vr –
- vazimuth –
- vx_out –
- vy_out –
- propagate_uncertainties – {propagate_uncertainties}
-
GraphQL operations¶
-
class
vaex.graphql.
DataFrameAccessorGraphQL
(df)[source]¶ Bases:
object
Exposes a GraphQL layer to a DataFrame
See the GraphQL example for more usage.
The easiest way to learn to use the GraphQL language/vaex interface is to launch a server, and play with the GraphiQL graphical interface, its autocomplete, and the schema explorer.
We try to stay close to the Hasura API: https://docs.hasura.io/1.0/graphql/manual/api-reference/graphql-api/query.html
-
__weakref__
¶ list of weak references to the object (if defined)
-
Machine learning with vaex.ml¶
See the ML tutorial an introduction, and the ML examples for more advanced usage.
Scikit-learn¶
vaex.ml.sklearn.IncrementalPredictor ([…]) |
This class wraps any scikit-learn estimator (a.k.a predictions) that has a .partial_fit method, and makes it a vaex pipeline object. |
vaex.ml.sklearn.Predictor ([features, model, …]) |
This class wraps any scikit-learn estimator (a.k.a predictor) making it a vaex pipeline object. |
-
class
vaex.ml.sklearn.
IncrementalPredictor
(batch_size=1000000, features=traitlets.Undefined, model=None, num_epochs=1, partial_fit_kwargs=traitlets.Undefined, prediction_name='prediction', shuffle=False, target='')[source]¶ Bases:
vaex.ml.state.HasState
This class wraps any scikit-learn estimator (a.k.a predictions) that has a .partial_fit method, and makes it a vaex pipeline object.
By wrapping “on-line” scikit-learn estimators with this class, they become a vaex pipeline object. Thus, they can take full advantage of the serialization and pipeline system of vaex. While the underlying estimator need to call the .partial_fit method, this class contains the standard .fit method, and the rest happens behind the scenes. One can also iterate over the data multiple times (epochs), and optionally shuffle each batch before it is sent to the estimator. The predict method returns a numpy array, while the transform method adds the prediction as a virtual column to a vaex DataFrame.
Note: the .fit method will use as much memory as needed to copy one batch of data, while the .predict method will require as much memory as needed to output the predictions as a numpy array. The transform method is evaluated lazily, and no memory copies are made.
Note: we are using normal sklearn without modifications here.
Example:
>>> import vaex >>> import vaex.ml >>> from vaex.ml.sklearn import IncrementalPredictor >>> from sklearn.linear_model import SGDRegressor >>> >>> df = vaex.example() >>> >>> features = df.column_names[:6] >>> target = 'FeH' >>> >>> standard_scaler = vaex.ml.StandardScaler(features=features) >>> df = standard_scaler.fit_transform(df) >>> >>> features = df.get_column_names(regex='^standard') >>> model = SGDRegressor(learning_rate='constant', eta0=0.01, random_state=42) >>> >>> incremental = IncrementalPredictor(model=model, ... features=features, ... target=target, ... batch_size=10_000, ... num_epochs=3, ... shuffle=True, ... prediction_name='pred_FeH') >>> incremental.fit(df=df) >>> df = incremental.transform(df) >>> df.head(5)[['FeH', 'pred_FeH']] # FeH pred_FeH 0 -2.30923 -1.66226 1 -1.78874 -1.68218 2 -0.761811 -1.59562 3 -1.52088 -1.62225 4 -2.65534 -1.61991
Parameters: - batch_size – Number of samples to be sent to the model in each batch.
- features – List of features to use.
- model – A scikit-learn estimator with a .fit_predict method.
- num_epochs – Number of times each batch is sent to the model.
- partial_fit_kwargs – A dictionary of key word arguments to be passed on to the fit_predict method of the model.
- prediction_name – The name of the virtual column housing the predictions.
- shuffle – If True, shuffle the samples before sending them to the model.
- target – The name of the target column.
-
batch_size
¶ An int trait.
-
features
¶ An instance of a Python list.
-
fit
(df, progress=None)[source]¶ Fit the IncrementalPredictor to the DataFrame.
Parameters: - df – A vaex DataFrame containing the features and target on which to train the model.
- progress – If True, display a progressbar which tracks the training progress.
-
model
¶ A trait which allows any value.
-
num_epochs
¶ An int trait.
-
partial_fit_kwargs
¶ An instance of a Python dict.
-
predict
(df)[source]¶ Get an in-memory numpy array with the predictions of the SKLearnPredictor.self
Parameters: df – A vaex DataFrame, containing the input features. Returns: A in-memory numpy array containing the SKLearnPredictor predictions. Return type: numpy.array
-
prediction_name
¶ A trait for unicode strings.
-
shuffle
¶ A boolean (True, False) trait.
-
target
¶ A trait for unicode strings.
-
transform
(df)[source]¶ Transform a DataFrame such that it contains the predictions of the IncrementalPredictor. in form of a virtual column.
Parameters: df – A vaex DataFrame. Return copy: A shallow copy of the DataFrame that includes the IncrementalPredictor prediction as a virtual column. Return type: DataFrame
-
class
vaex.ml.sklearn.
Predictor
(features=traitlets.Undefined, model=None, prediction_name='prediction', target='')[source]¶ Bases:
vaex.ml.state.HasState
This class wraps any scikit-learn estimator (a.k.a predictor) making it a vaex pipeline object.
By wrapping any scikit-learn estimators with this class, it becomes a vaex pipeline object. Thus, it can take full advantage of the serialization and pipeline system of vaex. One can use the predict method to get a numpy array as an output of a fitted estimator, or the transform method do add such a prediction to a vaex DataFrame as a virtual column.
Note that a full memory copy of the data used is created when the fit and predict are called. The transform method is evaluated lazily.
The scikit-learn estimators themselves are not modified at all, they are taken from your local installation of scikit-learn.
Example:
>>> import vaex.ml >>> from vaex.ml.sklearn import Predictor >>> from sklearn.linear_model import LinearRegression >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length'] >>> df_train, df_test = df.ml.train_test_split() >>> model = Predictor(model=LinearRegression(), features=features, target='petal_width', prediction_name='pred') >>> model.fit(df_train) >>> df_train = model.transform(df_train) >>> df_train.head(3) # sepal_length sepal_width petal_length petal_width class_ pred 0 5.4 3 4.5 1.5 1 1.64701 1 4.8 3.4 1.6 0.2 0 0.352236 2 6.9 3.1 4.9 1.5 1 1.59336 >>> df_test = model.transform(df_test) >>> df_test.head(3) # sepal_length sepal_width petal_length petal_width class_ pred 0 5.9 3 4.2 1.5 1 1.39437 1 6.1 3 4.6 1.4 1 1.56469 2 6.6 2.9 4.6 1.3 1 1.44276
Parameters: - features – List of features to use.
- model – A scikit-learn estimator.
- prediction_name – The name of the virtual column housing the predictions.
- target – The name of the target column.
-
features
¶ An instance of a Python list.
-
fit
(df, **kwargs)[source]¶ Fit the SKLearnPredictor to the DataFrame.
Parameters: df – A vaex DataFrame containing the features and target on which to train the model.
-
model
¶ A trait which allows any value.
-
predict
(df)[source]¶ Get an in-memory numpy array with the predictions of the SKLearnPredictor.self
Parameters: df – A vaex DataFrame, containing the input features. Returns: A in-memory numpy array containing the SKLearnPredictor predictions. Return type: numpy.array
-
prediction_name
¶ A trait for unicode strings.
-
target
¶ A trait for unicode strings.
-
transform
(df)[source]¶ Transform a DataFrame such that it contains the predictions of the SKLearnPredictor. in form of a virtual column.
Parameters: df – A vaex DataFrame. Return copy: A shallow copy of the DataFrame that includes the SKLearnPredictor prediction as a virtual column. Return type: DataFrame
-
class
vaex.ml.sklearn.
SKLearnPredictor
(features=traitlets.Undefined, model=None, prediction_name='prediction', target='')[source]¶ Bases:
vaex.ml.sklearn.Predictor
Parameters: - features – List of features to use.
- model – A scikit-learn estimator.
- prediction_name – The name of the virtual column housing the predictions.
- target – The name of the target column.
Clustering¶
vaex.ml.cluster.KMeans ([cluster_centers, …]) |
The KMeans clustering algorithm. |
-
class
vaex.ml.cluster.
KMeans
(cluster_centers=traitlets.Undefined, features=traitlets.Undefined, inertia=None, init='random', max_iter=300, n_clusters=2, n_init=1, prediction_label='prediction_kmeans', random_state=None, verbose=False)[source]¶ Bases:
vaex.ml.state.HasState
The KMeans clustering algorithm.
Example:
>>> import vaex.ml >>> import vaex.ml.cluster >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> cls = vaex.ml.cluster.KMeans(n_clusters=3, features=features, init='random', max_iter=10) >>> cls.fit(df) >>> df = cls.transform(df) >>> df.head(5) # sepal_width petal_length sepal_length petal_width class_ prediction_kmeans 0 3 4.2 5.9 1.5 1 2 1 3 4.6 6.1 1.4 1 2 2 2.9 4.6 6.6 1.3 1 2 3 3.3 5.7 6.7 2.1 2 0 4 4.2 1.4 5.5 0.2 0 1
Parameters: - cluster_centers – Coordinates of cluster centers.
- features – List of features to cluster.
- inertia – Sum of squared distances of samples to their closest cluster center.
- init – Method for initializing the centroids.
- max_iter – Maximum number of iterations of the KMeans algorithm for a single run.
- n_clusters – Number of clusters to form.
- n_init – Number of centroid initializations. The KMeans algorithm will be run for each initialization, and the final results will be the best output of the n_init consecutive runs in terms of inertia.
- prediction_label – The name of the virtual column that houses the cluster labels for each point.
- random_state – Random number generation for centroid initialization. If an int is specified, the randomness becomes deterministic.
- verbose – If True, enable verbosity mode.
Transformers/encoders¶
vaex.ml.transformations.FrequencyEncoder ([…]) |
Encode categorical columns by the frequency of their respective samples. |
vaex.ml.transformations.LabelEncoder ([…]) |
Encode categorical columns with integer values between 0 and num_classes-1. |
vaex.ml.transformations.MaxAbsScaler ([…]) |
Scale features by their maximum absolute value. |
vaex.ml.transformations.MinMaxScaler ([…]) |
Will scale a set of features to a given range. |
vaex.ml.transformations.OneHotEncoder ([…]) |
Encode categorical columns according ot the One-Hot scheme. |
vaex.ml.transformations.PCA ([features, …]) |
Transform a set of features using a Principal Component Analysis. |
vaex.ml.transformations.RobustScaler ([…]) |
The RobustScaler removes the median and scales the data according to a given percentile range. |
vaex.ml.transformations.StandardScaler ([…]) |
Standardize features by removing thir mean and scaling them to unit variance. |
vaex.ml.transformations.CycleTransformer ([…]) |
A strategy for transforming cyclical features (e.g. |
vaex.ml.transformations.BayesianTargetEncoder (…) |
Encode categorical variables with a Bayesian Target Encoder. |
vaex.ml.transformations.WeightOfEvidenceEncoder (…) |
Encode categorical variables with a Weight of Evidence Encoder. |
-
class
vaex.ml.transformations.
FrequencyEncoder
(features=traitlets.Undefined, mappings_=traitlets.Undefined, prefix='frequency_encoded_', unseen='nan')[source]¶ Bases:
vaex.ml.transformations.Transformer
Encode categorical columns by the frequency of their respective samples.
Example:
>>> import vaex >>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red', 'green']) >>> df # color 0 red 1 green 2 green 3 blue 4 red >>> encoder = vaex.ml.FrequencyEncoder(features=['color']) >>> encoder.fit_transform(df) # color frequency_encoded_color 0 red 0.333333 1 green 0.5 2 green 0.5 3 blue 0.166667 4 red 0.333333 5 green 0.5
Parameters: - features – List of features to transform.
- prefix – Prefix for the names of the transformed features.
- unseen – Strategy to deal with unseen values.
-
class
vaex.ml.transformations.
LabelEncoder
(allow_unseen=False, features=traitlets.Undefined, prefix='label_encoded_')[source]¶ Bases:
vaex.ml.transformations.Transformer
Encode categorical columns with integer values between 0 and num_classes-1.
Example:
>>> import vaex >>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red']) >>> df # color 0 red 1 green 2 green 3 blue 4 red >>> encoder = vaex.ml.LabelEncoder(features=['color']) >>> encoder.fit_transform(df) # color label_encoded_color 0 red 2 1 green 1 2 green 1 3 blue 0 4 red 2
Parameters: - allow_unseen – If True, unseen values will be encoded with -1, otherwise an error is raised
- features – List of features to transform.
- prefix – Prefix for the names of the transformed features.
-
class
vaex.ml.transformations.
MaxAbsScaler
(features=traitlets.Undefined, prefix='absmax_scaled_')[source]¶ Bases:
vaex.ml.transformations.Transformer
Scale features by their maximum absolute value.
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> scaler = vaex.ml.MaxAbsScaler(features=['x', 'y']) >>> scaler.fit_transform(df) # x y absmax_scaled_x absmax_scaled_y 0 2 -2 0.133333 -0.2 1 5 3 0.333333 0.3 2 7 0 0.466667 0 3 2 0 0.133333 0 4 15 10 1 1
Parameters: - features – List of features to transform.
- prefix – Prefix for the names of the transformed features.
-
class
vaex.ml.transformations.
MinMaxScaler
(feature_range=traitlets.Undefined, features=traitlets.Undefined, prefix='minmax_scaled_')[source]¶ Bases:
vaex.ml.transformations.Transformer
Will scale a set of features to a given range.
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> scaler = vaex.ml.MinMaxScaler(features=['x', 'y']) >>> scaler.fit_transform(df) # x y minmax_scaled_x minmax_scaled_y 0 2 -2 0 0 1 5 3 0.230769 0.416667 2 7 0 0.384615 0.166667 3 2 0 0 0.166667 4 15 10 1 1
Parameters: - feature_range – The range the features are scaled to.
- features – List of features to transform.
- prefix – Prefix for the names of the transformed features.
-
class
vaex.ml.transformations.
OneHotEncoder
(features=traitlets.Undefined, one=1, prefix='', zero=0)[source]¶ Bases:
vaex.ml.transformations.Transformer
Encode categorical columns according ot the One-Hot scheme.
Example:
>>> import vaex >>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red']) >>> df # color 0 red 1 green 2 green 3 blue 4 red >>> encoder = vaex.ml.OneHotEncoder(features=['color']) >>> encoder.fit_transform(df) # color color_blue color_green color_red 0 red 0 0 1 1 green 0 1 0 2 green 0 1 0 3 blue 1 0 0 4 red 0 0 1
Parameters: - features – List of features to transform.
- one – Value to encode when a category is present.
- prefix – Prefix for the names of the transformed features.
- zero – Value to encode when category is absent.
-
class
vaex.ml.transformations.
PCA
(features=traitlets.Undefined, n_components=0, prefix='PCA_', progress=False)[source]¶ Bases:
vaex.ml.transformations.Transformer
Transform a set of features using a Principal Component Analysis.
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> pca = vaex.ml.PCA(n_components=2, features=['x', 'y']) >>> pca.fit_transform(df) # x y PCA_0 PCA_1 0 2 -2 5.92532 0.413011 1 5 3 0.380494 -1.39112 2 7 0 0.840049 2.18502 3 2 0 4.61287 -1.09612 4 15 10 -11.7587 -0.110794
Parameters: - features – List of features to transform.
- n_components – Number of components to retain. If None, all the components will be retained.
- prefix – Prefix for the names of the transformed features.
- progress – If True, display a progressbar of the PCA fitting process.
-
class
vaex.ml.transformations.
RobustScaler
(features=traitlets.Undefined, percentile_range=traitlets.Undefined, prefix='robust_scaled_', with_centering=True, with_scaling=True)[source]¶ Bases:
vaex.ml.transformations.Transformer
The RobustScaler removes the median and scales the data according to a given percentile range. By default, the scaling is done between the 25th and the 75th percentile. Centering and scaling happens independently for each feature (column).
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> scaler = vaex.ml.MaxAbsScaler(features=['x', 'y']) >>> scaler.fit_transform(df) # x y robust_scaled_x robust_scaled_y 0 2 -2 -0.333686 -0.266302 1 5 3 -0.000596934 0.399453 2 7 0 0.221462 0 3 2 0 -0.333686 0 4 15 10 1.1097 1.33151
Parameters: - features – List of features to transform.
- percentile_range – The percentile range to which to scale each feature to.
- prefix – Prefix for the names of the transformed features.
- with_centering – If True, remove the median.
- with_scaling – If True, scale each feature between the specified percentile range.
-
class
vaex.ml.transformations.
StandardScaler
(features=traitlets.Undefined, prefix='standard_scaled_', with_mean=True, with_std=True)[source]¶ Bases:
vaex.ml.transformations.Transformer
Standardize features by removing thir mean and scaling them to unit variance.
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> scaler = vaex.ml.StandardScaler(features=['x', 'y']) >>> scaler.fit_transform(df) # x y standard_scaled_x standard_scaled_y 0 2 -2 -0.876523 -0.996616 1 5 3 -0.250435 0.189832 2 7 0 0.166957 -0.522037 3 2 0 -0.876523 -0.522037 4 15 10 1.83652 1.85086
Parameters: - features – List of features to transform.
- prefix – Prefix for the names of the transformed features.
- with_mean – If True, remove the mean from each feature.
- with_std – If True, scale each feature to unit variance.
-
class
vaex.ml.transformations.
CycleTransformer
(features=traitlets.Undefined, n=0, prefix_x='', prefix_y='', suffix_x='_x', suffix_y='_y')[source]¶ Bases:
vaex.ml.transformations.Transformer
A strategy for transforming cyclical features (e.g. angles, time).
Think of each feature as an angle of a unit circle in polar coordinates, and then and then obtaining the x and y coordinate projections, or the cos and sin components respectively.
Suitable for a variaty of machine learning tasks. It preserves the cyclical continuity of the feature. Inspired by: http://blog.davidkaleko.com/feature-engineering-cyclical-features.html
Example:
>>> import vaex >>> import vaex.ml >>> df = vaex.from_arrays(days=[0, 1, 2, 3, 4, 5, 6]) >>> cyctrans = vaex.ml.CycleTransformer(n=7, features=['days']) >>> cyctrans.fit_transform(df) # days days_x days_y 0 0 1 0 1 1 0.62349 0.781831 2 2 -0.222521 0.974928 3 3 -0.900969 0.433884 4 4 -0.900969 -0.433884 5 5 -0.222521 -0.974928 6 6 0.62349 -0.781831
Parameters: - features – List of features to transform.
- n – The number of elements in one cycle.
- prefix_x – Prefix for the x-component of the transformed features.
- prefix_y – Prefix for the y-component of the transformed features.
- suffix_x – Suffix for the x-component of the transformed features.
- suffix_y – Suffix for the y-component of the transformed features.
-
class
vaex.ml.transformations.
BayesianTargetEncoder
(*args, **kwargs)[source]¶ Bases:
vaex.ml.transformations.Transformer
Encode categorical variables with a Bayesian Target Encoder.
The categories are encoded by the mean of their target value, which is adjusted by the global mean value of the target variable using a Bayesian schema. For a larger weight value, the target encodings are smoothed toward the global mean, while for a weight of 0, the encodings are just the mean target value per class.
Reference: https://www.wikiwand.com/en/Bayes_estimator#/Practical_example_of_Bayes_estimators
Example:
>>> import vaex >>> import vaex.ml >>> df = vaex.from_arrays(x=['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], ... y=[1, 1, 1, 0, 0, 0, 0, 1]) >>> target_encoder = vaex.ml.BayesianTargetEncoder(features=['x'], weight=4) >>> target_encoder.fit_transform(df, 'y') # x y mean_encoded_x 0 a 1 0.625 1 a 1 0.625 2 a 1 0.625 3 a 0 0.625 4 b 0 0.375 5 b 0 0.375 6 b 0 0.375 7 b 1 0.375
-
class
vaex.ml.transformations.
WeightOfEvidenceEncoder
(*args, **kwargs)[source]¶ Bases:
vaex.ml.transformations.Transformer
Encode categorical variables with a Weight of Evidence Encoder.
Weight of Evidence measures how well a particular feature supports the given hypothesis (i.e. the target variable). With this encoder, each category in a categorical feature is encoded by its “strength” i.e. Weight of Evidence value. The target feature can be a boolean or numerical column, where True/1 is seen as ‘Good’, and False/0 is seen as ‘Bad’
Reference: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
Example:
>>> import vaex >>> import vaex.ml >>> df = vaex.from_arrays(x=['a', 'a', 'b', 'b', 'b', 'c', 'c'], ... y=[1, 1, 0, 0, 1, 1, 0]) >>> woe_encoder = vaex.ml.WeightOfEvidenceEncoder(target='y', features=['x']) >>> woe_encoder.fit_transform(df) # x y mean_encoded_x 0 a 1 13.8155 1 a 1 13.8155 2 b 0 -0.693147 3 b 0 -0.693147 4 b 1 -0.693147 5 c 1 0 6 c 0 0
Boosted trees¶
vaex.ml.lightgbm.LightGBMModel ([features, …]) |
The LightGBM algorithm. |
vaex.ml.xgboost.XGBoostModel ([features, …]) |
The XGBoost algorithm. |
-
class
vaex.ml.lightgbm.
LightGBMModel
(features=traitlets.Undefined, num_boost_round=0, params=traitlets.Undefined, prediction_name='lightgbm_prediction', target='')[source]¶ Bases:
vaex.ml.state.HasState
The LightGBM algorithm.
This class provides an interface to the LightGBM algorithm, with some optimizations for better memory efficiency when training large datasets. The algorithm itself is not modified at all.
LightGBM is a fast gradient boosting algorithm based on decision trees and is mainly used for classification, regression and ranking tasks. It is under the umbrella of the Distributed Machine Learning Toolkit (DMTK) project of Microsoft. For more information, please visit https://github.com/Microsoft/LightGBM/.
Example:
>>> import vaex.ml >>> import vaex.ml.lightgbm >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = df.ml.train_test_split() >>> params = { 'boosting': 'gbdt', 'max_depth': 5, 'learning_rate': 0.1, 'application': 'multiclass', 'num_class': 3, 'subsample': 0.80, 'colsample_bytree': 0.80} >>> booster = vaex.ml.lightgbm.LightGBMModel(features=features, target='class_', num_boost_round=100, params=params) >>> booster.fit(df_train) >>> df_train = booster.transform(df_train) >>> df_train.head(3) # sepal_width petal_length sepal_length petal_width class_ lightgbm_prediction 0 3 4.5 5.4 1.5 1 [0.00165619 0.98097899 0.01736482] 1 3.4 1.6 4.8 0.2 0 [9.99803930e-01 1.17346471e-04 7.87235133e-05] 2 3.1 4.9 6.9 1.5 1 [0.00107541 0.9848717 0.01405289] >>> df_test = booster.transform(df_test) >>> df_test.head(3) # sepal_width petal_length sepal_length petal_width class_ lightgbm_prediction 0 3 4.2 5.9 1.5 1 [0.00208904 0.9821348 0.01577616] 1 3 4.6 6.1 1.4 1 [0.00182039 0.98491357 0.01326604] 2 2.9 4.6 6.6 1.3 1 [2.50915444e-04 9.98431777e-01 1.31730785e-03]
Parameters: - features – List of features to use when fitting the LightGBMModel.
- num_boost_round – Number of boosting iterations.
- params – parameters to be passed on the to the LightGBM model.
- prediction_name – The name of the virtual column housing the predictions.
- target – The name of the target column.
-
fit
(df, valid_sets=None, valid_names=None, early_stopping_rounds=None, evals_result=None, verbose_eval=None, copy=False, **kwargs)[source]¶ Fit the LightGBMModel to the DataFrame.
The model will train until the validation score stops improving. Validation score needs to improve at least every early_stopping_rounds rounds to continue training. Requires at least one validation DataFrame, metric specified. If there’s more than one, will check all of them, but the training data is ignored anyway. If early stopping occurs, the model will add
best_iteration
field to the booster object.Parameters: - df – A vaex DataFrame containing the features and target on which to train the model.
- valid_sets (list) – A list of DataFrames to be used for validation.
- valid_names (list) – A list of strings to label the validation sets.
- int (early_stopping_rounds) – Activates early stopping.
- evals_result (dict) – A dictionary storing the evaluation results of all valid_sets.
- verbose_eval (bool) – Requires at least one item in valid_sets. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage.
- copy (bool) – (default, False) If True, make an in memory copy of the data before passing it to LightGBMModel.
-
predict
(df, **kwargs)[source]¶ Get an in-memory numpy array with the predictions of the LightGBMModel on a vaex DataFrame. This method accepts the key word arguments of the predict method from LightGBM.
Parameters: df – A vaex DataFrame. Returns: A in-memory numpy array containing the LightGBMModel predictions. Return type: numpy.array
-
class
vaex.ml.xgboost.
XGBoostModel
(features=traitlets.Undefined, num_boost_round=0, params=traitlets.Undefined, prediction_name='xgboost_prediction', target='')[source]¶ Bases:
vaex.ml.state.HasState
The XGBoost algorithm.
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. (https://github.com/dmlc/xgboost)
Example:
>>> import vaex >>> import vaex.ml.xgboost >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = df.ml.train_test_split() >>> params = { 'max_depth': 5, 'learning_rate': 0.1, 'objective': 'multi:softmax', 'num_class': 3, 'subsample': 0.80, 'colsample_bytree': 0.80, 'silent': 1} >>> booster = vaex.ml.xgboost.XGBoostModel(features=features, target='class_', num_boost_round=100, params=params) >>> booster.fit(df_train) >>> df_train = booster.transform(df_train) >>> df_train.head(3) # sepal_length sepal_width petal_length petal_width class_ xgboost_prediction 0 5.4 3 4.5 1.5 1 1 1 4.8 3.4 1.6 0.2 0 0 2 6.9 3.1 4.9 1.5 1 1 >>> df_test = booster.transform(df_test) >>> df_test.head(3) # sepal_length sepal_width petal_length petal_width class_ xgboost_prediction 0 5.9 3 4.2 1.5 1 1 1 6.1 3 4.6 1.4 1 1 2 6.6 2.9 4.6 1.3 1 1
Parameters: - features – List of features to use when fitting the XGBoostModel.
- num_boost_round – Number of boosting iterations.
- params – A dictionary of parameters to be passed on to the XGBoost model.
- prediction_name – The name of the virtual column housing the predictions.
- target – The name of the target column.
-
fit
(df, evals=(), early_stopping_rounds=None, evals_result=None, verbose_eval=False, **kwargs)[source]¶ Fit the XGBoost model given a DataFrame.
This method accepts all key word arguments for the xgboost.train method.
Parameters: - df – A vaex DataFrame containing the features and target on which to train the model.
- evals – A list of pairs (DataFrame, string). List of items to be evaluated during training, this allows user to watch performance on the validation set.
- early_stopping_rounds (int) – Activates early stopping. Validation error needs to decrease at least every early_stopping_rounds round(s) to continue training. Requires at least one item in evals. If there’s more than one, will use the last. Returns the model from the last iteration (not the best one).
- evals_result (dict) – A dictionary storing the evaluation results of all the items in evals.
- verbose_eval (bool) – Requires at least one item in evals. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage.
-
predict
(df, **kwargs)[source]¶ Provided a vaex DataFrame, get an in-memory numpy array with the predictions from the XGBoost model. This method accepts the key word arguments of the predict method from XGBoost.
Returns: A in-memory numpy array containing the XGBoostModel predictions. Return type: numpy.array
-
transform
(df)[source]¶ Transform a DataFrame such that it contains the predictions of the XGBoostModel in form of a virtual column.
Parameters: df – A vaex DataFrame. It should have the same columns as the DataFrame used to train the model. Return copy: A shallow copy of the DataFrame that includes the XGBoostModel prediction as a virtual column. Return type: DataFrame
Incubator/experimental¶
These models are in the incubator phase and may disappear in the future
-
class
vaex.ml.incubator.annoy.
ANNOYModel
(features=traitlets.Undefined, metric='euclidean', n_neighbours=10, n_trees=10, predcition_name='annoy_prediction', prediction_name='annoy_prediction', search_k=-1)[source]¶ Bases:
vaex.ml.state.HasState
Parameters: - features – List of features to use.
- metric – Metric to use for distance calculations
- n_neighbours – Now many neighbours
- n_trees – Number of trees to build.
- predcition_name – Output column name for the neighbours when transforming a DataFrame
- prediction_name – Output column name for the neighbours when transforming a DataFrame
- search_k – Jovan?