API documentation for vaex library¶
Quick lists¶
Opening/reading in your data.¶
vaex.open (path[, convert, shuffle, copy_index]) 
Open a DataFrame from file given by path. 
vaex.from_arrow_table (table) 
Creates a vaex DataFrame from an arrow Table. 
vaex.from_arrays (**arrays) 
Create an in memory DataFrame from numpy arrays. 
vaex.from_dict (data) 
Create an in memory dataset from a dict with column names as keys and list/numpyarrays as values 
vaex.from_csv (filename_or_buffer[, copy_index]) 
Shortcut to read a csv file using pandas and convert to a DataFrame directly. 
vaex.from_ascii (path[, seperator, names, …]) 
Create an in memory DataFrame from an ascii file (whitespace seperated by default). 
vaex.from_pandas (df[, name, copy_index, …]) 
Create an in memory DataFrame from a pandas DataFrame. 
vaex.from_astropy_table (table) 
Create a vaex DataFrame from an Astropy Table. 
Visualization.¶
vaex.dataframe.DataFrame.plot ([x, y, z, …]) 
Viz data in a 2d histogram/heatmap. 
vaex.dataframe.DataFrame.plot1d ([x, what, …]) 
Viz data in 1d (histograms, running means etc) 
vaex.dataframe.DataFrame.scatter (x, y[, …]) 
Viz (small amounts) of data in 2d using a scatter plot 
vaex.dataframe.DataFrame.plot_widget (x, y[, …]) 
Viz 1d, 2d or 3d in a Jupyter notebook 
vaex.dataframe.DataFrame.healpix_plot ([…]) 
Viz data in 2d using a healpix column. 
Statistics.¶
vaex.dataframe.DataFrame.count ([expression, …]) 
Count the number of nonNaN values (or all, if expression is None or “*”). 
vaex.dataframe.DataFrame.mean (expression[, …]) 
Calculate the mean for expression, possibly on a grid defined by binby. 
vaex.dataframe.DataFrame.std (expression[, …]) 
Calculate the standard deviation for the given expression, possible on a grid defined by binby 
vaex.dataframe.DataFrame.var (expression[, …]) 
Calculate the sample variance for the given expression, possible on a grid defined by binby 
vaex.dataframe.DataFrame.cov (x[, y, binby, …]) 
Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby. 
vaex.dataframe.DataFrame.correlation (x[, y, …]) 
Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possibly on a grid defined by binby. 
vaex.dataframe.DataFrame.median_approx (…) 
Calculate the median , possibly on a grid defined by binby. 
vaex.dataframe.DataFrame.mode (expression[, …]) 
Calculate/estimate the mode. 
vaex.dataframe.DataFrame.min (expression[, …]) 
Calculate the minimum for given expressions, possibly on a grid defined by binby. 
vaex.dataframe.DataFrame.max (expression[, …]) 
Calculate the maximum for given expressions, possibly on a grid defined by binby. 
vaex.dataframe.DataFrame.minmax (expression) 
Calculate the minimum and maximum for expressions, possibly on a grid defined by binby. 
vaex.dataframe.DataFrame.mutual_information (x) 
Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby. 
vaexcore¶
Vaex is a library for dealing with larger than memory DataFrames (out of core).
The most important class (datastructure) in vaex is the DataFrame
. A DataFrame is obtained by either, opening
the example dataset:
>>> import vaex
>>> df = vaex.example()
Or using open()
to open a file.
>>> df1 = vaex.open("somedata.hdf5")
>>> df2 = vaex.open("somedata.fits")
>>> df2 = vaex.open("somedata.arrow")
>>> df4 = vaex.open("somedata.csv")
Or connecting to a remove server:
>>> df_remote = vaex.open("http://try.vaex.io/nyc_taxi_2015")
A few strong features of vaex are:
 Performance: Works with huge tabular data, process over a billion (> 10:sup:9) rows/second.
 Expression system / Virtual columns: compute on the fly, without wasting ram.
 Memory efficient: no memory copies when doing filtering/selections/subsets.
 Visualization: directly supported, a oneliner is often enough.
 User friendly API: You will only need to deal with a DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.
 Very fast statiscs on N dimensional grids such as histograms, running mean, heatmaps.
Follow the tutorial at https://docs.vaex.io/en/latest/tutorial.html to learn how to use vaex.

vaex.
open
(path, convert=False, shuffle=False, copy_index=True, *args, **kwargs)[source]¶ Open a DataFrame from file given by path.
Example:
>>> ds = vaex.open('sometable.hdf5') >>> ds = vaex.open('somedata*.csv', convert='bigdata.hdf5')
Parameters:  or list path (str) – local or absolute path to file, or glob string, or list of paths
 convert – convert files to an hdf5 file for optimization, can also be a path
 shuffle (bool) – shuffle converted DataFrame or not
 args – extra arguments for file readers that need it
 kwargs – extra keyword arguments
 copy_index (bool) – copy index when source is read via pandas
Returns: return a DataFrame on succes, otherwise None
Return type:

vaex.
from_arrays
(**arrays)[source]¶ Create an in memory DataFrame from numpy arrays.
Example
>>> import vaex, numpy as np >>> x = np.arange(5) >>> y = x ** 2 >>> vaex.from_arrays(x=x, y=y) # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16 >>> some_dict = {'x': x, 'y': y} >>> vaex.from_arrays(**some_dict) # in case you have your columns in a dict # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16
Parameters: arrays – keyword arguments with arrays Return type: DataFrame

vaex.
from_dict
(data)[source]¶ Create an in memory dataset from a dict with column names as keys and list/numpyarrays as values
Example
>>> data = {'A':[1,2,3],'B':['a','b','c']} >>> vaex.from_dict(data) # A B 0 1 'a' 1 2 'b' 2 3 'c'
Parameters: data – A dict of {columns:[value, value,…]} Return type: DataFrame

vaex.
from_items
(*items)[source]¶ Create an in memory DataFrame from numpy arrays, in contrast to from_arrays this keeps the order of columns intact (for Python < 3.6).
Example
>>> import vaex, numpy as np >>> x = np.arange(5) >>> y = x ** 2 >>> vaex.from_items(('x', x), ('y', y)) # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16
Parameters: items – list of [(name, numpy array), …] Return type: DataFrame

vaex.
from_arrow_table
(table)[source]¶ Creates a vaex DataFrame from an arrow Table.
Return type: DataFrame

vaex.
from_csv
(filename_or_buffer, copy_index=True, **kwargs)[source]¶ Shortcut to read a csv file using pandas and convert to a DataFrame directly.
Return type: DataFrame

vaex.
from_ascii
(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]¶ Create an in memory DataFrame from an ascii file (whitespace seperated by default).
>>> ds = vx.from_ascii("table.asc") >>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])
Parameters:  path – file path
 seperator – value seperator, by default whitespace, use “,” for comma seperated values.
 names – If True, the first line is used for the column names, otherwise provide a list of strings with names
 skip_lines – skip lines at the start of the file
 skip_after – skip lines at the end of the file
 kwargs –
Return type:

vaex.
from_pandas
(df, name='pandas', copy_index=True, index_name='index')[source]¶ Create an in memory DataFrame from a pandas DataFrame.
Param: pandas.DataFrame df: Pandas DataFrame Param: name: unique for the DataFrame >>> import vaex, pandas as pd >>> df_pandas = pd.from_csv('test.csv') >>> df = vaex.from_pandas(df_pandas)
Return type: DataFrame

vaex.
from_samp
(username=None, password=None)[source]¶ Connect to a SAMP Hub and wait for a single table load event, disconnect, download the table and return the DataFrame.
Useful if you want to send a single table from say TOPCAT to vaex in a python console or notebook.

vaex.
open_many
(filenames)[source]¶ Open a list of filenames, and return a DataFrame with all DataFrames cocatenated.
Parameters: filenames (list[str]) – list of filenames/paths Return type: DataFrame

vaex.
register_function
(scope=None, as_property=False, name=None)[source]¶ Decorator to register a new function with vaex.
Example:
>>> import vaex >>> df = vaex.example() >>> @vaex.register_function() >>> def invert(x): >>> return 1/x >>> df.x.invert()
>>> import numpy as np >>> df = vaex.from_arrays(departure=np.arange('20150101', '20151205', dtype='datetime64')) >>> @vaex.register_function(as_property=True, scope='dt') >>> def dt_relative_day(x): >>> return vaex.functions.dt_dayofyear(x)/365. >>> df.departure.dt.relative_day

vaex.
server
(url, **kwargs)[source]¶ Connect to hostname supporting the vaex web api.
Parameters: hostname (str) – hostname or ip address of server Return vaex.dataframe.ServerRest: returns a server object, note that it does not connect to the server yet, so this will always succeed Return type: ServerRest

vaex.
example
(download=True)[source]¶ Returns an example DataFrame which comes with vaex for testing/learning purposes.
Return type: DataFrame

vaex.
app
(*args, **kwargs)[source]¶ Create a vaex app, the QApplication mainloop must be started.
In ipython notebook/jupyter do the following:
>>> import vaex.ui.main # this causes the qt api level to be set properly >>> import vaex
Next cell:
>>> %gui qt
Next cell:
>>> app = vaex.app()
From now on, you can run the app along with jupyter

vaex.
delayed
(f)[source]¶ Decorator to transparantly accept delayed computation.
Example:
>>> delayed_sum = ds.sum(ds.E, binby=ds.x, limits=limits, >>> shape=4, delay=True) >>> @vaex.delayed >>> def total_sum(sums): >>> return sums.sum() >>> sum_of_sums = total_sum(delayed_sum) >>> ds.execute() >>> sum_of_sums.get() See the tutorial for a more complete example https://docs.vaex.io/en/latest/tutorial.html#Parallelcomputations
DataFrame class¶

class
vaex.dataframe.
DataFrame
(name, column_names, executor=None)[source]¶ Bases:
object
All local or remote datasets are encapsulated in this class, which provides a pandas like API to your dataset.
Each DataFrame (df) has a number of columns, and a number of rows, the length of the DataFrame.
All DataFrames have multiple ‘selection’, and all calculations are done on the whole DataFrame (default) or for the selection. The following example shows how to use the selection.
>>> df.select("x < 0") >>> df.sum(df.y, selection=True) >>> df.sum(df.y, selection=[df.x < 0, df.x > 0])

__delitem__
(item)[source]¶ Removes a (virtual) column from the DataFrame.
Note: this does not remove check if the column is used in a virtual expression or in the filter and may lead to issues. It is safer to use
drop()
.

__getitem__
(item)[source]¶ Convenient way to get expressions, (shallow) copies of a few columns, or to apply filtering.
Example:
>>> df['Lz'] # the expression 'Lz >>> df['Lz/2'] # the expression 'Lz/2' >>> df[["Lz", "E"]] # a shallow copy with just two columns >>> df[df.Lz < 0] # a shallow copy with the filter Lz < 0 applied

__init__
(name, column_names, executor=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.

__setitem__
(name, value)[source]¶ Convenient way to add a virtual column / expression to this DataFrame.
Example:
>>> import vaex, numpy as np >>> df = vaex.example() >>> df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2) >>> df.r <vaex.expression.Expression(expressions='r')> instance at 0x121687e80 values=[2.9655450396553587, 5.77829281049018, 6.99079603950256, 9.431842752707537, 0.8825613121347967 ... (total 330000 values) ... 7.453831761514681, 15.398412491068198, 8.864250273925633, 17.601047186042507, 14.540181524970293]

__weakref__
¶ list of weak references to the object (if defined)

add_variable
(name, expression, overwrite=True, unique=True)[source]¶ Add a variable to to a DataFrame.
A variable may refer to other variables, and virtual columns and expression may refer to variables.
Example
>>> df.add_variable('center', 0) >>> df.add_virtual_column('x_prime', 'xcenter') >>> df.select('x_prime < 0')
Param: str name: name of virtual varible Param: expression: expression for the variable

add_virtual_column
(name, expression, unique=False)[source]¶ Add a virtual column to the DataFrame.
Example:
>>> df.add_virtual_column("r", "sqrt(x**2 + y**2 + z**2)") >>> df.select("r < 10")
Param: str name: name of virtual column Param: expression: expression for the column Parameters: unique (str) – if name is already used, make it unique by adding a postfix, e.g. _1, or _2

byte_size
(selection=False, virtual=False)[source]¶ Return the size in bytes the whole DataFrame requires (or the selection), respecting the active_fraction.

cat
(i1, i2, format='html')[source]¶ Display the DataFrame from row i1 till i2
For format, see https://pypi.org/project/tabulate/
Parameters:  i1 (int) – Start row
 i2 (int) – End row.
 format (str) – Format to use, e.g. ‘html’, ‘plain’, ‘latex’

close_files
()[source]¶ Close any possible open file handles, the DataFrame will not be in a usable state afterwards.

col
¶ Gives direct access to the columns only (useful for tab completion).
Convenient when working with ipython in combination with small DataFrames, since this gives tabcompletion.
Columns can be accesed by there names, which are attributes. The attribues are currently expressions, so you can do computations with them.
Example
>>> ds = vaex.example() >>> df.plot(df.col.x, df.col.y)

combinations
(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]¶ Generate a list of combinations for the possible expressions for the given dimension.
Parameters:  expressions_list – list of list of expressions, where the inner list defines the subspace
 dimensions – if given, generates a subspace with all possible combinations for that dimension
 exclude – list of

correlation
(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, delay=False, progress=None)[source]¶ Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possibly on a grid defined by binby.
Example:
>>> df.correlation("x**2+y**2+z**2", "log(E+1)") array(0.6366637382215669) >>> df.correlation("x**2+y**2+z**2", "log(E+1)", binby="Lz", shape=4) array([ 0.40594394, 0.69868851, 0.61394099, 0.65266318])
Parameters:  x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

count
(expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]¶ Count the number of nonNaN values (or all, if expression is None or “*”).
Example:
>>> df.count() 330000 >>> df.count("*") 330000.0 >>> df.count("*", binby=["x"], shape=4) array([ 10925., 155427., 152007., 10748.])
Parameters:  expression – Expression or column for which to count nonmissing values, or None or ‘*’ for counting the rows
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
 edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at 1
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

cov
(x, y=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby.
Either x and y are expressions, e.g:
>>> df.cov("x", "y")
Or only the x argument is given with a list of expressions, e,g.:
>>> df.cov(["x, "y, "z"])
Example:
>>> df.cov("x", "y") array([[ 53.54521742, 3.8123135 ], [ 3.8123135 , 60.62257881]]) >>> df.cov(["x", "y", "z"]) array([[ 53.54521742, 3.8123135 , 0.98260511], [ 3.8123135 , 60.62257881, 1.21381057], [ 0.98260511, 1.21381057, 25.55517638]])
>>> df.cov("x", "y", binby="E", shape=2) array([[[ 9.74852878e+00, 3.02004780e02], [ 3.02004780e02, 9.99288215e+00]], [[ 8.43996546e+01, 6.51984181e+00], [ 6.51984181e+00, 9.68938284e+01]]])
Parameters:  x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 y – if previous argument is not a list, this argument should be given
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)

covar
(x, y, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the covariance cov[x,y] between and x and y, possibly on a grid defined by binby.
Example:
>>> df.covar("x**2+y**2+z**2", "log(E+1)") array(52.69461456005138) >>> df.covar("x**2+y**2+z**2", "log(E+1)")/(df.std("x**2+y**2+z**2") * df.std("log(E+1)")) 0.63666373822156686 >>> df.covar("x**2+y**2+z**2", "log(E+1)", binby="Lz", shape=4) array([ 10.17387143, 51.94954078, 51.24902796, 20.2163929 ])
Parameters:  x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

describe
(strings=True, virtual=True, selection=None)[source]¶ Give a description of the DataFrame.
>>> import vaex >>> df = vaex.example()[['x', 'y', 'z']] >>> df.describe() x y z dtype float64 float64 float64 count 330000 330000 330000 missing 0 0 0 mean 0.0671315 0.0535899 0.0169582 std 7.31746 7.78605 5.05521 min 128.294 71.5524 44.3342 max 271.366 146.466 50.7185 >>> df.describe(selection=df.x > 0) x y z dtype float64 float64 float64 count 164060 164060 164060 missing 165940 165940 165940 mean 5.13572 0.486786 0.0868073 std 5.18701 7.61621 5.02831 min 1.51635e05 71.5524 44.3342 max 271.366 78.0724 40.2191
Parameters:  strings (bool) – Describe string columns or not
 virtual (bool) – Describe virtual columns or not
 selection – Optional selection to use.
Returns: Pandas dataframe

drop
(columns, inplace=False, check=True)[source]¶ Drop columns (or a single column).
Parameters:  columns – List of columns or a single column name
 inplace – Make modifications to self or return a new DataFrame
 check – When true, it will check if the column is used in virtual columns or the filter, and hide it instead.

dropna
(drop_nan=True, drop_masked=True, column_names=None)[source]¶ Create a shallow copy of a DataFrame, with filtering set using select_non_missing.
Parameters:  drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)
 drop_masked – drop rows when there is a masked value in any of the columns
 column_names – The columns to consider, default: all (real, nonvirtual) columns
Return type:

dtype
(expression, internal=False)[source]¶ Return the numpy dtype for the given expression, if not a column, the first row will be evaluated to get the dtype.

dtypes
¶ Gives a Pandas series object containing all numpy dtypes of all columns (except hidden).

evaluate
(expression, i1=None, i2=None, out=None, selection=None)[source]¶ Evaluate an expression, and return a numpy array with the results for the full column or a part of it.
Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.
To get partial results, use i1 and i2
Parameters:  expression (str) – Name/expression to evaluate
 i1 (int) – Start row index, default is the start (0)
 i2 (int) – End row index, default is the length of the DataFrame
 out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to a memory mapped array)
 selection – selection to apply
Returns:

extract
()[source]¶ Return a DataFrame containing only the filtered rows.
Note
Note that no copy of the underlying data is made, only a view/reference is make.
The resulting DataFrame may be more efficient to work with when the original DataFrame is heavily filtered (contains just a small number of rows).
If no filtering is applied, it returns a trimmed view. For the returned df, len(df) == df.length_original() == df.length_unfiltered()
Return type: DataFrame

fillna
(value, fill_nan=True, fill_masked=True, column_names=None, prefix='__original_', inplace=False)[source]¶ Return a DataFrame, where missing values/NaN are filled with ‘value’.
The original columns will be renamed, and by default they will be hidden columns. No data is lost.
Note
Note that no copy of the underlying data is made, only a view/reference is make.
Note
Note that filtering will be ignored (since they may change), you may want to consider running
extract()
first.Example:
>>> import vaex >>> import numpy as np >>> x = np.array([3, 1, np.nan, 10, np.nan]) >>> df = vaex.from_arrays(x=x) >>> df_filled = df.fillna(value=1, column_names=['x']) >>> df_filled # x 0 3 1 1 2 1 3 10 4 1
Parameters:  value (float) – The value to use for filling nan or masked values.
 fill_na (bool) – If True, fill np.nan values with value.
 fill_masked (bool) – If True, fill masked values with values.
 column_names (list) – List of column names in which to fill missing values.
 prefix (str) – The prefix to give the original columns.
 inplace – Make modifications to self or return a new DataFrame

first
(expression, order_expression, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]¶ Return the first element of a binned expression, where the values each bin are sorted by order_expression.
Example:
>>> import vaex >>> df = vaex.example() >>> df.first(df.x, df.y, shape=8) >>> df.first(df.x, df.y, shape=8, binby=[df.y]) >>> df.first(df.x, df.y, shape=8, binby=[df.y]) array([4.81883764, 11.65378 , 9.70084476, 7.3025589 , 4.84954977, 8.47446537, 5.73602629, 10.18783 ])
Parameters:  expression – The value to be placed in the bin.
 order_expression – Order the values in the bins by this expression.
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
 edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at 1
Returns: Ndarray containing the first elements.
Return type: numpy.array

get_column_names
(virtual=True, strings=True, hidden=False, regex=None)[source]¶ Return a list of column names
Example:
>>> import vaex >>> df = vaex.from_scalars(x=1, x2=2, y=3, s='string') >>> df['r'] = (df.x**2 + df.y**2)**2 >>> df.get_column_names() ['x', 'x2', 'y', 's', 'r'] >>> df.get_column_names(virtual=False) ['x', 'x2', 'y', 's'] >>> df.get_column_names(regex='x.*') ['x', 'x2']
Parameters:  virtual – If False, skip virtual columns
 hidden – If False, skip hidden columns
 strings – If False, skip string columns
 regex – Only return column names matching the (optional) regular expression
Return type: list of str
Example: >>> import vaex >>> df = vaex.from_scalars(x=1, x2=2, y=3, s=’string’) >>> df[‘r’] = (df.x**2 + df.y**2)**2 >>> df.get_column_names() [‘x’, ‘x2’, ‘y’, ‘s’, ‘r’] >>> df.get_column_names(virtual=False) [‘x’, ‘x2’, ‘y’, ‘s’] >>> df.get_column_names(regex=’x.*’) [‘x’, ‘x2’]

get_current_row
()[source]¶ Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked.

get_private_dir
(create=False)[source]¶ Each DataFrame has a directory where files are stored for metadata etc.
Example
>>> import vaex >>> ds = vaex.example() >>> vaex.get_private_dir() '/Users/users/breddels/.vaex/dfs/_Users_users_breddels_vaextesting_data_helmidezeeuw200010p.hdf5'
Parameters: create (bool) – is True, it will create the directory if it does not exist

get_selection
(name='default')[source]¶ Get the current selection object (mostly for internal use atm).

get_variable
(name)[source]¶ Returns the variable given by name, it will not evaluate it.
For evaluation, see
DataFrame.evaluate_variable()
, see alsoDataFrame.set_variable()

healpix_count
(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, delay=False, progress=None, selection=None)[source]¶ Count non missing value for expression on an array which represents healpix data.
Parameters:  expression – Expression or column for which to count nonmissing values, or None or ‘*’ for counting the rows
 healpix_expression – {healpix_max_level}
 healpix_max_level – {healpix_max_level}
 healpix_level – {healpix_level}
 binby – {binby}, these dimension follow the first healpix dimension.
 limits – {limits}
 shape – {shape}
 selection – {selection}
 delay – {delay}
 progress – {progress}
Returns:

healpix_plot
(healpix_expression='source_id/34359738368', healpix_max_level=12, healpix_level=8, what='count(*)', selection=None, grid=None, healpix_input='equatorial', healpix_output='galactic', f=None, colormap='afmhot', grid_limits=None, image_size=800, nest=True, figsize=None, interactive=False, title='', smooth=None, show=False, colorbar=True, rotation=(0, 0, 0), **kwargs)[source]¶ Viz data in 2d using a healpix column.
Parameters:  healpix_expression – {healpix_max_level}
 healpix_max_level – {healpix_max_level}
 healpix_level – {healpix_level}
 what – {what}
 selection – {selection}
 grid – {grid}
 healpix_input – Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”.
 healpix_output – Plot in “equatorial”, “galactic” or “ecliptic”.
 f – function to apply to the data
 colormap – matplotlib colormap
 grid_limits – Optional sequence [minvalue, maxvalue] that determine the min and max value that map to the colormap (values below and above these are clipped to the the min/max). (default is [min(f(grid)), max(f(grid)))
 image_size – size for the image that healpy uses for rendering
 nest – If the healpix data is in nested (True) or ring (False)
 figsize – If given, modify the matplotlib figure size. Example (14,9)
 interactive – (Experimental, uses healpy.mollzoom is True)
 title – Title of figure
 smooth – apply gaussian smoothing, in degrees
 show – Call matplotlib’s show (True) or not (False, defaut)
 rotation – Rotatate the plot, in format (lon, lat, psi) such that (lon, lat) is the center, and rotate on the screen by angle psi. All angles are degrees.
Returns:

length_original
()[source]¶ the full length of the DataFrame, independent what active_fraction is, or filtering. This is the real length of the underlying ndarrays.

length_unfiltered
()[source]¶ The length of the arrays that should be considered (respecting active range), but without filtering.

limits
(expression, value=None, square=False, selection=None, delay=False, shape=None)[source]¶ Calculate the [min, max] range for expression, as described by value, which is ‘99.7%’ by default.
If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.
Example:
>>> df.limits("x") array([28.86381927, 28.9261226 ]) >>> df.limits(["x", "y"]) (array([28.86381927, 28.9261226 ]), array([28.60476934, 28.96535249])) >>> df.limits(["x", "y"], "minmax") (array([128.293991, 271.365997]), array([ 71.5523682, 146.465836 ])) >>> df.limits(["x", "y"], ["minmax", "90%"]) (array([128.293991, 271.365997]), array([13.37438402, 13.4224423 ])) >>> df.limits(["x", "y"], ["minmax", [0, 10]]) (array([128.293991, 271.365997]), [0, 10])
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 value – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

limits_percentage
(expression, percentage=99.73, square=False, delay=False)[source]¶ Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.
The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:
Example:
>>> df.limits_percentage("x", 90) array([12.35081376, 12.14858052] >>> df.percentile_approx("x", 5), df.percentile_approx("x", 95) (array([12.36813152]), array([ 12.13275818]))
NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 percentage (float) – Value between 0 and 100
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

materialize
(virtual_column, inplace=False)[source]¶ Returns a new DataFrame where the virtual column is turned into an in memory numpy array.
Example:
>>> x = np.arange(1,4) >>> y = np.arange(2,5) >>> df = vaex.from_arrays(x=x, y=y) >>> df['r'] = (df.x**2 + df.y**2)**0.5 # 'r' is a virtual column (computed on the fly) >>> df = df.materialize('r') # now 'r' is a 'real' column (i.e. a numpy array)
Parameters: inplace – {inplace}

max
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False)[source]¶ Calculate the maximum for given expressions, possibly on a grid defined by binby.
Example:
>>> df.max("x") array(271.365997) >>> df.max(["x", "y"]) array([ 271.365997, 146.465836]) >>> df.max("x", binby="x", shape=5, limits=[10, 10]) array([6.00010443, 2.00002384, 1.99998057, 5.99983597, 9.99984646])
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mean
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False)[source]¶ Calculate the mean for expression, possibly on a grid defined by binby.
Example:
>>> df.mean("x") 0.067131491264005971 >>> df.mean("(x**2+y**2)**0.5", binby="E", shape=4) array([ 2.43483742, 4.41840721, 8.26742458, 15.53846476])
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

median_approx
(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=256, percentile_limits='minmax', selection=False, delay=False)[source]¶ Calculate the median , possibly on a grid defined by binby.
NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
 percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

min
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False)[source]¶ Calculate the minimum for given expressions, possibly on a grid defined by binby.
Example:
>>> df.min("x") array(128.293991) >>> df.min(["x", "y"]) array([128.293991 , 71.5523682]) >>> df.min("x", binby="x", shape=5, limits=[10, 10]) array([9.99919128, 5.99972439, 1.99991322, 2.0000093 , 6.0004878 ])
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

minmax
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the minimum and maximum for expressions, possibly on a grid defined by binby.
Example:
>>> df.minmax("x") array([128.293991, 271.365997]) >>> df.minmax(["x", "y"]) array([[128.293991 , 271.365997 ], [ 71.5523682, 146.465836 ]]) >>> df.minmax("x", binby="x", shape=5, limits=[10, 10]) array([[9.99919128, 6.00010443], [5.99972439, 2.00002384], [1.99991322, 1.99998057], [ 2.0000093 , 5.99983597], [ 6.0004878 , 9.99984646]])
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

ml_label_encoder
(features=None, prefix='label_encoded_')¶ Requires vaex.ml: Create
vaex.ml.transformations.LabelEncoder
and fit it.

ml_lightgbm_model
(label, num_round, features=None, copy=False, param={}, classifier=False, prediction_name='lightgbm_prediction')¶ Requires vaex.ml: create a lightgbm model and train/fit it.
Parameters:  label – label to train/fit on
 num_round – number of rounds
 features – list of features to train on
 copy (bool) – Copy data or use the modified xgboost library for efficient transfer
 classifier (bool) – If true, return a the classifier (will use argmax on the probabilities)
Return vaex.ml.lightgbm.LightGBMModel or LightGBMClassifier: fitted LightGBM model

ml_minmax_scaler
(features=None, feature_range=[0, 1])¶ Requires vaex.ml: Create
vaex.ml.transformations.MinMaxScaler
and fit it

ml_one_hot_encoder
(features=None, one=1, zero=0, prefix='')¶ Requires vaex.ml: Create
vaex.ml.transformations.OneHotEncoder
and fit it.Parameters:  features – list of features to onehot encode
 one – what value to use instead of “1”
 zero – what value to use instead of “0”
Returns one_hot_encoder: vaex.ml.transformations.OneHotEncoder object

ml_pca
(n_components=2, features=None, progress=False)¶ Requires vaex.ml: Create
vaex.ml.transformations.PCA
and fit it

ml_pygbm_model
(label, max_iter, features=None, param={}, classifier=False, prediction_name='pygbm_prediction', **kwargs)¶ Requires vaex.ml: create a pygbm model and train/fit it.
Parameters:  label – label to train/fit on
 max_iter – max number of iterations/trees
 features – list of features to train on
 classifier (bool) – If true, return a the classifier (will use argmax on the probabilities)
Return vaex.ml.pygbm.PyGBMModel or vaex.ml.pygbm.PyGBMClassifier: fitted PyGBM model

ml_standard_scaler
(features=None, with_mean=True, with_std=True)¶ Requires vaex.ml: Create
vaex.ml.transformations.StandardScaler
and fit it

ml_to_xgboost_dmatrix
(label, features=None, selection=None, blocksize=1000000)¶ label: ndarray containing the labels

ml_train_test_split
(test_size=0.2, strings=True, virtual=True, verbose=True)¶ Will split the dataset in train and test part, assuming it is shuffled.

ml_xgboost_model
(label, num_round, features=None, copy=False, param={}, prediction_name='xgboost_prediction')¶ Requires vaex.ml: create a XGBoost model and train/fit it.
Parameters:  label – label to train/fit on
 num_round – number of rounds
 features – list of features to train on
 copy (bool) – Copy data or use the modified xgboost library for efficient transfer
Return vaex.ml.xgboost.XGBModel: fitted XGBoost model

mode
(expression, binby=[], limits=None, shape=256, mode_shape=64, mode_limits=None, progressbar=False, selection=None)[source]¶ Calculate/estimate the mode.

mutual_information
(x, y=None, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, delay=False)[source]¶ Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby.
If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order.
Example:
>>> df.mutual_information("x", "y") array(0.1511814526380327) >>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]]) array([ 0.15118145, 0.18439181, 1.07067379]) >>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True) (array([ 1.07067379, 0.18439181, 0.15118145]), [['E', 'Lz'], ['x', 'z'], ['x', 'y']])
Parameters:  x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,

nbytes
¶ Alias for df.byte_size(), see
DataFrame.byte_size()
.

nop
(expression, progress=False, delay=False)[source]¶ Evaluates expression, and drop the result, usefull for benchmarking, since vaex is usually lazy

percentile_approx
(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, delay=False)[source]¶ Calculate the percentile given by percentage, possibly on a grid defined by binby.
NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits.
Example:
>>> df.percentile_approx("x", 10), df.percentile_approx("x", 90) (array([8.3220355]), array([ 7.92080358])) >>> df.percentile_approx("x", 50, binby="x", shape=5, limits=[10, 10]) array([[7.56462982], [3.61036641], [0.01296306], [ 3.56697863], [ 7.45838367]])
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
 percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

plot
(x=None, y=None, z=None, what='count(*)', vwhat=None, reduce=['colormap'], f=None, normalize='normalize', normalize_axis='what', vmin=None, vmax=None, shape=256, vshape=32, limits=None, grid=None, colormap='afmhot', figsize=None, xlabel=None, ylabel=None, aspect='auto', tight_layout=True, interpolation='nearest', show=False, colorbar=True, colorbar_label=None, selection=None, selection_labels=None, title=None, background_color='white', pre_blend=False, background_alpha=1.0, visual={'column': 'what', 'fade': 'selection', 'layer': 'z', 'row': 'subspace', 'x': 'x', 'y': 'y'}, smooth_pre=None, smooth_post=None, wrap=True, wrap_columns=4, return_extra=False, hardcopy=None)¶ Viz data in a 2d histogram/heatmap.
Declarative plotting of statistical plots using matplotlib, supports subplots, selections, layers.
Instead of passing x and y, pass a list as x argument for multiple panels. Give what a list of options to have multiple panels. When both are present then will be origanized in a column/row order.
This methods creates a 6 dimensional ‘grid’, where each dimension can map the a visual dimension. The grid dimensions are:
 x: shape determined by shape, content by x argument or the first dimension of each space
 y: ,,
 z: related to the z argument
 selection: shape equals length of selection argument
 what: shape equals length of what argument
 space: shape equals length of x argument if multiple values are given
By default, this its shape is (1, 1, 1, 1, shape, shape) (where x is the last dimension)
The visual dimensions are
 x: x coordinate on a plot / image (default maps to grid’s x)
 y: y ,, (default maps to grid’s y)
 layer: each image in this dimension is blended togeher to one image (default maps to z)
 fade: each image is shown faded after the next image (default mapt to selection)
 row: rows of subplots (default maps to space)
 columns: columns of subplot (default maps to what)
All these mappings can be changes by the visual argument, some examples:
>>> df.plot('x', 'y', what=['mean(x)', 'correlation(vx, vy)'])
Will plot each ‘what’ as a column.
>>> df.plot('x', 'y', selection=['FeH < 3', '(FeH >= 3) & (FeH < 2)'], visual=dict(column='selection'))
Will plot each selection as a column, instead of a faded on top of each other.
Parameters:  x – Expression to bin in the x direction (by default maps to x), or list of pairs, like [[‘x’, ‘y’], [‘x’, ‘z’]], if multiple pairs are given, this dimension maps to rows by default
 y – y (by default maps to y)
 z – Expression to bin in the z direction, followed by a :start,end,shape signature, like ‘FeH:3,1:5’ will produce 5 layers between 10 and 10 (by default maps to layer)
 what – What to plot, count(*) will show a Nd histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)
 reduce –
 f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
 normalize – normalization function, currently only ‘normalize’ is supported
 normalize_axis – which axes to normalize on, None means normalize by the global maximum.
 vmin – instead of automatic normalization, (using normalize and normalization_axis) scale the data between vmin and vmax to [0, 1]
 vmax – see vmin
 shape – shape/size of the nD histogram grid
 limits – list of [[xmin, xmax], [ymin, ymax]], or a description such as ‘minmax’, ‘99%’
 grid – if the binning is done before by yourself, you can pass it
 colormap – matplotlib colormap to use
 figsize – (x, y) tuple passed to pylab.figure for setting the figure size
 xlabel –
 ylabel –
 aspect –
 tight_layout – call pylab.tight_layout or not
 colorbar – plot a colorbar or not
 interpolation – interpolation for imshow, possible options are: ‘nearest’, ‘bilinear’, ‘bicubic’, see matplotlib for more
 return_extra –
Returns:

plot1d
(x=None, what='count(*)', grid=None, shape=64, facet=None, limits=None, figsize=None, f='identity', n=None, normalize_axis=None, xlabel=None, ylabel=None, label=None, selection=None, show=False, tight_layout=True, hardcopy=None, **kwargs)¶ Viz data in 1d (histograms, running means etc)
Example
>>> df.plot1d(df.x) >>> df.plot1d(df.x, limits=[0, 100], shape=100) >>> df.plot1d(df.x, what='mean(y)', limits=[0, 100], shape=100)
If you want to do a computation yourself, pass the grid argument, but you are responsible for passing the same limits arguments:
>>> counts = df.mean(df.y, binby=df.x, limits=[0, 100], shape=100)/100. >>> df.plot1d(df.x, limits=[0, 100], shape=100, grid=means, label='mean(y)/100')
Parameters:  x – Expression to bin in the x direction
 what – What to plot, count(*) will show a Nd histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum
 grid – If the binning is done before by yourself, you can pass it
 facet – Expression to produce facetted plots ( facet=’x:0,1,12’ will produce 12 plots with x in a range between 0 and 1)
 limits – list of [xmin, xmax], or a description such as ‘minmax’, ‘99%’
 figsize – (x, y) tuple passed to pylab.figure for setting the figure size
 f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
 n – normalization function, currently only ‘normalize’ is supported, or None for no normalization
 normalize_axis – which axes to normalize on, None means normalize by the global maximum.
 normalize_axis –
 xlabel – String for label on x axis (may contain latex)
 ylabel – Same for y axis
 kwargs – extra argument passed to pylab.plot
Param: tight_layout: call pylab.tight_layout or not
Returns:

plot2d_contour
(x=None, y=None, what='count(*)', limits=None, shape=256, selection=None, f='identity', figsize=None, xlabel=None, ylabel=None, aspect='auto', levels=None, fill=False, colorbar=False, colorbar_label=None, colormap=None, colors=None, linewidths=None, linestyles=None, vmin=None, vmax=None, grid=None, show=None, **kwargs)¶ Plot conting contours on 2D grid.
Parameters:  x – {expression}
 y – {expression}
 what – What to plot, count(*) will show a Nd histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)
 limits – {limits}
 shape – {shape}
 selection – {selection}
 f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
 figsize – (x, y) tuple passed to pylab.figure for setting the figure size
 xlabel – label of the xaxis (defaults to param x)
 ylabel – label of the yaxis (defaults to param y)
 aspect – the aspect ratio of the figure
 levels – the contour levels to be passed on pylab.contour or pylab.contourf
 colorbar – plot a colorbar or not
 colorbar_label – the label of the colourbar (defaults to param what)
 colormap – matplotlib colormap to pass on to pylab.contour or pylab.contourf
 colors – the colours of the contours
 linewidths – the widths of the contours
 linestyles – the style of the contour lines
 vmin – instead of automatic normalization, scale the data between vmin and vmax
 vmax – see vmin
 grid – {grid}
 show –

plot3d
(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]¶ Use at own risk, requires ipyvolume

plot_bq
(x, y, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, **kwargs)[source]¶ Deprecated: use plot_widget

plot_widget
(x, y, z=None, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, backend='bqplot', **kwargs)[source]¶ Viz 1d, 2d or 3d in a Jupyter notebook
Note
This API is not fully settled and may change in the future
Example:
>>> df.plot_widget(df.x, df.y, backend='bqplot') >>> df.plot_widget(df.pickup_longitude, df.pickup_latitude, backend='ipyleaflet')
Parameters: backend – Widget backend to use: ‘bqplot’, ‘ipyleaflet’, ‘ipyvolume’, ‘matplotlib’

propagate_uncertainties
(columns, depending_variables=None, cov_matrix='auto', covariance_format='{}_{}_covariance', uncertainty_format='{}_uncertainty')[source]¶ Propagates uncertainties (full covariance matrix) for a set of virtual columns.
Covariance matrix of the depending variables is guessed by finding columns prefixed by “e” or “e_” or postfixed by “_error”, “_uncertainty”, “e” and “_e”. Off diagonals (covariance or correlation) by postfixes with “_correlation” or “_corr” for correlation or “_covariance” or “_cov” for covariances. (Note that x_y_cov = x_e * y_e * x_y_correlation.)
Example
>>> df = vaex.from_scalars(x=1, y=2, e_x=0.1, e_y=0.2) >>> df["u"] = df.x + df.y >>> df["v"] = np.log10(df.x) >>> df.propagate_uncertainties([df.u, df.v]) >>> df.u_uncertainty, df.v_uncertainty
Parameters:  columns – list of columns for which to calculate the covariance matrix.
 depending_variables – If not given, it is found out automatically, otherwise a list of columns which have uncertainties.
 cov_matrix – List of list with expressions giving the covariance matrix, in the same order as depending_variables. If ‘full’ or ‘auto’, the covariance matrix for the depending_variables will be guessed, where ‘full’ gives an error if an entry was not found.

remove_virtual_meta
()[source]¶ Removes the file with the virtual column etc, it does not change the current virtual columns etc.

rename_column
(name, new_name, unique=False, store_in_state=True)[source]¶ Renames a column, not this is only the in memory name, this will not be reflected on disk

sample
(n=None, frac=None, replace=False, weights=None, random_state=None)[source]¶ Returns a DataFrame with a random set of rows
Note
Note that no copy of the underlying data is made, only a view/reference is make.
Provide either n or frac.
Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5)) >>> df # s x 0 a 1 1 b 2 2 c 3 3 d 4 >>> df.sample(n=2, random_state=42) # 2 random rows, fixed seed # s x 0 b 2 1 d 4 >>> df.sample(frac=1, random_state=42) # 'shuffling' # s x 0 c 3 1 a 1 2 d 4 3 b 2 >>> df.sample(frac=1, replace=True, random_state=42) # useful for bootstrap (may contain repeated samples) # s x 0 d 4 1 a 1 2 a 1 3 d 4
Parameters:  n (int) – number of samples to take (default 1 if frac is None)
 frac (float) – fractional number of takes to take
 replace (bool) – If true, a row may be drawn multiple times
 or expression weights (str) – (unnormalized) probability that a row can be drawn
 or RandomState (int) – seed or RandomState for reproducability, when None a random seed it chosen
Returns: Returns a new DataFrame with a shallow copy/view of the underlying data
Return type:

scatter
(x, y, xerr=None, yerr=None, cov=None, corr=None, s_expr=None, c_expr=None, labels=None, selection=None, length_limit=50000, length_check=True, label=None, xlabel=None, ylabel=None, errorbar_kwargs={}, ellipse_kwargs={}, **kwargs)¶ Viz (small amounts) of data in 2d using a scatter plot
Convenience wrapper around pylab.scatter when for working with small DataFrames or selections
Parameters:  x – Expression for x axis
 y – Idem for y
 s_expr – When given, use if for the s (size) argument of pylab.scatter
 c_expr – When given, use if for the c (color) argument of pylab.scatter
 labels – Annotate the points with these text values
 selection – Single selection expression, or None
 length_limit – maximum number of rows it will plot
 length_check – should we do the maximum row check or not?
 label – label for the legend
 xlabel – label for x axis, if None .label(x) is used
 ylabel – label for y axis, if None .label(y) is used
 errorbar_kwargs – extra dict with arguments passed to plt.errorbar
 kwargs – extra arguments passed to pylab.scatter
Returns:

select
(boolean_expression, mode='replace', name='default', executor=None)[source]¶ Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode.
Selections are recorded in a history tree, per name, undo/redo can be done for them separately.
Parameters:  boolean_expression (str) – Any valid column expression, with comparison operators
 mode (str) – Possible boolean operator: replace/and/or/xor/subtract
 name (str) – history tree or selection ‘slot’ to use
 executor –
Returns:

select_box
(spaces, limits, mode='replace', name='default')[source]¶ Select a ndimensional rectangular box bounded by limits.
The following examples are equivalent:
>>> df.select_box(['x', 'y'], [(0, 10), (0, 1)]) >>> df.select_rectangle('x', 'y', [(0, 10), (0, 1)])
Parameters:  spaces – list of expressions
 limits – sequence of shape [(x1, x2), (y1, y2)]
 mode –
 name –
Returns:

select_circle
(x, y, xc, yc, r, mode='replace', name='default', inclusive=True)[source]¶ Select a circular region centred on xc, yc, with a radius of r.
Example:
>>> df.select_circle('x','y',2,3,1)
Parameters:  x – expression for the x space
 y – expression for the y space
 xc – location of the centre of the circle in x
 yc – location of the centre of the circle in y
 r – the radius of the circle
 name – name of the selection
 mode –
Returns:

select_ellipse
(x, y, xc, yc, width, height, angle=0, mode='replace', name='default', radians=False, inclusive=True)[source]¶ Select an elliptical region centred on xc, yc, with a certain width, height and angle.
Example:
>>> df.select_ellipse('x','y', 2, 1, 5,1, 30, name='my_ellipse')
Parameters:  x – expression for the x space
 y – expression for the y space
 xc – location of the centre of the ellipse in x
 yc – location of the centre of the ellipse in y
 width – the width of the ellipse (diameter)
 height – the width of the ellipse (diameter)
 angle – (degrees) orientation of the ellipse, counterclockwise measured from the y axis
 name – name of the selection
 mode –
Returns:

select_inverse
(name='default', executor=None)[source]¶ Invert the selection, i.e. what is selected will not be, and vice versa
Parameters:  name (str) –
 executor –
Returns:

select_lasso
(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]¶ For performance reasons, a lasso selection is handled differently.
Parameters:  expression_x (str) – Name/expression for the x coordinate
 expression_y (str) – Name/expression for the y coordinate
 xsequence – list of x numbers defining the lasso, together with y
 ysequence –
 mode (str) – Possible boolean operator: replace/and/or/xor/subtract
 name (str) –
 executor –
Returns:

select_non_missing
(drop_nan=True, drop_masked=True, column_names=None, mode='replace', name='default')[source]¶ Create a selection that selects rows having non missing values for all columns in column_names.
The name reflect Panda’s, no rows are really dropped, but a mask is kept to keep track of the selection
Parameters:  drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)
 drop_masked – drop rows when there is a masked value in any of the columns
 column_names – The columns to consider, default: all (real, nonvirtual) columns
 mode (str) – Possible boolean operator: replace/and/or/xor/subtract
 name (str) – history tree or selection ‘slot’ to use
Returns:

select_rectangle
(x, y, limits, mode='replace', name='default')[source]¶ Select a 2d rectangular box in the space given by x and y, bounds by limits.
Example:
>>> df.select_box('x', 'y', [(0, 10), (0, 1)])
Parameters:  x – expression for the x space
 y – expression fo the y space
 limits – sequence of shape [(x1, x2), (y1, y2)]
 mode –

set_active_fraction
(value)[source]¶ Sets the active_fraction, set picked row to None, and remove selection.
TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_active_range
(i1, i2)[source]¶ Sets the active_fraction, set picked row to None, and remove selection.
TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_selection
(selection, name='default', executor=None)[source]¶ Sets the selection object
Parameters:  selection – Selection object
 name – selection ‘slot’
 executor –
Returns:

set_variable
(name, expression_or_value, write=True)[source]¶ Set the variable to an expression or value defined by expression_or_value.
Example
>>> df.set_variable("a", 2.) >>> df.set_variable("b", "a**2") >>> df.get_variable("b") 'a**2' >>> df.evaluate_variable("b") 4.0
Parameters:  name – Name of the variable
 write – write variable to meta file
 expression – value or expression

sort
(by, ascending=True, kind='quicksort')[source]¶ Return a sorted DataFrame, sorted by the expression ‘by’
Note
Note that no copy of the underlying data is made, only a view/reference is make.
Note
Note that filtering will be ignored (since they may change), you may want to consider running
extract()
first.Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5)) >>> df['y'] = (df.x1.8)**2 >>> df # s x y 0 a 1 0.64 1 b 2 0.04 2 c 3 1.44 3 d 4 4.84 >>> df.sort('y', ascending=False) # Note: passing '(x1.8)**2' gives the same result # s x y 0 d 4 4.84 1 c 3 1.44 2 a 1 0.64 3 b 2 0.04
Parameters:  or expression by (str) – expression to sort by
 ascending (bool) – ascending (default, True) or descending (False)
 kind (str) – kind of algorithm to use (passed to numpy.argsort)

split
(frac)[source]¶ Returns a list containing ordered subsets of the DataFrame.
Note
Note that no copy of the underlying data is made, only a view/reference is make.
Example:
>>> import vaex >>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> for dfs in df.split(frac=0.3): ... print(dfs.x.values) ... [0 1 3] [3 4 5 6 7 8 9] >>> for split in df.split(frac=[0.2, 0.3, 0.5]): ... print(dfs.x.values) [0 1] [2 3 4] [5 6 7 8 9]
Parameters: frac (int/list) – If int will split the DataFrame in two portions, the first of which will have size as specified by this parameter. If list, the generator will generate as many portions as elements in the list, where each element defines the relative fraction of that portion. Returns: A list of DataFrames. Return type: list

split_random
(frac, random_state=None)[source]¶ Returns a list containing random portions of the DataFrame.
Note
Note that no copy of the underlying data is made, only a view/reference is make.
Example:
>>> import vaex, import numpy as np >>> np.random.seed(111) >>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> for dfs in df.split_random(frac=0.3, random_state=42): ... print(dfs.x.values) ... [8 1 5] [0 7 2 9 4 3 6] >>> for split in df.split_random(frac=[0.2, 0.3, 0.5], random_state=42): ... print(dfs.x.values) [8 1] [5 0 7] [2 9 4 3 6]
Parameters:  frac (int/list) – If int will split the DataFrame in two portions, the first of which will have size as specified by this parameter. If list, the generator will generate as many portions as elements in the list, where each element defines the relative fraction of that portion.
 random_state (int) – (default, None) Random number seed for reproducibility.
Returns: A list of DataFrames.
Return type: list

state_get
()[source]¶ Return the internal state of the DataFrame in a dictionary
Example:
>>> import vaex >>> df = vaex.from_scalars(x=1, y=2) >>> df['r'] = (df.x**2 + df.y**2)**0.5 >>> df.state_get() {'active_range': [0, 1], 'column_names': ['x', 'y', 'r'], 'description': None, 'descriptions': {}, 'functions': {}, 'renamed_columns': [], 'selections': {'__filter__': None}, 'ucds': {}, 'units': {}, 'variables': {}, 'virtual_columns': {'r': '(((x ** 2) + (y ** 2)) ** 0.5)'}}

state_load
(f, use_active_range=False)[source]¶ Load a state previously stored by
DataFrame.state_store()
, see alsoDataFrame.state_set()
.

state_set
(state, use_active_range=False)[source]¶ Sets the internal state of the df
Example:
>>> import vaex >>> df = vaex.from_scalars(x=1, y=2) >>> df # x y r 0 1 2 2.23607 >>> df['r'] = (df.x**2 + df.y**2)**0.5 >>> state = df.state_get() >>> state {'active_range': [0, 1], 'column_names': ['x', 'y', 'r'], 'description': None, 'descriptions': {}, 'functions': {}, 'renamed_columns': [], 'selections': {'__filter__': None}, 'ucds': {}, 'units': {}, 'variables': {}, 'virtual_columns': {'r': '(((x ** 2) + (y ** 2)) ** 0.5)'}} >>> df2 = vaex.from_scalars(x=3, y=4) >>> df2.state_set(state) # now the virtual functions are 'copied' >>> df2 # x y r 0 3 4 5
Parameters:  state – dict as returned by
DataFrame.state_get()
.  use_active_range (bool) – Whether to use the active range or not.
 state – dict as returned by

state_write
(f)[source]¶ Write the internal state to a json or yaml file (see
DataFrame.state_get()
)Example
>>> import vaex >>> df = vaex.from_scalars(x=1, y=2) >>> df['r'] = (df.x**2 + df.y**2)**0.5 >>> df.state_write('state.json') >>> print(open('state.json').read()) { "virtual_columns": { "r": "(((x ** 2) + (y ** 2)) ** 0.5)" }, "column_names": [ "x", "y", "r" ], "renamed_columns": [], "variables": { "pi": 3.141592653589793, "e": 2.718281828459045, "km_in_au": 149597870.7, "seconds_per_year": 31557600 }, "functions": {}, "selections": { "__filter__": null }, "ucds": {}, "units": {}, "descriptions": {}, "description": null, "active_range": [ 0, 1 ] } >>> df.state_write('state.yaml') >>> print(open('state.yaml').read()) active_range:  0  1 column_names:  x  y  r description: null descriptions: {} functions: {} renamed_columns: [] selections: __filter__: null ucds: {} units: {} variables: pi: 3.141592653589793 e: 2.718281828459045 km_in_au: 149597870.7 seconds_per_year: 31557600 virtual_columns: r: (((x ** 2) + (y ** 2)) ** 0.5)
Parameters: f (str) – filename (ending in .json or .yaml)

std
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the standard deviation for the given expression, possible on a grid defined by binby
>>> df.std("vz") 110.31773397535071 >>> df.std("vz", binby=["(x**2+y**2)**0.5"], shape=4) array([ 123.57954851, 85.35190177, 61.14345748, 38.0740619 ])
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

sum
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False)[source]¶ Calculate the sum for the given expression, possible on a grid defined by binby
Example:
>>> df.sum("L") 304054882.49378014 >>> df.sum("L", binby="E", shape=4) array([ 8.83517994e+06, 5.92217598e+07, 9.55218726e+07, 1.40008776e+08])
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

take
(indices, unfiltered=False)[source]¶ Returns a DataFrame containing only rows indexed by indices
Note
Note that no copy of the underlying data is made, only a view/reference is make.
Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5)) >>> df.take([0,2]) # s x 0 a 1 1 c 3
Parameters:  indices – sequence (list or numpy array) with row numbers
 unfiltered – (for internal use) The indices refer to the unfiltered data.
Returns: DataFrame which is a shallow copy of the original data.
Return type:

to_arrow_table
(column_names=None, selection=None, strings=True, virtual=False)[source]¶ Returns an arrow Table object containing the arrays corresponding to the evaluated data
Parameters:  column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 strings – argument passed to DataFrame.get_column_names when column_names is None
 virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns: pyarrow.Table object

to_astropy_table
(column_names=None, selection=None, strings=True, virtual=False, index=None)[source]¶ Returns a astropy table object containing the ndarrays corresponding to the evaluated data
Parameters:  column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 strings – argument passed to DataFrame.get_column_names when column_names is None
 virtual – argument passed to DataFrame.get_column_names when column_names is None
 index – if this column is given it is used for the index of the DataFrame
Returns: astropy.table.Table object

to_copy
(column_names=None, selection=None, strings=True, virtual=False, selections=True)[source]¶ Return a copy of the DataFrame, if selection is None, it does not copy the data, it just has a reference
Parameters:  column_names – list of column names, to copy, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 strings – argument passed to DataFrame.get_column_names when column_names is None
 virtual – argument passed to DataFrame.get_column_names when column_names is None
 selections – copy selections to a new DataFrame
Returns: dict

to_dict
(column_names=None, selection=None, strings=True, virtual=False)[source]¶ Return a dict containing the ndarray corresponding to the evaluated data
Parameters:  column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 strings – argument passed to DataFrame.get_column_names when column_names is None
 virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns: dict

to_items
(column_names=None, selection=None, strings=True, virtual=False)[source]¶ Return a list of [(column_name, ndarray), …)] pairs where the ndarray corresponds to the evaluated data
Parameters:  column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 strings – argument passed to DataFrame.get_column_names when column_names is None
 virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns: list of (name, ndarray) pairs

to_pandas_df
(column_names=None, selection=None, strings=True, virtual=False, index_name=None)[source]¶ Return a pandas DataFrame containing the ndarray corresponding to the evaluated data
If index is given, that column is used for the index of the dataframe.
Example
>>> df_pandas = df.to_pandas_df(["x", "y", "z"]) >>> df_copy = vaex.from_pandas(df_pandas)
Parameters:  column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 strings – argument passed to DataFrame.get_column_names when column_names is None
 virtual – argument passed to DataFrame.get_column_names when column_names is None
 index_column – if this column is given it is used for the index of the DataFrame
Returns: pandas.DataFrame object

trim
(inplace=False)[source]¶ Return a DataFrame, where all columns are ‘trimmed’ by the active range.
For the returned DataFrame, df.get_active_range() returns (0, df.length_original()).
Note
Note that no copy of the underlying data is made, only a view/reference is make.
Parameters: inplace – Make modifications to self or return a new DataFrame Return type: DataFrame

ucd_find
(ucds, exclude=[])[source]¶ Find a set of columns (names) which have the ucd, or part of the ucd.
Prefixed with a ^, it will only match the first part of the ucd.
Example
>>> df.ucd_find('pos.eq.ra', 'pos.eq.dec') ['RA', 'DEC'] >>> df.ucd_find('pos.eq.ra', 'doesnotexist') >>> df.ucds[df.ucd_find('pos.eq.ra')] 'pos.eq.ra;meta.main' >>> df.ucd_find('meta.main')] 'dec' >>> df.ucd_find('^meta.main')]

unit
(expression, default=None)[source]¶ Returns the unit (an astropy.unit.Units object) for the expression.
Example
>>> import vaex >>> ds = vaex.example() >>> df.unit("x") Unit("kpc") >>> df.unit("x*L") Unit("km kpc2 / s")
Parameters:  expression – Expression, which can be a column name
 default – if no unit is known, it will return this
Returns: The resulting unit of the expression
Return type: astropy.units.Unit

var
(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Calculate the sample variance for the given expression, possible on a grid defined by binby
Example:
>>> df.var("vz") 12170.002429456246 >>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4) array([ 15271.90481083, 7284.94713504, 3738.52239232, 1449.63418988]) >>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5 array([ 123.57954851, 85.35190177, 61.14345748, 38.0740619 ]) >>> df.std("vz", binby=["(x**2+y**2)**0.5"], shape=4) array([ 123.57954851, 85.35190177, 61.14345748, 38.0740619 ])
Parameters:  expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
 binby – List of expressions for constructing a binned grid
 limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
 shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
 selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
 progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns: Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

DataFrameLocal class¶

class
vaex.dataframe.
DataFrameLocal
(name, path, column_names)[source]¶ Bases:
vaex.dataframe.DataFrame
Base class for DataFrames that work with local file/data

__array__
(dtype=None)[source]¶ Gives a full memory copy of the DataFrame into a 2d numpy array of shape (n_rows, n_columns). Note that the memory order is fortran, so all values of 1 column are contiguous in memory for performance reasons.
Note this returns the same result as:
>>> np.array(ds)
If any of the columns contain masked arrays, the masks are ignored (i.e. the masked elements are returned as well).

__init__
(name, path, column_names)[source]¶ Initialize self. See help(type(self)) for accurate signature.

binby
(by=None, agg=None)[source]¶ Return a
BinBy
orDataArray
object when agg is not NoneThe binby operations does not return a ‘flat’ DataFrame, instead it returns an Nd grid in the form of an xarray.
Parameters: list or agg agg (dict,) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the binby object. Returns: DataArray
orBinBy
object.

categorize
(column, labels=None, check=True)[source]¶ Mark column as categorical, with given labels, assuming zero indexing

compare
(other, report_missing=True, report_difference=False, show=10, orderby=None, column_names=None)[source]¶ Compare two DataFrames and report their difference, use with care for large DataFrames

concat
(other)[source]¶ Concatenates two DataFrames, adding the rows of one the other DataFrame to the current, returned in a new DataFrame.
No copy of the data is made.
Parameters: other – The other DataFrame that is concatenated with this DataFrame Returns: New DataFrame with the rows concatenated Return type: DataFrameConcatenated

data
¶ Gives direct access to the data as numpy arrays.
Convenient when working with IPython in combination with small DataFrames, since this gives tabcompletion. Only real columns (i.e. no virtual) columns can be accessed, for getting the data from virtual columns, use DataFrame.evalulate(…).
Columns can be accesed by there names, which are attributes. The attribues are of type numpy.ndarray.
Example:
>>> df = vaex.example() >>> r = np.sqrt(df.data.x**2 + df.data.y**2)

evaluate
(expression, i1=None, i2=None, out=None, selection=None, filtered=True, internal=False)[source]¶ The local implementation of
DataFrame.evaluate()

export
(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a file written with arrow
Parameters:  df (DataFrameLocal) – DataFrame to export
 path (str) – path for file
 column_names (lis[str]) – list of column names to export or None for all columns
 byteorder (str) – = for native, < for little endian and > for big endian (not supported for fits)
 shuffle (bool) – export rows in random order
 selection (bool) – export selection or not
 progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
 sort (str) – expression used for sorting the output
 ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:

export_arrow
(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a file written with arrow
Parameters:  df (DataFrameLocal) – DataFrame to export
 path (str) – path for file
 column_names (lis[str]) – list of column names to export or None for all columns
 byteorder (str) – = for native, < for little endian and > for big endian
 shuffle (bool) – export rows in random order
 selection (bool) – export selection or not
 progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
 sort (str) – expression used for sorting the output
 ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:

export_fits
(path, column_names=None, shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a fits file that is compatible with TOPCAT colfits format
Parameters:  df (DataFrameLocal) – DataFrame to export
 path (str) – path for file
 column_names (lis[str]) – list of column names to export or None for all columns
 shuffle (bool) – export rows in random order
 selection (bool) – export selection or not
 progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
 sort (str) – expression used for sorting the output
 ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:

export_hdf5
(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a vaex hdf5 file
Parameters:  df (DataFrameLocal) – DataFrame to export
 path (str) – path for file
 column_names (lis[str]) – list of column names to export or None for all columns
 byteorder (str) – = for native, < for little endian and > for big endian
 shuffle (bool) – export rows in random order
 selection (bool) – export selection or not
 progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
 sort (str) – expression used for sorting the output
 ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:

export_parquet
(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]¶ Exports the DataFrame to a parquet file
Parameters:  df (DataFrameLocal) – DataFrame to export
 path (str) – path for file
 column_names (lis[str]) – list of column names to export or None for all columns
 byteorder (str) – = for native, < for little endian and > for big endian
 shuffle (bool) – export rows in random order
 selection (bool) – export selection or not
 progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
 sort (str) – expression used for sorting the output
 ascending (bool) – sort ascending (True) or descending
Param: bool virtual: When True, export virtual columns
Returns:

groupby
(by=None, agg=None)[source]¶ Return a
GroupBy
orDataFrame
object when agg is not NoneExamples:
>>> import vaex >>> import numpy as np >>> np.random.seed(42) >>> x = np.random.randint(1, 5, 10) >>> y = x**2 >>> df = vaex.from_arrays(x=x, y=y) >>> df.groupby(df.x, agg='count') # x y_count 0 3 4 1 4 2 2 1 3 3 2 1 >>> df.groupby(df.x, agg=[vaex.agg.count('y'), vaex.agg.mean('y')]) # x y_count y_mean 0 3 4 9 1 4 2 16 2 1 3 1 3 2 1 4 >>> df.groupby(df.x, agg={'z': [vaex.agg.count('y'), vaex.agg.mean('y')]}) # x z_count z_mean 0 3 4 9 1 4 2 16 2 1 3 1 3 2 1 4
Example using datetime:
>>> import vaex >>> import numpy as np >>> t = np.arange('20150101', '20150201', dtype=np.datetime64) >>> y = np.arange(len(t)) >>> df = vaex.from_arrays(t=t, y=y) >>> df.groupby(vaex.BinnerTime.per_week(df.t)).agg({'y' : 'sum'}) # t y 0 20150101 00:00:00 21 1 20150108 00:00:00 70 2 20150115 00:00:00 119 3 20150122 00:00:00 168 4 20150129 00:00:00 87
Parameters: list or agg agg (dict,) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the groupby object. Returns: DataFrame
orGroupBy
object.

is_local
()[source]¶ The local implementation of
DataFrame.evaluate()
, always returns True.

join
(other, on=None, left_on=None, right_on=None, lsuffix='', rsuffix='', how='left', inplace=False)[source]¶ Return a DataFrame joined with other DataFrames, matched by columns/expression on/left_on/right_on
If neither on/left_on/right_on is given, the join is done by simply adding the columns (i.e. on the implicit row index).
Note: The filters will be ignored when joining, the full DataFrame will be joined (since filters may change). If either DataFrame is heavily filtered (contains just a small number of rows) consider running
DataFrame.extract()
first.Example:
>>> a = np.array(['a', 'b', 'c']) >>> x = np.arange(1,4) >>> ds1 = vaex.from_arrays(a=a, x=x) >>> b = np.array(['a', 'b', 'd']) >>> y = x**2 >>> ds2 = vaex.from_arrays(b=b, y=y) >>> ds1.join(ds2, left_on='a', right_on='b')
Parameters:  other – Other DataFrame to join with (the right side)
 on – default key for the left table (self)
 left_on – key for the left table (self), overrides on
 right_on – default key for the right table (other), overrides on
 lsuffix – suffix to add to the left column names in case of a name collision
 rsuffix – similar for the right
 how – how to join, ‘left’ keeps all rows on the left, and adds columns (with possible missing values) ‘right’ is similar with self and other swapped.
 inplace – Make modifications to self or return a new DataFrame
Returns:

label_encode
(column, values=None, inplace=False)¶ Deprecated: use is_category
Encode column as ordinal values and mark it as categorical.
The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)1].

length
(selection=False)[source]¶ Get the length of the DataFrames, for the selection of the whole DataFrame.
If selection is False, it returns len(df).
TODO: Implement this in DataFrameRemote, and move the method up in
DataFrame.length()
Parameters: selection – When True, will return the number of selected rows Returns:

ordinal_encode
(column, values=None, inplace=False)[source]¶ Deprecated: use is_category
Encode column as ordinal values and mark it as categorical.
The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)1].

selected_length
(selection='default')[source]¶ The local implementation of
DataFrame.selected_length()

Expression class¶

class
vaex.expression.
Expression
(ds, expression)[source]¶ Bases:
object
Expression class

__weakref__
¶ list of weak references to the object (if defined)

abs
()¶ absolute(x, /, out=None, *, where=True, casting=’same_kind’, order=’K’, dtype=None, subok=True[, signature, extobj])
Calculate the absolute value elementwise.
np.abs
is a shorthand for this function. x : array_like
 Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 absolute : ndarray
 An ndarray containing the absolute value of
each element in x. For complex input,
a + ib
, the absolute value is \(\sqrt{ a^2 + b^2 }\). This is a scalar if x is a scalar.
>>> x = np.array([1.2, 1.2]) >>> np.absolute(x) array([ 1.2, 1.2]) >>> np.absolute(1.2 + 1j) 1.5620499351813308
Plot the function over
[10, 10]
:>>> import matplotlib.pyplot as plt
>>> x = np.linspace(start=10, stop=10, num=101) >>> plt.plot(x, np.absolute(x)) >>> plt.show()
Plot the function over the complex plane:
>>> xx = x + 1j * x[:, np.newaxis] >>> plt.imshow(np.abs(xx), extent=[10, 10, 10, 10], cmap='gray') >>> plt.show()

arccos
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Trigonometric inverse cosine, elementwise.
The inverse of cos so that, if
y = cos(x)
, thenx = arccos(y)
. x : array_like
 xcoordinate on the unit circle. For real arguments, the domain is [1, 1].
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 angle : ndarray
 The angle of the ray intersecting the unit circle at the given xcoordinate in radians [0, pi]. This is a scalar if x is a scalar.
cos, arctan, arcsin, emath.arccos
arccos is a multivalued function: for each x there are infinitely many numbers z such that cos(z) = x. The convention is to return the angle z whose real part lies in [0, pi].
For realvalued input data types, arccos always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arccos is a complex analytic function that has branch cuts [inf, 1] and [1, inf] and is continuous from above on the former and from below on the latter.
The inverse cos is also known as acos or cos^1.
M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 79. http://www.math.sfu.ca/~cbm/aands/
We expect the arccos of 1 to be 0, and of 1 to be pi:
>>> np.arccos([1, 1]) array([ 0. , 3.14159265])
Plot arccos:
>>> import matplotlib.pyplot as plt >>> x = np.linspace(1, 1, num=100) >>> plt.plot(x, np.arccos(x)) >>> plt.axis('tight') >>> plt.show()

arccosh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Inverse hyperbolic cosine, elementwise.
 x : array_like
 Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 arccosh : ndarray
 Array of the same shape as x. This is a scalar if x is a scalar.
cosh, arcsinh, sinh, arctanh, tanh
arccosh is a multivalued function: for each x there are infinitely many numbers z such that cosh(z) = x. The convention is to return the z whose imaginary part lies in [pi, pi] and the real part in
[0, inf]
.For realvalued input data types, arccosh always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arccosh is a complex analytical function that has a branch cut [inf, 1] and is continuous from above on it.
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 86. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Inverse hyperbolic function”, https://en.wikipedia.org/wiki/Arccosh >>> np.arccosh([np.e, 10.0]) array([ 1.65745445, 2.99322285]) >>> np.arccosh(1) 0.0

arcsin
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Inverse sine, elementwise.
 x : array_like
 ycoordinate on the unit circle.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 angle : ndarray
 The inverse sine of each element in x, in radians and in the
closed interval
[pi/2, pi/2]
. This is a scalar if x is a scalar.
sin, cos, arccos, tan, arctan, arctan2, emath.arcsin
arcsin is a multivalued function: for each x there are infinitely many numbers z such that \(sin(z) = x\). The convention is to return the angle z whose real part lies in [pi/2, pi/2].
For realvalued input data types, arcsin always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arcsin is a complex analytic function that has, by convention, the branch cuts [inf, 1] and [1, inf] and is continuous from above on the former and from below on the latter.
The inverse sine is also known as asin or sin^{1}.
Abramowitz, M. and Stegun, I. A., Handbook of Mathematical Functions, 10th printing, New York: Dover, 1964, pp. 79ff. http://www.math.sfu.ca/~cbm/aands/
>>> np.arcsin(1) # pi/2 1.5707963267948966 >>> np.arcsin(1) # pi/2 1.5707963267948966 >>> np.arcsin(0) 0.0

arcsinh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Inverse hyperbolic sine elementwise.
 x : array_like
 Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 out : ndarray or scalar
 Array of the same shape as x. This is a scalar if x is a scalar.
arcsinh is a multivalued function: for each x there are infinitely many numbers z such that sinh(z) = x. The convention is to return the z whose imaginary part lies in [pi/2, pi/2].
For realvalued input data types, arcsinh always returns real output. For each value that cannot be expressed as a real number or infinity, it returns
nan
and sets the invalid floating point error flag.For complexvalued input, arccos is a complex analytical function that has branch cuts [1j, infj] and [1j, infj] and is continuous from the right on the former and from the left on the latter.
The inverse hyperbolic sine is also known as asinh or
sinh^1
.[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 86. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Inverse hyperbolic function”, https://en.wikipedia.org/wiki/Arcsinh >>> np.arcsinh(np.array([np.e, 10.0])) array([ 1.72538256, 2.99822295])

arctan
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Trigonometric inverse tangent, elementwise.
The inverse of tan, so that if
y = tan(x)
thenx = arctan(y)
.x : array_like out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs. where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 out : ndarray or scalar
 Out has the same shape as x. Its real part is in
[pi/2, pi/2]
(arctan(+/inf)
returns+/pi/2
). This is a scalar if x is a scalar.
 arctan2 : The “four quadrant” arctan of the angle formed by (x, y)
 and the positive xaxis.
angle : Argument of complex values.
arctan is a multivalued function: for each x there are infinitely many numbers z such that tan(z) = x. The convention is to return the angle z whose real part lies in [pi/2, pi/2].
For realvalued input data types, arctan always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arctan is a complex analytic function that has [1j, infj] and [1j, infj] as branch cuts, and is continuous from the left on the former and from the right on the latter.
The inverse tangent is also known as atan or tan^{1}.
Abramowitz, M. and Stegun, I. A., Handbook of Mathematical Functions, 10th printing, New York: Dover, 1964, pp. 79. http://www.math.sfu.ca/~cbm/aands/
We expect the arctan of 0 to be 0, and of 1 to be pi/4:
>>> np.arctan([0, 1]) array([ 0. , 0.78539816])
>>> np.pi/4 0.78539816339744828
Plot arctan:
>>> import matplotlib.pyplot as plt >>> x = np.linspace(10, 10) >>> plt.plot(x, np.arctan(x)) >>> plt.axis('tight') >>> plt.show()

arctan2
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Elementwise arc tangent of
x1/x2
choosing the quadrant correctly.The quadrant (i.e., branch) is chosen so that
arctan2(x1, x2)
is the signed angle in radians between the ray ending at the origin and passing through the point (1,0), and the ray ending at the origin and passing through the point (x2, x1). (Note the role reversal: the “ycoordinate” is the first function parameter, the “xcoordinate” is the second.) By IEEE convention, this function is defined for x2 = +/0 and for either or both of x1 and x2 = +/inf (see Notes for specific values).This function is not defined for complexvalued arguments; for the socalled argument of complex values, use angle.
 x1 : array_like, realvalued
 ycoordinates.
 x2 : array_like, realvalued
 xcoordinates. x2 must be broadcastable to match the shape of x1 or vice versa.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 angle : ndarray
 Array of angles in radians, in the range
[pi, pi]
. This is a scalar if both x1 and x2 are scalars.
arctan, tan, angle
arctan2 is identical to the atan2 function of the underlying C library. The following special values are defined in the C standard: [1]_
x1 x2 arctan2(x1,x2) +/ 0 +0 +/ 0 +/ 0 0 +/ pi > 0 +/inf +0 / +pi < 0 +/inf 0 / pi +/inf +inf +/ (pi/4) +/inf inf +/ (3*pi/4) Note that +0 and 0 are distinct floating point numbers, as are +inf and inf.
[1] ISO/IEC standard 9899:1999, “Programming language C.” Consider four points in different quadrants:
>>> x = np.array([1, +1, +1, 1]) >>> y = np.array([1, 1, +1, +1]) >>> np.arctan2(y, x) * 180 / np.pi array([135., 45., 45., 135.])
Note the order of the parameters. arctan2 is defined also when x2 = 0 and at several other special points, obtaining values in the range
[pi, pi]
:>>> np.arctan2([1., 1.], [0., 0.]) array([ 1.57079633, 1.57079633]) >>> np.arctan2([0., 0., np.inf], [+0., 0., np.inf]) array([ 0. , 3.14159265, 0.78539816])

arctanh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Inverse hyperbolic tangent elementwise.
 x : array_like
 Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 out : ndarray or scalar
 Array of the same shape as x. This is a scalar if x is a scalar.
emath.arctanh
arctanh is a multivalued function: for each x there are infinitely many numbers z such that tanh(z) = x. The convention is to return the z whose imaginary part lies in [pi/2, pi/2].
For realvalued input data types, arctanh always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arctanh is a complex analytical function that has branch cuts [1, inf] and [1, inf] and is continuous from above on the former and from below on the latter.
The inverse hyperbolic tangent is also known as atanh or
tanh^1
.[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 86. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Inverse hyperbolic function”, https://en.wikipedia.org/wiki/Arctanh >>> np.arctanh([0, 0.5]) array([ 0. , 0.54930614])

clip
(a_min, a_max, out=None)¶ Clip (limit) the values in an array.
Given an interval, values outside the interval are clipped to the interval edges. For example, if an interval of
[0, 1]
is specified, values smaller than 0 become 0, and values larger than 1 become 1. a : array_like
 Array containing elements to clip.
 a_min : scalar or array_like or None
 Minimum value. If None, clipping is not performed on lower interval edge. Not more than one of a_min and a_max may be None.
 a_max : scalar or array_like or None
 Maximum value. If None, clipping is not performed on upper interval edge. Not more than one of a_min and a_max may be None. If a_min or a_max are array_like, then the three arrays will be broadcasted to match their shapes.
 out : ndarray, optional
 The results will be placed in this array. It may be the input array for inplace clipping. out must be of the right shape to hold the output. Its type is preserved.
 clipped_array : ndarray
 An array with the elements of a, but where values < a_min are replaced with a_min, and those > a_max with a_max.
numpy.doc.ufuncs : Section “Output arguments”
>>> a = np.arange(10) >>> np.clip(a, 1, 8) array([1, 1, 2, 3, 4, 5, 6, 7, 8, 8]) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.clip(a, 3, 6, out=a) array([3, 3, 3, 3, 4, 5, 6, 6, 6, 6]) >>> a = np.arange(10) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.clip(a, [3, 4, 1, 1, 1, 4, 4, 4, 4, 4], 8) array([3, 4, 2, 3, 4, 5, 6, 7, 8, 8])

cos
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Cosine elementwise.
 x : array_like
 Input array in radians.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 The corresponding cosine values. This is a scalar if x is a scalar.
If out is provided, the function writes the result into it, and returns a reference to out. (See Examples)
M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. New York, NY: Dover, 1972.
>>> np.cos(np.array([0, np.pi/2, np.pi])) array([ 1.00000000e+00, 6.12303177e17, 1.00000000e+00]) >>> >>> # Example of providing the optional output parameter >>> out2 = np.cos([0.1], out1) >>> out2 is out1 True >>> >>> # Example of ValueError due to provision of shape mismatched `out` >>> np.cos(np.zeros((3,3)),np.zeros((2,2))) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: operands could not be broadcast together with shapes (3,3) (2,2)

cosh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Hyperbolic cosine, elementwise.
Equivalent to
1/2 * (np.exp(x) + np.exp(x))
andnp.cos(1j*x)
. x : array_like
 Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 out : ndarray or scalar
 Output array of same shape as x. This is a scalar if x is a scalar.
>>> np.cosh(0) 1.0
The hyperbolic cosine describes the shape of a hanging cable:
>>> import matplotlib.pyplot as plt >>> x = np.linspace(4, 4, 1000) >>> plt.plot(x, np.cosh(x)) >>> plt.show()

count
(binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]¶ Shortcut for ds.count(expression, …), see Dataset.count

deg2rad
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Convert angles from degrees to radians.
 x : array_like
 Angles in degrees.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 The corresponding angle in radians. This is a scalar if x is a scalar.
rad2deg : Convert angles from radians to degrees. unwrap : Remove large jumps in angle by wrapping.
New in version 1.3.0.
deg2rad(x)
isx * pi / 180
.>>> np.deg2rad(180) 3.1415926535897931

exp
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Calculate the exponential of all elements in the input array.
 x : array_like
 Input values.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 out : ndarray or scalar
 Output array, elementwise exponential of x. This is a scalar if x is a scalar.
expm1 : Calculate
exp(x)  1
for all elements in the array. exp2 : Calculate2**x
for all elements in the array.The irrational number
e
is also known as Euler’s number. It is approximately 2.718281, and is the base of the natural logarithm,ln
(this means that, if \(x = \ln y = \log_e y\), then \(e^x = y\). For real input,exp(x)
is always positive.For complex arguments,
x = a + ib
, we can write \(e^x = e^a e^{ib}\). The first term, \(e^a\), is already known (it is the real argument, described above). The second term, \(e^{ib}\), is \(\cos b + i \sin b\), a function with magnitude 1 and a periodic phase.[1] Wikipedia, “Exponential function”, https://en.wikipedia.org/wiki/Exponential_function [2] M. Abramovitz and I. A. Stegun, “Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables,” Dover, 1964, p. 69, http://www.math.sfu.ca/~cbm/aands/page_69.htm Plot the magnitude and phase of
exp(x)
in the complex plane:>>> import matplotlib.pyplot as plt
>>> x = np.linspace(2*np.pi, 2*np.pi, 100) >>> xx = x + 1j * x[:, np.newaxis] # a + ib over complex plane >>> out = np.exp(xx)
>>> plt.subplot(121) >>> plt.imshow(np.abs(out), ... extent=[2*np.pi, 2*np.pi, 2*np.pi, 2*np.pi], cmap='gray') >>> plt.title('Magnitude of exp(x)')
>>> plt.subplot(122) >>> plt.imshow(np.angle(out), ... extent=[2*np.pi, 2*np.pi, 2*np.pi, 2*np.pi], cmap='hsv') >>> plt.title('Phase (angle) of exp(x)') >>> plt.show()

expand
(stop=[])[source]¶ Expand the expression such that no virtual columns occurs, only normal columns.
Example:
>>> df = vaex.example() >>> r = np.sqrt(df.data.x**2 + df.data.y**2) >>> r.expand().expression 'sqrt(((x ** 2) + (y ** 2)))'

expm1
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Calculate
exp(x)  1
for all elements in the array. x : array_like
 Input values.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 out : ndarray or scalar
 Elementwise exponential minus one:
out = exp(x)  1
. This is a scalar if x is a scalar.
log1p :
log(1 + x)
, the inverse of expm1.This function provides greater precision than
exp(x)  1
for small values ofx
.The true value of
exp(1e10)  1
is1.00000000005e10
to about 32 significant digits. This example shows the superiority of expm1 in this case.>>> np.expm1(1e10) 1.00000000005e10 >>> np.exp(1e10)  1 1.000000082740371e10

fillna
(value, fill_nan=True, fill_masked=True)¶ Returns an array where missing values are replaced by value.
If the dtype is object, nan values and ‘nan’ string values are replaced by value when fill_nan==True.

format
(format)¶ Uses http://www.cplusplus.com/reference/string/to_string/ for formatting

log
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Natural logarithm, elementwise.
The natural logarithm log is the inverse of the exponential function, so that log(exp(x)) = x. The natural logarithm is logarithm in base e.
 x : array_like
 Input value.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 The natural logarithm of x, elementwise. This is a scalar if x is a scalar.
log10, log2, log1p, emath.log
Logarithm is a multivalued function: for each x there is an infinite number of z such that exp(z) = x. The convention is to return the z whose imaginary part lies in [pi, pi].
For realvalued input data types, log always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, log is a complex analytical function that has a branch cut [inf, 0] and is continuous from above on it. log handles the floatingpoint negative zero as an infinitesimal negative number, conforming to the C99 standard.
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 67. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Logarithm”. https://en.wikipedia.org/wiki/Logarithm >>> np.log([1, np.e, np.e**2, 0]) array([ 0., 1., 2., Inf])

log10
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Return the base 10 logarithm of the input array, elementwise.
 x : array_like
 Input values.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 The logarithm to the base 10 of x, elementwise. NaNs are returned where x is negative. This is a scalar if x is a scalar.
emath.log10
Logarithm is a multivalued function: for each x there is an infinite number of z such that 10**z = x. The convention is to return the z whose imaginary part lies in [pi, pi].
For realvalued input data types, log10 always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, log10 is a complex analytical function that has a branch cut [inf, 0] and is continuous from above on it. log10 handles the floatingpoint negative zero as an infinitesimal negative number, conforming to the C99 standard.
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 67. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Logarithm”. https://en.wikipedia.org/wiki/Logarithm >>> np.log10([1e15, 3.]) array([15., NaN])

log1p
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Return the natural logarithm of one plus the input array, elementwise.
Calculates
log(1 + x)
. x : array_like
 Input values.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 Natural logarithm of 1 + x, elementwise. This is a scalar if x is a scalar.
expm1 :
exp(x)  1
, the inverse of log1p.For realvalued input, log1p is accurate also for x so small that 1 + x == 1 in floatingpoint accuracy.
Logarithm is a multivalued function: for each x there is an infinite number of z such that exp(z) = 1 + x. The convention is to return the z whose imaginary part lies in [pi, pi].
For realvalued input data types, log1p always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, log1p is a complex analytical function that has a branch cut [inf, 1] and is continuous from above on it. log1p handles the floatingpoint negative zero as an infinitesimal negative number, conforming to the C99 standard.
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 67. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Logarithm”. https://en.wikipedia.org/wiki/Logarithm >>> np.log1p(1e99) 1e99 >>> np.log(1 + 1e99) 0.0

map
(mapper, nan_value=None, null_value=None, default_value=None, allow_missing=False)[source]¶ Map values of an expression or in memory column accoring to an input dictionary or a custom callable function.
Example:
>>> import vaex >>> df = vaex.from_arrays(color=['red', 'red', 'blue', 'red', 'green']) >>> mapper = {'red': 1, 'blue': 2, 'green': 3} >>> df['color_mapped'] = df.color.map(mapper) >>> df # color color_mapped 0 red 1 1 red 1 2 blue 2 3 red 1 4 green 3 >>> import numpy as np >>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, np.nan]) >>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user', np.nan: 'unknown'}) >>> df # type role 0 0 admin 1 1 maintainer 2 2 user 3 2 user 4 2 user 5 nan unknown >>> import vaex >>> import numpy as np >>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, 4]) >>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user'}, default_value='unknown') >>> df # type role 0 0 admin 1 1 maintainer 2 2 user 3 2 user 4 2 user 5 4 unknown :param mapper: dict like object used to map the values from keys to values :param nan_value: value to be used when a nan is present (and not in the mapper) :param null_value: value to use used when there is a missing value :param default_value: value to be used when a value is not in the mapper (like dict.get(key, default)) :param allow_missing: used to signal that values in the mapper should map to a masked array with missing values, assumed True when default_value is not None. :return: A vaex expression :rtype: vaex.expression.Expression

masked
¶ Alias to df.is_masked(expression)

max
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.max(expression, …), see Dataset.max

maximum
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Elementwise maximum of array elements.
Compare two arrays and returns a new array containing the elementwise maxima. If one of the elements being compared is a NaN, then that element is returned. If both elements are NaNs then the first is returned. The latter distinction is important for complex NaNs, which are defined as at least one of the real or imaginary parts being a NaN. The net effect is that NaNs are propagated.
 x1, x2 : array_like
 The arrays holding the elements to be compared. They must have the same shape, or shapes that can be broadcast to a single shape.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray or scalar
 The maximum of x1 and x2, elementwise. This is a scalar if both x1 and x2 are scalars.
 minimum :
 Elementwise minimum of two arrays, propagates NaNs.
 fmax :
 Elementwise maximum of two arrays, ignores NaNs.
 amax :
 The maximum value of an array along a given axis, propagates NaNs.
 nanmax :
 The maximum value of an array along a given axis, ignores NaNs.
fmin, amin, nanmin
The maximum is equivalent to
np.where(x1 >= x2, x1, x2)
when neither x1 nor x2 are nans, but it is faster and does proper broadcasting.>>> np.maximum([2, 3, 4], [1, 5, 2]) array([2, 5, 4])
>>> np.maximum(np.eye(2), [0.5, 2]) # broadcasting array([[ 1. , 2. ], [ 0.5, 2. ]])
>>> np.maximum([np.nan, 0, np.nan], [0, np.nan, np.nan]) array([ NaN, NaN, NaN]) >>> np.maximum(np.Inf, 1) inf

mean
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.mean(expression, …), see Dataset.mean

min
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.min(expression, …), see Dataset.min

minimum
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Elementwise minimum of array elements.
Compare two arrays and returns a new array containing the elementwise minima. If one of the elements being compared is a NaN, then that element is returned. If both elements are NaNs then the first is returned. The latter distinction is important for complex NaNs, which are defined as at least one of the real or imaginary parts being a NaN. The net effect is that NaNs are propagated.
 x1, x2 : array_like
 The arrays holding the elements to be compared. They must have the same shape, or shapes that can be broadcast to a single shape.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray or scalar
 The minimum of x1 and x2, elementwise. This is a scalar if both x1 and x2 are scalars.
 maximum :
 Elementwise maximum of two arrays, propagates NaNs.
 fmin :
 Elementwise minimum of two arrays, ignores NaNs.
 amin :
 The minimum value of an array along a given axis, propagates NaNs.
 nanmin :
 The minimum value of an array along a given axis, ignores NaNs.
fmax, amax, nanmax
The minimum is equivalent to
np.where(x1 <= x2, x1, x2)
when neither x1 nor x2 are NaNs, but it is faster and does proper broadcasting.>>> np.minimum([2, 3, 4], [1, 5, 2]) array([1, 3, 2])
>>> np.minimum(np.eye(2), [0.5, 2]) # broadcasting array([[ 0.5, 0. ], [ 0. , 1. ]])
>>> np.minimum([np.nan, 0, np.nan],[0, np.nan, np.nan]) array([ NaN, NaN, NaN]) >>> np.minimum(np.Inf, 1) inf

minmax
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.minmax(expression, …), see Dataset.minmax

nop
()[source]¶ Evaluates expression, and drop the result, usefull for benchmarking, since vaex is usually lazy

rad2deg
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Convert angles from radians to degrees.
 x : array_like
 Angle in radians.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 The corresponding angle in degrees. This is a scalar if x is a scalar.
deg2rad : Convert angles from degrees to radians. unwrap : Remove large jumps in angle by wrapping.
New in version 1.3.0.
rad2deg(x) is
180 * x / pi
.>>> np.rad2deg(np.pi/2) 90.0

searchsorted
(v, side='left', sorter=None)¶ Find indices where elements should be inserted to maintain order.
Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.
Assuming that a is sorted:
side returned index i satisfies left a[i1] < v <= a[i]
right a[i1] <= v < a[i]
 a : 1D array_like
 Input array. If sorter is None, then it must be sorted in ascending order, otherwise sorter must be an array of indices that sort it.
 v : array_like
 Values to insert into a.
 side : {‘left’, ‘right’}, optional
 If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of a).
 sorter : 1D array_like, optional
Optional array of integer indices that sort array a into ascending order. They are typically the result of argsort.
New in version 1.7.0.
 indices : array of ints
 Array of insertion points with the same shape as v.
sort : Return a sorted copy of an array. histogram : Produce histogram from 1D data.
Binary search is used to find the required insertion points.
As of NumPy 1.4.0 searchsorted works with real/complex arrays containing nan values. The enhanced sort order is documented in sort.
This function is a faster version of the builtin python bisect.bisect_left (
side='left'
) and bisect.bisect_right (side='right'
) functions, which is also vectorized in the v argument.>>> np.searchsorted([1,2,3,4,5], 3) 2 >>> np.searchsorted([1,2,3,4,5], 3, side='right') 3 >>> np.searchsorted([1,2,3,4,5], [10, 10, 2, 3]) array([0, 5, 1, 2])

sin
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Trigonometric sine, elementwise.
 x : array_like
 Angle, in radians (\(2 \pi\) rad equals 360 degrees).
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : array_like
 The sine of each element of x. This is a scalar if x is a scalar.
arcsin, sinh, cos
The sine is one of the fundamental functions of trigonometry (the mathematical study of triangles). Consider a circle of radius 1 centered on the origin. A ray comes in from the \(+x\) axis, makes an angle at the origin (measured counterclockwise from that axis), and departs from the origin. The \(y\) coordinate of the outgoing ray’s intersection with the unit circle is the sine of that angle. It ranges from 1 for \(x=3\pi / 2\) to +1 for \(\pi / 2.\) The function has zeroes where the angle is a multiple of \(\pi\). Sines of angles between \(\pi\) and \(2\pi\) are negative. The numerous properties of the sine and related functions are included in any standard trigonometry text.
Print sine of one angle:
>>> np.sin(np.pi/2.) 1.0
Print sines of an array of angles given in degrees:
>>> np.sin(np.array((0., 30., 45., 60., 90.)) * np.pi / 180. ) array([ 0. , 0.5 , 0.70710678, 0.8660254 , 1. ])
Plot the sine function:
>>> import matplotlib.pylab as plt >>> x = np.linspace(np.pi, np.pi, 201) >>> plt.plot(x, np.sin(x)) >>> plt.xlabel('Angle [rad]') >>> plt.ylabel('sin(x)') >>> plt.axis('tight') >>> plt.show()

sinc
()¶ Return the sinc function.
The sinc function is \(\sin(\pi x)/(\pi x)\).
 x : ndarray
 Array (possibly multidimensional) of values for which to to
calculate
sinc(x)
.
 out : ndarray
sinc(x)
, which has the same shape as the input.
sinc(0)
is the limit value 1.The name sinc is short for “sine cardinal” or “sinus cardinalis”.
The sinc function is used in various signal processing applications, including in antialiasing, in the construction of a Lanczos resampling filter, and in interpolation.
For bandlimited interpolation of discretetime signals, the ideal interpolation kernel is proportional to the sinc function.
[1] Weisstein, Eric W. “Sinc Function.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/SincFunction.html [2] Wikipedia, “Sinc function”, https://en.wikipedia.org/wiki/Sinc_function >>> import matplotlib.pyplot as plt >>> x = np.linspace(4, 4, 41) >>> np.sinc(x) array([ 3.89804309e17, 4.92362781e02, 8.40918587e02, 8.90384387e02, 5.84680802e02, 3.89804309e17, 6.68206631e02, 1.16434881e01, 1.26137788e01, 8.50444803e02, 3.89804309e17, 1.03943254e01, 1.89206682e01, 2.16236208e01, 1.55914881e01, 3.89804309e17, 2.33872321e01, 5.04551152e01, 7.56826729e01, 9.35489284e01, 1.00000000e+00, 9.35489284e01, 7.56826729e01, 5.04551152e01, 2.33872321e01, 3.89804309e17, 1.55914881e01, 2.16236208e01, 1.89206682e01, 1.03943254e01, 3.89804309e17, 8.50444803e02, 1.26137788e01, 1.16434881e01, 6.68206631e02, 3.89804309e17, 5.84680802e02, 8.90384387e02, 8.40918587e02, 4.92362781e02, 3.89804309e17])
>>> plt.plot(x, np.sinc(x)) [<matplotlib.lines.Line2D object at 0x...>] >>> plt.title("Sinc Function") <matplotlib.text.Text object at 0x...> >>> plt.ylabel("Amplitude") <matplotlib.text.Text object at 0x...> >>> plt.xlabel("X") <matplotlib.text.Text object at 0x...> >>> plt.show()
It works in 2D as well:
>>> x = np.linspace(4, 4, 401) >>> xx = np.outer(x, x) >>> plt.imshow(np.sinc(xx)) <matplotlib.image.AxesImage object at 0x...>

sinh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Hyperbolic sine, elementwise.
Equivalent to
1/2 * (np.exp(x)  np.exp(x))
or1j * np.sin(1j*x)
. x : array_like
 Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 The corresponding hyperbolic sine values. This is a scalar if x is a scalar.
If out is provided, the function writes the result into it, and returns a reference to out. (See Examples)
M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. New York, NY: Dover, 1972, pg. 83.
>>> np.sinh(0) 0.0 >>> np.sinh(np.pi*1j/2) 1j >>> np.sinh(np.pi*1j) # (exact value is 0) 1.2246063538223773e016j >>> # Discrepancy due to vagaries of floating point arithmetic.
>>> # Example of providing the optional output parameter >>> out2 = np.sinh([0.1], out1) >>> out2 is out1 True
>>> # Example of ValueError due to provision of shape mismatched `out` >>> np.sinh(np.zeros((3,3)),np.zeros((2,2))) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: operands could not be broadcast together with shapes (3,3) (2,2)

sqrt
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Return the nonnegative squareroot of an array, elementwise.
 x : array_like
 The values whose squareroots are required.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 An array of the same shape as x, containing the positive
squareroot of each element in x. If any element in x is
complex, a complex array is returned (and the squareroots of
negative reals are calculated). If all of the elements in x
are real, so is y, with negative elements returning
nan
. If out was provided, y is a reference to it. This is a scalar if x is a scalar.
 lib.scimath.sqrt
 A version which returns complex numbers when given negative reals.
sqrt has–consistent with common convention–as its branch cut the real “interval” [inf, 0), and is continuous from above on it. A branch cut is a curve in the complex plane across which a given complex function fails to be continuous.
>>> np.sqrt([1,4,9]) array([ 1., 2., 3.])
>>> np.sqrt([4, 1, 3+4J]) array([ 2.+0.j, 0.+1.j, 1.+2.j])
>>> np.sqrt([4, 1, numpy.inf]) array([ 2., NaN, Inf])

std
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.std(expression, …), see Dataset.std

str
¶ Gives access to string operations

str_pandas
¶ Gives access to string operations (using Pandas Series)

sum
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.sum(expression, …), see Dataset.sum

tan
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Compute tangent elementwise.
Equivalent to
np.sin(x)/np.cos(x)
elementwise. x : array_like
 Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 The corresponding tangent values. This is a scalar if x is a scalar.
If out is provided, the function writes the result into it, and returns a reference to out. (See Examples)
M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. New York, NY: Dover, 1972.
>>> from math import pi >>> np.tan(np.array([pi,pi/2,pi])) array([ 1.22460635e16, 1.63317787e+16, 1.22460635e16]) >>> >>> # Example of providing the optional output parameter illustrating >>> # that what is returned is a reference to said parameter >>> out2 = np.cos([0.1], out1) >>> out2 is out1 True >>> >>> # Example of ValueError due to provision of shape mismatched `out` >>> np.cos(np.zeros((3,3)),np.zeros((2,2))) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: operands could not be broadcast together with shapes (3,3) (2,2)

tanh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Compute hyperbolic tangent elementwise.
Equivalent to
np.sinh(x)/np.cosh(x)
or1j * np.tan(1j*x)
. x : array_like
 Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
 A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
 Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
 For other keywordonly arguments, see the ufunc docs.
 y : ndarray
 The corresponding hyperbolic tangent values. This is a scalar if x is a scalar.
If out is provided, the function writes the result into it, and returns a reference to out. (See Examples)
[1] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. New York, NY: Dover, 1972, pg. 83. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Hyperbolic function”, https://en.wikipedia.org/wiki/Hyperbolic_function >>> np.tanh((0, np.pi*1j, np.pi*1j/2)) array([ 0. +0.00000000e+00j, 0. 1.22460635e16j, 0. +1.63317787e+16j])
>>> # Example of providing the optional output parameter illustrating >>> # that what is returned is a reference to said parameter >>> out2 = np.tanh([0.1], out1) >>> out2 is out1 True
>>> # Example of ValueError due to provision of shape mismatched `out` >>> np.tanh(np.zeros((3,3)),np.zeros((2,2))) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: operands could not be broadcast together with shapes (3,3) (2,2)

transient
¶ If this expression is not transient (e.g. on disk) optimizations can be made

value_counts
(dropna=False, dropnull=True, ascending=False, progress=False)[source]¶ Computes counts of unique values.
 WARNING:
 If the expression/column is not categorical, it will be converted on the fly
 dropna is False by default, it is True by default in pandas
Parameters:  dropna – when True, it will not report the missing values
 ascending – when False (default) it will report the most frequent occuring item first
Returns: Pandas series containing the counts

var
(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶ Shortcut for ds.std(expression, …), see Dataset.var

variables
()[source]¶ Return a set of variables this expression depends on.
Example:
>>> df = vaex.example() >>> r = np.sqrt(df.data.x**2 + df.data.y**2) >>> r.variables() {'x', 'y'}

where
(condition[, x, y])¶ Return elements chosen from x or y depending on condition.
Note
When only condition is provided, this function is a shorthand for
np.asarray(condition).nonzero()
. Using nonzero directly should be preferred, as it behaves correctly for subclasses. The rest of this documentation covers only the case where all three arguments are provided. condition : array_like, bool
 Where True, yield x, otherwise yield y.
 x, y : array_like
 Values from which to choose. x, y and condition need to be broadcastable to some shape.
 out : ndarray
 An array with elements from x where condition is True, and elements from y elsewhere.
choose nonzero : The function that is called when x and y are omitted
If all the arrays are 1D, where is equivalent to:
[xv if c else yv for c, xv, yv in zip(condition, x, y)]
>>> a = np.arange(10) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.where(a < 5, a, 10*a) array([ 0, 1, 2, 3, 4, 50, 60, 70, 80, 90])
This can be used on multidimensional arrays too:
>>> np.where([[True, False], [True, True]], ... [[1, 2], [3, 4]], ... [[9, 8], [7, 6]]) array([[1, 8], [3, 4]])
The shapes of x, y, and the condition are broadcast together:
>>> x, y = np.ogrid[:3, :4] >>> np.where(x < y, x, 10 + y) # both x and 10+y are broadcast array([[10, 0, 0, 0], [10, 11, 1, 1], [10, 11, 12, 2]])
>>> a = np.array([[0, 1, 2], ... [0, 2, 4], ... [0, 3, 6]]) >>> np.where(a < 4, a, 1) # 1 is broadcast array([[ 0, 1, 2], [ 0, 2, 1], [ 0, 3, 1]])

String operations¶

class
vaex.expression.
StringOperations
(expression)[source]¶ Bases:
object
String operations

__weakref__
¶ list of weak references to the object (if defined)

byte_length
()¶ Returns the number of bytes in a string sample.
Returns: an expression contains the number of bytes in each sample of a string column. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.byte_length() Expression = str_byte_length(text) Length: 5 dtype: int64 (expression)  0 9 1 11 2 9 3 3 4 4

capitalize
()¶ Capitalize the first letter of a string sample.
Returns: an expression containing the capitalized strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.capitalize() Expression = str_capitalize(text) Length: 5 dtype: str (expression)  0 Something 1 Very pretty 2 Is coming 3 Our 4 Way.

cat
(other)¶ Concatenate two string columns on a rowbyrow basis.
Parameters: other (expression) – The expression of the other column to be concatenated. Returns: an expression containing the concatenated columns. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.cat(df.text) Expression = str_cat(text, text) Length: 5 dtype: str (expression)  0 SomethingSomething 1 very prettyvery pretty 2 is comingis coming 3 ourour 4 way.way.

center
(width, fillchar=' ')¶ Fills the left and right side of the strings with additional characters, such that the sample has a total of width characters.
Parameters:  width (int) – The total number of characters of the resulting string sample.
 fillchar (str) – The character used for filling.
Returns: an expression containing the filled strings.
Example: >>> import vaex >>> text = [‘Something’, ‘very pretty’, ‘is coming’, ‘our’, ‘way.’] >>> df = vaex.from_arrays(text=text) >>> df
# text 0 Something 1 very pretty 2 is coming 3 our 4 way.>>> df.text.str.center(width=11, fillchar='!') Expression = str_center(text, width=11, fillchar='!') Length: 5 dtype: str (expression)  0 !Something! 1 very pretty 2 !is coming! 3 !!!!our!!!! 4 !!!!way.!!!

contains
(pattern, regex=True)¶ Check if a string pattern or regex is contained within a sample of a string column.
Parameters:  pattern (str) – A string or regex pattern
 regex (bool) – If True,
Returns: an expression which is evaluated to True if the pattern is found in a given sample, and it is False otherwise.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.contains('very') Expression = str_contains(text, 'very') Length: 5 dtype: bool (expression)  0 False 1 True 2 False 3 False 4 False

count
(pat, regex=False)¶ Count the occurences of a pattern in sample of a string column.
Parameters:  pat (str) – A string or regex pattern
 regex (bool) – If True,
Returns: an expression containing the number of times a pattern is found in each sample.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.count(pat="et", regex=False) Expression = str_count(text, pat='et', regex=False) Length: 5 dtype: int64 (expression)  0 1 1 1 2 0 3 0 4 0

endswith
(pat)¶ Check if the end of each string sample matches the specified pattern.
Parameters: pat (str) – A string pattern or a regex Returns: an expression evaluated to True if the pattern is found at the end of a given sample, False otherwise. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.endswith(pat="ing") Expression = str_endswith(text, pat='ing') Length: 5 dtype: bool (expression)  0 True 1 False 2 True 3 False 4 False

find
(sub, start=0, end=None)¶ Returns the lowest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, 1 is returned.
Parameters:  sub (str) – A substring to be found in the samples
 start (int) –
 end (int) –
Returns: an expression containing the lowest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.find(sub="et") Expression = str_find(text, sub='et') Length: 5 dtype: int64 (expression)  0 3 1 7 2 1 3 1 4 1

get
(i)¶ Extract a character from each sample at the specified position from a string column. Note that if the specified position is out of bound of the string sample, this method returns ‘’, while pandas retunrs nan.
Parameters: i (int) – The index location, at which to extract the character. Returns: an expression containing the extracted characters. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.get(5) Expression = str_get(text, 5) Length: 5 dtype: str (expression)  0 h 1 p 2 m 3 4

index
(sub, start=0, end=None)¶ Returns the lowest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, 1 is returned. It is the same as str.find.
Parameters:  sub (str) – A substring to be found in the samples
 start (int) –
 end (int) –
Returns: an expression containing the lowest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.index(sub="et") Expression = str_find(text, sub='et') Length: 5 dtype: int64 (expression)  0 3 1 7 2 1 3 1 4 1

isalnum
()¶ Check if all characters in a string sample are alphanumeric.
Returns: an expression evaluated to True if a sample contains only alphanumeric characters, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.isalnum() Expression = str_isalnum(text) Length: 5 dtype: bool (expression)  0 True 1 False 2 False 3 True 4 False

isalpha
()¶ Check if all characters in a string sample are alphabetic.
Returns: an expression evaluated to True if a sample contains only alphabetic characters, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.isalpha() Expression = str_isalpha(text) Length: 5 dtype: bool (expression)  0 True 1 False 2 False 3 True 4 False

isdigit
()¶ Check if all characters in a string sample are digits.
Returns: an expression evaluated to True if a sample contains only digits, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', '6'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 6
>>> df.text.str.isdigit() Expression = str_isdigit(text) Length: 5 dtype: bool (expression)  0 False 1 False 2 False 3 False 4 True

islower
()¶ Check if all characters in a string sample are lowercase characters.
Returns: an expression evaluated to True if a sample contains only lowercase characters, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.islower() Expression = str_islower(text) Length: 5 dtype: bool (expression)  0 False 1 True 2 True 3 True 4 True

isspace
()¶ Check if all characters in a string sample are whitespaces.
Returns: an expression evaluated to True if a sample contains only whitespaces, otherwise False. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', ' ', ' '] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 4
>>> df.text.str.isspace() Expression = str_isspace(text) Length: 5 dtype: bool (expression)  0 False 1 False 2 False 3 True 4 True

isupper
()¶ Check if all characters in a string sample are lowercase characters.
Returns: an expression evaluated to True if a sample contains only lowercase characters, otherwise False. Example:
>>> import vaex >>> text = ['SOMETHING', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 SOMETHING 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.isupper() Expression = str_isupper(text) Length: 5 dtype: bool (expression)  0 True 1 False 2 False 3 False 4 False

join
(sep)¶ Same as find (difference with pandas is that it does not raise a ValueError)

len
()¶ Returns the length of a string sample.
Returns: an expression contains the length of each sample of a string column. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.len() Expression = str_len(text) Length: 5 dtype: int64 (expression)  0 9 1 11 2 9 3 3 4 4

ljust
(width, fillchar=' ')¶ Fills the right side of string samples with a specified character such that the strings are righthand justified.
Parameters:  width (int) – The minimal width of the strings.
 fillchar (str) – The character used for filling.
Returns: an expression containing the filled strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.ljust(width=10, fillchar='!') Expression = str_ljust(text, width=10, fillchar='!') Length: 5 dtype: str (expression)  0 Something! 1 very pretty 2 is coming! 3 our!!!!!!! 4 way.!!!!!!

lower
()¶ Converts string samples to lower case.
Returns: an expression containing the converted strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.lower() Expression = str_lower(text) Length: 5 dtype: str (expression)  0 something 1 very pretty 2 is coming 3 our 4 way.

lstrip
(to_strip=None)¶ Remove leading characters from a string sample.
Parameters: to_strip (str) – The string to be removed Returns: an expression containing the modified string column. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.lstrip(to_strip='very ') Expression = str_lstrip(text, to_strip='very ') Length: 5 dtype: str (expression)  0 Something 1 pretty 2 is coming 3 our 4 way.

match
(pattern)¶ Check if a string sample matches a given regular expression.
Parameters: pattern (str) – a string or regex to match to a string sample. Returns: an expression which is evaluated to True if a match is found, False otherwise. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.match(pattern='our') Expression = str_match(text, pattern='our') Length: 5 dtype: bool (expression)  0 False 1 False 2 False 3 True 4 False

pad
(width, side='left', fillchar=' ')¶ Pad strings in a given column.
Parameters:  width (int) – The total width of the string
 side (str) – If ‘left’ than pad on the left, if ‘right’ than pad on the right side the string.
 fillchar (str) – The character used for padding.
Returns: an expression containing the padded strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.pad(width=10, side='left', fillchar='!') Expression = str_pad(text, width=10, side='left', fillchar='!') Length: 5 dtype: str (expression)  0 !Something 1 very pretty 2 !is coming 3 !!!!!!!our 4 !!!!!!way.

repeat
(repeats)¶ Duplicate each string in a column.
Parameters: repeats (int) – number of times each string sample is to be duplicated. Returns: an expression containing the duplicated strings Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.repeat(3) Expression = str_repeat(text, 3) Length: 5 dtype: str (expression)  0 SomethingSomethingSomething 1 very prettyvery prettyvery pretty 2 is comingis comingis coming 3 ourourour 4 way.way.way.

replace
(pat, repl, n=1, flags=0, regex=False)¶ Replace occurences of a pattern/regex in a column with some other string.
Parameters:  pattern (str) – string or a regex pattern
 replace (str) – a replacement string
 n (int) – number of replacements to be made from the start. If 1 make all replacements.
 flags (int) –
??
 regex (bool) – If True, …?
Returns: an expression containing the string replacements.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.replace(pat='et', repl='__') Expression = str_replace(text, pat='et', repl='__') Length: 5 dtype: str (expression)  0 Som__hing 1 very pr__ty 2 is coming 3 our 4 way.

rfind
(sub, start=0, end=None)¶ Returns the highest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, 1 is returned.
Parameters:  sub (str) – A substring to be found in the samples
 start (int) –
 end (int) –
Returns: an expression containing the highest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rfind(sub="et") Expression = str_rfind(text, sub='et') Length: 5 dtype: int64 (expression)  0 3 1 7 2 1 3 1 4 1

rindex
(sub, start=0, end=None)¶ Returns the highest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, 1 is returned. Same as str.rfind.
Parameters:  sub (str) – A substring to be found in the samples
 start (int) –
 end (int) –
Returns: an expression containing the highest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rindex(sub="et") Expression = str_rindex(text, sub='et') Length: 5 dtype: int64 (expression)  0 3 1 7 2 1 3 1 4 1

rjust
(width, fillchar=' ')¶ Fills the left side of string samples with a specified character such that the strings are lefthand justified.
Parameters:  width (int) – The minimal width of the strings.
 fillchar (str) – The character used for filling.
Returns: an expression containing the filled strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rjust(width=10, fillchar='!') Expression = str_rjust(text, width=10, fillchar='!') Length: 5 dtype: str (expression)  0 !Something 1 very pretty 2 !is coming 3 !!!!!!!our 4 !!!!!!way.

rstrip
(to_strip=None)¶ Remove trailing characters from a string sample.
Parameters: to_strip (str) – The string to be removed Returns: an expression containing the modified string column. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rstrip(to_strip='ing') Expression = str_rstrip(text, to_strip='ing') Length: 5 dtype: str (expression)  0 Someth 1 very pretty 2 is com 3 our 4 way.

slice
(start=0, stop=None)¶ Slice substrings from each string element in a column.
Parameters:  start (int) – The start position for the slice operation.
 end (int) – The stop position for the slice operation.
Returns: an expression containing the sliced substrings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.slice(start=2, stop=5) Expression = str_pandas_slice(text, start=2, stop=5) Length: 5 dtype: str (expression)  0 met 1 ry 2 co 3 r 4 y.

startswith
(pat)¶ Check if a start of a string matches a pattern.
Parameters: pat (str) – A string pattern. Regular expressions are not supported. Returns: an expression which is evaluated to True if the pattern is found at the start of a string sample, False otherwise. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.startswith(pat='is') Expression = str_startswith(text, pat='is') Length: 5 dtype: bool (expression)  0 False 1 False 2 True 3 False 4 False

strip
(to_strip=None)¶ Removes leading and trailing characters.
Strips whitespaces (including new lines), or a set of specified characters from each string saple in a column, both from the left right sides.
Parameters:  to_strip (str) – The characters to be removed. All combinations of the characters will be removed. If None, it removes whitespaces.
 returns – an expression containing the modified string samples.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.strip(to_strip='very') Expression = str_strip(text, to_strip='very') Length: 5 dtype: str (expression)  0 Something 1 prett 2 is coming 3 ou 4 way.

title
()¶ Converts all string samples to titlecase.
Returns: an expression containing the converted strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.title() Expression = str_title(text) Length: 5 dtype: str (expression)  0 Something 1 Very Pretty 2 Is Coming 3 Our 4 Way.

upper
()¶ Converts all strings in a column to uppercase.
Returns: an expression containing the converted strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.upper() Expression = str_upper(text) Length: 5 dtype: str (expression)  0 SOMETHING 1 VERY PRETTY 2 IS COMING 3 OUR 4 WAY.

zfill
(width)¶ Pad strings in a column by prepanding “0” characters.
Parameters: width (int) – The minimum length of the resulting string. Strings shorter less than width will be prepended with zeros. Returns: an expression containing the modified strings. Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.zfill(width=12) Expression = str_zfill(text, width=12) Length: 5 dtype: str (expression)  0 000Something 1 0very pretty 2 000is coming 3 000000000our 4 00000000way.


class
vaex.expression.
StringOperationsPandas
(expression)[source]¶ Bases:
object
String operations using Pandas Series

__weakref__
¶ list of weak references to the object (if defined)

byte_length
(**kwargs)¶ Wrapper around pandas.Series.byte_length

capitalize
(**kwargs)¶ Wrapper around pandas.Series.capitalize

cat
(**kwargs)¶ Wrapper around pandas.Series.cat

center
(**kwargs)¶ Wrapper around pandas.Series.center

contains
(**kwargs)¶ Wrapper around pandas.Series.contains

count
(**kwargs)¶ Wrapper around pandas.Series.count

endswith
(**kwargs)¶ Wrapper around pandas.Series.endswith

find
(**kwargs)¶ Wrapper around pandas.Series.find

get
(**kwargs)¶ Wrapper around pandas.Series.get

index
(**kwargs)¶ Wrapper around pandas.Series.index

isalnum
(**kwargs)¶ Wrapper around pandas.Series.isalnum

isalpha
(**kwargs)¶ Wrapper around pandas.Series.isalpha

isdigit
(**kwargs)¶ Wrapper around pandas.Series.isdigit

islower
(**kwargs)¶ Wrapper around pandas.Series.islower

isspace
(**kwargs)¶ Wrapper around pandas.Series.isspace

isupper
(**kwargs)¶ Wrapper around pandas.Series.isupper

join
(**kwargs)¶ Wrapper around pandas.Series.join

len
(**kwargs)¶ Wrapper around pandas.Series.len

ljust
(**kwargs)¶ Wrapper around pandas.Series.ljust

lower
(**kwargs)¶ Wrapper around pandas.Series.lower

lstrip
(**kwargs)¶ Wrapper around pandas.Series.lstrip

match
(**kwargs)¶ Wrapper around pandas.Series.match

pad
(**kwargs)¶ Wrapper around pandas.Series.pad

repeat
(**kwargs)¶ Wrapper around pandas.Series.repeat

replace
(**kwargs)¶ Wrapper around pandas.Series.replace

rfind
(**kwargs)¶ Wrapper around pandas.Series.rfind

rindex
(**kwargs)¶ Wrapper around pandas.Series.rindex

rjust
(**kwargs)¶ Wrapper around pandas.Series.rjust

rstrip
(**kwargs)¶ Wrapper around pandas.Series.rstrip

slice
(**kwargs)¶ Wrapper around pandas.Series.slice

split
(**kwargs)¶ Wrapper around pandas.Series.split

startswith
(**kwargs)¶ Wrapper around pandas.Series.startswith

strip
(**kwargs)¶ Wrapper around pandas.Series.strip

title
(**kwargs)¶ Wrapper around pandas.Series.title

upper
(**kwargs)¶ Wrapper around pandas.Series.upper

zfill
(**kwargs)¶ Wrapper around pandas.Series.zfill

Machine learning with vaex.ml¶
Note that vaex.ml does not fall under the MIT, but the CC BYCCND LICENSE, which means it’s ok for personal or academic use. You can install vaexml using pip install vaexml.
Clustering¶

class
vaex.ml.cluster.
KMeans
(cluster_centers=traitlets.Undefined, features=traitlets.Undefined, inertia=None, init='random', max_iter=300, n_clusters=2, n_init=1, prediction_label='prediction_kmeans', random_state=None, verbose=False)[source]¶ Bases:
vaex.ml.state.HasState
The KMeans clustering algorithm.
>>> import vaex.ml >>> import vaex.ml.cluster >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = vaex.ml.train_test_split(df) >>> cls = vaex.ml.cluster.KMeans(n_clusters=3, features=features, init='random', max_iter=10) >>> df_train = cls.fit_transform(df_train) >>> df_test = cls.transform(df_test)
Parameters:  cluster_centers – Coordinates of cluster centers.
 features – List of features to cluster.
 inertia – Sum of squared distances of samples to their closest cluster center.
 init – Method for initializing the centroids.
 max_iter – Maximum number of iterations of the KMeans algorithm for a single run.
 n_clusters – Number of clusters to form.
 n_init – Number of centroid initializations. The KMeans algorithm will be run for each initialization, and the final results will be the best output of the n_init consecutive runs in terms of inertia.
 prediction_label – The name of the virtual column that houses the cluster labels for each point.
 random_state – Random number generation for centroid initialization. If an int is specified, the randomness becomes deterministic.
 verbose – If True, enable verbosity mode.
PCA¶

class
vaex.ml.transformations.
PCA
(features=traitlets.Undefined, n_components=2, prefix='PCA_')[source]¶ Bases:
vaex.ml.transformations.Transformer
Transform a set of features using a Principal Component Analysis.
>>> import vaex.ml >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = vaex.ml.train_test_split(df) >>> pca = vaex.ml.PCA(features=features, n_components=3) >>> df_train = pca.fit_transform(df_train) >>> df_test = pca.transform(df_test)
Parameters:  features – List of features to transform.
 n_components – Number of components to retain.
 prefix – Prefix for the names of the transformed features.
Encoders¶

class
vaex.ml.transformations.
LabelEncoder
(features=traitlets.Undefined, prefix='Prefix for the names of the transformed features.')[source]¶ Bases:
vaex.ml.transformations.Transformer
Encode categorical columns with integer values between 0 and num_classes1.
>>> import vaex.ml >>> df = vaex.ml.datasets.load_titanic() >>> df_train, df_test = vaex.ml.train_test_split(df) >>> encoder = vaex.ml.LabelEncoder(features=['sex', 'embarked']) >>> df_train = encoder.fit_transform(df_train) >>> df_test = encoder.transform(df_test)
Parameters:  features – List of features to transform.
 prefix –

class
vaex.ml.transformations.
OneHotEncoder
(features=traitlets.Undefined, one=1, prefix='', zero=0)[source]¶ Bases:
vaex.ml.transformations.Transformer
Encode categorical columns according ot the OneHot scheme.
>>> import vaex.ml >>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red']) >>> df # color 0 red 1 green 2 green 3 blue 4 red >>> encoder = vaex.ml.OneHotEncoder(features=['color']) >>> encoder.fit_transform(df) # color color_blue color_green color_red 0 red 0 0 1 1 green 0 1 0 2 green 0 1 0 3 blue 1 0 0 4 red 0 0 1
Parameters:  features – List of features to transform.
 one – Value to encode when a category is present.
 prefix – Prefix for the names of the transformed features.
 zero – Value to encode when category is absent.

class
vaex.ml.transformations.
StandardScaler
(features=traitlets.Undefined, prefix='standard_scaled_', with_mean=True, with_std=True)[source]¶ Bases:
vaex.ml.transformations.Transformer
Standardize features by removing thir mean and scaling them to unit variance.
>>> import vaex.ml >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = vaex.ml.train_test_split(df) >>> scaler = vaex.ml.StandardScaler(features=features, with_mean=True, with_std=True) >>> df_train = scaler.fit_transform(df_train) >>> df_test = scaler.transform(df_test)
Parameters:  features – List of features to transform.
 prefix – Prefix for the names of the transformed features.
 with_mean – If True, remove the mean from each feature.
 with_std – If True, scale each feature to unit variance.

class
vaex.ml.transformations.
MinMaxScaler
(feature_range=traitlets.Undefined, features=traitlets.Undefined, prefix='minmax_scaled_')[source]¶ Bases:
vaex.ml.transformations.Transformer
Will scale a set of features to a given range.
>>> import vaex.ml >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = vaex.ml.train_test_split(df) >>> scaler = vaex.ml.MinMaxScaler(features=features, feature_range=(0, 1)) >>> df_train = scaler.fit_transform(df_train) >>> df_test = scaler.transform(df_test)
Parameters:  feature_range – The range the features are scaled to.
 features – List of features to transform.
 prefix – Prefix for the names of the transformed features.

class
vaex.ml.transformations.
MaxAbsScaler
(features=traitlets.Undefined, prefix='absmax_scaled_')[source]¶ Bases:
vaex.ml.transformations.Transformer
Scale features by their maximum absolute value.
>>> import vaex.ml >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = vaex.ml.train_test_split(df) >>> scaler = vaex.ml.MaxAbsScaler(features=features) >>> df_train = scaler.fit_transform(df_train) >>> df_test = scaler.transform(df_test)
Parameters:  features – List of features to transform.
 prefix – Prefix for the names of the transformed features.

class
vaex.ml.transformations.
RobustScaler
(features=traitlets.Undefined, percentile_range=traitlets.Undefined, prefix='robust_scaled_', with_centering=True, with_scaling=True)[source]¶ Bases:
vaex.ml.transformations.Transformer
The RobustScaler removes the median and scales the data according to a given percentile range. By default, the scaling is done between the 25th and the 75th percentile. Centering and scaling happens independently for each feature (column).
>>> import vaex.ml >>> df = vaex.ml.datasets.load_iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = vaex.ml.train_test_split(df) >>> scaler = vaex.ml.RobustScaler(features=features, percentile_range=(25, 75)) >>> df_train = scaler.fit_transform(df_train) >>> df_test = scaler.transform(df_test)
Parameters:  features – List of features to transform.
 percentile_range – The percentile range to which to scale each feature to.
 prefix – Prefix for the names of the transformed features.
 with_centering – If True, remove the median.
 with_scaling – If True, scale each feature between the specified percentile range.
Boosted trees¶

class
vaex.ml.lightgbm.
LightGBMModel
(features=traitlets.Undefined, num_round=0, param=traitlets.Undefined, prediction_name='lightgbm_prediction')[source]¶ Bases:
vaex.ml.state.HasState
The LightGBM algorithm.
This class provides an interface to the LightGBM aloritham, with some optimizations for better memory efficiency when training large datasets. The algorithm itself is not modified at all.
LightGBM is a fast gradient boosting algorithm based on decision trees and is mainly used for classification, regression and ranking tasks. It is under the umbrella of the Distributed Machine Learning Toolkit (DMTK) project of Microsoft. For more information, please visit https://github.com/Microsoft/LightGBM/.
import vaex.ml. >>> import vaex.ml.lightgbm >>> df = vaex.ml.datasets.load_iris() >>> features = [‘sepal_width’, ‘petal_length’, ‘sepal_length’, ‘petal_width’] >>> df_train, df_test = vaex.ml.train_test_split(df) >>> params = {
‘boosting’: ‘gbdt’, ‘max_depth’: 5, ‘learning_rate’: 0.1, ‘application’: ‘multiclass’, ‘num_class’: 3, ‘subsample’: 0.80, ‘colsample_bytree’: 0.80}>>> booster = vaex.ml.lightgbm.LightGBMModel(features=features, num_rounds=100, param=params) >>> booster.fit(df_train, 'class_') >>> df_train = booster.transform(df_train) >>> df_test = booster.transform(df_test)
Parameters:  features – List of features to use when fitting the LightGBMModel.
 num_round – Number of boosting iterations.
 param – parameters to be passed on the to the LightGBM model.
 prediction_name – The name of the virtual column housing the predictions.

fit
(dataset, label, copy=False)[source]¶ Fit the LightGBMModel to the dataset.
Parameters:  dataset – A vaex dataset.
 label – The name of the column containing the target variable.
 copy – bool, if True, make an in memory copy of the data before passing it to the LightGBMModel.
self

predict
(dataset, copy=False)[source]¶ Get an inmemory numpy array with the predictions of the LightGBMModel on a vaex dataset
Parameters:  dataset – A vaex dataset.
 copy – bool, if True, make an in memory copy of the data before passing it to the LightGBMModel.
A inmemory numpy array containing the LightGBMModel predictions.

class
vaex.ml.lightgbm.
LightGBMClassifier
(features=traitlets.Undefined, num_round=0, param=traitlets.Undefined, prediction_name='lightgbm_prediction')[source]¶ Bases:
vaex.ml.lightgbm.LightGBMModel
Parameters:  features – List of features to use when fitting the LightGBMModel.
 num_round – Number of boosting iterations.
 param – parameters to be passed on the to the LightGBM model.
 prediction_name – The name of the virtual column housing the predictions.

predict
(dataset, copy=False)[source]¶ Get an inmemory numpy array with the predictions of the LightGBMModel on a vaex dataset
Parameters:  dataset – A vaex dataset.
 copy – bool, if True, make an in memory copy of the data before passing it to the LightGBMModel.
A inmemory numpy array containing the LightGBMModel predictions.
Nearest neighbour¶
Annoy support is in the incubator phase, which means support may disappear in future versions

class
vaex.ml.incubator.annoy.
ANNOYModel
(features=traitlets.Undefined, metric='euclidean', n_neighbours=10, n_trees=10, predcition_name='annoy_prediction', prediction_name='annoy_prediction', search_k=1)[source]¶ Bases:
vaex.ml.state.HasState
Parameters:  features – List of features to use.
 metric – Metric to use for distance calculations
 n_neighbours – Now many neighbours
 n_trees – Number of trees to build.
 predcition_name – Output column name for the neighbours when transforming a dataset
 prediction_name – Output column name for the neighbours when transforming a dataset
 search_k – Jovan?