API documentation for vaex library

Vaex is a library for dealing with big tabular data.

The most important class (datastructure) in vaex is the Dataset. A dataset is obtained by either, opening the example dataset:

>>> import vaex as vx
>>> t = vx.example()

Or opening a file:

>>> t1 = vx.open("somedata.hdf5")
>>> t2 = vx.open("somedata.fits")
>>> t3 = vx.open("somedata.csv")

Or connecting to a remove server:

>>> tbig = vx.open("http://bla.com/bigtable")

The main purpose of vaex is to provide statistics, such as mean, count, sum, standard deviation, per columns, possibly with a selection, and on a regular grid.

To count the number of rows:

>>> t = vx.example()
>>> t.count()
330000.0

Or the number of valid values, which for this dataset is the same:

>>> t.count("x")
330000.0

Count them on a regular grid:

>>> t.count("x", binby=["x", "y"], shape=(4,4))
array([[   902.,   5893.,   5780.,   1193.],
       [  4097.,  71445.,  75916.,   4560.],
       [  4743.,  71131.,  65560.,   4108.],
       [  1115.,   6578.,   4382.,    821.]])

Visualise it using matplotlib:

>>> t.plot("x", "y", show=True)
<matplotlib.image.AxesImage at 0x1165a5090>
vaex.open(path, *args, **kwargs)[source]

Open a dataset from file given by path

Parameters:
  • path (str) – local or absolute path to file
  • args – extra arguments for file readers that need it
  • kwargs – extra keyword arguments
Returns:

return dataset if file is supported, otherwise None

Return type:

Dataset

Example:
>>> import vaex as vx
>>> vx.open('myfile.hdf5')
<vaex.dataset.Hdf5MemoryMapped at 0x1136ee3d0>
>>> vx.open('gadget_file.hdf5', 3) # this will read only particle type 3
<vaex.dataset.Hdf5MemoryMappedGadget at 0x1136ef3d0>
vaex.server(url, **kwargs)[source]

Connect to hostname supporting the vaex web api

Parameters:hostname (str) – hostname or ip address of server
Return vaex.dataset.ServerRest:
 returns a server object, note that it does not connect to the server yet, so this will always succeed
Return type:ServerRest
vaex.example(download=True)[source]

Returns an example dataset which comes with vaex for testing/learning purposes

Return type:vaex.dataset.Dataset
vaex.from_arrays(**arrays)[source]

Create an in memory dataset from numpy arrays

Param:arrays: keyword arguments with arrays
Example:
>>> x = np.arange(10)
>>> y = x ** 2
>>> dataset = vx.from_arrays(x=x, y=y)
vaex.from_pandas(df, name='pandas', copy_index=True, index_name='index')[source]

Create an in memory dataset from a pandas dataframe

Param:pandas.DataFrame df: Pandas dataframe
Param:name: unique for the dataset
>>> import pandas as pd
>>> df = pd.from_csv("test.csv")
>>> ds = vx.from_pandas(df, name="test")
vaex.from_ascii(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]

Create an in memory dataset from an ascii file (whitespace seperated by default).

>>> ds = vx.from_ascii("table.asc")
>>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])
Parameters:
  • path – file path
  • seperator – value seperator, by default whitespace, use ”,” for comma seperated values.
  • names – If True, the first line is used for the column names, otherwise provide a list of strings with names
  • skip_lines – skip lines at the start of the file
  • skip_after – skip lines at the end of the file
  • kwargs
Returns:

vaex.from_samp(username=None, password=None)[source]

Connect to a SAMP Hub and wait for a single table load event, disconnect, download the table and return the dataset

Useful if you want to send a single table from say TOPCAT to vaex in a python console or notebook

vaex.open_many(filenames)[source]

Open a list of filenames, and return a dataset with all datasets cocatenated

Parameters:filenames (list[str]) – list of filenames/paths
Return type:Dataset
vaex.app(*args, **kwargs)[source]

Create a vaex app, the QApplication mainloop must be started.

In ipython notebook/jupyter do the following: import vaex.ui.main # this causes the qt api level to be set properly import vaex as xs Next cell: %gui qt Next cell app = vx.app()

From now on, you can run the app along with jupyter

vaex.zeldovich(dim=2, N=256, n=-2.5, t=None, scale=1, seed=None)[source]

Creates a zeldovich dataset

vaex.set_log_level_debug()[source]

set log level to debug

vaex.set_log_level_info()[source]

set log level to info

vaex.set_log_level_warning()[source]

set log level to warning

vaex.set_log_level_exception()[source]

set log level to exception

vaex.set_log_level_off()[source]

Disabled logging

Subpackages

Submodules

vaex.dataset module

class vaex.dataset.Dataset(name, column_names, executor=None)[source]

Bases: object

All datasets are encapsulated in this class, local or remote dataets

Each dataset has a number of columns, and a number of rows, the length of the dataset.

The most common operations are: Dataset.plot >>> >>>

All Datasets have one ‘selection’, and all calculations by Subspace are done on the whole dataset (default) or for the selection. The following example shows how to use the selection.

>>> some_dataset.select("x < 0")
>>> subspace_xy = some_dataset("x", "y")
>>> subspace_xy_selected = subspace_xy.selected()

TODO: active fraction, length and shuffled

add_column(name, f_or_array)[source]

Add an in memory array as a column

add_column_healpix(name='healpix', longitude='ra', latitude='dec', degrees=True, healpix_order=12, nest=True)[source]

Add a healpix (in memory) column based on a longitude and latitude

Parameters:
  • name – Name of column
  • longitude – longitude expression
  • latitude – latitude expression (astronomical convenction latitude=90 is north pole)
  • degrees – If lon/lat are in degrees (default) or radians.
  • healpix_order – healpix order, >= 0
  • nest – Nested healpix (default) or ring.
add_variable(name, expression, overwrite=True)[source]

Add a variable column to the dataset

Param:str name: name of virtual varible
Param:expression: expression for the variable

Variable may refer to other variables, and virtual columns and expression may refer to variables

Example:
>>> dataset.add_variable("center")
>>> dataset.add_virtual_column("x_prime", "x-center")
>>> dataset.select("x_prime < 0")
add_virtual_column(name, expression)[source]

Add a virtual column to the dataset

Example: >>> dataset.add_virtual_column(“r”, “sqrt(x**2 + y**2 + z**2)”) >>> dataset.select(“r < 10”)

Param:str name: name of virtual column
Param:expression: expression for the column
add_virtual_column_bearing(name, lon1, lat1, lon2, lat2)[source]
add_virtual_columns_aitoff(alpha, delta, x, y, radians=True)[source]

Add aitoff (https://en.wikipedia.org/wiki/Aitoff_projection) projection

Parameters:
  • alpha – azimuth angle
  • delta – polar angle
  • x – output name for x coordinate
  • y – output name for y coordinate
  • radians – input and output in radians (True), or degrees (False)
Returns:

add_virtual_columns_cartesian_to_polar(x='x', y='y', radius_out='r_polar', azimuth_out='phi_polar', cov_matrix_x_y=None, covariance_postfix='_covariance', uncertainty_postfix='_uncertainty', radians=False)[source]

Convert cartesian to polar coordinates

Parameters:
  • x – expression for x
  • y – expression for y
  • radius_out – name for the virtual column for the radius
  • azimuth_out – name for the virtual column for the azimuth angle
  • cov_matrix_x_y – List all convariance values as a double list of expressions, or “full” to guess all entries (which gives an error when values are not found), or “auto” to guess, but allow for missing values
  • covariance_postfix
  • uncertainty_postfix
  • radians – if True, azimuth is in radians, defaults to degrees
Returns:

add_virtual_columns_cartesian_to_spherical(x='x', y='y', z='z', alpha='l', delta='b', distance='distance', radians=False, center=None, center_name='solar_position')[source]

Convert cartesian to spherical coordinates.

Parameters:
  • x
  • y
  • z
  • alpha
  • delta – name for polar angle, ranges from -90 to 90 (or -pi to pi when radians is True).
  • distance
  • radians
  • center
  • center_name
Returns:

add_virtual_columns_cartesian_velocities_to_pmvr(x='x', y='y', z='z', vx='vx', vy='vy', vz='vz', vr='vr', pm_long='pm_long', pm_lat='pm_lat', distance=None)[source]

Concert velocities from a cartesian system to proper motions and radial velocities

TODO: errors

Parameters:
  • x – name of x column (input)
  • y – y
  • z – z
  • vx – vx
  • vy – vy
  • vz – vz
  • vr – name of the column for the radial velocity in the r direction (output)
  • pm_long – name of the column for the proper motion component in the longitude direction (output)
  • pm_lat – name of the column for the proper motion component in the latitude direction, positive points to the north pole (output)
  • distance – Expression for distance, if not given defaults to sqrt(x**2+y**2+z**2), but if this column already exists, passing this expression may lead to a better performance
Returns:

add_virtual_columns_cartesian_velocities_to_polar(x='x', y='y', vx='vx', radius_polar=None, vy='vy', vr_out='vr_polar', vazimuth_out='vphi_polar', cov_matrix_x_y_vx_vy=None, covariance_postfix='_covariance', uncertainty_postfix='_uncertainty')[source]

Convert cartesian to polar velocities.

Parameters:
  • x
  • y
  • vx
  • radius_polar – Optional expression for the radius, may lead to a better performance when given.
  • vy
  • vr_out
  • vazimuth_out
  • cov_matrix_x_y_vx_vy
  • covariance_postfix
  • uncertainty_postfix
Returns:

add_virtual_columns_cartesian_velocities_to_spherical(x='x', y='y', z='z', vx='vx', vy='vy', vz='vz', vr='vr', vlong='vlong', vlat='vlat', distance=None)[source]

Concert velocities from a cartesian to a spherical coordinate system

TODO: errors

Parameters:
  • x – name of x column (input)
  • y – y
  • z – z
  • vx – vx
  • vy – vy
  • vz – vz
  • vr – name of the column for the radial velocity in the r direction (output)
  • vlong – name of the column for the velocity component in the longitude direction (output)
  • vlat – name of the column for the velocity component in the latitude direction, positive points to the north pole (output)
  • distance – Expression for distance, if not given defaults to sqrt(x**2+y**2+z**2), but if this column already exists, passing this expression may lead to a better performance
Returns:

add_virtual_columns_celestial(long_in, lat_in, long_out, lat_out, input=None, output=None, name_prefix='__celestial', radians=False)[source]
add_virtual_columns_distance_from_parallax(parallax='parallax', distance_name='distance', parallax_uncertainty=None, uncertainty_postfix='_uncertainty')[source]

Convert parallax to distance (i.e. 1/parallax)

Parameters:
  • parallax – expression for the parallax, e.g. “parallax”
  • distance_name – name for the virtual column of the distance, e.g. “distance”
  • parallax_uncertainty – expression for the uncertainty on the parallax, e.g. “parallax_error”
  • uncertainty_postfix – distance_name + uncertainty_postfix is the name for the virtual column, e.g. “distance_uncertainty” by default
Returns:

add_virtual_columns_eq2ecl(long_in='ra', lat_in='dec', long_out='lambda_', lat_out='beta', input=None, output=None, name_prefix='__celestial_eq2ecl', radians=False)[source]

Add ecliptic coordates (long_out, lat_out) from equatorial coordinates.

Parameters:
  • long_in – Name/expression for right ascension
  • lat_in – Name/expression for declination
  • long_out – Output name for lambda coordinate
  • lat_out – Output name for beta coordinate
  • input
  • output
  • name_prefix
  • radians – input and output in radians (True), or degrees (False)
Returns:

add_virtual_columns_eq2gal(long_in='ra', lat_in='dec', long_out='l', lat_out='b', input=None, output=None, name_prefix='__celestial_eq2gal', radians=False)[source]

Add galactic coordates (long_out, lat_out) from equatorial coordinates.

Parameters:
  • long_in – Name/expression for right ascension
  • lat_in – Name/expression for declination
  • long_out – Output name for galactic longitude
  • lat_out – Output name for galactic latitude
  • input
  • output
  • name_prefix
  • radians – input and output in radians (True), or degrees (False)
Returns:

add_virtual_columns_equatorial_to_galactic_cartesian(alpha, delta, distance, xname, yname, zname, radians=True, alpha_gp=3.3660329196841534, delta_gp=0.47347728280415174, l_omega=0.57477043300337094)[source]

From http://arxiv.org/pdf/1306.2945v2.pdf

add_virtual_columns_lbrvr_proper_motion2vcartesian(long_in='l', lat_in='b', distance='distance', pm_long='pm_l', pm_lat='pm_b', vr='vr', vx='vx', vy='vy', vz='vz', cov_matrix_vr_distance_pm_long_pm_lat=None, uncertainty_postfix='_uncertainty', covariance_postfix='_covariance', name_prefix='__lbvr_proper_motion2vcartesian', center_v=(0, 0, 0), center_v_name='solar_motion', radians=False)[source]

Convert radial velocity and galactic proper motions (and positions) to cartesian velocities wrt the center_v

Based on http://adsabs.harvard.edu/abs/1987AJ.....93..864J

Parameters:
  • long_in – Name/expression for galactic longitude
  • lat_in – Name/expression for galactic latitude
  • distance – Name/expression for heliocentric distance
  • pm_long – Name/expression for the galactic proper motion in latitude direction (pm_l*, so cosine(b) term should be included)
  • pm_lat – Name/expression for the galactic proper motion in longitude direction
  • vr – Name/expression for the radial velocity
  • vx – Output name for the cartesian velocity x-component
  • vy – Output name for the cartesian velocity y-component
  • vz – Output name for the cartesian velocity z-component
  • name_prefix
  • center_v – Extra motion that should be added, for instance lsr + motion of the sun wrt the galactic restframe
  • center_v_name
  • radians – input and output in radians (True), or degrees (False)
Returns:

add_virtual_columns_matrix3d(x, y, z, xnew, ynew, znew, matrix, matrix_name, matrix_is_expression=False)[source]
Parameters:
  • x (str) – name of x column
  • y (str) –
  • z (str) –
  • xnew (str) – name of transformed x column
  • ynew (str) –
  • znew (str) –
  • matrix (list[list]) – 2d array or list, with [row,column] order
  • matrix_name (str) –
Returns:

add_virtual_columns_projection_gnomic(alpha, delta, alpha0=0, delta0=0, x='x', y='y', radians=False, postfix='')[source]
add_virtual_columns_proper_motion2vperpendicular(distance='distance', pm_long='pm_l', pm_lat='pm_b', vl='vl', vb='vb', cov_matrix_distance_pm_long_pm_lat=None, uncertainty_postfix='_uncertainty', covariance_postfix='_covariance', radians=False)[source]

Convert proper motion to perpendicular velocities.

Parameters:
  • distance
  • pm_long
  • pm_lat
  • vl
  • vb
  • cov_matrix_distance_pm_long_pm_lat
  • uncertainty_postfix
  • covariance_postfix
  • radians
Returns:

add_virtual_columns_proper_motion_eq2gal(long_in='ra', lat_in='dec', pm_long='pm_ra', pm_lat='pm_dec', pm_long_out='pm_l', pm_lat_out='pm_b', cov_matrix_alpha_delta_pma_pmd=None, covariance_postfix='_covariance', uncertainty_postfix='_uncertainty', name_prefix='__proper_motion_eq2gal', radians=False)[source]

Transform/rotate proper motions from equatorial to galactic coordinates

Taken from http://arxiv.org/abs/1306.2945

Parameters:
  • long_in – Name/expression for right ascension
  • lat_in – Name/expression for declination
  • pm_long – Proper motion for ra
  • pm_lat – Proper motion for dec
  • pm_long_out – Output name for output proper motion on l direction
  • pm_lat_out – Output name for output proper motion on b direction
  • name_prefix
  • radians – input and output in radians (True), or degrees (False)
Returns:

add_virtual_columns_rotation(x, y, xnew, ynew, angle_degrees)[source]

Rotation in 2d

Parameters:
  • x (str) – Name/expression of x column
  • y (str) – idem for y
  • xnew (str) – name of transformed x column
  • ynew (str) –
  • angle_degrees (float) – rotation in degrees, anti clockwise
Returns:

add_virtual_columns_spherical_to_cartesian(alpha, delta, distance, xname='x', yname='y', zname='z', cov_matrix_alpha_delta_distance=None, covariance_postfix='_covariance', uncertainty_postfix='_uncertainty', center=None, center_name='solar_position', radians=False)[source]

Convert spherical to cartesian coordinates.

Parameters:
  • alpha
  • delta – polar angle, ranging from the -90 (south pole) to 90 (north pole)
  • distance – radial distance, determines the units of x, y and z
  • xname
  • yname
  • zname
  • cov_matrix_alpha_delta_distance – List all convariance values as a double list of expressions, or “full” to guess all entries (which gives an error when values are not found), or “auto” to guess, but allow for missing values
  • covariance_postfix
  • uncertainty_postfix
  • center
  • center_name
  • radians
Returns:

bins(expression, limits, shape=128, edges=True)[source]
byte_size(selection=False)[source]

Return the size in bytes the whole dataset requires (or the selection), respecting the active_fraction

classmethod can_open(path, *args, **kwargs)[source]

Tests if this class can open the file given by path

cat(i1, i2)[source]
close_files()[source]

Close any possible open file handles, the dataset will not be in a usable state afterwards

col

Gives direct access to the data as numpy-like arrays.

Convenient when working with ipython in combination with small datasets, since this gives tab-completion

Columns can be accesed by there names, which are attributes. The attribues are currently strings, so you cannot do computations with them

Example:
>>> ds = vx.example()
>>> ds.plot(ds.col.x, ds.col.y)
column_count()[source]

Returns the number of columns, not counting virtual ones

combinations(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]

Generate a list of combinations for the possible expressions for the given dimension

Parameters:
  • expressions_list – list of list of expressions, where the inner list defines the subspace
  • dimensions – if given, generates a subspace with all possible combinations for that dimension
  • exclude – list of
copy_metadata(other)[source]
correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, async=False, progress=None)[source]

Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possible on a grid defined by binby

Examples:

>>> ds.correlation("x**2+y**2+z**2", "-log(-E+1)")
array(0.6366637382215669)
>>> ds.correlation("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 0.40594394,  0.69868851,  0.61394099,  0.65266318])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

count(expression=None, binby=[], limits=None, shape=128, selection=False, async=False, edges=False, progress=None)[source]

Count the number of non-NaN values (or all, if expression is None or “*”)

Examples:

>>> ds.count()
330000.0
>>> ds.count("*")
330000.0
>>> ds.count("*", binby=["x"], shape=4)
array([  10925.,  155427.,  152007.,   10748.])
Parameters:
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

cov(x, y=None, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the covariance matrix for x and y or more expressions, possible on a grid defined by binby

Either x and y are expressions, e.g:

>>> ds.cov("x", "y")

Or only the x argument is given with a list of expressions, e,g.:

>> ds.cov([“x, “y, “z”])

Examples:

>>> ds.cov("x", "y")
array([[ 53.54521742,  -3.8123135 ],

[ -3.8123135 , 60.62257881]]) >>> ds.cov([“x”, “y”, “z”]) array([[ 53.54521742, -3.8123135 , -0.98260511], [ -3.8123135 , 60.62257881, 1.21381057], [ -0.98260511, 1.21381057, 25.55517638]])

>>> ds.cov("x", "y", binby="E", shape=2)
array([[[  9.74852878e+00,  -3.02004780e-02],

[ -3.02004780e-02, 9.99288215e+00]],

[[ 8.43996546e+01, -6.51984181e+00],

[ -6.51984181e+00, 9.68938284e+01]]])

param x:expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
param y:if previous argument is not a list, this argument should be given
param binby:List of expressions for constructing a binned grid
param limits:description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
param shape:shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
param selection:
 Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
param async:Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
return:Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)
covar(x, y, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the covariance cov[x,y] between and x and y, possible on a grid defined by binby

Examples:

>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)")
array(52.69461456005138)
>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)")/(ds.std("x**2+y**2+z**2") * ds.std("-log(-E+1)"))
0.63666373822156686
>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 10.17387143,  51.94954078,  51.24902796,  20.2163929 ])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

delete_variable(name)[source]

Deletes a variable from a dataset

delete_virtual_column(name)[source]

Deletes a virtual column from a dataset

dtype(expression)[source]
evaluate(expression, i1=None, i2=None, out=None, selection=None)[source]

Evaluate an expression, and return a numpy array with the results for the full column or a part of it.

Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.

To get partial results, use i1 and i2/

Parameters:
  • expression (str) – Name/expression to evaluate
  • i1 (int) – Start row index, default is the start (0)
  • i2 (int) – End row index, default is the length of the dataset
  • out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to

a memory mapped array) :param selection: selection to apply :return:

evaluate_selection_mask(name='default', i1=None, i2=None, selection=None)[source]
evaluate_variable(name)[source]

Evaluates the variable given by name

full_length()[source]

the full length of the dataset, independant what active_fraction is

get_active_fraction()[source]

Value in the range (0, 1], to work only with a subset of rows

get_active_range()[source]
get_auto_fraction()[source]
get_column_names(virtual=False, hidden=False, strings=False)[source]

Return a list of column names

Parameters:
  • virtual – If True, also return virtual columns
  • hidden – If True, also return hidden columns
Return type:

list of str

get_current_row()[source]

Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked

classmethod get_options(path)[source]
get_private_dir(create=False)[source]

Each datasets has a directory where files are stored for metadata etc

Example:
>>> import vaex as vx
>>> ds = vx.example()
>>> ds.get_private_dir()
'/Users/users/breddels/.vaex/datasets/_Users_users_breddels_vaex-testing_data_helmi-dezeeuw-2000-10p.hdf5'
Parameters:create (bool) – is True, it will create the directory if it does not exist
get_selection(name='default')[source]

Get the current selection object (mostly for internal use atm)

get_variable(name)[source]

Returns the variable given by name, it will not evaluate it.

For evaluation, see Dataset.evaluate_variable(), see also Dataset.set_variable()

has_current_row()[source]

Returns True/False is there currently is a picked row

has_selection(name='default')[source]

Returns True of there is a selection

head(n=10)[source]
head_and_tail(n=10)[source]
healpix_count(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, async=False, progress=None, selection=None)[source]

Count non missing value for expression on an array which represents healpix data.

Parameters:
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
  • healpix_expression – {healpix_max_level}
  • healpix_max_level – {healpix_max_level}
  • healpix_level – {healpix_level}
  • binby – {binby}, these dimension follow the first healpix dimension.
  • limits – {limits}
  • shape – {shape}
  • selection – {selection}
  • async – {async}
  • progress – {progress}
Returns:

healpix_plot(healpix_expression='source_id/34359738368', healpix_max_level=12, healpix_level=8, what='count(*)', selection=None, grid=None, healpix_input='equatorial', healpix_output='galactic', f=None, colormap='afmhot', grid_limits=None, image_size=800, nest=True, figsize=None, interactive=False, title='', smooth=None, show=False, colorbar=True, rotation=(0, 0, 0))[source]
Parameters:
  • healpix_expression – {healpix_max_level}
  • healpix_max_level – {healpix_max_level}
  • healpix_level – {healpix_level}
  • what – {what}
  • selection – {selection}
  • grid – {grid}
  • healpix_input – Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”.
  • healpix_output – Plot in “equatorial”, “galactic” or “ecliptic”.
  • f – function to apply to the data
  • colormap – matplotlib colormap
  • grid_limits – Optional sequence [minvalue, maxvalue] that determine the min and max value that map to the colormap (values below and above these are clipped to the the min/max). (default is [min(f(grid)), max(f(grid)))
  • image_size – size for the image that healpy uses for rendering
  • nest – If the healpix data is in nested (True) or ring (False)
  • figsize – If given, modify the matplotlib figure size. Example (14,9)
  • interactive – (Experimental, uses healpy.mollzoom is True)
  • title – Title of figure
  • smooth – apply gaussian smoothing, in degrees
  • show – Call matplotlib’s show (True) or not (False, defaut)
  • rotation – Rotatate the plot, in format (lon, lat, psi) such that (lon, lat) is the center, and rotate on the screen by angle psi. All angles are degrees.
Returns:

info(description=True)[source]
is_local()[source]

Returns True if the dataset is a local dataset, False when a remote dataset

is_masked(column)[source]
label(expression, unit=None, output_unit=None, format='latex_inline')[source]
limits(expression, value=None, square=False, selection=None, async=False)[source]

Calculate the [min, max] range for expression, as described by value, which is ‘99.7%’ by default.

If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.

Example:

>>> ds.limits("x")
array([-28.86381927,  28.9261226 ])
>>> ds.limits(["x", "y"])
(array([-28.86381927,  28.9261226 ]), array([-28.60476934,  28.96535249]))
>>> ds.limits(["x", "y"], "minmax")
(array([-128.293991,  271.365997]), array([ -71.5523682,  146.465836 ]))
>>> ds.limits(["x", "y"], ["minmax", "90%"])
(array([-128.293991,  271.365997]), array([-13.37438402,  13.4224423 ]))
>>> ds.limits(["x", "y"], ["minmax", [0, 10]])
(array([-128.293991,  271.365997]), [0, 10])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • value – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
Returns:

List in the form [[xmin, xmax], [ymin, ymax], .... ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

limits_percentage(expression, percentage=99.73, square=False, async=False)[source]

Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.

The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:

>>> ds.limits_percentage("x", 90)
array([-12.35081376,  12.14858052]
>>> ds.percentile_approx("x", 5), ds.percentile_approx("x", 95)
(array([-12.36813152]), array([ 12.13275818]))

NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code

Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • percentage (float) – Value between 0 and 100
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
Returns:

List in the form [[xmin, xmax], [ymin, ymax], .... ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

map_reduce(map, reduce, arguments, async=False)[source]
max(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the maximum for given expressions, possible on a grid defined by binby

Example:

>>> ds.max("x")
array(271.365997)
>>> ds.max(["x", "y"])
array([ 271.365997,  146.465836])
>>> ds.max("x", binby="x", shape=5, limits=[-10, 10])
array([-6.00010443, -2.00002384,  1.99998057,  5.99983597,  9.99984646])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mean(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the mean for expression, possibly on a grid defined by binby.

Examples:

>>> ds.mean("x")
-0.067131491264005971
>>> ds.mean("(x**2+y**2)**0.5", binby="E", shape=4)
array([  2.43483742,   4.41840721,   8.26742458,  15.53846476])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

median_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=256, percentile_limits='minmax', selection=False, async=False)[source]

Calculate the median , possible on a grid defined by binby

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

min(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the minimum for given expressions, possible on a grid defined by binby

Example:

>>> ds.min("x")
array(-128.293991)
>>> ds.min(["x", "y"])
array([-128.293991 ,  -71.5523682])
>>> ds.min("x", binby="x", shape=5, limits=[-10, 10])
array([-9.99919128, -5.99972439, -1.99991322,  2.0000093 ,  6.0004878 ])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

minmax(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the minimum and maximum for expressions, possible on a grid defined by binby

Example:

>>> ds.minmax("x")
array([-128.293991,  271.365997])
>>> ds.minmax(["x", "y"])
array([[-128.293991 ,  271.365997 ],
           [ -71.5523682,  146.465836 ]])
>>> ds.minmax("x", binby="x", shape=5, limits=[-10, 10])
array([[-9.99919128, -6.00010443],
           [-5.99972439, -2.00002384],
           [-1.99991322,  1.99998057],
           [ 2.0000093 ,  5.99983597],
           [ 6.0004878 ,  9.99984646]])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mode(expression, binby=[], limits=None, shape=256, mode_shape=64, mode_limits=None, progressbar=False, selection=None)[source]
mutual_information(x, y=None, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, async=False)[source]

Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possible on a grid defined by binby

If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order

Examples:

>>> ds.mutual_information("x", "y")
array(0.1511814526380327)
>>> ds.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]])
array([ 0.15118145,  0.18439181,  1.07067379])
>>> ds.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True)
(array([ 1.07067379,  0.18439181,  0.15118145]),
[['E', 'Lz'], ['x', 'z'], ['x', 'y']])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,

nearest_bin(value, limits, shape)[source]
classmethod option_to_args(option)[source]
percentile_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, async=False)[source]

Calculate the percentile given by percentage, possible on a grid defined by binby

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

>>> ds.percentile_approx("x", 10), ds.percentile_approx("x", 90)
(array([-8.3220355]), array([ 7.92080358]))
>>> ds.percentile_approx("x", 50, binby="x", shape=5, limits=[-10, 10])
array([[-7.56462982],
           [-3.61036641],
           [-0.01296306],
           [ 3.56697863],
           [ 7.45838367]])

0:1:0.1 1:1:0.2 2:1:0.3 3:1:0.4 4:1:0.5

5:1:0.6 6:1:0.7 7:1:0.8 8:1:0.9 9:1:1.0

Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

plot(x=None, y=None, z=None, what='count(*)', vwhat=None, reduce=['colormap'], f=None, normalize='normalize', normalize_axis='what', vmin=None, vmax=None, shape=256, vshape=32, limits=None, grid=None, colormap='afmhot', figsize=None, xlabel=None, ylabel=None, aspect='auto', tight_layout=True, interpolation='nearest', show=False, colorbar=True, selection=None, selection_labels=None, title=None, background_color='white', pre_blend=False, background_alpha=1.0, visual={'y': 'y', 'layer': 'z', 'fade': 'selection', 'column': 'what', 'x': 'x', 'row': 'subspace'}, smooth_pre=None, smooth_post=None, wrap=True, wrap_columns=4, return_extra=False, hardcopy=None)[source]

Declarative plotting of statistical plots using matplotlib, supports subplots, selections, layers

Instead of passing x and y, pass a list as x argument for multiple panels. Give what a list of options to have multiple panels. When both are present then will be origanized in a column/row order.

This methods creates a 6 dimensional ‘grid’, where each dimension can map the a visual dimension. The grid dimensions are:

  • x: shape determined by shape, content by x argument or the first dimension of each space
  • y: ,,
  • z: related to the z argument
  • selection: shape equals length of selection argument
  • what: shape equals length of what argument
  • space: shape equals length of x argument if multiple values are given

By default, this its shape is (1, 1, 1, 1, shape, shape) (where x is the last dimension)

The visual dimensions are

  • x: x coordinate on a plot / image (default maps to grid’s x)
  • y: y ,, (default maps to grid’s y)
  • layer: each image in this dimension is blended togeher to one image (default maps to z)
  • fade: each image is shown faded after the next image (default mapt to selection)
  • row: rows of subplots (default maps to space)
  • columns: columns of subplot (default maps to what)

All these mappings can be changes by the visual argument, some examples:

>>> ds.plot('x', 'y', what=['mean(x)', 'correlation(vx, vy)'])

Will plot each ‘what’ as a column

>>> ds.plot('x', 'y', selection=['FeH < -3', '(FeH >= -3) & (FeH < -2)'], visual=dict(column='selection'))

Will plot each selection as a column, instead of a faded on top of each other.

Parameters:
  • x – Expression to bin in the x direction (by default maps to x), or list of pairs, like [[‘x’, ‘y’], [‘x’, ‘z’]], if multiple pairs are given, this dimension maps to rows by default
  • y – y (by default maps to y)
  • z – Expression to bin in the z direction, followed by a :start,end,shape signature, like ‘FeH:-3,1:5’ will produce 5 layers between -10 and 10 (by default maps to layer)
  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)
  • reduce
  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
  • normalize – normalization function, currently only ‘normalize’ is supported
  • normalize_axis – which axes to normalize on, None means normalize by the global maximum.
  • vmin – instead of automatic normalization, (using normalize and normalization_axis) scale the data between vmin and vmax to [0, 1]
  • vmax – see vmin
  • shape – shape/size of the n-D histogram grid
  • limits – list of [[xmin, xmax], [ymin, ymax]], or a description such as ‘minmax’, ‘99%’
  • grid – if the binning is done before by yourself, you can pass it
  • colormap – matplotlib colormap to use
  • figsize – (x, y) tuple passed to pylab.figure for setting the figure size
  • xlabel
  • ylabel
  • aspect
  • tight_layout – call pylab.tight_layout or not
  • colorbar – plot a colorbar or not
  • interpolation – interpolation for imshow, possible options are: ‘nearest’, ‘bilinear’, ‘bicubic’, see matplotlib for more
  • return_extra
Returns:

plot1d(x=None, what='count(*)', grid=None, shape=64, facet=None, limits=None, figsize=None, f='identity', n=None, normalize_axis=None, xlabel=None, ylabel=None, label=None, selection=None, show=False, tight_layout=True, hardcopy=None, **kwargs)[source]
Parameters:
  • x – Expression to bin in the x direction
  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum
  • grid
  • grid – if the binning is done before by yourself, you can pass it
  • facet – Expression to produce facetted plots ( facet=’x:0,1,12’ will produce 12 plots with x in a range between 0 and 1)
  • limits – list of [xmin, xmax], or a description such as ‘minmax’, ‘99%’
  • figsize – (x, y) tuple passed to pylab.figure for setting the figure size
  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
  • n – normalization function, currently only ‘normalize’ is supported, or None for no normalization
  • normalize_axis – which axes to normalize on, None means normalize by the global maximum.
  • normalize_axis
  • xlabel – String for label on x axis (may contain latex)
  • ylabel – Same for y axis
  • kwargs – extra argument passed to pylab.plot
Param:

tight_layout: call pylab.tight_layout or not

Returns:

plot3d(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]

Use at own risk, requires ipyvolume

plot_bq(x, y, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, **kwargs)[source]
plot_widget(x, y, z=None, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, backend='bqplot', **kwargs)[source]
remove_virtual_meta()[source]

Removes the file with the virtual column etc, it does not change the current virtual columns etc

rename_column(name, new_name)[source]

Renames a column, not this is only the in memory name, this will not be reflected on disk

scatter(x, y, xerr=None, yerr=None, s_expr=None, c_expr=None, selection=None, length_limit=50000, length_check=True, label=None, xlabel=None, ylabel=None, errorbar_kwargs={}, **kwargs)[source]

Convenience wrapper around pylab.scatter when for working with small datasets or selections

Parameters:
  • x – Expression for x axis
  • y – Idem for y
  • s_expr – When given, use if for the s (size) argument of pylab.scatter
  • c_expr – When given, use if for the c (color) argument of pylab.scatter
  • selection – Single selection expression, or None
  • length_limit – maximum number of rows it will plot
  • length_check – should we do the maximum row check or not?
  • xlabel – label for x axis, if None .label(x) is used
  • ylabel – label for y axis, if None .label(y) is used
  • errorbar_kwargs – extra dict with arguments passed to plt.errorbar
  • kwargs – extra arguments passed to pylab.scatter
Returns:

select(boolean_expression, mode='replace', name='default', executor=None)[source]

Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode

Selections are recorded in a history tree, per name, undo/redo can be done for them seperately

Parameters:
  • boolean_expression (str) – Any valid column expression, with comparison operators
  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract
  • name (str) – history tree or selection ‘slot’ to use
  • executor
Returns:

select_box(spaces, limits, mode='replace')[source]

Select a n-dimensional rectangular box bounded by limits

The following examples are equivalent: >>> ds.select_box([‘x’, ‘y’], [(0, 10), (0, 1)]) >>> ds.select_rectangle(‘x’, ‘y’, [(0, 10), (0, 1)]) :param spaces: list of expressions :param limits: sequence of shape [(x1, x2), (y1, y2)] :param mode: :return:

select_inverse(name='default', executor=None)[source]

Invert the selection, i.e. what is selected will not be, and vice versa

Parameters:
  • name (str) –
  • executor
Returns:

select_lasso(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]

For performance reasons, a lasso selection is handled differently.

Parameters:
  • expression_x (str) – Name/expression for the x coordinate
  • expression_y (str) – Name/expression for the y coordinate
  • xsequence – list of x numbers defining the lasso, together with y
  • ysequence
  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract
  • name (str) –
  • executor
Returns:

select_nothing(name='default')[source]

Select nothing

select_rectangle(x, y, limits, mode='replace')[source]

Select a 2d rectangular box in the space given by x and y, bounds by limits

Example: >>> ds.select_box(‘x’, ‘y’, [(0, 10), (0, 1)])

Parameters:
  • x – expression for the x space
  • y – expression fo the y space
  • limits – sequence of shape [(x1, x2), (y1, y2)]
  • mode
Returns:

selected_length()[source]

Returns the number of rows that are selected

selection_can_redo(name='default')[source]

Can selection name be redone?

selection_can_undo(name='default')[source]

Can selection name be undone?

selection_favorite_add(name, selection_name='default')[source]
selection_favorite_apply(name, selection_name='default', executor=None)[source]
selection_favorite_remove(name)[source]
selection_redo(name='default', executor=None)[source]

Redo selection, for the name

selection_undo(name='default', executor=None)[source]

Undo selection, for the name

selections_favorite_load()[source]
selections_favorite_store()[source]
set_active_fraction(value)[source]

Sets the active_fraction, set picked row to None, and remove selection

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_active_range(i1, i2)[source]

Sets the active_fraction, set picked row to None, and remove selection

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_auto_fraction(enabled)[source]
set_current_row(value)[source]

Set the current row, and emit the signal signal_pick

set_selection(selection, name='default', executor=None)[source]

Sets the selection object

Parameters:
  • selection – Selection object
  • name – selection ‘slot’
  • executor
Returns:

set_variable(name, expression_or_value, write=True)[source]

Set the variable to an expression or value defined by expression_or_value

Example:
>>> ds.set_variable("a", 2.)
>>> ds.set_variable("b", "a**2")
>>> ds.get_variable("b")
'a**2'
>>> ds.evaluate_variable("b")
4.0
Parameters:
  • name – Name of the variable
  • write – write variable to meta file
  • expression – value or expression
std(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the standard deviation for the given expression, possible on a grid defined by binby

>>> ds.std("vz")
110.31773397535071
>>> ds.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

subspace(*expressions, **kwargs)[source]

Return a Subspace for this dataset with the given expressions:

Example:

>>> subspace_xy = some_dataset("x", "y")
Return type:

Subspace

Parameters:
  • expressions (list[str]) – list of expressions
  • kwargs
Returns:

subspaces(expressions_list=None, dimensions=None, exclude=None, **kwargs)[source]

Generate a Subspaces object, based on a custom list of expressions or all possible combinations based on dimension

Parameters:
  • expressions_list – list of list of expressions, where the inner list defines the subspace
  • dimensions – if given, generates a subspace with all possible combinations for that dimension
  • exclude – list of
sum(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the sum for the given expression, possible on a grid defined by binby

Examples:

>>> ds.sum("L")
304054882.49378014
>>> ds.sum("L", binby="E", shape=4)
array([  8.83517994e+06,   5.92217598e+07,   9.55218726e+07,
                 1.40008776e+08])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

tail(n=10)[source]
to_astropy_table(column_names=None, selection=None, strings=True, virtual=False, index=None)[source]

Returns a astropy table object containing the ndarrays corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
  • index – if this column is given it is used for the index of the DataFrame
Returns:

astropy.table.Table object

to_copy(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a copy of the Dataset, if selection is None, it does not copy the data, it just has a reference

Parameters:
  • column_names – list of column names, to copy, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
Returns:

dict

to_dict(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a dict containing the ndarray corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
Returns:

dict

to_items(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a list of [(column_name, ndarray), ...)] pairs where the ndarray corresponds to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
Returns:

list of (name, ndarray) pairs

to_pandas_df(column_names=None, selection=None, strings=True, virtual=False, index_name=None)[source]

Return a pandas DataFrame containing the ndarray corresponding to the evaluated data

If index is given, that column is used for the index of the dataframe.

Example:
>>> df = ds.to_pandas_df(["x", "y", "z"])
>>> ds_copy = vx.from_pandas(df)
Parameters:
  • column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
  • index_column – if this column is given it is used for the index of the DataFrame
Returns:

pandas.DataFrame object

ucd_find(ucds, exclude=[])[source]

Find a set of columns (names) which have the ucd, or part of the ucd

Prefixed with a ^, it will only match the first part of the ucd

Example:
>>> dataset.ucd_find('pos.eq.ra', 'pos.eq.dec')
['RA', 'DEC']
>>> dataset.ucd_find('pos.eq.ra', 'doesnotexist')
>>> dataset.ucds[dataset.ucd_find('pos.eq.ra')]
'pos.eq.ra;meta.main'
>>> dataset.ucd_find('meta.main')]
'dec'
>>> dataset.ucd_find('^meta.main')]
>>>
unit(expression, default=None)[source]

Returns the unit (an astropy.unit.Units object) for the expression

Example:
>>> import vaex as vx
>>> ds = vx.example()
>>> ds.unit("x")
Unit("kpc")
>>> ds.unit("x*L")
Unit("km kpc2 / s")
Parameters:
  • expression – Expression, which can be a column name
  • default – if no unit is known, it will return this
Returns:

The resulting unit of the expression

Return type:

astropy.units.Unit

update_meta()[source]

Will read back the ucd, descriptions, units etc, written by Dataset.write_meta(). This will be done when opening a dataset.

update_virtual_meta()[source]

Will read back the virtual column etc, written by Dataset.write_virtual_meta(). This will be done when opening a dataset.

validate_expression(expression)[source]

Validate an expression (may throw Exceptions)

var(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the sample variance for the given expression, possible on a grid defined by binby

Examples:

>>> ds.var("vz")
12170.002429456246
>>> ds.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 15271.90481083,   7284.94713504,   3738.52239232,   1449.63418988])
>>> ds.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
>>> ds.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

write_meta()[source]

Writes all meta data, ucd,description and units

The default implementation is to write this to a file called meta.yaml in the directory defined by Dataset.get_private_dir(). Other implementation may store this in the dataset file itself. (For instance the vaex hdf5 implementation does this)

This method is called after virtual columns or variables are added. Upon opening a file, Dataset.update_meta() is called, so that the information is not lost between sessions.

Note: opening a dataset twice may result in corruption of this file.

write_virtual_meta()[source]

Writes virtual columns, variables and their ucd,description and units

The default implementation is to write this to a file called virtual_meta.yaml in the directory defined by Dataset.get_private_dir(). Other implementation may store this in the dataset file itself.

This method is called after virtual columns or variables are added. Upon opening a file, Dataset.update_virtual_meta() is called, so that the information is not lost between sessions.

Note: opening a dataset twice may result in corruption of this file.

class vaex.dataset.DatasetLocal(name, path, column_names)[source]

Bases: vaex.dataset.Dataset

Base class for datasets that work with local file/data

compare(other, report_missing=True, report_difference=False, show=10, orderby=None, column_names=None)[source]

Compare two datasets and report their difference, use with care for large datasets

concat(other)[source]

Concatenates two datasets, adding the rows of one the other dataset to the current, returned in a new dataset.

No copy of the data is made.

Parameters:other – The other dataset that is concatenated with this dataset
Returns:New dataset with the rows concatenated
Return type:DatasetConcatenated
data

Gives direct access to the data as numpy arrays.

Convenient when working with IPython in combination with small datasets, since this gives tab-completion. Only real columns (i.e. no virtual) columns can be accessed, for getting the data from virtual columns, use Dataset.evalulate(...)

Columns can be accesed by there names, which are attributes. The attribues are of type numpy.ndarray

Example:
>>> ds = vx.example()
>>> r = np.sqrt(ds.data.x**2 + ds.data.y**2)
echo(arg)[source]
evaluate(expression, i1=None, i2=None, out=None, selection=None)[source]

The local implementation of Dataset.evaluate()

export_fits(path, column_names=None, shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]

Exports the dataset to a fits file that is compatible with TOPCAT colfits format

Parameters:
  • dataset (DatasetLocal) – dataset to export
  • path (str) – path for file
  • column_names (lis[str]) – list of column names to export or None for all columns
  • shuffle (bool) – export rows in random order
  • selection (bool) – export selection or not
  • progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
  • sort (str) – expression used for sorting the output
  • ascending (bool) – sort ascending (True) or descending
Param:

bool virtual: When True, export virtual columns

Returns:

export_hdf5(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]

Exports the dataset to a vaex hdf5 file

Parameters:
  • dataset (DatasetLocal) – dataset to export
  • path (str) – path for file
  • column_names (lis[str]) – list of column names to export or None for all columns
  • byteorder (str) – = for native, < for little endian and > for big endian
  • shuffle (bool) – export rows in random order
  • selection (bool) – export selection or not
  • progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
  • sort (str) – expression used for sorting the output
  • ascending (bool) – sort ascending (True) or descending
Param:

bool virtual: When True, export virtual columns

Returns:

is_local()[source]

The local implementation of Dataset.evaluate(), always returns True

length(selection=False)[source]

Get the length of the datasets, for the selection of the whole dataset.

If selection is False, it returns len(dataset)

TODO: Implement this in DatasetRemote, and move the method up in Dataset.length()

Parameters:selection – When True, will return the number of selected rows
Returns:
selected_length(selection='default')[source]

The local implementation of Dataset.selected_length()

shallow_copy(virtual=True, variables=True)[source]

Creates a (shallow) copy of the dataset

It will link to the same data, but will have its own state, e.g. virtual columns, variables, selection etc

class vaex.dataset.DatasetConcatenated(datasets, name=None)[source]

Bases: vaex.dataset.DatasetLocal

Represents a set of datasets all concatenated. See DatasetLocal.concat() for usage.

is_masked(column)[source]
class vaex.dataset.DatasetArrays(name='arrays')[source]

Bases: vaex.dataset.DatasetLocal

Represent an in-memory dataset of numpy arrays, see from_arrays() for usage.

add_column(name, data)[source]

Add a column to the dataset

Parameters:
  • name (str) – name of column
  • data – numpy array with the data

vaex.events module

class vaex.events.Signal(name=None)[source]

Bases: object

connect(callback, prepend=False, *args, **kwargs)[source]
disconnect(callback)[source]
emit(*args, **kwargs)[source]

vaex.execution module

class vaex.execution.Column[source]

Bases: vaex.execution.Column

needs_copy()[source]
class vaex.execution.Executor(thread_pool=None, buffer_size=None, thread_mover=None)[source]

Bases: object

execute()[source]
execute_threaded()[source]
run(task)[source]
schedule(task)[source]
class vaex.execution.Job(task, order)[source]

Bases: object

exception vaex.execution.UserAbort(reason)[source]

Bases: exceptions.Exception

vaex.grids module

class vaex.grids.GridScope(locals=None, globals=None)[source]

Bases: object

add_lazy(key, f)[source]
cumulative(array, normalize=True)[source]
disjoined()[source]
evaluate(expression)[source]
marginal2d(i, j)[source]
normalize(array)[source]
setter(key)[source]
slice(slice)[source]
vaex.grids.add_mem(bytes, *info)[source]
vaex.grids.dog(grid, sigma1, sigma2)[source]
vaex.grids.gf(grid, sigma, **kwargs)[source]
vaex.grids.grid_average(scope, counts_name='counts', weighted_name='weighted')[source]

vaex.kld module

class vaex.kld.KlDivergenceShuffle(dataset, pairs, gridsize=128)[source]

Bases: object

get_jobs()[source]
vaex.kld.kl_divergence(P, Q, axis=None)[source]
vaex.kld.kld_shuffled(columns, Ngrid=128, datamins=None, datamaxes=None, offset=1)[source]
vaex.kld.kld_shuffled_grouped(dataset, range_map, pairs, feedback=None, size_grid=32, use_mask=True, bytes_max=536870912)[source]
vaex.kld.mutual_information(data)[source]
vaex.kld.to_disjoined(counts)[source]

vaex.multithreading module

class vaex.multithreading.MiniJob(callable, queue_out, args)[source]

Bases: object

cancel()[source]
class vaex.multithreading.ThreadPool(nthreads=4)[source]

Bases: object

close()[source]
execute(index)[source]
run_blocks(callable, total_length)[source]
run_parallel(callable, args_list=[])[source]
class vaex.multithreading.ThreadPoolIndex(nthreads=None)[source]

Bases: object

close()[source]
execute(index)[source]
map(callable, iterator, on_error=None, progress=None, cancel=None)[source]
run_blocks(callabble, total_length, parts=10, on_error=None)[source]
vaex.multithreading.get_main_pool()[source]

vaex.quick module

vaex.remote module

class vaex.remote.DatasetRemote(name, server, column_names)[source]

Bases: vaex.dataset.Dataset

class vaex.remote.DatasetRest(server, name, column_names, dtypes, ucds, descriptions, units, description, full_length, virtual_columns=None)[source]

Bases: vaex.remote.DatasetRemote

correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, async=False, progress=None)[source]
count(expression=None, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
cov(x, y=None, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
dtype(expression)[source]
evaluate(expression, i1=None, i2=None, out=None, selection=None, async=False)[source]

basic support for evaluate at server, at least to run some unittest, do not expect this to work from strings

is_local()[source]
mean(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
minmax(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
sum(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
var(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
class vaex.remote.ServerExecutor[source]

Bases: object

execute()[source]
class vaex.remote.ServerRest(hostname, port=5000, base_path='/', background=False, thread_mover=None, websocket=True)[source]

Bases: object

close()[source]
datasets(as_dict=False, async=False)[source]
submit_http(path, arguments, post_process, async, progress=None, **kwargs)[source]
submit_websocket(path, arguments, async=False, progress=None, post_process=<function <lambda>>)[source]
wait()[source]
class vaex.remote.SubspaceRemote(dataset, expressions, executor, async, masked=False)[source]

Bases: vaex.legacy.Subspace

correlation(means=None, vars=None)[source]
dimension
histogram(limits, size=256, weight=None)[source]
limits_sigma(sigmas=3, square=False)[source]
mean()[source]
minmax()[source]
mutual_information(limits=None, size=256)[source]
nearest(point, metric=None)[source]
sleep(seconds, async=False)[source]
sum()[source]
toarray(list)[source]
var(means=None)[source]
class vaex.remote.TaskServer(post_process, async)[source]

Bases: vaex.dataset.Task

execute()[source]
schedule(task)[source]
vaex.remote.listify(value)[source]
vaex.remote.wrap_future_with_promise(future)[source]

vaex.samp module

class vaex.samp.Samp(daemon=True, name=None)[source]

Bases: object

class vaex.samp.SampSingle(name='vaex - single table load')[source]

Bases: object

wait_for_table()[source]
vaex.samp.ask_cmd_line(username, password)[source]
vaex.samp.fetch_votable(url, username=None, password=None, ask=<function ask_cmd_line>)[source]
vaex.samp.single_table(username=None, password=None)[source]

vaex.settings module

class vaex.settings.AutoStoreDict(settings, store)[source]

Bases: _abcoll.MutableMapping

class vaex.settings.Files(open, recent)[source]

Bases: object

class vaex.settings.Settings(filename)[source]

Bases: object

auto_store_dict(key)[source]
dump()[source]
get(key, default=None)[source]
store(key, value)[source]

vaex.utils module

class vaex.utils.AttrDict(*args, **kwargs)[source]

Bases: dict

class vaex.utils.CpuUsage(format='CPU Usage: %(cpu_usage)s%%', usage_format='% 5d')[source]

Bases: progressbar.widgets.FormatWidgetMixin, progressbar.widgets.TimeSensitiveWidgetBase

class vaex.utils.Timer(name=None, logger=None)[source]

Bases: object

vaex.utils.check_memory_usage(bytes_needed, confirm)[source]
vaex.utils.confirm_on_console(topic, msg)[source]
vaex.utils.dict_constructor(loader, node)[source]
vaex.utils.dict_representer(dumper, data)[source]
vaex.utils.disjoined(data)[source]
vaex.utils.ensure_string(string_or_bytes, encoding='utf-8', cast=False)[source]
vaex.utils.filename_shorten(path, max_length=150)[source]
vaex.utils.filesize_format(value)[source]
vaex.utils.get_data_file(filename)[source]
vaex.utils.get_private_dir(subdir=None)[source]
vaex.utils.get_root_path()[source]
vaex.utils.linspace_centers(start, stop, N)[source]
vaex.utils.listify(*args)[source]
vaex.utils.make_list(sequence)[source]
vaex.utils.multisum(a, axes)[source]
vaex.utils.os_open(document)[source]

Open document by the default handler of the OS, could be a url opened by a browser, a text file by an editor etc

vaex.utils.progressbar(name='processing', max_value=1)[source]
vaex.utils.progressbar_callable(name='processing', max_value=1)[source]
vaex.utils.progressbars(f=True, next=None, name=None)[source]
vaex.utils.read_json_or_yaml(filename)[source]
vaex.utils.subdivide(length, parts=None, max_length=None)[source]

Generates a list with start end stop indices of length parts, [(0, length/parts), ..., (.., length)]

vaex.utils.submit_subdivide(thread_count, f, length, max_length)[source]
vaex.utils.unlistify(waslist, *args)[source]
vaex.utils.write_json_or_yaml(filename, data)[source]
vaex.utils.yaml_dump(f, data)[source]
vaex.utils.yaml_load(f)[source]