Contents

Installing

Warning

It is recommended not to install directly into your operating system’s Python using sudo since it may break your system. Instead, you should install Anaconda, which is a Python distribution that makes installing Python packages much easier or use virtualenv or venv.

Short version

  • Anaconda users: conda install -c conda-forge vaex

  • Regular Python users using virtualenv: pip install vaex

  • Regular Python users (not recommended): pip install --user vaex

  • System install (not recommended): sudo pip install vaex

Longer version

If you don’t want all packages installed, do not install the vaex package. The vaex package is a meta packages that depends on all other vaex packages so it will instal them all, but if you don’t need astronomy related parts (vaex-astro), or don’t care about graphql (vaex-graphql), you can leave out those packages. Copy paste the following lines and remove what you do not need:

  • Regular Python users: pip install vaex-core vaex-viz vaex-jupyter vaex-server vaex-hdf5 vaex-astro vaex-ml

  • Anaconda users: conda install -c conda-forge vaex-core vaex-viz vaex-jupyter vaex-server vaex-hdf5 vaex-astro vaex-ml

For developers

If you want to work on vaex for a Pull Request from the source, use the following recipe:

  • git clone --recursive https://github.com/vaexio/vaex # make sure you get the submodules

  • cd vaex

  • make sure the dev versions of pcre are installed (e.g. conda install -c conda-forge pcre)

  • install using (note: if you’re on Windows, make sure that your command line/terminal has administrator privileges):

  • make init or pip install -e ".[dev]" (again, use (ana)conda or virtualenv/venv)

  • If you want to do a PR

  • git remote rename origin upstream

  • (now fork on github)

  • git remote add origin https://github.com/yourusername/vaex/

  • … edit code … (or do this after the next step)

  • git checkout -b feature_X

  • git commit -a -m "new: some feature X"

  • git push origin feature_X

  • git checkout master

  • Get your code in sync with upstream

  • git checkout master

  • git fetch upstream

  • git merge upstream/master

Tutorials

Vaex introduction in 11 minutes

Because vaex goes up to 11

If you want to try out this notebook with a live Python kernel, use mybinder:

https://mybinder.org/badge_logo.svg

DataFrame

Central to Vaex is the DataFrame (similar, but more efficient than a Pandas DataFrame), and we often use the variable df to represent it. A DataFrame is an efficient representation for a large tabular dataset, and has:

  • A number of columns, say x, y and z, which are:

  • Backed by a Numpy array;

  • Wrapped by an expression system e.g. df.x, df['x'] or df.col.x is an Expression;

  • Columns/expression can perform lazy computations, e.g. df.x * np.sin(df.y) does nothing, until the result is needed.

  • A set of virtual columns, columns that are backed by a (lazy) computation, e.g. df['r'] = df.x/df.y

  • A set of selections, that can be used to explore the dataset, e.g. df.select(df.x < 0)

  • Filtered DataFrames, that does not copy the data, df_negative = df[df.x < 0]

Lets start with an example dataset, which is included in Vaex.

[1]:
import vaex
df = vaex.example()
df  # Since this is the last statement in a cell, it will print the DataFrame in a nice HTML format.
[1]:
# id x y z vx vy vz E L Lz FeH
0 0 1.2318683862686157 -0.39692866802215576-0.598057746887207 301.1552734375 174.05947875976562 27.42754554748535 -149431.40625 407.38897705078125333.9555358886719 -1.0053852796554565
1 23 -0.163700610399246223.654221296310425 -0.25490644574165344-195.00022888183594170.47216796875 142.5302276611328 -124247.953125890.2411499023438 684.6676025390625 -1.7086670398712158
2 32 -2.120255947113037 3.326052665710449 1.7078403234481812 -48.63423156738281 171.6472930908203 -2.079437255859375 -138500.546875372.2410888671875 -202.17617797851562-1.8336141109466553
3 8 4.7155890464782715 4.5852508544921875 2.2515437602996826 -232.42083740234375-294.850830078125 62.85865020751953 -60037.03906251297.63037109375 -324.6875 -1.4786882400512695
4 16 7.21718692779541 11.99471664428711 -1.064562201499939 -1.6891745328903198181.329345703125 -11.333610534667969-83206.84375 1332.79895019531251328.948974609375 -1.8570483922958374
... ... ... ... ... ... ... ... ... ... ... ...
329,99521 1.9938701391220093 0.789276123046875 0.22205990552902222 -216.9299011230468816.124420166015625 -211.244384765625 -146457.4375 457.72247314453125203.36758422851562 -1.7451677322387695
329,99625 3.7180912494659424 0.721337616443634 1.6415337324142456 -185.92160034179688-117.25082397460938-105.4986572265625 -126627.109375335.0025634765625 -301.8370056152344 -0.9822322130203247
329,99714 0.3688507676124573 13.029608726501465 -3.633934736251831 -53.677146911621094-145.15771484375 76.70909881591797 -84912.2578125817.1375732421875 645.8507080078125 -1.7645612955093384
329,99818 -0.112592644989490511.4529125690460205 2.168952703475952 179.30865478515625 205.79710388183594 -68.75872802734375 -133498.46875 724.000244140625 -283.6910400390625 -1.8808952569961548
329,9994 20.796220779418945 -3.331387758255005 12.18841552734375 42.69000244140625 69.20479583740234 29.54275131225586 -65519.328125 1843.07470703125 1581.4151611328125 -1.1231083869934082

Columns

The above preview shows that this dataset contains \(> 300,000\) rows, and columns named x ,y, z (positions), vx, vy, vz (velocities), E (energy), L (angular momentum), and an id (subgroup of samples). When we print out a column, we can see that it is not a Numpy array, but an Expression.

[2]:
df.x  # df.col.x or df['x'] are equivalent, but df.x may be preferred because it is more tab completion friendly or programming friendly respectively
[2]:
Expression = x
Length: 330,000 dtype: float32 (column)
---------------------------------------
     0    1.23187
     1  -0.163701
     2   -2.12026
     3    4.71559
     4    7.21719
       ...
329995    1.99387
329996    3.71809
329997   0.368851
329998  -0.112593
329999    20.7962

One can use the .values method to get an in-memory representation of an expression. The same method can be applied to a DataFrame as well.

[3]:
df.x.values
[3]:
array([ 1.2318684 , -0.16370061, -2.120256  , ...,  0.36885077,
       -0.11259264, 20.79622   ], dtype=float32)

Most Numpy functions (ufuncs) can be performed on expressions, and will not result in a direct result, but in a new expression.

[4]:
import numpy as np
np.sqrt(df.x**2 + df.y**2 + df.z**2)
[4]:
Expression = sqrt((((x ** 2) + (y ** 2)) + (z ** 2)))
Length: 330,000 dtype: float32 (expression)
-------------------------------------------
     0  1.42574
     1  3.66676
     2  4.29824
     3  6.95203
     4   14.039
      ...
329995  2.15587
329996  4.12785
329997  13.5319
329998  2.61304
329999  24.3339

Virtual columns

Sometimes it is convenient to store an expression as a column. We call this a virtual column since it does not take up any memory, and is computed on the fly when needed. A virtual column is treated just as a normal column.

[5]:
df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2)
df[['x', 'y', 'z', 'r']]
[5]:
# x y z r
0 1.2318683862686157 -0.39692866802215576-0.598057746887207 1.425736665725708
1 -0.163700610399246223.654221296310425 -0.254906445741653443.666757345199585
2 -2.120255947113037 3.326052665710449 1.7078403234481812 4.298235893249512
3 4.7155890464782715 4.5852508544921875 2.2515437602996826 6.952032566070557
4 7.21718692779541 11.99471664428711 -1.064562201499939 14.03902816772461
... ... ... ... ...
329,9951.9938701391220093 0.789276123046875 0.22205990552902222 2.155872344970703
329,9963.7180912494659424 0.721337616443634 1.6415337324142456 4.127851963043213
329,9970.3688507676124573 13.029608726501465 -3.633934736251831 13.531896591186523
329,998-0.112592644989490511.4529125690460205 2.168952703475952 2.613041877746582
329,99920.796220779418945 -3.331387758255005 12.18841552734375 24.333894729614258

Selections and filtering

Vaex can be efficient when exploring subsets of the data, for instance to remove outliers or to inspect only a part of the data. Instead of making copies, Vaex internally keeps track which rows are selected.

[6]:
df.select(df.x < 0)
df.evaluate(df.x, selection=True)
[6]:
array([-0.16370061, -2.120256  , -7.7843747 , ..., -8.126636  ,
       -3.9477386 , -0.11259264], dtype=float32)

Selections are useful when you frequently modify the portion of the data you want to visualize, or when you want to efficiently compute statistics on several portions of the data effectively.

Alternatively, you can also create filtered datasets. This is similar to using Pandas, except that Vaex does not copy the data.

[7]:
df_negative = df[df.x < 0]
df_negative[['x', 'y', 'z', 'r']]
[7]:
# x y z r
0 -0.163700610399246223.654221296310425 -0.254906445741653443.666757345199585
1 -2.120255947113037 3.326052665710449 1.7078403234481812 4.298235893249512
2 -7.784374713897705 5.989774703979492 -0.682695209980011 9.845809936523438
3 -3.5571861267089844 5.413629055023193 0.09171556681394577 6.478376865386963
4 -20.813940048217773 -3.294677495956421 13.486607551574707 25.019264221191406
... ... ... ... ...
166,274-2.5926425457000732 -2.871671676635742 -0.180483341217041023.8730955123901367
166,275-0.7566012144088745 2.9830434322357178 -6.940553188323975 7.592250823974609
166,276-8.126635551452637 1.1619765758514404 -1.6459038257598877 8.372657775878906
166,277-3.9477386474609375 -3.0684902667999268-1.5822702646255493 5.244411468505859
166,278-0.112592644989490511.4529125690460205 2.168952703475952 2.613041877746582

Statistics on N-d grids

A core feature of Vaex is the extremely efficient calculation of statistics on N-dimensional grids. This is rather useful for making visualisations of large datasets.

[8]:
df.count(), df.mean(df.x), df.mean(df.x, selection=True)
[8]:
(array(330000), array(-0.0632868), array(-5.18457762))

Similar to SQL’s groupby, Vaex uses the binby concept, which tells Vaex that a statistic should be calculated on a regular grid (for performance reasons)

[9]:
counts_x = df.count(binby=df.x, limits=[-10, 10], shape=64)
counts_x
[9]:
array([1374, 1350, 1459, 1618, 1706, 1762, 1852, 2007, 2240, 2340, 2610,
       2840, 3126, 3337, 3570, 3812, 4216, 4434, 4730, 4975, 5332, 5800,
       6162, 6540, 6805, 7261, 7478, 7642, 7839, 8336, 8736, 8279, 8269,
       8824, 8217, 7978, 7541, 7383, 7116, 6836, 6447, 6220, 5864, 5408,
       4881, 4681, 4337, 4015, 3799, 3531, 3320, 3040, 2866, 2629, 2488,
       2244, 1981, 1905, 1734, 1540, 1437, 1378, 1233, 1186])

This results in a Numpy array with the number counts in 64 bins distributed between x = -10, and x = 10. We can quickly visualize this using Matplotlib.

[10]:
import matplotlib.pyplot as plt
plt.plot(np.linspace(-10, 10, 64), counts_x)
plt.show()
_images/tutorial_20_0.png

We can do the same in 2D as well (this can be generalized to N-D actually!), and display it with Matplotlib.

[11]:
xycounts = df.count(binby=[df.x, df.y], limits=[[-10, 10], [-10, 20]], shape=(64, 128))
xycounts
[11]:
array([[ 5,  2,  3, ...,  3,  3,  0],
       [ 8,  4,  2, ...,  5,  3,  2],
       [ 5, 11,  7, ...,  3,  3,  1],
       ...,
       [ 4,  8,  5, ...,  2,  0,  2],
       [10,  6,  7, ...,  1,  1,  2],
       [ 6,  7,  9, ...,  2,  2,  2]])
[12]:
plt.imshow(xycounts.T, origin='lower', extent=[-10, 10, -10, 20])
plt.show()
_images/tutorial_23_0.png
[13]:
v = np.sqrt(df.vx**2 + df.vy**2 + df.vz**2)
xy_mean_v = df.mean(v, binby=[df.x, df.y], limits=[[-10, 10], [-10, 20]], shape=(64, 128))
xy_mean_v
[13]:
array([[156.15283203, 226.0004425 , 206.95940653, ...,  90.0340627 ,
        152.08784485,          nan],
       [203.81366634, 133.01436043, 146.95962524, ..., 137.54756927,
         98.68717448, 141.06020737],
       [150.59178772, 188.38820371, 137.46753802, ..., 155.96900177,
        148.91660563, 138.48191833],
       ...,
       [168.93819809, 187.75943136, 137.318647  , ..., 144.83927917,
                 nan, 107.7273407 ],
       [154.80492783, 140.55182203, 180.30700166, ..., 184.01670837,
         95.10913086, 131.18122864],
       [166.06868235, 150.54079764, 125.84606828, ..., 130.56007385,
        121.04217911, 113.34659195]])
[14]:
plt.imshow(xy_mean_v.T, origin='lower', extent=[-10, 10, -10, 20])
plt.show()
_images/tutorial_25_0.png

Other statistics can be computed, such as:

Or see the full list at the API docs.

Getting your data in

Before continuing with this tutorial, you may want to read in your own data. Ultimately, a Vaex DataFrame just wraps a set of Numpy arrays. If you can access your data as a set of Numpy arrays, you can easily construct a DataFrame using from_arrays.

[15]:
import vaex
import numpy as np
x = np.arange(5)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df
[15]:
# x y
0 0 0
1 1 1
2 2 4
3 3 9
4 4 16

Other quick ways to get your data in are:

Exporting, or converting a DataFrame to a different datastructure is also quite easy:

Nowadays, it is common to put data, especially larger dataset, on the cloud. Vaex can read data straight from S3, in a lazy manner, meaning that only that data that is needed will be downloaded, and cached on disk.

[16]:
# Read in the NYC Taxi dataset straight from S3
nyctaxi = vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')
nyctaxi.head(5)
[16]:
# vendor_id pickup_datetime dropoff_datetime passenger_countpayment_type trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
0VTS 2009-01-04 02:52:00.0000000002009-01-04 03:02:00.000000000 1CASH 2.63 -73.992 40.7216 nan nan -73.9938 40.6959 8.9 0.5 nan 0 0 9.4
1VTS 2009-01-04 03:31:00.0000000002009-01-04 03:38:00.000000000 3Credit 4.55 -73.9821 40.7363 nan nan -73.9558 40.768 12.1 0.5 nan 2 0 14.6
2VTS 2009-01-03 15:43:00.0000000002009-01-03 15:57:00.000000000 5Credit 10.35 -74.0026 40.7397 nan nan -73.87 40.7702 23.7 0 nan 4.74 0 28.44
3DDS 2009-01-01 20:52:58.0000000002009-01-01 21:14:00.000000000 1CREDIT 5 -73.9743 40.791 nan nan -73.9966 40.7318 14.9 0.5 nan 3.05 0 18.45
4DDS 2009-01-24 16:18:23.0000000002009-01-24 16:24:56.000000000 1CASH 0.4 -74.0016 40.7194 nan nan -74.0084 40.7203 3.7 0 nan 0 0 3.7

Plotting

1-D and 2-D

Most visualizations are done in 1 or 2 dimensions, and Vaex nicely wraps Matplotlib to satisfy a variety of frequent use cases.

[17]:
import vaex
import numpy as np
df = vaex.example()

The simplest visualization is a 1-D plot using DataFrame.viz.histogram. Here, we only show 99.7% of the data.

[1]:
df.viz.histogram(df.x, limits='99.7%')
_images/tutorial_35_0.png

A slighly more complication visualization, is to plot not the counts, but a different statistic for that bin. In most cases, passing the what='<statistic>(<expression>) argument will do, where <statistic> is any of the statistics mentioned in the list above, or in the API docs.

[19]:
df.viz.histogram(df.x, what='mean(E)', limits='99.7%');
_images/tutorial_37_0.png

An equivalent method is to use the vaex.stat.<statistic> functions, e.g. vaex.stat.mean.

[20]:
df.viz.histogram(df.x, what=vaex.stat.mean(df.E), limits='99.7%');
_images/tutorial_39_0.png

The vaex.stat.<statistic> objects are very similar to Vaex expressions, in that they represent an underlying calculation. Typical arithmetic and Numpy functions can be applied to these calulations. However, these objects compute a single statistic, and do not return a column or expression.

[21]:
np.log(vaex.stat.mean(df.x)/vaex.stat.std(df.x))
[21]:
log((mean(x) / std(x)))

These statistical objects can be passed to the what argument. The advantage being that the data will only have to be passed over once.

[22]:
df.viz.histogram(df.x, what=np.clip(np.log(-vaex.stat.mean(df.E)), 11, 11.4), limits='99.7%');
_images/tutorial_43_0.png

A similar result can be obtained by calculating the statistic ourselves, and passing it to plot1d’s grid argument. Care has to be taken that the limits used for calculating the statistics and the plot are the same, otherwise the x axis may not correspond to the real data.

[3]:
limits = [-30, 30]
shape  = 64
meanE  = df.mean(df.E, binby=df.x, limits=limits, shape=shape)
grid   = np.clip(np.log(-meanE), 11, 11.4)
df.viz.histogram(df.x, grid=grid, limits=limits, ylabel='clipped E');
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb Cell 46' in <cell line: 3>()
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=0'>1</a> limits = [-30, 30]
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=1'>2</a> shape  = 64
----> <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=2'>3</a> meanE  = df.mean(df.E, binby=df.x, limits=limits, shape=shape)
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=3'>4</a> grid   = np.clip(np.log(-meanE), 11, 11.4)
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=4'>5</a> df.viz.histogram(df.x, grid=grid, limits=limits, ylabel='clipped E')

NameError: name 'df' is not defined

Instead of plotting the density across one dimension (a histogram), we can also plot the density across two dimensions. This is done with the DataFrame.viz.heatmap function. It shares many arguments and is very similar to the histogram.

[24]:
df.viz.heatmap(df.x, df.y, what=vaex.stat.mean(df.E)**2, limits='99.7%');
_images/tutorial_47_0.png

Selections for plotting

While filtering is useful for narrowing down the contents of a DataFrame (e.g. df_negative = df[df.x < 0]) there are a few downsides to this. First, a practical issue is that when you filter 4 different ways, you will need to have 4 different DataFrames polluting your namespace. More importantly, when Vaex executes a bunch of statistical computations, it will do that per DataFrame, meaning that 4 passes over the data will be made, and even though all 4 of those DataFrames point to the same underlying data.

If instead we have 4 (named) selections in our DataFrame, we can calculate statistics in one single pass over the data, which can be significantly faster especially in the cases when your dataset is larger than your memory.

In the plot below we show three selection, which by default are blended together, requiring just one pass over the data.

[25]:
df.viz.heatmap(df.x, df.y, what=np.log(vaex.stat.count()+1), limits='99.7%',
        selection=[None, df.x < df.y, df.x < -10]);
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/image.py:113: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rgba_dest[:, :, c][[mask]] = np.clip(result[[mask]], 0, 1)
_images/tutorial_49_1.png

Advanced Plotting

Lets say we would like to see two plots next to eachother. To achieve this we can pass a list of expression pairs.

[26]:
df.viz.heatmap([["x", "y"], ["x", "z"]], limits='99.7%',
        title="Face on and edge on", figsize=(10,4));
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/viz/mpl.py:779: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  ax = plt.subplot(gs[row_offset + row * row_scale:row_offset + (row + 1) * row_scale, column * column_scale:(column + 1) * column_scale])
_images/tutorial_51_1.png

By default, if you have multiple plots, they are shown as columns, multiple selections are overplotted, and multiple ‘whats’ (statistics) are shown as rows.

[27]:
df.viz.heatmap([["x", "y"], ["x", "z"]],
        limits='99.7%',
        what=[np.log(vaex.stat.count()+1), vaex.stat.mean(df.E)],
        selection=[None, df.x < df.y],
        title="Face on and edge on", figsize=(10,10));
_images/tutorial_53_0.png

Note that the selection has no effect in the bottom rows.

However, this behaviour can be changed using the visual argument.

[28]:
df.viz.heatmap([["x", "y"], ["x", "z"]],
        limits='99.7%',
        what=vaex.stat.mean(df.E),
        selection=[None, df.Lz < 0],
        visual=dict(column='selection'),
        title="Face on and edge on", figsize=(10,10));
_images/tutorial_55_0.png

Slices in a 3rd dimension

If a 3rd axis (z) is given, you can ‘slice’ through the data, displaying the z slices as rows. Note that here the rows are wrapped, which can be changed using the wrap_columns argument.

[29]:
df.viz.heatmap("Lz", "E",
        limits='99.7%',
        z="FeH:-2.5,-1,8", show=True, visual=dict(row="z"),
        figsize=(12,8), f="log", wrap_columns=3);
_images/tutorial_57_0.png

Visualization of smaller datasets

Although Vaex focuses on large datasets, sometimes you end up with a fraction of the data (e.g. due to a selection) and you want to make a scatter plot. You can do so with the following approach:

[30]:
import vaex
df = vaex.example()
[31]:
import matplotlib.pyplot as plt
x = df.evaluate("x", selection=df.Lz < -2500)
y = df.evaluate("y", selection=df.Lz < -2500)
plt.scatter(x, y, c="red", alpha=0.5, s=4);
_images/tutorial_60_0.png

Using DataFrame.viz.scatter:

[32]:
df.viz.scatter(df.x, df.y, selection=df.Lz < -2500, c="red", alpha=0.5, s=4)
df.viz.scatter(df.x, df.y, selection=df.Lz > 1500, c="green", alpha=0.5, s=4);
_images/tutorial_62_0.png

In control

While Vaex provides a wrapper for Matplotlib, there are situations where you want to use the DataFrame.viz methods, but want to be in control of the plot. Vaex simply uses the current figure and axes objects, so that it is easy to do.

[33]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,7))
plt.sca(ax1)
selection = df.Lz < -2500
x = df[selection].x.evaluate()#selection=selection)
y = df[selection].y.evaluate()#selection=selection)
df.viz.heatmap(df.x, df.y)
plt.scatter(x, y)
plt.xlabel('my own label $\gamma$')
plt.xlim(-20, 20)
plt.ylim(-20, 20)

plt.sca(ax2)
df.viz.histogram(df.x, label='counts', n=True)
x = np.linspace(-30, 30, 100)
std = df.std(df.x.expression)
y = np.exp(-(x**2/std**2/2)) / np.sqrt(2*np.pi) / std
plt.plot(x, y, label='gaussian fit')
plt.legend()
plt.show()
_images/tutorial_64_0.png

Healpix (Plotting)

Healpix plotting is supported via the healpy package. Vaex does not need special support for healpix, only for plotting, but some helper functions are introduced to make working with healpix easier.

In the following example we will use the TGAS astronomy dataset.

To understand healpix better, we will start from the beginning. If we want to make a density sky plot, we would like to pass healpy a 1D Numpy array where each value represents the density at a location of the sphere, where the location is determined by the array size (the healpix level) and the offset (the location). The TGAS (and Gaia) data includes the healpix index encoded in the source_id. By diving the source_id by 34359738368 you get a healpix index level 12, and diving it further will take you to lower levels.

[34]:
import vaex
import healpy as hp
tgas = vaex.datasets.tgas(full=True)

We will start showing how you could manually do statistics on healpix bins using vaex.count. We will do a really course healpix scheme (level 2).

[35]:
level = 2
factor = 34359738368 * (4**(12-level))
nmax = hp.nside2npix(2**level)
epsilon = 1e-16
counts = tgas.count(binby=tgas.source_id/factor, limits=[-epsilon, nmax-epsilon], shape=nmax)
counts
[35]:
array([ 4021,  6171,  5318,  7114,  5755, 13420, 12711, 10193,  7782,
       14187, 12578, 22038, 17313, 13064, 17298, 11887,  3859,  3488,
        9036,  5533,  4007,  3899,  4884,  5664, 10741,  7678, 12092,
       10182,  6652,  6793, 10117,  9614,  3727,  5849,  4028,  5505,
        8462, 10059,  6581,  8282,  4757,  5116,  4578,  5452,  6023,
        8340,  6440,  8623,  7308,  6197, 21271, 23176, 12975, 17138,
       26783, 30575, 31931, 29697, 17986, 16987, 19802, 15632, 14273,
       10594,  4807,  4551,  4028,  4357,  4067,  4206,  3505,  4137,
        3311,  3582,  3586,  4218,  4529,  4360,  6767,  7579, 14462,
       24291, 10638, 11250, 29619,  9678, 23322, 18205,  7625,  9891,
        5423,  5808, 14438, 17251,  7833, 15226,  7123,  3708,  6135,
        4110,  3587,  3222,  3074,  3941,  3846,  3402,  3564,  3425,
        4125,  4026,  3689,  4084, 16617, 13577,  6911,  4837, 13553,
       10074,  9534, 20824,  4976,  6707,  5396,  8366, 13494, 19766,
       11012, 16130,  8521,  8245,  6871,  5977,  8789, 10016,  6517,
        8019,  6122,  5465,  5414,  4934,  5788,  6139,  4310,  4144,
       11437, 30731, 13741, 27285, 40227, 16320, 23039, 10812, 14686,
       27690, 15155, 32701, 18780,  5895, 23348,  6081, 17050, 28498,
       35232, 26223, 22341, 15867, 17688,  8580, 24895, 13027, 11223,
        7880,  8386,  6988,  5815,  4717,  9088,  8283, 12059,  9161,
        6952,  4914,  6652,  4666, 12014, 10703, 16518, 10270,  6724,
        4553,  9282,  4981])

And using healpy’s mollview we can visualize this.

[36]:
hp.mollview(counts, nest=True)
_images/tutorial_70_0.png

To simplify life, Vaex includes DataFrame.healpix_count to take care of this.

[37]:
counts = tgas.healpix_count(healpix_level=6)
hp.mollview(counts, nest=True)
_images/tutorial_72_0.png

Or even simpler, use DataFrame.viz.healpix_heatmap

[38]:
tgas.viz.healpix_heatmap(
    f="log1p",
    healpix_level=6,
    figsize=(10,8),
    healpix_output="ecliptic"
)
_images/tutorial_74_0.png

xarray suppport

The df.count method can also return an xarray data array instead of a numpy array. This is easily done via the array_type keyword. Building on top of numpy, xarray adds dimension labels, coordinates and attributes, that makes working with multi-dimensional arrays more convenient.

[39]:
xarr = df.count(binby=[df.x, df.y], limits=[-10, 10], shape=64, array_type='xarray')
xarr
[39]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
  • x: 64
  • y: 64
  • 6 3 7 9 10 13 6 13 17 7 12 15 7 14 ... 11 8 10 6 7 5 17 9 10 10 6 5 7
    array([[ 6,  3,  7, ..., 15, 10, 11],
           [10,  3,  7, ..., 10, 13, 11],
           [ 5, 15,  5, ..., 12, 18, 12],
           ...,
           [ 7,  8, 10, ...,  6,  7,  7],
           [12, 10, 17, ..., 11,  8,  2],
           [ 7, 10, 13, ...,  6,  5,  7]])
    • x
      (x)
      float64
      -9.844 -9.531 ... 9.531 9.844
      array([-9.84375, -9.53125, -9.21875, -8.90625, -8.59375, -8.28125, -7.96875,
             -7.65625, -7.34375, -7.03125, -6.71875, -6.40625, -6.09375, -5.78125,
             -5.46875, -5.15625, -4.84375, -4.53125, -4.21875, -3.90625, -3.59375,
             -3.28125, -2.96875, -2.65625, -2.34375, -2.03125, -1.71875, -1.40625,
             -1.09375, -0.78125, -0.46875, -0.15625,  0.15625,  0.46875,  0.78125,
              1.09375,  1.40625,  1.71875,  2.03125,  2.34375,  2.65625,  2.96875,
              3.28125,  3.59375,  3.90625,  4.21875,  4.53125,  4.84375,  5.15625,
              5.46875,  5.78125,  6.09375,  6.40625,  6.71875,  7.03125,  7.34375,
              7.65625,  7.96875,  8.28125,  8.59375,  8.90625,  9.21875,  9.53125,
              9.84375])
    • y
      (y)
      float64
      -9.844 -9.531 ... 9.531 9.844
      array([-9.84375, -9.53125, -9.21875, -8.90625, -8.59375, -8.28125, -7.96875,
             -7.65625, -7.34375, -7.03125, -6.71875, -6.40625, -6.09375, -5.78125,
             -5.46875, -5.15625, -4.84375, -4.53125, -4.21875, -3.90625, -3.59375,
             -3.28125, -2.96875, -2.65625, -2.34375, -2.03125, -1.71875, -1.40625,
             -1.09375, -0.78125, -0.46875, -0.15625,  0.15625,  0.46875,  0.78125,
              1.09375,  1.40625,  1.71875,  2.03125,  2.34375,  2.65625,  2.96875,
              3.28125,  3.59375,  3.90625,  4.21875,  4.53125,  4.84375,  5.15625,
              5.46875,  5.78125,  6.09375,  6.40625,  6.71875,  7.03125,  7.34375,
              7.65625,  7.96875,  8.28125,  8.59375,  8.90625,  9.21875,  9.53125,
              9.84375])

In addition, xarray also has a plotting method that can be quite convenient. Since the xarray object has information about the labels of each dimension, the plot axis will be automatially labeled.

[40]:
xarr.plot();
_images/tutorial_78_0.png

Having xarray as output helps us to explore the contents of our data faster. In the following example we show how easy it is to plot the 2D distribution of the positions of the samples (x, y), per id group.

Notice how xarray automatically adds the appropriate titles and axis labels to the figure.

[41]:
df.categorize('id', inplace=True)  # treat the id as a categorical column - automatically adjusts limits and shape
xarr = df.count(binby=['x', 'y', 'id'], limits='95%', array_type='xarray')
np.log1p(xarr).plot(col='id', col_wrap=7);
_images/tutorial_80_0.png

Interactive widgets

Note: The interactive widgets require a running Python kernel, if you are viewing this documentation online you can get a feeling for what the widgets can do, but computation will not be possible!

Using the vaex-jupyter package, we get access to interactive widgets (go see the Vaex Jupyter tutorial for a more in depth tutorial)

[42]:
import vaex
import vaex.jupyter
import numpy as np
import matplotlib.pyplot as plt
df = vaex.example()

The simplest way to get a more interactive visualization (or even print out statistics) is to use the vaex.jupyter.interactive_selection decorator, which will execute the decorated function each time the selection is changed.

[43]:
df.select(df.x > 0)
@vaex.jupyter.interactive_selection(df)
def plot(*args, **kwargs):
    print("Mean x for the selection is:", df.mean(df.x, selection=True))
    df.viz.heatmap(df.x, df.y, what=np.log(vaex.stat.count()+1), selection=[None, True], limits='99.7%')
    plt.show()

After changing the selection programmatically, the visualization will update, as well as the print output.

[44]:
df.select(df.x > df.y)

However, to get truly interactive visualization, we need to use widgets, such as the bqplot library. Again, if we make a selection here, the above visualization will also update, so lets select a square region.

See more interactive widgets in the Vaex Jupyter tutorial

Joining

Joining in Vaex is similar to Pandas, except the data will no be copied. Internally an index array is kept for each row on the left DataFrame, pointing to the right DataFrame, requiring about 8GB for a billion row \(10^9\) dataset. Lets start with 2 small DataFrames, df1 and df2:

[47]:
a = np.array(['a', 'b', 'c'])
x = np.arange(1,4)
df1 = vaex.from_arrays(a=a, x=x)
df1
[47]:
# a x
0a 1
1b 2
2c 3
[48]:
b = np.array(['a', 'b', 'd'])
y = x**2
df2 = vaex.from_arrays(b=b, y=y)
df2
[48]:
# b y
0a 1
1b 4
2d 9

The default join, is a ‘left’ join, where all rows for the left DataFrame (df1) are kept, and matching rows of the right DataFrame (df2) are added. We see that for the columns b and y, some values are missing, as expected.

[49]:
df1.join(df2, left_on='a', right_on='b')
[49]:
# a xb y
0a 1a 1
1b 2b 4
2c 3-- --

A ‘right’ join, is basically the same, but now the roles of the left and right label swapped, so now we have some values from columns x and a missing.

[50]:
df1.join(df2, left_on='a', right_on='b', how='right')
[50]:
# b ya x
0a 1a 1
1b 4b 2
2d 9-- --

We can also do ‘inner’ join, in which the output DataFrame has only the rows common between df1 and df2.

[51]:
df1.join(df2, left_on='a', right_on='b', how='inner')
[51]:
# a xb y
0a 1a 1
1b 2b 4

Other joins (e.g. outer) are currently not supported. Feel free to open an issue on GitHub for this.

Group-by

With Vaex one can also do fast group-by aggregations. The output is Vaex DataFrame. Let us see few examples.

[52]:
import vaex
animal = ['dog', 'dog', 'cat', 'guinea pig', 'guinea pig', 'dog']
age = [2, 1, 5, 1, 3, 7]
cuteness = [9, 10, 5, 8, 4, 8]
df_pets = vaex.from_arrays(animal=animal, age=age, cuteness=cuteness)
df_pets
[52]:
# animal age cuteness
0dog 2 9
1dog 1 10
2cat 5 5
3guinea pig 1 8
4guinea pig 3 4
5dog 7 8

The syntax for doing group-by operations is virtually identical to that of Pandas. Note that when multiple aggregations are passed to a single column or expression, the output colums are appropriately named.

[53]:
df_pets.groupby(by='animal').agg({'age': 'mean',
                                  'cuteness': ['mean', 'std']})
[53]:
# animal age cuteness_mean cuteness_std
0dog 3.33333 9 0.816497
1cat 5 5 0
2guinea pig2 6 2

Vaex supports a number of aggregation functions:

In addition, we can specify the aggregation operations inside the groupby-method. Also we can name the resulting aggregate columns as we wish.

[54]:
df_pets.groupby(by='animal',
                agg={'mean_age': vaex.agg.mean('age'),
                     'cuteness_unique_values': vaex.agg.nunique('cuteness'),
                     'cuteness_unique_min': vaex.agg.min('cuteness')})
[54]:
# animal mean_age cuteness_unique_values cuteness_unique_min
0dog 3.33333 3 8
1cat 5 1 5
2guinea pig 2 2 4

A powerful feature of the aggregation functions in Vaex is that they support selections. This gives us the flexibility to make selections while aggregating. For example, let’s calculate the mean cuteness of the pets in this example DataFrame, but separated by age.

[55]:
df_pets.groupby(by='animal',
                agg={'mean_cuteness_old': vaex.agg.mean('cuteness', selection='age>=5'),
                     'mean_cuteness_young': vaex.agg.mean('cuteness', selection='~(age>=5)')})
[55]:
# animal mean_cuteness_old mean_cuteness_young
0dog 8 9.5
1cat 5 nan
2guinea pig nan 6

Note that in the last example, the grouped DataFrame contains NaNs for the groups in which there are no samples.

String processing

String processing is similar to Pandas, except all operations are performed lazily, multithreaded, and faster (in C++). Check the API docs for more examples.

[56]:
import vaex
text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
df = vaex.from_arrays(text=text)
df
[56]:
# text
0Something
1very pretty
2is coming
3our
4way.
[57]:
df.text.str.upper()
[57]:
Expression = str_upper(text)
Length: 5 dtype: str (expression)
---------------------------------
0    SOMETHING
1  VERY PRETTY
2    IS COMING
3          OUR
4         WAY.
[58]:
df.text.str.title().str.replace('et', 'ET')
[58]:
Expression = str_replace(str_title(text), 'et', 'ET')
Length: 5 dtype: str (expression)
---------------------------------
0    SomEThing
1  Very PrETty
2    Is Coming
3          Our
4         Way.
[59]:
df.text.str.contains('e')
[59]:
Expression = str_contains(text, 'e')
Length: 5 dtype: bool (expression)
----------------------------------
0   True
1   True
2  False
3  False
4  False
[60]:
df.text.str.count('e')
[60]:
Expression = str_count(text, 'e')
Length: 5 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  0
3  0
4  0

Propagation of uncertainties

In science one often deals with measurement uncertainties (sometimes refererred to as measurement errors). When transformations are made with quantities that have uncertainties associated with them, the uncertainties on these transformed quantities can be calculated automatically by Vaex. Note that propagation of uncertainties requires derivatives and matrix multiplications of lengthy equations, which is not complex, but tedious. Vaex can automatically calculate all dependencies, derivatives and compute the full covariance matrix.

As an example, let us use the TGAS astronomy dataset once again. Even though the TGAS dataset already contains galactic sky coordinates (l and b), let’s add them again by performing a coordinate system rotation from RA. and Dec. We can apply a similar transformation and convert from the Sperical galactic to Cartesian coordinates.

[61]:
# convert parallas to distance
tgas.add_virtual_columns_distance_from_parallax(tgas.parallax)
# 'overwrite' the real columns 'l' and 'b' with virtual columns
tgas.add_virtual_columns_eq2gal('ra', 'dec', 'l', 'b')
# and combined with the galactic sky coordinates gives galactic cartesian coordinates of the stars
tgas.add_virtual_columns_spherical_to_cartesian(tgas.l, tgas.b, tgas.distance, 'x', 'y', 'z')
[61]:
# astrometric_delta_q astrometric_excess_noise astrometric_excess_noise_sig astrometric_n_bad_obs_ac astrometric_n_bad_obs_al astrometric_n_good_obs_ac astrometric_n_good_obs_al astrometric_n_obs_ac astrometric_n_obs_al astrometric_primary_flag astrometric_priors_used astrometric_relegation_factor astrometric_weight_ac astrometric_weight_al b dec dec_error dec_parallax_corr dec_pmdec_corr dec_pmra_corr duplicated_source ecl_lat ecl_lon hip l matched_observations parallax parallax_error parallax_pmdec_corr parallax_pmra_corr phot_g_mean_flux phot_g_mean_flux_error phot_g_mean_mag phot_g_n_obs phot_variable_flag pmdec pmdec_error pmra pmra_error pmra_pmdec_corr ra ra_dec_corr ra_error ra_parallax_corr ra_pmdec_corr ra_pmra_corr random_index ref_epoch scan_direction_mean_k1 scan_direction_mean_k2 scan_direction_mean_k3 scan_direction_mean_k4 scan_direction_strength_k1 scan_direction_strength_k2 scan_direction_strength_k3 scan_direction_strength_k4 solution_id source_id tycho2_id distance x y z
0 1.9190566539764404 0.7171010000916003 412.6059727233687 1 0 78 79 79 79 84 3 2.9360971450805664 1.2669624084082898e-05 1.818157434463501 -16.1210428281140140.23539164875137225 0.21880220693566088-0.4073381721973419 0.06065881997346878 -0.09945132583379745 70 -16.12105217335385342.64182504417002 13989 42.641804308626725 9 6.35295075173405 0.3079103606852086 -0.10195717215538025 -0.001576789305545389710312332.17299333210577.365273118843 7.991377829505826 77 b'NOT_AVAILABLE' -7.641989988351149 0.0874017933455474743.75231341609215 0.070542206426400810.21467718482017517 45.03433035439128 -0.41497212648391724 0.305989282002827270.17996619641780853-0.08575969189405441 0.15920649468898773 243619 2015.0 -113.76032257080078 21.39291763305664 -41.67839813232422 26.201841354370117 0.3823484778404236 0.5382660627365112 0.3923785090446472 0.9163063168525696 16353784107819335687627862074752 b'' 0.157407170160582170.111236040400056370.10243667003803988 -0.04370685490397632
1 nan 0.2534628812968044 47.316290890180255 2 0 55 57 57 57 84 5 2.6523141860961914 3.1600175134371966e-05 12.861557006835938 -16.19302376369384 0.2000676896877873 1.1977893944215496 0.8376259803771973 -0.9756439924240112 0.9725773334503174 70 -16.19303311057312 42.761180489478576-214748364842.76115974936648 8 3.90032893506844 0.3234880030045522 -0.8537789583206177 0.8397389650344849 949564.6488279914 1140.173576223928 10.58095871890025662 b'NOT_AVAILABLE' -55.10917285969142 2.522928801165149 10.03626300124532 4.611413518289133 -0.9963987469673157 45.1650067708984 -0.9959233403205872 2.583882288511597 -0.86091065406799320.9734798669815063 -0.9724165201187134 487238 2015.0 -156.432861328125 22.76607322692871 -36.23965835571289 22.890602111816406 0.7110026478767395 0.9659702777862549 0.6461148858070374 0.8671600818634033 16353784107819335689277129363072 b'55-28-1' 0.256388631996868450.1807701962996959 0.16716755815017084 -0.07150016957395491
2 nan 0.3989006354041912 221.18496561724646 4 1 57 60 61 61 84 5 3.9934017658233643 2.5633918994572014e-05 5.767529487609863 -16.12335382439265 0.24882543945301736 0.1803264123376257 -0.39189115166664124-0.193255528807640080.08942046016454697 70 -16.12336317040229642.69750168007008 -214748364842.69748094193635 7 3.15531322003673730.2734838183180671 -0.11855248361825943 -0.0418587327003479 817837.6000768564 1827.3836759985832 10.74310238043427360 b'NOT_AVAILABLE' -1.602867102186794 1.0352589283446592 2.9322836829569003 1.908644426623371 -0.9142706990242004 45.08615483797584 -0.1774432212114334 0.2138361631952843 0.30772241950035095-0.1848166137933731 0.04686680808663368 1948952 2015.0 -117.00751495361328 19.772153854370117 -43.108219146728516 26.7157039642334 0.4825277626514435 0.4287584722042084 0.5241528153419495 0.9030616879463196 163537841078193356813297218905216 b'55-1191-1' 0.316925747228465950.223761030194755460.2064625216744117 -0.08801225918215205
3 nan 0.4224923646481251 179.98201436339852 1 0 51 52 52 52 84 5 4.215157985687256 2.8672602638835087e-05 5.3608622550964355 -16.1182068792970340.24821079122833972 0.20095844850181172-0.33721715211868286-0.223501190543174740.13181143999099731 70 -16.11821622503516 42.67779093546686 -214748364842.67777019818556 7 2.292366835156796 0.2809724206784257 -0.10920235514640808 -0.049440864473581314 602053.4754362862 905.8772856344845 11.07568239443574561 b'NOT_AVAILABLE' -18.4149121148257321.1298513589995536 3.661982345981763 2.065051873379775 -0.9261773228645325 45.06654155758114 -0.36570677161216736 0.2760390513575931 0.2028782218694687 -0.058928851038217545 -0.050908856093883514102321 2015.0 -132.42112731933594 22.56928253173828 -38.95445251464844 25.878559112548828 0.4946548640727997 0.6384561061859131 0.5090736746788025 0.8989177942276001 163537841078193356813469017597184 b'55-624-1' 0.436230355745659160.308100140405318630.2840853806346911 -0.12110624783986161
4 nan 0.3175001122010629 119.74837853832186 2 3 85 84 87 87 84 5 3.2356362342834473 2.22787512029754e-05 8.080779075622559 -16.0554718307503740.33504360351532875 0.1701298562030361 -0.43870800733566284-0.279348850250244140.12179157137870789 70 -16.0554811777948 42.77336987816832 -214748364842.77334913546197 11 1.582076960273368 0.2615394689640736 -0.329196035861969 0.10031197965145111 1388122.242048847 2826.428866453177 10.16870078127108896 b'NOT_AVAILABLE' -2.379387386351838 0.7106320061478508 0.340802333695025161.2204755227890713 -0.8336043357849121 45.13603822322069 -0.0490525588393211360.170696952833767760.4714251756668091 -0.1563923954963684 -0.15207625925540924 409284 2015.0 -106.85968017578125 4.452099323272705 -47.8953971862793 26.755468368530273 0.5206537842750549 0.23930974304676056 0.653376579284668 0.8633849024772644 163537841078193356815736760328576 b'55-849-1' 0.6320805024726543 0.445878380954020440.41250283253756015 -0.17481316927621393
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2,057,04525.898868560791016 0.6508009723190962 172.3136755413185 0 0 54 54 54 54 84 3 6.386378765106201 1.8042501324089244e-05 2.2653496265411377 16.006806970347426 -0.423196860251580430.249741476396420750.00821441039443016 0.2133195698261261 -0.00080527918180450870 16.006807041815204 317.0782357688112 103561 -42.92178788756781 8 5.07430693974197760.2840892420661878 -0.0308084636926651 -0.03397708386182785 4114975.455725508 3447.5776608146016 8.988851940956916 69 b'NOT_AVAILABLE' -4.440524133201202 0.0474329790178223721.970772995655643 0.078468931186690470.3920176327228546 314.741700437929240.08548042178153992 0.2773321068969684 0.2473779171705246 -0.00060404307441785930.11652233451604843 1595738 2015.0 -18.078920364379883 -17.731922149658203 38.27400588989258 27.63787269592285 0.29217642545700073 0.11402469873428345 0.0404343381524086 0.937016487121582 16353784107819335686917488998546378368b'' 0.197071247733951380.13871698568448773-0.129002113090694430.054342703136315784
2,057,046nan 0.17407523451856974 28.886549102578012 0 2 54 52 54 54 84 5 1.9612410068511963 2.415467497485224e-05 24.774322509765625 16.12926993546893 -0.324975343682328940.148233655691999750.8842677474021912 -0.9121489524841309 -0.8994856476783752 70 16.129270018016896 317.0105462544942 -2147483648-42.98947742356782 7 1.69834808174399220.7410137777358506 -0.9793509840965271 -0.9959075450897217 1202425.5197785893871.2480333575235 10.32462460143572359 b'NOT_AVAILABLE' -10.4012251112689621.4016954983272711 -1.28356129908418742.7416807292293637 0.980453610420227 314.643817893111930.8981446623802185 0.3590974400544809 0.9818224906921387 -0.9802247881889343 -0.9827051162719727 2019553 2015.0 -87.07184600830078 -31.574886322021484 -36.37055206298828 29.130958557128906 0.22651544213294983 0.07730517536401749 0.2675701975822449 0.9523505568504333 16353784107819335686917493705830041600b'5179-753-1' 0.5888074481016426 0.4137467499267554 -0.385683048078504840.16357391078619246
2,057,047nan 0.47235246463190794 92.12190417660749 2 0 34 36 36 36 84 5 4.68601131439209 2.138371200999245e-05 3.9279115200042725 15.92496896432183 -0.343177320443203870.20902981533215972-0.2000708132982254 0.31042322516441345 -0.3574342727661133 70 15.924968943694909 317.6408327998631 -2147483648-42.3591908420944146 6.036938108863445 0.39688014089787665-0.7275367975234985 -0.25934046506881714 3268640.52536146954918.5087736624755 9.238852161621992 51 b'NOT_AVAILABLE' -27.8523447526722451.2778575351686428 15.713555906870294 0.9411842746983148 -0.1186852976679802 315.2828795933192 -0.47665935754776 0.4722647631556871 0.704002320766449 -0.77033931016922 0.12704335153102875 788948 2015.0 -21.23501205444336 20.132535934448242 33.55913162231445 26.732301712036133 0.41511622071266174 0.5105549693107605 0.15976844727993011 0.9333845376968384 16353784107819335686917504975824469248b'5192-877-1' 0.165646886214022630.11770477437507047-0.107325590749532430.045449912782963474
2,057,048nan 0.3086465263182493 76.66564461310193 1 2 52 51 53 53 84 5 3.154139280319214 1.9043474821955897e-05 9.627826690673828 16.193728871838935 -0.228113600435448820.131650037775767 0.3082593083381653 -0.5279345512390137 -0.4065483510494232 70 16.193728933791913 317.1363617703344 -2147483648-42.86366191921117 7 1.484142306295484 0.34860128377301614-0.7272516489028931 -0.9375584125518799 4009408.31726829061929.9834553649182 9.017069346445364 60 b'NOT_AVAILABLE' 1.8471079057572073 0.7307171627866237 11.352888915160555 1.219847308406543 0.7511345148086548 314.7406481637209 0.41397571563720703 0.192052966417785630.7539510726928711 -0.7239754796028137 -0.7911394238471985 868066 2015.0 -89.73970794677734 -25.196216583251953 -35.13546371459961 29.041872024536133 0.21430812776088715 0.06784655898809433 0.2636755108833313 0.9523414969444275 16353784107819335686917517998165066624b'5179-1401-1'0.6737898352187435 0.4742760432178817 -0.440164289459801350.18791055094922077
2,057,049nan 0.4329850465924866 60.789771079095715 0 0 26 26 26 26 84 5 4.3140177726745605 2.7940122890868224e-05 4.742301940917969 16.135962442685898 -0.221300816243519350.2686748166142929 -0.466053694486618040.30018869042396545 -0.3290684223175049 70 16.13596246842634 317.3575812619557 -2147483648-42.6424424173883245 2.680111343641743 0.4507741964825321 -0.689416229724884 -0.1735922396183014 2074338.153903563 4136.498086035368 9.732571175024953 31 b'NOT_AVAILABLE' 3.15173423618292 1.4388911228835037 2.897878776243949 1.0354817855168323 -0.21837876737117767314.960730599014 -0.4467950165271759 0.491820509447922160.7087226510047913 -0.8360105156898499 0.2156151533126831 1736132 2015.0 -63.01319885253906 18.303699493408203 -49.05630111694336 28.76698875427246 0.3929939866065979 0.32352808117866516 0.24211134016513824 0.9409775733947754 16353784107819335686917521537218608640b'5179-1719-1'0.3731188267130712 0.2636519673685346 -0.242801102164863340.10369630532457579

Since RA. and Dec. are in degrees, while ra_error and dec_error are in miliarcseconds, we need put them on the same scale

[62]:
tgas['ra_error'] = tgas.ra_error / 1000 / 3600
tgas['dec_error'] = tgas.dec_error / 1000 / 3600

We now let Vaex sort out what the covariance matrix is for the Cartesian coordinates x, y, and z. Then take 50 samples from the dataset for visualization.

[63]:
tgas.propagate_uncertainties([tgas.x, tgas.y, tgas.z])
tgas_50 = tgas.sample(50, random_state=42)

For this small subset of the dataset we can visualize the uncertainties, with and without the covariance.

[64]:
tgas_50.scatter(tgas_50.x, tgas_50.y, xerr=tgas_50.x_uncertainty, yerr=tgas_50.y_uncertainty)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()
tgas_50.scatter(tgas_50.x, tgas_50.y, xerr=tgas_50.x_uncertainty, yerr=tgas_50.y_uncertainty, cov=tgas_50.y_x_covariance)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()
_images/tutorial_120_0.png
_images/tutorial_120_1.png

From the second plot, we see that showing error ellipses (so narrow that they appear as lines) instead of error bars reveal that the distance information dominates the uncertainty in this case.

Just-In-Time compilation

Let us start with a function that calculates the angular distance between two points on a surface of a sphere. The input of the function is a pair of 2 angular coordinates, in radians.

[65]:
import vaex
import numpy as np
# From http://pythonhosted.org/pythran/MANUAL.html
def arc_distance(theta_1, phi_1, theta_2, phi_2):
    """
    Calculates the pairwise arc distance
    between all points in vector a and b.
    """
    temp = (np.sin((theta_2-2-theta_1)/2)**2
           + np.cos(theta_1)*np.cos(theta_2) * np.sin((phi_2-phi_1)/2)**2)
    distance_matrix = 2 * np.arctan2(np.sqrt(temp), np.sqrt(1-temp))
    return distance_matrix

Let us use the New York Taxi dataset of 2015, as can be downloaded in hdf5 format

[66]:
# nytaxi = vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')
nytaxi = vaex.open('/Users/jovan/Work/vaex-work/vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5')
# lets use just 20% of the data, since we want to make sure it fits
# into memory (so we don't measure just hdd/ssd speed)
nytaxi.set_active_fraction(0.2)

Although the function above expects Numpy arrays, Vaex can pass in columns or expression, which will delay the execution untill it is needed, and add the resulting expression as a virtual column.

[67]:
nytaxi['arc_distance'] = arc_distance(nytaxi.pickup_longitude * np.pi/180,
                                      nytaxi.pickup_latitude * np.pi/180,
                                      nytaxi.dropoff_longitude * np.pi/180,
                                      nytaxi.dropoff_latitude * np.pi/180)

When we calculate the mean angular distance of a taxi trip, we encounter some invalid data, that will give warnings, which we can safely ignore for this demonstration.

[68]:
%%time
nytaxi.mean(nytaxi.arc_distance)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in sqrt
  return function(*args, **kwargs)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in sin
  return function(*args, **kwargs)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in cos
  return function(*args, **kwargs)
CPU times: user 44.5 s, sys: 5.03 s, total: 49.5 s
Wall time: 6.14 s
[68]:
array(1.99993285)

This computation uses quite some heavy mathematical operations, and since it’s (internally) using Numpy arrays, also uses quite some temporary arrays. We can optimize this calculation by doing a Just-In-Time compilation, based on numba, pythran, or if you happen to have an NVIDIA graphics card cuda. Choose whichever gives the best performance or is easiest to install.

[69]:
nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_numba()
# nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_pythran()
# nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_cuda()
[70]:
%%time
nytaxi.mean(nytaxi.arc_distance_jit)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/expression.py:1038: RuntimeWarning: invalid value encountered in f
  return self.f(*args, **kwargs)
CPU times: user 25.7 s, sys: 330 ms, total: 26 s
Wall time: 2.31 s
[70]:
array(1.9999328)

We can get a significant speedup (\(\sim 3 x\)) in this case.

Parallel computations

As mentioned in the sections on selections, Vaex can do computations in parallel. Often this is taken care of, for instance, when passing multiple selections to a method, or multiple arguments to one of the statistical functions. However, sometimes it is difficult or impossible to express a computation in one expression, and we need to resort to doing so called ‘delayed’ computation, similar as in joblib and dask.

[71]:
import vaex
df = vaex.example()
limits = [-10, 10]
delayed_count = df.count(df.E, binby=df.x, limits=limits,
                         shape=4, delay=True)
delayed_count
[71]:
<vaex.promise.Promise at 0x7ffbd64072d0>

Note that now the returned value is now a promise (TODO: a more Pythonic way would be to return a Future). This may be subject to change, and the best way to work with this is to use the delayed decorator. And call DataFrame.execute when the result is needed.

In addition to the above delayed computation, we schedule more computation, such that both the count and mean are executed in parallel such that we only do a single pass over the data. We schedule the execution of two extra functions using the vaex.delayed decorator, and run the whole pipeline using df.execute().

[72]:
delayed_sum = df.sum(df.E, binby=df.x, limits=limits,
                         shape=4, delay=True)

@vaex.delayed
def calculate_mean(sums, counts):
    print('calculating mean')
    return sums/counts

print('before calling mean')
# since calculate_mean is decorated with vaex.delayed
# this now also returns a 'delayed' object (a promise)
delayed_mean = calculate_mean(delayed_sum, delayed_count)

# if we'd like to perform operations on that, we can again
# use the same decorator
@vaex.delayed
def print_mean(means):
    print('means', means)
print_mean(delayed_mean)

print('before calling execute')
df.execute()

# Using the .get on the promise will also return the result
# However, this will only work after execute, and may be
# subject to change
means = delayed_mean.get()
print('same means', means)

before calling mean
before calling execute
calculating mean
means [ -94323.68051598 -118749.23850834 -119119.46292653  -95021.66183457]
same means [ -94323.68051598 -118749.23850834 -119119.46292653  -95021.66183457]

Extending Vaex

Vaex can be extended using several mechanisms.

Adding functions

Use the vaex.register_function decorator API to add new functions.

[73]:
import vaex
import numpy as np
@vaex.register_function()
def add_one(ar):
    return ar+1

The function can be invoked using the df.func accessor, to return a new expression. Each argument that is an expresssion, will be replaced by a Numpy array on evaluations in any Vaex context.

[74]:
df = vaex.from_arrays(x=np.arange(4))
df.func.add_one(df.x)
[74]:
Expression = add_one(x)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  3
3  4

By default (passing on_expression=True), the function is also available as a method on Expressions, where the expression itself is automatically set as the first argument (since this is a quite common use case).

[75]:
df.x.add_one()
[75]:
Expression = add_one(x)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  3
3  4

In case the first argument is not an expression, pass on_expression=True, and use df.func.<funcname>, to build a new expression using the function:

[76]:
@vaex.register_function(on_expression=False)
def addmul(a, b, x, y):
    return a*x + b * y
[77]:
df = vaex.from_arrays(x=np.arange(4))
df['y'] = df.x**2
df.func.addmul(2, 3, df.x, df.y)
[77]:
Expression = addmul(2, 3, x, y)
Length: 4 dtype: int64 (expression)
-----------------------------------
0   0
1   5
2  16
3  33

These expressions can be added as virtual columns, as expected.

[78]:
df = vaex.from_arrays(x=np.arange(4))
df['y'] = df.x**2
df['z'] = df.func.addmul(2, 3, df.x, df.y)
df['w'] = df.x.add_one()
df
[78]:
# x y z w
0 0 0 0 1
1 1 1 5 2
2 2 4 16 3
3 3 9 33 4

Adding DataFrame accessors

When adding methods that operate on Dataframes, it makes sense to group them together in a single namespace.

[79]:
@vaex.register_dataframe_accessor('scale', override=True)
class ScalingOps(object):
    def __init__(self, df):
        self.df = df

    def mul(self, a):
        df = self.df.copy()
        for col in df.get_column_names(strings=False):
            if df[col].dtype:
                df[col] = df[col] * a
        return df

    def add(self, a):
        df = self.df.copy()
        for col in df.get_column_names(strings=False):
            if df[col].dtype:
                df[col] = df[col] + a
        return df
[80]:
df.scale.add(1)
[80]:
# x y z w
0 1 1 1 2
1 2 2 6 3
2 3 5 17 4
3 4 10 34 5
[81]:
df.scale.mul(2)
[81]:
# x y z w
0 0 0 0 2
1 2 2 10 4
2 4 8 32 6
3 6 18 66 8

Convenience methods

Get column names

We often want to work with a subset of columns in our DataFrame. With the get_column_names method, Vaex makes it quite easy and convenient to get the exact columns you need. By default, get_column_names returns all the columns:

[1]:
import vaex
df = vaex.datasets.titanic()
print(df.get_column_names())
['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home_dest']

The same method has a few arguments that makes it easy to get right subset of columns you need. For example, one can pass a regex expression on how to select columns based on their names. In the cell below we select all columns whose names are 5 characters long:

[2]:
print(df.get_column_names(regex='^[a-zA-Z]{5}$'))
['sibsp', 'parch', 'cabin']

We can also select columns based on type. Below we select all columns that are integers or floats:

[3]:
df.get_column_names(dtype=['int', 'float'])
[3]:
['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']

The escape hatch: apply

In case a calculation cannot be expressed as a Vaex expression, one can use the apply method as a last resort. This can be useful if the function you want to apply is written in pure Python, a third party library, and is difficult or impossible to vectorize.

We think apply should only be used as a last resort, because it needs to use multiprocessing (which spawns new processes) to avoid the Python Global Interpreter Lock (GIL) to make use of multiple cores. This comes at a cost of having to transfer the data between the main and child processes.

Here is an example which uses the apply method:

[1]:
import vaex

def slow_is_prime(x):
    return x > 1 and all((x % i) != 0 for i in range(2, x))

df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))
# you need to explicitly specify which arguments you need
df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])
df.head(10)
[1]:
# xis_prime
0 0False
1 1False
2 2True
3 3True
4 4False
5 5True
6 6False
7 7True
8 8False
9 9False
[2]:
prime_count = df.is_prime.sum()
print(f'There are {prime_count} prime numbers between 0 and {len(df)}')
There are 9592 prime number between 0 and 100000
[3]:
# both of these are equivalent
df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])
# but this form only works for a single argument
df['is_prime'] = df.x.apply(slow_is_prime)

When not to use apply

You should not use apply when your function can be vectorized. When you use Vaex’ expression system, we know what you do, we see the expression, and can manipulate it in order to achieve optimal performance. An apply function is like a black box, we cannot do anything with it, like JIT-ting for instance.

[4]:
df = vaex.from_arrays(x=vaex.vrange(0, 10_000_000, dtype='f4'))
[5]:
# ideal case
df['y'] = df.x**2
[6]:
%%timeit
df.y.sum()
29.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[7]:
# will transfer the data to child processes, and execute the ** operation in Python for each element
df['y_slow'] = df.x.apply(lambda x: x**2)
[8]:
%%timeit
df.y_slow.sum()
353 ms ± 40 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[9]:
# bad idea: it will transfer the data to the child process, where it will be executed in vectorized form
df['y_slow_vectorized'] = df.x.apply(lambda x: x**2, vectorize=True)
[10]:
%%timeit
df.y_slow_vectorized.sum()
82.8 ms ± 525 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[11]:
# bad idea: same performance as just dy['y'], but we lose the information about what was done
df['y_fast'] = df.x.apply(lambda x: x**2, vectorize=True, multiprocessing=False)
[12]:
%%timeit
df.y_fast.sum()
28.8 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Machine Learning with vaex.ml

If you want to try out this notebook with a live Python kernel, use mybinder:

https://mybinder.org/badge_logo.svg

The vaex.ml package brings some machine learning algorithms to vaex. If you installed the individual subpackages (vaex-core, vaex-hdf5, …) instead of the vaex metapackage, you may need to install it by running pip install vaex-ml, or conda install -c conda-forge vaex-ml.

The API of vaex.ml stays close to that of scikit-learn, while providing better performance and the ability to efficiently perform operations on data that is larger than the available RAM. This page is an overview and a brief introduction to the capabilities offered by vaex.ml.

[1]:
import vaex
vaex.multithreading.thread_count_default = 8
import vaex.ml

import numpy as np
import matplotlib.pyplot as plt

We will use the well known Iris flower and Titanic passenger list datasets, two classical datasets for machine learning demonstrations.

[2]:
df = vaex.datasets.iris()
df
[2]:
# sepal_length sepal_width petal_length petal_width class_
0 5.9 3.0 4.2 1.5 1
1 6.1 3.0 4.6 1.4 1
2 6.6 2.9 4.6 1.3 1
3 6.7 3.3 5.7 2.1 2
4 5.5 4.2 1.4 0.2 0
... ... ... ... ... ...
1455.2 3.4 1.4 0.2 0
1465.1 3.8 1.6 0.2 0
1475.8 2.6 4.0 1.2 1
1485.7 3.8 1.7 0.3 0
1496.2 2.9 4.3 1.3 1
[3]:
df.scatter(df.petal_length, df.petal_width, c_expr=df.class_);
/home/jovan/vaex/packages/vaex-core/vaex/viz/mpl.py:205: UserWarning: `scatter` is deprecated and it will be removed in version 5.x. Please use `df.viz.scatter` instead.
  warnings.warn('`scatter` is deprecated and it will be removed in version 5.x. Please use `df.viz.scatter` instead.')
_images/tutorial_ml_5_1.png

Preprocessing

Scaling of numerical features

vaex.ml packs the common numerical scalers:

  • vaex.ml.StandardScaler - Scale features by removing their mean and dividing by their variance;

  • vaex.ml.MinMaxScaler - Scale features to a given range;

  • vaex.ml.RobustScaler - Scale features by removing their median and scaling them according to a given percentile range;

  • vaex.ml.MaxAbsScaler - Scale features by their maximum absolute value.

The usage is quite similar to that of scikit-learn, in the sense that each transformer implements the .fit and .transform methods.

[4]:
features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
scaler = vaex.ml.StandardScaler(features=features, prefix='scaled_')
scaler.fit(df)
df_trans = scaler.transform(df)
df_trans
[4]:
# sepal_length sepal_width petal_length petal_width class_ scaled_petal_length scaled_petal_width scaled_sepal_length scaled_sepal_width
0 5.9 3.0 4.2 1.5 1 0.25096730693923325 0.39617188299171285 0.06866179325140277 -0.12495760117130607
1 6.1 3.0 4.6 1.4 1 0.4784301228962429 0.26469891297233916 0.3109975341387059 -0.12495760117130607
2 6.6 2.9 4.6 1.3 1 0.4784301228962429 0.13322594295296575 0.9168368863569659 -0.3563605663033572
3 6.7 3.3 5.7 2.1 2 1.1039528667780207 1.1850097031079545 1.0380047568006185 0.5692512942248463
4 5.5 4.2 1.4 0.2 0 -1.341272404759837 -1.3129767272601438 -0.4160096885232057 2.6518779804133055
... ... ... ... ... ... ... ... ... ...
1455.2 3.4 1.4 0.2 0 -1.341272404759837 -1.3129767272601438 -0.7795132998541615 0.8006542593568975
1465.1 3.8 1.6 0.2 0 -1.2275409967813318 -1.3129767272601438 -0.9006811702978141 1.726266119885101
1475.8 2.6 4.0 1.2 1 0.13723589896072813 0.0017529729335920385-0.052506077192249874-1.0505694616995096
1485.7 3.8 1.7 0.3 0 -1.1706752927920796 -1.18150375724077 -0.17367394763590144 1.726266119885101
1496.2 2.9 4.3 1.3 1 0.30783301092848553 0.13322594295296575 0.4321654045823586 -0.3563605663033572

The output of the .transform method of any vaex.ml transformer is a shallow copy of a DataFrame that contains the resulting features of the transformations in addition to the original columns. A shallow copy means that this new DataFrame just references the original one, and no extra memory is used. In addition, the resulting features, in this case the scaled numerical features are virtual columns, which do not take any memory but are computed on the fly when needed. This approach is ideal for working with very large datasets.

Encoding of categorical features

vaex.ml contains several categorical encoders:

  • vaex.ml.LabelEncoder - Encoding features with as many integers as categories, startinfg from 0;

  • vaex.ml.OneHotEncoder - Encoding features according to the one-hot scheme;

  • vaex.ml.MultiHotEncoder - Encoding features according to the multi-hot scheme (binary vector);

  • vaex.ml.FrequencyEncoder - Encode features by the frequency of their respective categories;

  • vaex.ml.BayesianTargetEncoder - Encode categories with the mean of their target value;

  • vaex.ml.WeightOfEvidenceEncoder - Encode categories their weight of evidence value.

The following is a quick example using the Titanic dataset.

[5]:
df =  vaex.datasets.titanic()
df.head(5)
[5]:
# pclasssurvived name sex age sibsp parch ticket farecabin embarked boat bodyhome_dest
0 1True Allen, Miss. Elisabeth Walton female29 0 0 24160211.338B5 S 2 nanSt Louis, MO
1 1True Allison, Master. Hudson Trevor male 0.9167 1 2 113781151.55 C22 C26S 11 nanMontreal, PQ / Chesterville, ON
2 1False Allison, Miss. Helen Loraine female 2 1 2 113781151.55 C22 C26S -- nanMontreal, PQ / Chesterville, ON
3 1False Allison, Mr. Hudson Joshua Creighton male 30 1 2 113781151.55 C22 C26S -- 135Montreal, PQ / Chesterville, ON
4 1False Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25 1 2 113781151.55 C22 C26S -- nanMontreal, PQ / Chesterville, ON
[6]:
label_encoder = vaex.ml.LabelEncoder(features=['embarked'])
one_hot_encoder = vaex.ml.OneHotEncoder(features=['embarked'])
multi_hot_encoder = vaex.ml.MultiHotEncoder(features=['embarked'])
freq_encoder = vaex.ml.FrequencyEncoder(features=['embarked'])
bayes_encoder = vaex.ml.BayesianTargetEncoder(features=['embarked'], target='survived')
woe_encoder = vaex.ml.WeightOfEvidenceEncoder(features=['embarked'], target='survived')

df = label_encoder.fit_transform(df)
df = one_hot_encoder.fit_transform(df)
df = multi_hot_encoder.fit_transform(df)
df = freq_encoder.fit_transform(df)
df = bayes_encoder.fit_transform(df)
df = woe_encoder.fit_transform(df)

df.head(5)
[6]:
# pclasssurvived name sex age sibsp parch ticket farecabin embarked boat bodyhome_dest label_encoded_embarked embarked_missing embarked_C embarked_Q embarked_S embarked_0 embarked_1 embarked_2 frequency_encoded_embarked mean_encoded_embarked woe_encoded_embarked
0 1True Allen, Miss. Elisabeth Walton female29 0 0 24160211.338B5 S 2 nanSt Louis, MO 1 0 0 0 1 1 0 0 0.698243 0.337472 -0.696431
1 1True Allison, Master. Hudson Trevor male 0.9167 1 2 113781151.55 C22 C26S 11 nanMontreal, PQ / Chesterville, ON 1 0 0 0 1 1 0 0 0.698243 0.337472 -0.696431
2 1False Allison, Miss. Helen Loraine female 2 1 2 113781151.55 C22 C26S -- nanMontreal, PQ / Chesterville, ON 1 0 0 0 1 1 0 0 0.698243 0.337472 -0.696431
3 1False Allison, Mr. Hudson Joshua Creighton male 30 1 2 113781151.55 C22 C26S -- 135Montreal, PQ / Chesterville, ON 1 0 0 0 1 1 0 0 0.698243 0.337472 -0.696431
4 1False Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25 1 2 113781151.55 C22 C26S -- nanMontreal, PQ / Chesterville, ON 1 0 0 0 1 1 0 0 0.698243 0.337472 -0.696431

Notice that the transformed features are all included in the resulting DataFrame and are appropriately named. This is excellent for the construction of various diagnostic plots, and engineering of more complex features. The fact that the resulting (encoded) features take no memory, allows one to try out or combine a variety of preprocessing steps without spending any extra memory.

Feature Engineering

KBinsDiscretizer

With the KBinsDiscretizer you can convert a continous into a discrete feature by binning the data into specified intervals. You can specify the number of bins, the strategy on how to determine their size:

  • “uniform” - all bins have equal sizes;

  • “quantile” - all bins have (approximately) the same number of samples in them;

  • “kmeans” - values in each bin belong to the same 1D cluster as determined by the KMeans algorithm.

[7]:
kbdisc = vaex.ml.KBinsDiscretizer(features=['age'], n_bins=5, strategy='quantile')
df = kbdisc.fit_transform(df)
df.head(5)
/home/jovan/vaex/packages/vaex-core/vaex/ml/transformations.py:1089: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in   age are removed.Consider decreasing the number of bins.
  warnings.warn(f'Bins whose width are too small (i.e., <= 1e-8) in   {feat} are removed.'
[7]:
# pclasssurvived name sex age sibsp parch ticket farecabin embarked boat bodyhome_dest label_encoded_embarked embarked_missing embarked_C embarked_Q embarked_S frequency_encoded_embarked mean_encoded_embarked woe_encoded_embarked binned_age
0 1True Allen, Miss. Elisabeth Walton female29 0 0 24160211.338B5 S 2 nanSt Louis, MO 1 0 0 0 1 0.698243 0.337472 -0.696431 0
1 1True Allison, Master. Hudson Trevor male 0.9167 1 2 113781151.55 C22 C26S 11 nanMontreal, PQ / Chesterville, ON 1 0 0 0 1 0.698243 0.337472 -0.696431 0
2 1False Allison, Miss. Helen Loraine female 2 1 2 113781151.55 C22 C26S -- nanMontreal, PQ / Chesterville, ON 1 0 0 0 1 0.698243 0.337472 -0.696431 0
3 1False Allison, Mr. Hudson Joshua Creighton male 30 1 2 113781151.55 C22 C26S -- 135Montreal, PQ / Chesterville, ON 1 0 0 0 1 0.698243 0.337472 -0.696431 0
4 1False Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25 1 2 113781151.55 C22 C26S -- nanMontreal, PQ / Chesterville, ON 1 0 0 0 1 0.698243 0.337472 -0.696431 0

GroupBy Transformer

The GroupByTransformer is a handy feature in vaex-ml that lets you perform a groupby aggregations on the training data, and then use those aggregations as features in the training and test sets.

[8]:
gbt = vaex.ml.GroupByTransformer(by='pclass', agg={'age': ['mean', 'std'],
                                                   'fare': ['mean', 'std'],
                                                  })
df = gbt.fit_transform(df)
df.head(5)
[8]:
# pclasssurvived name sex age sibsp parch ticket farecabin embarked boat bodyhome_dest label_encoded_embarked embarked_missing embarked_C embarked_Q embarked_S frequency_encoded_embarked mean_encoded_embarked woe_encoded_embarked binned_age age_mean age_std fare_mean fare_std
0 1True Allen, Miss. Elisabeth Walton female29 0 0 24160211.338B5 S 2 nanSt Louis, MO 1 0 0 0 1 0.698243 0.337472 -0.696431 0 39.1599 14.5224 87.509 80.3226
1 1True Allison, Master. Hudson Trevor male 0.9167 1 2 113781151.55 C22 C26S 11 nanMontreal, PQ / Chesterville, ON 1 0 0 0 1 0.698243 0.337472 -0.696431 0 39.1599 14.5224 87.509 80.3226
2 1False Allison, Miss. Helen Loraine female 2 1 2 113781151.55 C22 C26S -- nanMontreal, PQ / Chesterville, ON 1 0 0 0 1 0.698243 0.337472 -0.696431 0 39.1599 14.5224 87.509 80.3226
3 1False Allison, Mr. Hudson Joshua Creighton male 30 1 2 113781151.55 C22 C26S -- 135Montreal, PQ / Chesterville, ON 1 0 0 0 1 0.698243 0.337472 -0.696431 0 39.1599 14.5224 87.509 80.3226
4 1False Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25 1 2 113781151.55 C22 C26S -- nanMontreal, PQ / Chesterville, ON 1 0 0 0 1 0.698243 0.337472 -0.696431 0 39.1599 14.5224 87.509 80.3226

CycleTransformer

The CycleTransformer provides a strategy for transforming cyclical features, such as angles or time. This is done by considering each feature to be describing a polar coordinate system, and converting it to Cartesian coorindate system. This is shown to help certain ML models to achieve better performance.

[9]:
df = vaex.from_arrays(days=[0, 1, 2, 3, 4, 5, 6])
cyctrans = vaex.ml.CycleTransformer(n=7, features=['days'])
cyctrans.fit_transform(df)
[9]:
# days days_x days_y
0 0 1 0
1 1 0.62349 0.781831
2 2-0.222521 0.974928
3 3-0.900969 0.433884
4 4-0.900969-0.433884
5 5-0.222521-0.974928
6 6 0.62349 -0.781831

Dimensionality reduction

Principal Component Analysis

The PCA implemented in vaex.ml can scale to a very large number of samples, even if that data we want to transform does not fit into RAM. To demonstrate this, let us do a PCA transformation on the Iris dataset. For this example, we have replicated this dataset thousands of times, such that it contains over 1 billion samples.

[10]:
df = vaex.datasets.iris_1e9()
n_samples = len(df)
print(f'Number of samples in DataFrame: {n_samples:,}')
Number of samples in DataFrame: 1,005,000,000
[11]:
features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
pca = vaex.ml.PCA(features=features, n_components=4)
pca.fit(df, progress='widget')

The PCA transformer implemented in vaex.ml can be fit in well under a minute, even when the data comprises 4 columns and 1 billion rows.

[12]:
df_trans = pca.transform(df)
df_trans
[12]:
# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3
0 5.9 3.0 4.2 1.5 1 -0.51109806050657190.10228410590320294 0.13232789125239366 -0.05010053260756789
1 6.1 3.0 4.6 1.4 1 -0.89016044564845710.03381244269907491 -0.0097680289049917950.1534482059864868
2 6.6 2.9 4.6 1.3 1 -1.0432977809309918-0.2289569106597803 -0.41481456509035997 0.03752354509774891
3 6.7 3.3 5.7 2.1 2 -2.275853649246034 -0.3333865237191275 0.28467815436304544 0.062230281630705805
4 5.5 4.2 1.4 0.2 0 2.5971594768136956 -1.1000219282272325 0.16358191524058419 0.09895807321522321
... ... ... ... ... ... ... ... ... ...
1,004,999,9955.2 3.4 1.4 0.2 0 2.6398212682948925 -0.3192900674870881 -0.1392533720548284 -0.06514104909063131
1,004,999,9965.1 3.8 1.6 0.2 0 2.537573370908207 -0.5103675457748862 0.17191840236558648 0.19216594960009262
1,004,999,9975.8 2.6 4.0 1.2 1 -0.22887904987726520.4022576190683287 -0.22736270650701024 -0.01862045442675292
1,004,999,9985.7 3.8 1.7 0.3 0 2.199077961161723 -0.8792440894091085 -0.11452146077196179 -0.025326942106218664
1,004,999,9996.2 2.9 4.3 1.3 1 -0.6416902782168139-0.019071177408365406-0.20417287674016232 0.02050967222367117

Recall that the transformed DataFrame, which includes the PCA components, takes no extra memory.

Incremental PCA

The PCA implementation in vaex is very fast, but more so for “tall” DataFrames, i.e. DataFrames that have many rows, but not many columns. For DataFrames that have hundreds of columns, it is more efficient to use an Incremental PCA method. vaex.ml provides a convenient method that essentialy wraps sklearn.decomposition.IncrementalPCA, the fitting of which is more efficient for “wide” DataFrames.

The usage is practically identical to the regular PCA method. Consider the following example:

[13]:
n_samples = 100_000
n_columns = 50
data_dict = {f'feat_{i}': np.random.normal(0, i+1, size=n_samples) for i in range(n_columns)}
df = vaex.from_dict(data_dict)


features = df.get_column_names()
pca = vaex.ml.PCAIncremental(n_components=10, features=features, batch_size=42_000)
pca.fit(df, progress='widget')
pca.transform(df)
[13]:
# feat_0 feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9 feat_10 feat_11 feat_12 feat_13 feat_14 feat_15 feat_16 feat_17 feat_18 feat_19 feat_20 feat_21 feat_22 feat_23 feat_24 feat_25 feat_26 feat_27 feat_28 feat_29 feat_30 feat_31 feat_32 feat_33 feat_34 feat_35 feat_36 feat_37 feat_38 feat_39 feat_40 feat_41 feat_42 feat_43 feat_44 feat_45 feat_46 feat_47 feat_48 feat_49 PCA_0 PCA_1 PCA_2 PCA_3 PCA_4 PCA_5 PCA_6 PCA_7 PCA_8 PCA_9
0 0.21916619701436382-1.1435438188965208-2.236473242690611 -8.81728920352771 1.9931414225984159 0.8289809515418928 -7.847441537857684 -5.990636964340006 0.43889103534482576-6.4855757436955965-14.48532696768287113.825392548457543 -5.5661773929038185-3.1816868599382633 27.66565101972783650.541940500115366 16.001390451665785 32.510983357481614 8.342038455860216 -1.7293759207235855-6.451472523437187 22.55340570655327 -2.543125122041264528.75425936065127 -39.487762558467345 -6.871003398404642 11.198673922236354 -86.63832306461876 -7.32368079105989237.35407351193795 23.653897939827836 39.52047029873747 42.79143756690254 -33.3810495394693 33.05317072490505 14.818285601642208 -67.03187283353228 -19.01476952180615 22.4905763733386 35.33833686808974 11.79457050704157 -86.70070654092856 25.185781359852896 20.521240128349977 19.814114866123216 78.05531698592385 10.029892443326418 -97.39820288821723 -0.9603735180566161-64.45083314406774 -67.59977551168708 9.37969253153906 -96.6057651764448 11.206098841188833 74.90790318762694 17.531645576460654 21.26591694292548 27.215113714718253-85.31326664717933 10.507088586039371
1 -0.42076958781498162.3850692704428043 -1.3661921493141755-0.57464980721204832.2588675039630703 -5.100101894797036 -0.0005433423021984177-3.0055202143012365 5.749693220009271 11.379708067727588 10.119772822286162 0.15698369211085733-10.937595546203902-31.110839874678003 -5.593388174686233-17.48851742053923519.942127063793418 -0.6804349583522779-19.03708392463745428.74230527011865 12.40206875918237 -9.990549218761593 -5.733244330514869 3.171827795840886 -43.944372783025386 -25.8820588524763123.517534442545183 -25.10463172872150417.068162563601867-26.188188765123446-17.51765346352225 -5.803234686368941 23.37461204071744 85.58386322836444 -24.84250900935848 42.2583557612343 -34.83625774127584447.25447854289113 -5.903960946365425 47.891908734840925 -9.673715993876817 -17.5774774820285274.066254744412671 -51.377913297883865-11.51987006746566810.497653831847085 16.358701536495925 -18.3914825056028029.858101501060483 -39.819369217021595-38.74298336407881 12.412960580526423 -16.79176108824452714.714058887306741 8.607153125744537 -6.384705477156807 -52.8779915958480663.667728062420572 -19.219755720289232 -16.20164176309122
2 -0.50247974091959910.9897062935454243 -1.152229281759237 -1.682033038083704 -4.091345910790923 -4.52742403771885552.129578282936375 10.936320913755608 -1.5695520680947808-6.034199421988269 -28.46431144964817 -15.32129294377632 -8.194011820344523 -16.218630438043398 12.021916867709596-4.908477966578501 -29.56619559878632 7.772108300044394 7.680046493196698 13.815505542053483 3.9208120473170016 47.34661694033482 1.544881077052938 9.440027347582042 18.56198304730558 22.3336072648248 -21.578332510459486-48.93092663572265616.5701671385727 16.656088505245513 19.8406469884787 5.384567961213235 -16.73392428744861614.376438801233908-35.323974854495155-7.411178531711759 -12.19133679331107557.91740496088699 34.873491696833774 88.28464395597479 87.65337555912684 -2.4096431528212445-7.8171455961597385-4.016403896979926 -22.96261029782406 -75.8940296403038 -38.8951677113029 -89.75675908427556 -79.5994302281645 -44.45310265105787 -42.34987503786076 -74.13417710288375 -94.54423466637282 -40.877591489278196-73.38521818144409 -14.487330945685514-6.8530939766408885-10.84894017617582-0.0388656483260952478.63468911909872
3 0.12617606561304665-0.91728226378698231.8277090696240983 -1.8883963021695365-3.26085343817413436.94314682034098 -1.964291832580844 5.476441728997025 5.985807394356193 -4.152754646002149 15.497819324027216 1.9473222994398216 -11.1546653716116812.1502221820849754 7.402217623202724 -20.974198348221123-18.49611969411084 -11.197532751079477-4.167571500828548 -16.7492676033496866.873971547452746 -22.28958212850625421.69520422160094 10.732001896726413 -24.901621899667955 13.663451847361172 40.92498717076184 62.02571061444625 97.46935359691241 1.3197202988059933 -13.355307678605655-59.98623606960067 -15.3460319107594843.85479178918432068.451030763844253 -37.3610034378942059.316605927851759 -15.936791503025487-14.200047091850191-96.04376311885646 6.793212237372706 -89.28406931570937 -6.342536181747704 9.84276729692308 -44.15480258178421 -19.716315609075178-8.963766643638541 13.328160220454095 -81.91979053839731 -58.49057458242536 -63.82740201878286 -78.04284003367316 6.898497938656784 -9.975022259994258 -24.581867540712196-43.13228076360685 5.384602201485904 -5.104240140134616-88.56822933573116 18.63888133757838
4 -1.5391949931048126-0.84243862338608713.808044749153777 -1.15040861016063344.975092670034785 -4.03814322037485956.475255733889277 -8.492789285986634 -0.71070840841147211.9868439665217876 -6.335098977847596 18.156422121050845 -3.9319838484429286-0.303888675665301 -18.038103704497153.6137256391127717 12.72102405166281 6.1797872895139765 -17.965746423694828-6.457595529218324 -11.1195782584740362.124546751440085 2.074247115486158 48.526431477044895 -47.7501423866134 -13.2189838629703170.7076755883915242 21.272708498626173 20.218314701800175-4.052289437744317 -28.29098298558251744.10471192261346 27.505033879695844 28.4585973718932739.564898635025768 -6.2001475733889375-33.28464087248315 13.562356933449957 72.47202649403566 -17.63088820680735222.257347577113283 19.793786901529828 -0.888840951088124115.45297619768772 80.01687713977846 -33.02953241445338 47.36388577265113 47.96488983389095 30.47783230830538 52.702201767487 56.4647664098084 27.388702583308334 47.716980722531005 48.86243093017444 -29.47766470897874 76.66863902366097 23.114022602360667 -3.03590434662457820.751371509793366 25.70018487608435
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99,995-1.160081518789358 -1.5967802399231468-2.15232040817518 -5.152880656063202 -2.81607683456671464.528707893808043 -9.219048918475725 -4.1152783877843895 15.434762333635224 -8.352240079142867 3.2341379115026694 7.679896402408659 19.99465474797146 -15.987822176846745 17.610005841221454-2.9940634500799996-36.9849615488119246.455731448290355 0.8700910607593357 -4.458798902046075 -8.573291238859795 1.7866347197434056 -5.748202862095839 -78.73536930217278 0.8664468950376607 -31.185290130437014-33.40360643789874548.79496517134476 4.273021608667145 -14.76645480929473223.034033698309216 47.916505903411704 22.82356373157275 58.17074570864146 13.075446180847607 5.357406097709567 19.301741918502767 30.91481630395726 -18.99658045583839429.068050048521297 -11.50032407194181 -94.16793562743486 10.247859328520715 -23.33364253340996864.88951899816107 -5.970342533069689 22.724974186922207 -46.358784230253264-76.06357310802707 36.34299568143191 34.5263251515797 -74.93722963856585 -51.83676476605647 28.086594105181963 -1.148488347990102264.59414944331482 -19.3363913041026487.146369194433403 -94.50249266159257 -11.416642775370095
99,9960.133221661855605742.0608209742055763 2.1641428725239287 -2.450274442812819 0.5729664553821341 11.655164926233269 -9.864613671442203 -4.600216494861485 10.08600220223909 5.916293624542951 14.812935982731668 -6.453293834403917 -11.90549514770099 -3.26727352515574 1.8764801411441934-20.02012175801679 20.579289884690567 -7.95774658159159 -8.387038826710807 -18.0222209635527342.692329970764943 -14.30398788132729721.66822494391352 -15.938191880312708-35.29052532512791 -8.631818482611655 9.787860087044647 -53.67539155301477 -6.29070859522252334.35010506794386 6.565193250636609 -15.486170359730892-3.031599295669413 -1.80098865175289345.55563650252154 -37.38886935392985 68.02203785140463 69.71021558546443 67.33004345391464 38.09747878907309 -15.32336767996999276.84362563371494 -35.79579407415943 -32.88316495646942 -23.620694143487448-90.01728440515039 -24.77449621235016567.92281355721133 30.03415640434173 -29.32574935340052 -21.82606452589530525.41085028514592 70.39416642353444 -29.213531794756513-90.47462518115402 -14.585892147549302-36.17160238891088 -33.2209566185244976.76852716941656 -18.539072237418367
99,9971.011157114782744 -0.80040986269630711.2571486498281934 3.8492594702419245 0.7592605926849842 -4.098302780814329 -1.9485099180060705 16.684513355922583 10.087604365608211 3.7452922672933973 -16.33173839915188 19.92199866574765 6.5771681345498845 -0.3230579773623871714.72654802079624613.583443459677845 -4.952279711617992 17.030998980346084 4.201801219449127 -3.910793205671661441.77733885408281 7.96614686571076 -39.10848664323428 -33.69630280939279 -7.463352385087283 7.458696462843669 -5.883303405785125 6.6310954865277845 -6.552748916196248-9.325031603876797 -11.7337490011325093.627520914240156 18.155090307885395 33.4073875839576 45.52621736035822 -22.938060053594263-27.364572553649534-58.35071648799318 -62.86375816449011 19.272818436422003 47.61050132614527 -11.301762317420524-82.24660966605563 16.961463120018315 13.762199024990316 9.330554417908111 -96.02479832620445 -24.711048464719337-2.078012378653908 -10.604821752483073-11.558372427683931-3.6825332773046875-23.548620629546026-95.72823548883444 15.77594599796893 14.557196623771969 15.812183077424558 -82.30672442508799-8.68501822662248 44.23079310012721
99,9980.9852518578365336 0.8203281912686264 -3.884122502896842 -0.95908400432742780.16746213933285223-0.8886763063332375-16.842052417441188 0.0198139466128886246.1752951086966466 -18.13326524831207 -0.33033598775980267.829297546305325 -10.4252625074002822.7819145440653568 1.158097590630274 30.6780239575918 -23.9448164051634155.6018938249159245 -35.65399756657973 2.673171211427327 -2.90883222148649 -3.59167991497657157.002401397456594 14.353272681106485 -20.458739593063836 -47.09280369705129 25.90478920629466 1.8398979773599367 20.39037292398545 6.635600259567852 21.290136759712006 -30.6802383525156 -32.70023383447721 -28.294300515770139.030591834969087 41.28614556628407 -3.340280013558715 -6.387187312457969 -6.795058954505738 -29.239868647721906-84.84487823247701 21.53413969040578 -9.656174756794805 85.86389211836673 -54.80830511204367 -30.709179188326925-20.51621281362256680.1393974655775 -15.86831043391858 69.46209659371226 66.36652900849339 -25.10453716959171579.18237523289388 -25.577375106247562-30.87284219351464 -56.81179164164408 83.71581743144066 -9.27379265343866519.727630954137673 -85.96069547051928
99,9990.280172477999310550.8792488188373339 -2.611294241397942 -1.271843401381004 -5.583106681289557 2.0063535490559556 8.803561240522425 5.065652252075632 8.014785992140089 2.726435130640515 12.46703945978122 -0.87624409106155750.313008136552742734.259569516217728 -8.76361980315363527.42697941843017 -18.4957182932119153.2235230804059354 19.09973219172654 -21.25726264511826 -10.180990877752983-1.519950417648088522.71070295724785 29.616379288189506 -0.1316424396912179417.225907298944403 5.9791658138855075 11.74845639489894 -4.90066391424355351.065677623825266 -3.7948783924044243-32.70626521313637 -49.77902739808171 -38.9673863548757 4.223577391775786 -26.91850352108989666.81964173436637 76.24293014754961 -31.65153708363635622.893190015052674 -36.482595175686725-25.30090587669703 -10.0417262668186585.274361409552595 -34.88489743571424498.35907785706063 23.57152847224355 26.457155702616525 -86.30659590503936 12.050979659904716 3.057710144296827 -86.50100893855216 23.845662599505307 27.79510549576583 97.55955420927998 -40.44816836188145 2.789198094433643 -4.188993886405869-29.329836024823493 -40.232345894787784

Note that you need scikit-learn installed to only fit the PCAIncremental transformer. The the transform method does not rely on scikit-learn being installed.

Random projections

Random projections is another popular way of doing dimensionality reduction, especially when the dimensionality of the data is very high. vaex.ml conveniently wraps both scikit-learn.random_projection.GaussianRandomProjection and scikit-learn.random_projection.SparseRandomProjection in a single vaex.ml transformer.

[14]:
rand_proj = vaex.ml.RandomProjections(features=features, n_components=10)
rand_proj.fit(df)
rand_proj.transform(df)
[14]:
# feat_0 feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9 feat_10 feat_11 feat_12 feat_13 feat_14 feat_15 feat_16 feat_17 feat_18 feat_19 feat_20 feat_21 feat_22 feat_23 feat_24 feat_25 feat_26 feat_27 feat_28 feat_29 feat_30 feat_31 feat_32 feat_33 feat_34 feat_35 feat_36 feat_37 feat_38 feat_39 feat_40 feat_41 feat_42 feat_43 feat_44 feat_45 feat_46 feat_47 feat_48 feat_49 random_projection_0 random_projection_1 random_projection_2 random_projection_3 random_projection_4 random_projection_5 random_projection_6 random_projection_7 random_projection_8 random_projection_9
0 0.21916619701436382-1.1435438188965208-2.236473242690611 -8.81728920352771 1.9931414225984159 0.8289809515418928 -7.847441537857684 -5.990636964340006 0.43889103534482576-6.4855757436955965-14.48532696768287113.825392548457543 -5.5661773929038185-3.1816868599382633 27.66565101972783650.541940500115366 16.001390451665785 32.510983357481614 8.342038455860216 -1.7293759207235855-6.451472523437187 22.55340570655327 -2.543125122041264528.75425936065127 -39.487762558467345 -6.871003398404642 11.198673922236354 -86.63832306461876 -7.32368079105989237.35407351193795 23.653897939827836 39.52047029873747 42.79143756690254 -33.3810495394693 33.05317072490505 14.818285601642208 -67.03187283353228 -19.01476952180615 22.4905763733386 35.33833686808974 11.79457050704157 -86.70070654092856 25.185781359852896 20.521240128349977 19.814114866123216 78.05531698592385 10.029892443326418 -97.39820288821723 -0.9603735180566161-64.45083314406774 -50.62485790513975 -8.969974902164104 -75.59787959901278 -32.23015488522056 -8.839635748773595 25.52280920491688 -67.81125847807398 20.625813141370337 -8.9492512335752 -38.397093148408445
1 -0.42076958781498162.3850692704428043 -1.3661921493141755-0.57464980721204832.2588675039630703 -5.100101894797036 -0.0005433423021984177-3.0055202143012365 5.749693220009271 11.379708067727588 10.119772822286162 0.15698369211085733-10.937595546203902-31.110839874678003 -5.593388174686233-17.48851742053923519.942127063793418 -0.6804349583522779-19.03708392463745428.74230527011865 12.40206875918237 -9.990549218761593 -5.733244330514869 3.171827795840886 -43.944372783025386 -25.8820588524763123.517534442545183 -25.10463172872150417.068162563601867-26.188188765123446-17.51765346352225 -5.803234686368941 23.37461204071744 85.58386322836444 -24.84250900935848 42.2583557612343 -34.83625774127584447.25447854289113 -5.903960946365425 47.891908734840925 -9.673715993876817 -17.5774774820285274.066254744412671 -51.377913297883865-11.51987006746566810.497653831847085 16.358701536495925 -18.3914825056028029.858101501060483 -39.819369217021595-24.167592671736728 -83.6194525409906 -31.474566122257382 -53.51874280599636 -9.295953556730474 12.065310248051029 21.935134361477004 -72.0479982398111 -66.96195351258001 76.22398276816658
2 -0.50247974091959910.9897062935454243 -1.152229281759237 -1.682033038083704 -4.091345910790923 -4.52742403771885552.129578282936375 10.936320913755608 -1.5695520680947808-6.034199421988269 -28.46431144964817 -15.32129294377632 -8.194011820344523 -16.218630438043398 12.021916867709596-4.908477966578501 -29.56619559878632 7.772108300044394 7.680046493196698 13.815505542053483 3.9208120473170016 47.34661694033482 1.544881077052938 9.440027347582042 18.56198304730558 22.3336072648248 -21.578332510459486-48.93092663572265616.5701671385727 16.656088505245513 19.8406469884787 5.384567961213235 -16.73392428744861614.376438801233908-35.323974854495155-7.411178531711759 -12.19133679331107557.91740496088699 34.873491696833774 88.28464395597479 87.65337555912684 -2.4096431528212445-7.8171455961597385-4.016403896979926 -22.96261029782406 -75.8940296403038 -38.8951677113029 -89.75675908427556 -79.5994302281645 -44.45310265105787 -30.370561351797924 -69.21024877654797 -131.21336032017504 -23.81397986098913 90.48694640695885 27.981469036784446 -71.13131857248655 -165.47320481693575 30.36401943353085 -37.55586272094929
3 0.12617606561304665-0.91728226378698231.8277090696240983 -1.8883963021695365-3.26085343817413436.94314682034098 -1.964291832580844 5.476441728997025 5.985807394356193 -4.152754646002149 15.497819324027216 1.9473222994398216 -11.1546653716116812.1502221820849754 7.402217623202724 -20.974198348221123-18.49611969411084 -11.197532751079477-4.167571500828548 -16.7492676033496866.873971547452746 -22.28958212850625421.69520422160094 10.732001896726413 -24.901621899667955 13.663451847361172 40.92498717076184 62.02571061444625 97.46935359691241 1.3197202988059933 -13.355307678605655-59.98623606960067 -15.3460319107594843.85479178918432068.451030763844253 -37.3610034378942059.316605927851759 -15.936791503025487-14.200047091850191-96.04376311885646 6.793212237372706 -89.28406931570937 -6.342536181747704 9.84276729692308 -44.15480258178421 -19.716315609075178-8.963766643638541 13.328160220454095 -81.91979053839731 -58.49057458242536 125.12748803342656 -25.206573635553035 61.805492059522535 15.847357808911099 -76.71575173832926 86.50353271166043 86.55719953897724 64.19018426217575 -109.12935339038033 -76.8186950536783
4 -1.5391949931048126-0.84243862338608713.808044749153777 -1.15040861016063344.975092670034785 -4.03814322037485956.475255733889277 -8.492789285986634 -0.71070840841147211.9868439665217876 -6.335098977847596 18.156422121050845 -3.9319838484429286-0.303888675665301 -18.038103704497153.6137256391127717 12.72102405166281 6.1797872895139765 -17.965746423694828-6.457595529218324 -11.1195782584740362.124546751440085 2.074247115486158 48.526431477044895 -47.7501423866134 -13.2189838629703170.7076755883915242 21.272708498626173 20.218314701800175-4.052289437744317 -28.29098298558251744.10471192261346 27.505033879695844 28.4585973718932739.564898635025768 -6.2001475733889375-33.28464087248315 13.562356933449957 72.47202649403566 -17.63088820680735222.257347577113283 19.793786901529828 -0.888840951088124115.45297619768772 80.01687713977846 -33.02953241445338 47.36388577265113 47.96488983389095 30.47783230830538 52.702201767487 9.100443729937155 -98.2487363365348 -86.04861549617408 -10.27966060169664 57.67907962932948 -74.56592607052885 -16.669282052441403 -26.583518157157688 47.49051485779235 178.45202653205695
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99,995-1.160081518789358 -1.5967802399231468-2.15232040817518 -5.152880656063202 -2.81607683456671464.528707893808043 -9.219048918475725 -4.1152783877843895 15.434762333635224 -8.352240079142867 3.2341379115026694 7.679896402408659 19.99465474797146 -15.987822176846745 17.610005841221454-2.9940634500799996-36.9849615488119246.455731448290355 0.8700910607593357 -4.458798902046075 -8.573291238859795 1.7866347197434056 -5.748202862095839 -78.73536930217278 0.8664468950376607 -31.185290130437014-33.40360643789874548.79496517134476 4.273021608667145 -14.76645480929473223.034033698309216 47.916505903411704 22.82356373157275 58.17074570864146 13.075446180847607 5.357406097709567 19.301741918502767 30.91481630395726 -18.99658045583839429.068050048521297 -11.50032407194181 -94.16793562743486 10.247859328520715 -23.33364253340996864.88951899816107 -5.970342533069689 22.724974186922207 -46.358784230253264-76.06357310802707 36.34299568143191 79.74173570372625 -120.99425995411295 -158.6863110682003 51.08724948440816 45.49604758883528 -92.51884988772696 -33.86586167918684 -110.19228327900962 10.471099356215348 95.03245666604596
99,9960.133221661855605742.0608209742055763 2.1641428725239287 -2.450274442812819 0.5729664553821341 11.655164926233269 -9.864613671442203 -4.600216494861485 10.08600220223909 5.916293624542951 14.812935982731668 -6.453293834403917 -11.90549514770099 -3.26727352515574 1.8764801411441934-20.02012175801679 20.579289884690567 -7.95774658159159 -8.387038826710807 -18.0222209635527342.692329970764943 -14.30398788132729721.66822494391352 -15.938191880312708-35.29052532512791 -8.631818482611655 9.787860087044647 -53.67539155301477 -6.29070859522252334.35010506794386 6.565193250636609 -15.486170359730892-3.031599295669413 -1.80098865175289345.55563650252154 -37.38886935392985 68.02203785140463 69.71021558546443 67.33004345391464 38.09747878907309 -15.32336767996999276.84362563371494 -35.79579407415943 -32.88316495646942 -23.620694143487448-90.01728440515039 -24.77449621235016567.92281355721133 30.03415640434173 -29.32574935340052 12.801266126889404 17.612236115044166 -31.111396519869256 -160.72849754950767 6.480988179687637 4.231265515946373 -52.555790176785194 -65.21246117529064 35.89601203569984 127.45678271483702
99,9971.011157114782744 -0.80040986269630711.2571486498281934 3.8492594702419245 0.7592605926849842 -4.098302780814329 -1.9485099180060705 16.684513355922583 10.087604365608211 3.7452922672933973 -16.33173839915188 19.92199866574765 6.5771681345498845 -0.3230579773623871714.72654802079624613.583443459677845 -4.952279711617992 17.030998980346084 4.201801219449127 -3.910793205671661441.77733885408281 7.96614686571076 -39.10848664323428 -33.69630280939279 -7.463352385087283 7.458696462843669 -5.883303405785125 6.6310954865277845 -6.552748916196248-9.325031603876797 -11.7337490011325093.627520914240156 18.155090307885395 33.4073875839576 45.52621736035822 -22.938060053594263-27.364572553649534-58.35071648799318 -62.86375816449011 19.272818436422003 47.61050132614527 -11.301762317420524-82.24660966605563 16.961463120018315 13.762199024990316 9.330554417908111 -96.02479832620445 -24.711048464719337-2.078012378653908 -10.604821752483073-2.4863267734391865 -10.434958342024952 -37.55392055999496 6.171867513827003 -29.256283776632728 -72.71591584878013 40.24611847925469 -102.31580552627864 -14.905953231227388 -11.740055851590997
99,9980.9852518578365336 0.8203281912686264 -3.884122502896842 -0.95908400432742780.16746213933285223-0.8886763063332375-16.842052417441188 0.0198139466128886246.1752951086966466 -18.13326524831207 -0.33033598775980267.829297546305325 -10.4252625074002822.7819145440653568 1.158097590630274 30.6780239575918 -23.9448164051634155.6018938249159245 -35.65399756657973 2.673171211427327 -2.90883222148649 -3.59167991497657157.002401397456594 14.353272681106485 -20.458739593063836 -47.09280369705129 25.90478920629466 1.8398979773599367 20.39037292398545 6.635600259567852 21.290136759712006 -30.6802383525156 -32.70023383447721 -28.294300515770139.030591834969087 41.28614556628407 -3.340280013558715 -6.387187312457969 -6.795058954505738 -29.239868647721906-84.84487823247701 21.53413969040578 -9.656174756794805 85.86389211836673 -54.80830511204367 -30.709179188326925-20.51621281362256680.1393974655775 -15.86831043391858 69.46209659371226 -70.00012029923253 198.0368255008663 129.3714720510582 30.652606384505287 -65.3920698996377 49.51640293990293 11.882703005485045 93.26651618256129 35.206089617027985 -61.77494520916369
99,9990.280172477999310550.8792488188373339 -2.611294241397942 -1.271843401381004 -5.583106681289557 2.0063535490559556 8.803561240522425 5.065652252075632 8.014785992140089 2.726435130640515 12.46703945978122 -0.87624409106155750.313008136552742734.259569516217728 -8.76361980315363527.42697941843017 -18.4957182932119153.2235230804059354 19.09973219172654 -21.25726264511826 -10.180990877752983-1.519950417648088522.71070295724785 29.616379288189506 -0.1316424396912179417.225907298944403 5.9791658138855075 11.74845639489894 -4.90066391424355351.065677623825266 -3.7948783924044243-32.70626521313637 -49.77902739808171 -38.9673863548757 4.223577391775786 -26.91850352108989666.81964173436637 76.24293014754961 -31.65153708363635622.893190015052674 -36.482595175686725-25.30090587669703 -10.0417262668186585.274361409552595 -34.88489743571424498.35907785706063 23.57152847224355 26.457155702616525 -86.30659590503936 12.050979659904716 45.50866581430373 33.59123204918983 66.48747993035953 93.58220327847411 -113.34727146050997 34.20894130389669 94.5050429333418 98.6447663145478 -42.700555543235716 -3.632586769281134

Clustering

K-Means

vaex.ml implements a fast and scalable K-Means clustering algorithm. The usage is similar to that of scikit-learn.

[15]:
import vaex.ml.cluster

df = vaex.datasets.iris()

features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
kmeans = vaex.ml.cluster.KMeans(features=features, n_clusters=3, max_iter=100, verbose=True, random_state=42)
kmeans.fit(df)

df_trans = kmeans.transform(df)
df_trans
Iteration    0, inertia  519.0500000000001
Iteration    1, inertia  156.70447116074328
Iteration    2, inertia  88.70688235734133
Iteration    3, inertia  80.23054939305554
Iteration    4, inertia  79.28654263977778
Iteration    5, inertia  78.94084142614601
Iteration    6, inertia  78.94084142614601
[15]:
# sepal_length sepal_width petal_length petal_width class_ prediction_kmeans
0 5.9 3.0 4.2 1.5 1 0
1 6.1 3.0 4.6 1.4 1 0
2 6.6 2.9 4.6 1.3 1 0
3 6.7 3.3 5.7 2.1 2 1
4 5.5 4.2 1.4 0.2 0 2
... ... ... ... ... ... ...
1455.2 3.4 1.4 0.2 0 2
1465.1 3.8 1.6 0.2 0 2
1475.8 2.6 4.0 1.2 1 0
1485.7 3.8 1.7 0.3 0 2
1496.2 2.9 4.3 1.3 1 0

K-Means is an unsupervised algorithm, meaning that the predicted cluster labels in the transformed dataset do not necessarily correspond to the class label. We can map the predicted cluster identifiers to match the class labels, making it easier to construct diagnostic plots.

[16]:
df_trans['predicted_kmean_map'] = df_trans.prediction_kmeans.map(mapper={0: 1, 1: 2, 2: 0})
df_trans
[16]:
# sepal_length sepal_width petal_length petal_width class_ prediction_kmeans predicted_kmean_map
0 5.9 3.0 4.2 1.5 1 0 1
1 6.1 3.0 4.6 1.4 1 0 1
2 6.6 2.9 4.6 1.3 1 0 1
3 6.7 3.3 5.7 2.1 2 1 2
4 5.5 4.2 1.4 0.2 0 2 0
... ... ... ... ... ... ... ...
1455.2 3.4 1.4 0.2 0 2 0
1465.1 3.8 1.6 0.2 0 2 0
1475.8 2.6 4.0 1.2 1 0 1
1485.7 3.8 1.7 0.3 0 2 0
1496.2 2.9 4.3 1.3 1 0 1

Now we can construct simple scatter plots, and see that in the case of the Iris dataset, K-Means does a pretty good job splitting the data into 3 classes.

[17]:
fig = plt.figure(figsize=(12, 5))

plt.subplot(121)
df_trans.scatter(df_trans.petal_length, df_trans.petal_width, c_expr=df_trans.class_)
plt.title('Original classes')

plt.subplot(122)
df_trans.scatter(df_trans.petal_length, df_trans.petal_width, c_expr=df_trans.predicted_kmean_map)
plt.title('Predicted classes')

plt.tight_layout()
plt.show()
/home/jovan/vaex/packages/vaex-core/vaex/viz/mpl.py:205: UserWarning: `scatter` is deprecated and it will be removed in version 5.x. Please use `df.viz.scatter` instead.
  warnings.warn('`scatter` is deprecated and it will be removed in version 5.x. Please use `df.viz.scatter` instead.')
_images/tutorial_ml_35_1.png

As with any algorithm implemented in vaex.ml, K-Means can be used on billions of samples. Fitting takes under 2 minutes when applied on the oversampled Iris dataset, numbering over 1 billion samples.

[18]:
df = vaex.datasets.iris_1e9()
n_samples = len(df)
print(f'Number of samples in DataFrame: {n_samples:,}')
Number of samples in DataFrame: 1,005,000,000
[19]:
%%time

features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
kmeans = vaex.ml.cluster.KMeans(features=features, n_clusters=3, max_iter=100, verbose=True, random_state=31)
kmeans.fit(df)
Iteration    0, inertia  838974000.0037192
Iteration    1, inertia  535903134.000306
Iteration    2, inertia  530190921.4848897
Iteration    3, inertia  528931941.03372437
Iteration    4, inertia  528931941.0337243
CPU times: user 2min 37s, sys: 1.26 s, total: 2min 39s
Wall time: 19.9 s

Supervised learning

While vaex.ml does not yet implement any supervised machine learning models, it does provide wrappers to several popular libraries such as scikit-learn, XGBoost, LightGBM and CatBoost.

The main benefit of these wrappers is that they turn the models into vaex.ml transformers. This means the models become part of the DataFrame state and thus can be serialized, and their predictions can be returned as virtual columns. This is especially useful for creating various diagnostic plots and evaluating performance metrics at no memory cost, as well as building ensembles.

Scikit-Learn example

The vaex.ml.sklearn module provides convenient wrappers to the scikit-learn estimators. In fact, these wrappers can be used with any library that follows the API convention established by scikit-learn, i.e. implements the .fit and .transform methods.

Here is an example:

[20]:
from vaex.ml.sklearn import Predictor
from sklearn.ensemble import GradientBoostingClassifier

df = vaex.datasets.iris()

features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
target = 'class_'

model = GradientBoostingClassifier(random_state=42)
vaex_model = Predictor(features=features, target=target, model=model, prediction_name='prediction')

vaex_model.fit(df=df)

df = vaex_model.transform(df)
df
[20]:
# sepal_length sepal_width petal_length petal_width class_ prediction
0 5.9 3.0 4.2 1.5 1 1
1 6.1 3.0 4.6 1.4 1 1
2 6.6 2.9 4.6 1.3 1 1
3 6.7 3.3 5.7 2.1 2 2
4 5.5 4.2 1.4 0.2 0 0
... ... ... ... ... ... ...
1455.2 3.4 1.4 0.2 0 0
1465.1 3.8 1.6 0.2 0 0
1475.8 2.6 4.0 1.2 1 1
1485.7 3.8 1.7 0.3 0 0
1496.2 2.9 4.3 1.3 1 1

One can still train a predictive model on datasets that are too big to fit into memory by leveraging the on-line learners provided by scikit-learn. The vaex.ml.sklearn.IncrementalPredictor conveniently wraps these learners and provides control on how the data is passed to them from a vaex DataFrame.

Let us train a model on the oversampled Iris dataset which comprises over 1 billion samples.

[21]:
from vaex.ml.sklearn import IncrementalPredictor
from sklearn.linear_model import SGDClassifier

df = vaex.datasets.iris_1e9()

features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
target = 'class_'

model = SGDClassifier(learning_rate='constant', eta0=0.0001, random_state=42)
vaex_model = IncrementalPredictor(features=features, target=target, model=model,
                                  batch_size=500_000, partial_fit_kwargs={'classes':[0, 1, 2]})

vaex_model.fit(df=df, progress='widget')

df = vaex_model.transform(df)
df
[21]:
# sepal_length sepal_width petal_length petal_width class_ prediction
0 5.9 3.0 4.2 1.5 1 1
1 6.1 3.0 4.6 1.4 1 1
2 6.6 2.9 4.6 1.3 1 1
3 6.7 3.3 5.7 2.1 2 2
4 5.5 4.2 1.4 0.2 0 0
... ... ... ... ... ... ...
1,004,999,9955.2 3.4 1.4 0.2 0 0
1,004,999,9965.1 3.8 1.6 0.2 0 0
1,004,999,9975.8 2.6 4.0 1.2 1 1
1,004,999,9985.7 3.8 1.7 0.3 0 0
1,004,999,9996.2 2.9 4.3 1.3 1 1

XGBoost example

Libraries such as XGBoost provide more options such as validation during training and early stopping for example. We provide wrappers that keeps close to the native API of these libraries, in addition to the scikit-learn API.

While the following example showcases the XGBoost wrapper, vaex.ml implements similar wrappers for LightGBM and CatBoost.

[22]:
from vaex.ml.xgboost import XGBoostModel

df = vaex.datasets.iris_1e5()
df_train, df_test = df.ml.train_test_split(test_size=0.2, verbose=False)

features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
target = 'class_'

params = {'learning_rate': 0.1,
          'max_depth': 3,
          'num_class': 3,
          'objective': 'multi:softmax',
          'subsample': 1,
          'random_state': 42,
          'n_jobs': -1}


booster = XGBoostModel(features=features, target=target, num_boost_round=500, params=params)
booster.fit(df=df_train, evals=[(df_train, 'train'), (df_test, 'test')], early_stopping_rounds=5)

df_test = booster.transform(df_train)
df_test
[13:41:31] WARNING: /home/conda/feedstock_root/build_artifacts/xgboost_1607604574104/work/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softmax' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[22]:
# sepal_length sepal_width petal_length petal_width class_ xgboost_prediction
0 5.9 3.0 4.2 1.5 1 1.0
1 6.1 3.0 4.6 1.4 1 1.0
2 6.6 2.9 4.6 1.3 1 1.0
3 6.7 3.3 5.7 2.1 2 2.0
4 5.5 4.2 1.4 0.2 0 0.0
... ... ... ... ... ... ...
80,3955.2 3.4 1.4 0.2 0 0.0
80,3965.1 3.8 1.6 0.2 0 0.0
80,3975.8 2.6 4.0 1.2 1 1.0
80,3985.7 3.8 1.7 0.3 0 0.0
80,3996.2 2.9 4.3 1.3 1 1.0

CatBoost example

The CatBoost library supports summing up models. With this feature, we can use CatBoost to train a model using data that is otherwise too large to fit in memory. The idea is to train a single CatBoost model per chunk of data, and than sum up the invidiual models to create a master model. To use this feature via vaex.ml just specify the batch_size argument in the CatBoostModel wrapper. One can also specify additional options such as the strategy on how to sum up the individual models, or how they should be weighted.

[23]:
from vaex.ml.catboost import CatBoostModel

df = vaex.datasets.iris_1e8()
df_train, df_test = df.ml.train_test_split(test_size=0.2, verbose=False)

features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
target = 'class_'

params = {
    'leaf_estimation_method': 'Gradient',
    'learning_rate': 0.1,
    'max_depth': 3,
    'bootstrap_type': 'Bernoulli',
    'subsample': 0.8,
    'sampling_frequency': 'PerTree',
    'colsample_bylevel': 0.8,
    'reg_lambda': 1,
    'objective': 'MultiClass',
    'eval_metric': 'MultiClass',
    'random_state': 42,
    'verbose': 0,
}

booster = CatBoostModel(features=features, target=target, num_boost_round=23,
                        params=params, prediction_type='Class', batch_size=11_000_000)
booster.fit(df=df_train, progress='widget')

df_test = booster.transform(df_train)
df_test
[23]:
# sepal_length sepal_width petal_length petal_width class_ catboost_prediction
0 5.9 3.0 4.2 1.5 1 array([1])
1 6.1 3.0 4.6 1.4 1 array([1])
2 6.6 2.9 4.6 1.3 1 array([1])
3 6.7 3.3 5.7 2.1 2 array([2])
4 5.5 4.2 1.4 0.2 0 array([0])
... ... ... ... ... ... ...
80,399,9955.2 3.4 1.4 0.2 0 array([0])
80,399,9965.1 3.8 1.6 0.2 0 array([0])
80,399,9975.8 2.6 4.0 1.2 1 array([1])
80,399,9985.7 3.8 1.7 0.3 0 array([0])
80,399,9996.2 2.9 4.3 1.3 1 array([1])

Keras example

Keras is the most popular high-level API to building neural network models with tensorflow as its backend. Neural networks can have very diverse and complicated architectures, and their training loops can be both simple and sophisticated. This is why, at least for now, we leave the users to train their keras models as they normaly would, and in vaex-ml provides a simple wrapper for serialization and lazy evaluation of those models. In addition, vaex-ml also provides a convenience method to turn a DataFrame into a generator, suitable for training of Keras models. See the example below.

[24]:
import vaex.ml.tensorflow
import tensorflow.keras as K

df = vaex.example()
df_train, df_valid, df_test = df.split_random([0.8, 0.1, 0.1], random_state=42)

features = ['x', 'y', 'z', 'vx', 'vy', 'vz']
target = 'FeH'

# Scaling the features
df_train = df_train.ml.minmax_scaler(features=features)
features = df_train.get_column_names(regex='^minmax_')

# Apply preprocessing to the validation
state_prep = df_train.state_get()
df_valid.state_set(state_prep)

# Generators for the train and validation sets
gen_train = df_train.ml.tensorflow.to_keras_generator(features=features, target=target, batch_size=512)
gen_valid = df_valid.ml.tensorflow.to_keras_generator(features=features, target=target, batch_size=512)

# Create and fit a simple Sequential Keras model
nn_model = K.Sequential()
nn_model.add(K.layers.Dense(3, activation='tanh'))
nn_model.add(K.layers.Dense(1, activation='linear'))
nn_model.compile(optimizer='sgd', loss='mse')
nn_model.fit(x=gen_train, validation_data=gen_valid, epochs=11, steps_per_epoch=516, validation_steps=65)

# Serialize the model
keras_model = vaex.ml.tensorflow.KerasModel(features=features, prediction_name='keras_pred', model=nn_model)
df_train = keras_model.transform(df_train)

# Apply all the transformations to the test set
state = df_train.state_get()
df_test.state_set(state)

# Preview the results
df_test.head(5)
2021-08-14 23:47:55.800260: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-08-14 23:47:55.800282: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Recommended "steps_per_epoch" arg: 516.0
Recommended "steps_per_epoch" arg: 65.0
2021-08-14 23:47:57.111408: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 23:47:57.111910: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-08-14 23:47:57.111974: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-08-14 23:47:57.112032: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-08-14 23:47:57.112093: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2021-08-14 23:47:57.112150: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2021-08-14 23:47:57.112206: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2021-08-14 23:47:57.112261: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-08-14 23:47:57.112317: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-08-14 23:47:57.112327: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-08-14 23:47:57.112682: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/11
 11/516 [..............................] - ETA: 2s - loss: 1.7922
2021-08-14 23:47:57.326751: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
516/516 [==============================] - 3s 6ms/step - loss: 0.2172 - val_loss: 0.1724
Epoch 2/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1736 - val_loss: 0.1715
Epoch 3/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1729 - val_loss: 0.1705
Epoch 4/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1725 - val_loss: 0.1707
Epoch 5/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1722 - val_loss: 0.1708
Epoch 6/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1720 - val_loss: 0.1701
Epoch 7/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1718 - val_loss: 0.1697
Epoch 8/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1717 - val_loss: 0.1706
Epoch 9/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1715 - val_loss: 0.1698
Epoch 10/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1714 - val_loss: 0.1702
Epoch 11/11
516/516 [==============================] - 3s 6ms/step - loss: 0.1713 - val_loss: 0.1701
INFO:tensorflow:Assets written to: /tmp/tmp14gsptzz/assets
2021-08-14 23:48:31.519641: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
[24]:
# id x y z vx vy vz E L Lz FeH minmax_scaled_x minmax_scaled_y minmax_scaled_z minmax_scaled_vx minmax_scaled_vy minmax_scaled_vzkeras_pred
0 23 0.137403-5.07974 1.40165 111.828 62.8776 -88.121 -134786 700.236 576.698-1.7935 0.375163 0.72055 0.397008 0.570648 0.56065 0.414253array([-1.6143968], dtype=float32)
1 31-1.95543 -0.840676 1.26239 -259.282 20.8279-148.457 -134990 676.813-258.7 -0.623007 0.365132 0.738746 0.395427 0.266912 0.5249 0.357964array([-1.509573], dtype=float32)
2 22 2.33077 -0.570014 0.761285 -53.4566-43.377 -71.3196-177062 196.209-131.573-0.889463 0.385676 0.739908 0.389737 0.43537 0.470313 0.429927array([-1.5752358], dtype=float32)
3 26 0.777881-2.83258 0.0797214 256.427 202.451 -12.76 -125176 884.581 883.833-1.65996 0.378233 0.730196 0.381998 0.688994 0.679314 0.484558array([-1.6558373], dtype=float32)
4 1 3.37429 2.62885 -0.797169 300.697 153.772 83.9173 -97150.4681.868-271.616-1.6496 0.390678 0.753639 0.372041 0.725228 0.637928 0.574749array([-1.6719546], dtype=float32)

River example

River is an up-and-coming library for online learning, and provides a variety of models that can learn incrementally. While most of the river models currently support per-sample training, few do support mini-batch training which is extremely fast - a great synergy to do machine learning with vaex.

[25]:
from vaex.ml.incubator.river import RiverModel
from river.linear_model import LinearRegression
from river import optim


df = vaex.datasets.iris_1e9()
df_train, df_test = df.ml.train_test_split(test_size=0.2, verbose=False)

features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
target = 'class_'

river_model = RiverModel(features=features,
                         target=target,
                         model=LinearRegression(optimizer=optim.SGD(0.001), intercept_lr=0.001),
                         prediction_name='prediction_raw',
                         batch_size=500_000)
river_model.fit(df_train, progress='widget')
river_model.transform(df_test)
[25]:
# sepal_length sepal_width petal_length petal_width class_ prediction_raw
0 5.9 3.0 4.2 1.5 1 1.2262451850482554
1 6.1 3.0 4.6 1.4 1 1.3372106202149072
2 6.6 2.9 4.6 1.3 1 1.3080263625894342
3 6.7 3.3 5.7 2.1 2 1.8246442870772779
4 5.5 4.2 1.4 0.2 0 -0.1719159051653813
... ... ... ... ... ... ...
200,999,9955.2 3.4 1.4 0.2 0 -0.06961837848289065
200,999,9965.1 3.8 1.6 0.2 0 -0.04133966888449841
200,999,9975.8 2.6 4.0 1.2 1 1.1380612859534056
200,999,9985.7 3.8 1.7 0.3 0 -0.005633275295105093
200,999,9996.2 2.9 4.3 1.3 1 1.2171097577656713

Metrics

vaex-ml also provides several of the most common evaluation metrics for classification and regression tasks. These metrics are implemented in vaex-ml and thus are evaluated out-of-core, so you do not need to materialize the target and predicted columns.

Here is a list of the currently supported metrics:

  • Classification (binary, and macro-average for multiclass problems):

    • Accuracy

    • Precision

    • Recall

    • F1-score

    • Confusion matrix

    • Classification report (a convenience method, which prints out the accuracy, precision, recall, and F1-score at the same time)

    • Matthews Correlation Coeficient

  • Regression

    • Mean Absolute Error

    • Mean Squared Error

    • R2 Correlation Score

Here is a simple example:

[26]:
import vaex.ml.metrics
from sklearn.linear_model import LogisticRegression

df = vaex.datasets.iris()
df_train, df_test = df.split_random([0.8, 0.2], random_state=55)

features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
target = 'class_'

model = LogisticRegression(random_state=42)
vaex_model = Predictor(features=features, target=target, model=model, prediction_name='pred')

vaex_model.fit(df=df_train)

df_test = vaex_model.transform(df_test)

print(df_test.ml.metrics.classification_report(df_test.class_, df_test.pred, average='macro'))

        Classification report:

        Accuracy:  0.933
        Precision: 0.928
        Recall:    0.928
        F1:        0.928

/home/jovan/vaex/packages/vaex-core/vaex/dataframe.py:5516: UserWarning: It seems your column class_ is already ordinal encoded (values between 0 and 2), automatically switching to use df.categorize
  warnings.warn(f'It seems your column {column} is already ordinal encoded (values between {min_value} and {max_value}), automatically switching to use df.categorize')
/home/jovan/vaex/packages/vaex-core/vaex/dataframe.py:5516: UserWarning: It seems your column pred is already ordinal encoded (values between 0 and 2), automatically switching to use df.categorize
  warnings.warn(f'It seems your column {column} is already ordinal encoded (values between {min_value} and {max_value}), automatically switching to use df.categorize')

State transfer - pipelines made easy

Each vaex DataFrame consists of two parts: data and state. The data is immutable, and any operation such as filtering, adding new columns, or applying transformers or predictive models just modifies the state. This is extremely powerful concept and can completely redefine how we imagine machine learning pipelines.

As an example, let us once again create a model based on the Iris dataset. Here, we will create a couple of new features, do a PCA transformation, and finally train a predictive model.

[27]:
# Load data and split it in train and test sets
df = vaex.datasets.iris()
df_train, df_test = df.ml.train_test_split(test_size=0.2, verbose=False)

# Create new features
df_train['petal_ratio'] = df_train.petal_length / df_train.petal_width
df_train['sepal_ratio'] = df_train.sepal_length / df_train.sepal_width

# Do a PCA transformation
features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width', 'petal_ratio', 'sepal_ratio']
pca = vaex.ml.PCA(features=features, n_components=6)
df_train = pca.fit_transform(df_train)

# Display the training DataFrame at this stage
df_train
[27]:
# sepal_length sepal_width petal_length petal_width class_ petal_ratio sepal_ratio PCA_0 PCA_1 PCA_2 PCA_3 PCA_4 PCA_5
0 5.4 3.0 4.5 1.5 1 3.0 1.8 -1.510547480171215 0.3611524321126822 -0.4005106138591812 0.5491844107628985 0.21135370342329635 -0.009542243224854377
1 4.8 3.4 1.6 0.2 0 8.0 1.411764705882353 4.447550641536847 0.2799644730487585 -0.04904458661276928 0.18719360579644695 0.10928493945448532 0.005228919010020094
2 6.9 3.1 4.9 1.5 1 3.266666666666667 2.2258064516129035-1.777649528149752 -0.60828897708458910.48007833550651513 -0.377620118668313350.05174472701894024 -0.04673816474220924
3 4.4 3.2 1.3 0.2 0 6.5 1.375 3.400548263702555 1.437036928591846 -0.3662652846960042 0.23420836198441913 0.05750021481634099 -0.023055011653267066
4 5.6 2.8 4.9 2.0 2 2.45 2.0 -2.32450987662220940.14710673877401348-0.5150809942258257 0.5471824391426298 -0.12154714382375817 0.0044686197532133876
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1155.2 3.4 1.4 0.2 0 6.999999999999999 1.52941176470588253.623794583238953 0.8255759252729563 0.23453320686724874 -0.17599408825208826-0.04687036865354327 -0.02424621891240747
1165.1 3.8 1.6 0.2 0 8.0 1.34210526315789474.42115266246093 0.222875055336637040.4450642830179705 0.2184424557783562 0.14504752606375293 0.07229123907677276
1175.8 2.6 4.0 1.2 1 3.33333333333333352.230769230769231 -1.069062832993727 0.3874258314654399 -0.4471767749236783 -0.2956609879568117 -0.0010695982441835394-0.0065225306610744715
1185.7 3.8 1.7 0.3 0 5.666666666666667 1.50000000000000022.2846521048417037 1.1920826609681359 0.8273738848637026 -0.210489464627257370.03381892388998425 0.018792165273013528
1196.2 2.9 4.3 1.3 1 3.30769230769230752.137931034482759 -1.29882299587484520.06960434514054464-0.0012167985718341268-0.240722552191808830.05282732890885841 -0.032459999314411514

At this point, we are ready to train a predictive model. In this example, let’s use LightGBM with its scikit-learn API.

[28]:
import lightgbm

features = df_train.get_column_names(regex='^PCA')

booster = lightgbm.LGBMClassifier()

vaex_model = Predictor(model=booster, features=features, target='class_')

vaex_model.fit(df=df_train)
df_train = vaex_model.transform(df_train)

df_train
[28]:
# sepal_length sepal_width petal_length petal_width class_ petal_ratio sepal_ratio PCA_0 PCA_1 PCA_2 PCA_3 PCA_4 PCA_5 prediction
0 5.4 3.0 4.5 1.5 1 3.0 1.8 -1.510547480171215 0.3611524321126822 -0.4005106138591812 0.5491844107628985 0.21135370342329635 -0.009542243224854377 1
1 4.8 3.4 1.6 0.2 0 8.0 1.411764705882353 4.447550641536847 0.2799644730487585 -0.04904458661276928 0.18719360579644695 0.10928493945448532 0.005228919010020094 0
2 6.9 3.1 4.9 1.5 1 3.266666666666667 2.2258064516129035-1.777649528149752 -0.60828897708458910.48007833550651513 -0.377620118668313350.05174472701894024 -0.04673816474220924 1
3 4.4 3.2 1.3 0.2 0 6.5 1.375 3.400548263702555 1.437036928591846 -0.3662652846960042 0.23420836198441913 0.05750021481634099 -0.023055011653267066 0
4 5.6 2.8 4.9 2.0 2 2.45 2.0 -2.32450987662220940.14710673877401348-0.5150809942258257 0.5471824391426298 -0.12154714382375817 0.0044686197532133876 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1155.2 3.4 1.4 0.2 0 6.999999999999999 1.52941176470588253.623794583238953 0.8255759252729563 0.23453320686724874 -0.17599408825208826-0.04687036865354327 -0.02424621891240747 0
1165.1 3.8 1.6 0.2 0 8.0 1.34210526315789474.42115266246093 0.222875055336637040.4450642830179705 0.2184424557783562 0.14504752606375293 0.07229123907677276 0
1175.8 2.6 4.0 1.2 1 3.33333333333333352.230769230769231 -1.069062832993727 0.3874258314654399 -0.4471767749236783 -0.2956609879568117 -0.0010695982441835394-0.00652253066107447151
1185.7 3.8 1.7 0.3 0 5.666666666666667 1.50000000000000022.2846521048417037 1.1920826609681359 0.8273738848637026 -0.210489464627257370.03381892388998425 0.018792165273013528 0
1196.2 2.9 4.3 1.3 1 3.30769230769230752.137931034482759 -1.29882299587484520.06960434514054464-0.0012167985718341268-0.240722552191808830.05282732890885841 -0.032459999314411514 1

The final df_train DataFrame contains all the features we created, including the predictions right at the end. Now, we would like to apply the same transformations to the test set. All we need to do, is to simply extract the state from df_train and apply it to df_test. This will propagate all the changes that were made to the training set on the test set.

[29]:
state = df_train.state_get()

df_test.state_set(state)
df_test
[29]:
# sepal_length sepal_width petal_length petal_width class_ petal_ratio sepal_ratio PCA_0 PCA_1 PCA_2 PCA_3 PCA_4 PCA_5 prediction
0 5.9 3.0 4.2 1.5 1 2.80000000000000031.9666666666666668-1.642627940409072 0.49931302910747727 -0.063088008066644660.10842057110641677 -0.03924298664189224-0.0273944397002728221
1 6.1 3.0 4.6 1.4 1 3.28571428571428562.033333333333333 -1.445047446393471 -0.1019091578746504 -0.018990122394938010.0209807676460904080.1614215276667148 -0.02716639637934938 1
2 6.6 2.9 4.6 1.3 1 3.538461538461538 2.2758620689655173-1.330564613235537 -0.419784747491312670.1759590589290671 -0.4631301992308477 0.08304243689815374 -0.0333517336774292741
3 6.7 3.3 5.7 2.1 2 2.71428571428571442.0303030303030303-2.6719170661531013-0.9149428897499291 0.4156162725009377 0.34633692661436644 0.03742964707590906 -0.0132542861962457742
4 5.5 4.2 1.4 0.2 0 6.999999999999999 1.30952380952380953.6322930267831404 0.8198526437905096 1.046277579362938 0.09738737839850209 0.09412658096734221 0.1329137026697501 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
255.5 2.5 4.0 1.3 1 3.07692307692307662.2 -1.25231200886008960.5975071562677784 -0.7019801415469216 -0.11489031841855571-0.036159457820878690.005496321827264977 1
265.8 2.7 3.9 1.2 1 3.25 2.148148148148148 -1.07923521659046570.5236883751378523 -0.34037717939532286-0.23743695029955128-0.00936891422024664-0.02184110533380834 1
274.4 2.9 1.4 0.2 0 6.999999999999999 1.517241379310345 3.7422969192506095 1.048460304741977 -0.636475521315278 0.07623157913054074 0.004215355833312173-0.06354157393133958 0
284.5 2.3 1.3 0.3 0 4.333333333333334 1.956521739130435 1.4537380535696471 2.4197864889383505 -1.0301500321688102 -0.5150263062576134 -0.2631218962099228 -0.06608059456656257 0
296.9 3.2 5.7 2.3 2 2.47826086956521772.15625 -2.963110301521378 -0.924626055589704 0.44833006106219797 0.20994670504662372 -0.2012725506779131 -0.0189004142877193532

And just like that df_test contains all the columns, transformations and the prediction we modelled on the training set. The state can be easily serialized to disk in a form of a JSON file. This makes deployment of a machine learning model as trivial as simply copying a JSON file from one environment to another.

[30]:
df_train.state_write('./iris_model.json')

df_test.state_load('./iris_model.json')
df_test
[30]:
# sepal_length sepal_width petal_length petal_width class_ petal_ratio sepal_ratio PCA_0 PCA_1 PCA_2 PCA_3 PCA_4 PCA_5 prediction
0 5.9 3.0 4.2 1.5 1 2.80000000000000031.9666666666666668-1.642627940409072 0.49931302910747727 -0.063088008066644660.10842057110641677 -0.03924298664189224-0.0273944397002728221
1 6.1 3.0 4.6 1.4 1 3.28571428571428562.033333333333333 -1.445047446393471 -0.1019091578746504 -0.018990122394938010.0209807676460904080.1614215276667148 -0.02716639637934938 1
2 6.6 2.9 4.6 1.3 1 3.538461538461538 2.2758620689655173-1.330564613235537 -0.419784747491312670.1759590589290671 -0.4631301992308477 0.08304243689815374 -0.0333517336774292741
3 6.7 3.3 5.7 2.1 2 2.71428571428571442.0303030303030303-2.6719170661531013-0.9149428897499291 0.4156162725009377 0.34633692661436644 0.03742964707590906 -0.0132542861962457742
4 5.5 4.2 1.4 0.2 0 6.999999999999999 1.30952380952380953.6322930267831404 0.8198526437905096 1.046277579362938 0.09738737839850209 0.09412658096734221 0.1329137026697501 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
255.5 2.5 4.0 1.3 1 3.07692307692307662.2 -1.25231200886008960.5975071562677784 -0.7019801415469216 -0.11489031841855571-0.036159457820878690.005496321827264977 1
265.8 2.7 3.9 1.2 1 3.25 2.148148148148148 -1.07923521659046570.5236883751378523 -0.34037717939532286-0.23743695029955128-0.00936891422024664-0.02184110533380834 1
274.4 2.9 1.4 0.2 0 6.999999999999999 1.517241379310345 3.7422969192506095 1.048460304741977 -0.636475521315278 0.07623157913054074 0.004215355833312173-0.06354157393133958 0
284.5 2.3 1.3 0.3 0 4.333333333333334 1.956521739130435 1.4537380535696471 2.4197864889383505 -1.0301500321688102 -0.5150263062576134 -0.2631218962099228 -0.06608059456656257 0
296.9 3.2 5.7 2.3 2 2.47826086956521772.15625 -2.963110301521378 -0.924626055589704 0.44833006106219797 0.20994670504662372 -0.2012725506779131 -0.0189004142877193532

Warning: This notebook needs a running kernel to be fully interactive, please run it locally or on mybinder.

Binder

Jupyter integration: interactivity

Vaex can process about 1 billion rows per second, and in combination with the Jupyter notebook, this allows for interactive exporation of large datasets.

Introduction

The vaex-jupyter package contains the building blocks to interactively define an N-dimensional grid, which is then used for visualizations.

We start by defining the building blocks (vaex.jupyter.model.Axis, vaex.jupyter.model.DataArray and vaex.jupyter.view.DataArray) used to define and visualize our N-dimensional grid.

Let us first import the relevant packages, and open the example DataFrame:

[1]:
import vaex
import vaex.jupyter.model as vjm

import numpy as np
import matplotlib.pyplot as plt

df = vaex.example()
df
[1]:
# id x y z vx vy vz E L Lz FeH
0 0 1.2318683862686157 -0.39692866802215576-0.598057746887207 301.1552734375 174.05947875976562 27.42754554748535 -149431.40625 407.38897705078125333.9555358886719 -1.0053852796554565
1 23 -0.163700610399246223.654221296310425 -0.25490644574165344-195.00022888183594170.47216796875 142.5302276611328 -124247.953125890.2411499023438 684.6676025390625 -1.7086670398712158
2 32 -2.120255947113037 3.326052665710449 1.7078403234481812 -48.63423156738281 171.6472930908203 -2.079437255859375 -138500.546875372.2410888671875 -202.17617797851562-1.8336141109466553
3 8 4.7155890464782715 4.5852508544921875 2.2515437602996826 -232.42083740234375-294.850830078125 62.85865020751953 -60037.03906251297.63037109375 -324.6875 -1.4786882400512695
4 16 7.21718692779541 11.99471664428711 -1.064562201499939 -1.6891745328903198181.329345703125 -11.333610534667969-83206.84375 1332.79895019531251328.948974609375 -1.8570483922958374
... ... ... ... ... ... ... ... ... ... ... ...
329,99521 1.9938701391220093 0.789276123046875 0.22205990552902222 -216.9299011230468816.124420166015625 -211.244384765625 -146457.4375 457.72247314453125203.36758422851562 -1.7451677322387695
329,99625 3.7180912494659424 0.721337616443634 1.6415337324142456 -185.92160034179688-117.25082397460938-105.4986572265625 -126627.109375335.0025634765625 -301.8370056152344 -0.9822322130203247
329,99714 0.3688507676124573 13.029608726501465 -3.633934736251831 -53.677146911621094-145.15771484375 76.70909881591797 -84912.2578125817.1375732421875 645.8507080078125 -1.7645612955093384
329,99818 -0.112592644989490511.4529125690460205 2.168952703475952 179.30865478515625 205.79710388183594 -68.75872802734375 -133498.46875 724.000244140625 -283.6910400390625 -1.8808952569961548
329,9994 20.796220779418945 -3.331387758255005 12.18841552734375 42.69000244140625 69.20479583740234 29.54275131225586 -65519.328125 1843.07470703125 1581.4151611328125 -1.1231083869934082

We want to build a 2 dimensinoal grid with the number counts in each bin. To do this, we first define two axis objects:

[2]:
E_axis = vjm.Axis(df=df, expression=df.E, shape=140)
Lz_axis = vjm.Axis(df=df, expression=df.Lz, shape=100)
Lz_axis
[2]:
Axis(bin_centers=None, exception=None, expression=Lz, max=None, min=None, shape=100, shape_default=64, slice=None, status=Status.NO_LIMITS)

When we inspect the Lz_axis object we see that the min, max, and bin centers are all None. This is because Vaex calculates them in the background, so the kernel stays interactive, meaning you can continue working in the notebook. We can ask Vaex to wait until all background calculations are done. Note that for billions of rows, this can take over a second.

[3]:
await vaex.jupyter.gather()  # wait until Vaex is done with all background computation
Lz_axis  # now min and max are computed, and bin_centers is set
[3]:
Axis(bin_centers=[-2877.11808899 -2830.27174744 -2783.42540588 -2736.57906433
 -2689.73272278 -2642.88638123 -2596.04003967 -2549.19369812
 -2502.34735657 -2455.50101501 -2408.65467346 -2361.80833191
 -2314.96199036 -2268.1156488  -2221.26930725 -2174.4229657
 -2127.57662415 -2080.73028259 -2033.88394104 -1987.03759949
 -1940.19125793 -1893.34491638 -1846.49857483 -1799.65223328
 -1752.80589172 -1705.95955017 -1659.11320862 -1612.26686707
 -1565.42052551 -1518.57418396 -1471.72784241 -1424.88150085
 -1378.0351593  -1331.18881775 -1284.3424762  -1237.49613464
 -1190.64979309 -1143.80345154 -1096.95710999 -1050.11076843
 -1003.26442688  -956.41808533  -909.57174377  -862.72540222
  -815.87906067  -769.03271912  -722.18637756  -675.34003601
  -628.49369446  -581.64735291  -534.80101135  -487.9546698
  -441.10832825  -394.26198669  -347.41564514  -300.56930359
  -253.72296204  -206.87662048  -160.03027893  -113.18393738
   -66.33759583   -19.49125427    27.35508728    74.20142883
   121.04777039   167.89411194   214.74045349   261.58679504
   308.4331366    355.27947815   402.1258197    448.97216125
   495.81850281   542.66484436   589.51118591   636.35752747
   683.20386902   730.05021057   776.89655212   823.74289368
   870.58923523   917.43557678   964.28191833  1011.12825989
  1057.97460144  1104.82094299  1151.66728455  1198.5136261
  1245.35996765  1292.2063092   1339.05265076  1385.89899231
  1432.74533386  1479.59167542  1526.43801697  1573.28435852
  1620.13070007  1666.97704163  1713.82338318  1760.66972473], exception=None, expression=Lz, max=1784.0928955078125, min=-2900.541259765625, shape=100, shape_default=64, slice=None, status=Status.READY)

Note that the Axis is a traitlets HasTrait object, similar to all ipywidget objects. This means that we can link all of its properties to an ipywidget and thus creating interactivity. We can also use observe to listen to any changes to our model.

An interactive xarray DataArray display

Now that we have defined our two axes, we can create a vaex.jupyter.model.DataArray (model) together with a vaex.jupyter.view.DataArray (view).

A convenient way to do this, is to use the widget accessor data_array method, which creates both, links them together and will return a view for us.

The returned view is an ipywidget object, which becomes a visual element in the Jupyter notebook when displayed.

[4]:
data_array_widget = df.widget.data_array(axes=[Lz_axis, E_axis], selection=[None, 'default'])
data_array_widget  # being the last expression in the cell, Jupyter  will 'display' the widget

Note: If you see this notebook on readthedocs, you will see the selection coordinate already has ``[None, ‘default’]``, because cells below have already been executed and have updated this widget. If you run this notebook yourself (say on mybinder), you will see after executing the above cell, the selection will have ``[None]`` as its only value.

From the specification of the axes and the selections, Vaex computes a 3d histogram, the first dimension being the selections. Interally this is simply a numpy array, but we wrap it in an xarray DataArray object. An xarray DataArray object can be seen as a labeled Nd array, i.e. a numpy array with extra metadata to make it fully self-describing.

Notice that in the above code cell, we specified the selection argument with a list containing two elements in this case, None and 'default'. The None selection simply shows all the data, while the default refers to any selection made without explicitly naming it. Even though the later has not been defined at this point, we can still pre-emptively include it, in case we want to modify it later.

The most important properties of the data_array are printed out below:

[5]:
# NOTE: since the computations are done in the background, data_array_widget.model.grid is initially None.
# We can ask vaex-jupyter to wait till all executions are done using:
await vaex.jupyter.gather()
# get a reference to the xarray DataArray object
data_array = data_array_widget.model.grid
print(f"type:", type(data_array))
print("dims:", data_array.dims)
print("data:", data_array.data)
print("coords:", data_array.coords)
print("Lz's data:", data_array.coords['Lz'].data)
print("Lz's attrs:", data_array.coords['Lz'].attrs)
print("And displaying the xarray DataArray:")
display(data_array)  # this is what the vaex.jupyter.view.DataArray uses
type: <class 'xarray.core.dataarray.DataArray'>
dims: ('selection', 'Lz', 'E')
data: [[[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]]
coords: Coordinates:
  * selection  (selection) object None
  * Lz         (Lz) float64 -2.877e+03 -2.83e+03 ... 1.714e+03 1.761e+03
  * E          (E) float64 -2.414e+05 -2.394e+05 ... 3.296e+04 3.495e+04
Lz's data: [-2877.11808899 -2830.27174744 -2783.42540588 -2736.57906433
 -2689.73272278 -2642.88638123 -2596.04003967 -2549.19369812
 -2502.34735657 -2455.50101501 -2408.65467346 -2361.80833191
 -2314.96199036 -2268.1156488  -2221.26930725 -2174.4229657
 -2127.57662415 -2080.73028259 -2033.88394104 -1987.03759949
 -1940.19125793 -1893.34491638 -1846.49857483 -1799.65223328
 -1752.80589172 -1705.95955017 -1659.11320862 -1612.26686707
 -1565.42052551 -1518.57418396 -1471.72784241 -1424.88150085
 -1378.0351593  -1331.18881775 -1284.3424762  -1237.49613464
 -1190.64979309 -1143.80345154 -1096.95710999 -1050.11076843
 -1003.26442688  -956.41808533  -909.57174377  -862.72540222
  -815.87906067  -769.03271912  -722.18637756  -675.34003601
  -628.49369446  -581.64735291  -534.80101135  -487.9546698
  -441.10832825  -394.26198669  -347.41564514  -300.56930359
  -253.72296204  -206.87662048  -160.03027893  -113.18393738
   -66.33759583   -19.49125427    27.35508728    74.20142883
   121.04777039   167.89411194   214.74045349   261.58679504
   308.4331366    355.27947815   402.1258197    448.97216125
   495.81850281   542.66484436   589.51118591   636.35752747
   683.20386902   730.05021057   776.89655212   823.74289368
   870.58923523   917.43557678   964.28191833  1011.12825989
  1057.97460144  1104.82094299  1151.66728455  1198.5136261
  1245.35996765  1292.2063092   1339.05265076  1385.89899231
  1432.74533386  1479.59167542  1526.43801697  1573.28435852
  1620.13070007  1666.97704163  1713.82338318  1760.66972473]
Lz's attrs: {'min': -2900.541259765625, 'max': 1784.0928955078125}
And displaying the xarray DataArray:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
  • selection: 1
  • Lz: 100
  • E: 140
  • 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    array([[[0, 0, 0, ..., 0, 0, 0],
            [0, 0, 0, ..., 0, 0, 0],
            [0, 0, 0, ..., 0, 0, 0],
            ...,
            [0, 0, 0, ..., 0, 0, 0],
            [0, 0, 0, ..., 0, 0, 0],
            [0, 0, 0, ..., 0, 0, 0]]])
    • selection
      (selection)
      object
      None
      array([None], dtype=object)
    • Lz
      (Lz)
      float64
      -2.877e+03 -2.83e+03 ... 1.761e+03
      min :
      -2900.541259765625
      max :
      1784.0928955078125
      array([-2877.118089, -2830.271747, -2783.425406, -2736.579064, -2689.732723,
             -2642.886381, -2596.04004 , -2549.193698, -2502.347357, -2455.501015,
             -2408.654673, -2361.808332, -2314.96199 , -2268.115649, -2221.269307,
             -2174.422966, -2127.576624, -2080.730283, -2033.883941, -1987.037599,
             -1940.191258, -1893.344916, -1846.498575, -1799.652233, -1752.805892,
             -1705.95955 , -1659.113209, -1612.266867, -1565.420526, -1518.574184,
             -1471.727842, -1424.881501, -1378.035159, -1331.188818, -1284.342476,
             -1237.496135, -1190.649793, -1143.803452, -1096.95711 , -1050.110768,
             -1003.264427,  -956.418085,  -909.571744,  -862.725402,  -815.879061,
              -769.032719,  -722.186378,  -675.340036,  -628.493694,  -581.647353,
              -534.801011,  -487.95467 ,  -441.108328,  -394.261987,  -347.415645,
              -300.569304,  -253.722962,  -206.87662 ,  -160.030279,  -113.183937,
               -66.337596,   -19.491254,    27.355087,    74.201429,   121.04777 ,
               167.894112,   214.740453,   261.586795,   308.433137,   355.279478,
               402.12582 ,   448.972161,   495.818503,   542.664844,   589.511186,
               636.357527,   683.203869,   730.050211,   776.896552,   823.742894,
               870.589235,   917.435577,   964.281918,  1011.12826 ,  1057.974601,
              1104.820943,  1151.667285,  1198.513626,  1245.359968,  1292.206309,
              1339.052651,  1385.898992,  1432.745334,  1479.591675,  1526.438017,
              1573.284359,  1620.1307  ,  1666.977042,  1713.823383,  1760.669725])
    • E
      (E)
      float64
      -2.414e+05 -2.394e+05 ... 3.495e+04
      min :
      -242407.5
      max :
      35941.86328125
      array([-241413.395131, -239425.185393, -237436.975656, -235448.765918,
             -233460.55618 , -231472.346443, -229484.136705, -227495.926967,
             -225507.717229, -223519.507492, -221531.297754, -219543.088016,
             -217554.878278, -215566.668541, -213578.458803, -211590.249065,
             -209602.039328, -207613.82959 , -205625.619852, -203637.410114,
             -201649.200377, -199660.990639, -197672.780901, -195684.571164,
             -193696.361426, -191708.151688, -189719.94195 , -187731.732213,
             -185743.522475, -183755.312737, -181767.102999, -179778.893262,
             -177790.683524, -175802.473786, -173814.264049, -171826.054311,
             -169837.844573, -167849.634835, -165861.425098, -163873.21536 ,
             -161885.005622, -159896.795884, -157908.586147, -155920.376409,
             -153932.166671, -151943.956934, -149955.747196, -147967.537458,
             -145979.32772 , -143991.117983, -142002.908245, -140014.698507,
             -138026.48877 , -136038.279032, -134050.069294, -132061.859556,
             -130073.649819, -128085.440081, -126097.230343, -124109.020605,
             -122120.810868, -120132.60113 , -118144.391392, -116156.181655,
             -114167.971917, -112179.762179, -110191.552441, -108203.342704,
             -106215.132966, -104226.923228, -102238.713491, -100250.503753,
              -98262.294015,  -96274.084277,  -94285.87454 ,  -92297.664802,
              -90309.455064,  -88321.245326,  -86333.035589,  -84344.825851,
              -82356.616113,  -80368.406376,  -78380.196638,  -76391.9869  ,
              -74403.777162,  -72415.567425,  -70427.357687,  -68439.147949,
              -66450.938211,  -64462.728474,  -62474.518736,  -60486.308998,
              -58498.099261,  -56509.889523,  -54521.679785,  -52533.470047,
              -50545.26031 ,  -48557.050572,  -46568.840834,  -44580.631097,
              -42592.421359,  -40604.211621,  -38616.001883,  -36627.792146,
              -34639.582408,  -32651.37267 ,  -30663.162932,  -28674.953195,
              -26686.743457,  -24698.533719,  -22710.323982,  -20722.114244,
              -18733.904506,  -16745.694768,  -14757.485031,  -12769.275293,
              -10781.065555,   -8792.855818,   -6804.64608 ,   -4816.436342,
               -2828.226604,    -840.016867,    1148.192871,    3136.402609,
                5124.612347,    7112.822084,    9101.031822,   11089.24156 ,
               13077.451297,   15065.661035,   17053.870773,   19042.080511,
               21030.290248,   23018.499986,   25006.709724,   26994.919461,
               28983.129199,   30971.338937,   32959.548675,   34947.758412])

Note that data_array.coords['Lz'].data is the same as Lz_axis.bin_centers and data_array.coords['Lz'].attrs contains the same min/max as the Lz_axis.

Also, we see that displaying the xarray.DataArray object (data_array_view.model.grid) gives us the same output as the data_array_view above. There is a big difference however. If we change a selection:

[6]:
df.select(df.x > 0)

and scroll back we see that the data_array_view widget has updated itself, and now contains two selections! This is a very powerful feature, that allows us to make interactive visualizations.

Interactive plots

To make interactive plots we can pass a custom display_function to the data_array_widget. This will override the default notebook behaviour which is a call to display(data_array_widget). In the following example we create a function that displays a matplotlib figure:

[7]:
# NOTE: da is short for 'data array'
def plot2d(da):
    plt.figure(figsize=(8, 8))
    ar = da.data[1]  # take the numpy data, and select take the selection
    print(f'imshow of a numpy array of shape: {ar.shape}')
    plt.imshow(np.log1p(ar.T), origin='lower')

df.widget.data_array(axes=[Lz_axis, E_axis], display_function=plot2d, selection=[None, True])

In the above figure, we choose index 1 along the selection axis, which referes to the 'default' selection. Choosing an index of 0 would correspond to the None selection, and all the data would be displayed. If we now change the selection, the figure will update itself:

[8]:
df.select(df.id < 10)

As xarray’s DataArray is fully self describing, we can improve the plot by using the dimension names for labeling, and setting the extent of the figure’s axes.

Note that we don’t need any information from the Axis objects created above, and in fact, we should not use them, since they may not be in sync with the xarray DataArray object. Later on, we will create a widget that will edit the Axis’ expression.

Our improved visualization with proper axes and labeling:

[9]:
def plot2d_with_labels(da):
    plt.figure(figsize=(8, 8))
    grid = da.data  # take the numpy data
    dim_x = da.dims[0]
    dim_y = da.dims[1]
    plt.title(f'{dim_y} vs {dim_x} - shape: {grid.shape}')
    extent = [
        da.coords[dim_x].attrs['min'], da.coords[dim_x].attrs['max'],
        da.coords[dim_y].attrs['min'], da.coords[dim_y].attrs['max']
    ]
    plt.imshow(np.log1p(grid.T), origin='lower', extent=extent, aspect='auto')
    plt.xlabel(da.dims[0])
    plt.ylabel(da.dims[1])

da_plot_view_nicer = df.widget.data_array(axes=[Lz_axis, E_axis], display_function=plot2d_with_labels)
da_plot_view_nicer

We can also create more sophisticated plots, for example one where we show all of the selections. Note that we can pre-emptively expect a selection and define it later:

[10]:
def plot2d_with_selections(da):
    grid = da.data
    # Create 1 row and #selections of columns of matplotlib axes
    fig, axgrid = plt.subplots(1, grid.shape[0], sharey=True, squeeze=False)
    for selection_index, ax in enumerate(axgrid[0]):
        ax.imshow(np.log1p(grid[selection_index].T), origin='lower')

df.widget.data_array(axes=[Lz_axis, E_axis], display_function=plot2d_with_selections,
                     selection=[None, 'default', 'rest'])

Modifying a selection will update the figure.

[11]:
df.select(df.id < 10)  # select 10 objects
df.select(df.id >= 10, name='rest')  # and the rest

Another advantage of using xarray is its excellent plotting capabilities. It handles a lot of the boring stuff like axis labeling, and also provides a nice interface for slicing the data even more.

Let us introduce another axis, FeH (fun fact: FeH is a property of stars that tells us how much iron relative to hydrogen is contained in them, an idicator of their origin):

[12]:
FeH_axis = vjm.Axis(df=df, expression='FeH', min=-3, max=1, shape=5)
da_view = df.widget.data_array(axes=[E_axis, Lz_axis, FeH_axis], selection=[None, 'default'])
da_view

We can see that we now have a 4 dimensional grid, which we would like to visualize.

And xarray’s plot make our life much easier:

[13]:
def plot_with_xarray(da):
    da_log = np.log1p(da)  # Note that an xarray DataArray is like a numpy array
    da_log.plot(x='Lz', y='E', col='FeH', row='selection', cmap='viridis')

plot_view = df.widget.data_array([E_axis, Lz_axis, FeH_axis], display_function=plot_with_xarray,
                                 selection=[None, 'default', 'rest'])
plot_view

We only have to tell xarray which axis it should map to which ‘aesthetic’, speaking in Grammar of Graphics terms.

Selection widgets

Although we can change the selection in the notebook (e.g. df.select(df.id > 20)), if we create a dashboard (using Voila) we cannot execute arbitrary code. Vaex-jupyter also comes with many widgets, and one of them is a selection_expression widget:

[14]:
selection_widget = df.widget.selection_expression()
selection_widget

The counter_selection creates a widget which keeps track of the number of rows in a selection. In this case we ask it to be ‘lazy’, which means that it will not cause extra passes over the data, but will ride along if some user action triggers a calculation.

[15]:
await vaex.jupyter.gather()
w = df.widget.counter_selection('default', lazy=True)
w

Axis control widgets

Let us create new axis objects using the same expressions as before, but give them more general names (x_axis and y_axis), because we want to change the expressions interactively.

[16]:
x_axis = vjm.Axis(df=df, expression=df.Lz)
y_axis = vjm.Axis(df=df, expression=df.E)

da_xy_view = df.widget.data_array(axes=[x_axis, y_axis], display_function=plot2d_with_labels, shape=180)
da_xy_view

Again, we can change the expressions of the axes programmatically:

[17]:
# wait for the previous plot to finish
await vaex.jupyter.gather()
# Change both the x and y axis
x_axis.expression = np.log(df.x**2)
y_axis.expression = df.y
# Note that both assignment will create 1 computation in the background (minimal amount of passes over the data)
await vaex.jupyter.gather()
# vaex computed the new min/max, and the xarray DataArray
# x_axis.min, x_axis.max, da_xy_view.model.grid

But, if we want to create a dashboard with Voila, we need to have a widget that controls them:

[18]:
x_widget = df.widget.expression(x_axis.expression, label='X axis')
x_widget

This widget will allow us to edit an expression, which will be validated by Vaex. How do we ‘link’ the value of the widget to the axis expression? Because both the Axis as well as the x_widget are HasTrait objects, we can link their traits together:

[19]:
from ipywidgets import link
link((x_widget, 'value'), (x_axis, 'expression'))
[19]:
<traitlets.traitlets.link at 0x122bed450>

Since this operation is so common, we can also directly pass the Axis object, and Vaex will set up the linking for us:

[20]:
y_widget = df.widget.expression(y_axis, label='X axis')
# vaex now does this for us, much shorter
# link((y_widget, 'value'), (y_axis, 'expression'))
y_widget
[21]:
await vaex.jupyter.gather()  # lets wait again till all calculations are finished

A nice container

If you are familiar with the ipyvuetify components, you can combine them to create very pretty widgets. Vaex-jupyter comes with a nice container:

[22]:
from vaex.jupyter.widgets import ContainerCard

ContainerCard(title='My plot',
              subtitle="using vaex-jupyter",
              main=da_xy_view,
              controls=[x_widget, y_widget], show_controls=True)

We can directly assign a Vaex expression to the x_axis.expression, or to x_widget.value since they are linked.

[23]:
y_axis.expression = df.vx

Interactive plots

So far we have been using interactive widgets to control the axes in the view. The figure itself however was not interactive, and we could not have panned or zoomed for example.

Vaex has a few builtin visualizations, most notably a heatmap and histogram using bqplot:

[24]:
df = vaex.example()  # we create the dataframe again, to leave all the plots above 'alone'
heatmap_xy = df.widget.heatmap(df.x, df.y, selection=[None, True])
heatmap_xy

Note that we passed expressions, and not axis objects. Vaex recognizes this and will create the axis objects for you. You can access them from the model:

[25]:
heatmap_xy.model.x
[25]:
Axis(bin_centers=[-77.7255446  -76.91058156 -76.09561852 -75.28065547 -74.46569243
 -73.65072939 -72.83576635 -72.0208033  -71.20584026 -70.39087722
 -69.57591417 -68.76095113 -67.94598809 -67.13102505 -66.316062
 -65.50109896 -64.68613592 -63.87117288 -63.05620983 -62.24124679
 -61.42628375 -60.6113207  -59.79635766 -58.98139462 -58.16643158
 -57.35146853 -56.53650549 -55.72154245 -54.90657941 -54.09161636
 -53.27665332 -52.46169028 -51.64672723 -50.83176419 -50.01680115
 -49.20183811 -48.38687506 -47.57191202 -46.75694898 -45.94198593
 -45.12702289 -44.31205985 -43.49709681 -42.68213376 -41.86717072
 -41.05220768 -40.23724464 -39.42228159 -38.60731855 -37.79235551
 -36.97739246 -36.16242942 -35.34746638 -34.53250334 -33.71754029
 -32.90257725 -32.08761421 -31.27265117 -30.45768812 -29.64272508
 -28.82776204 -28.01279899 -27.19783595 -26.38287291 -25.56790987
 -24.75294682 -23.93798378 -23.12302074 -22.3080577  -21.49309465
 -20.67813161 -19.86316857 -19.04820552 -18.23324248 -17.41827944
 -16.6033164  -15.78835335 -14.97339031 -14.15842727 -13.34346423
 -12.52850118 -11.71353814 -10.8985751  -10.08361205  -9.26864901
  -8.45368597  -7.63872293  -6.82375988  -6.00879684  -5.1938338
  -4.37887076  -3.56390771  -2.74894467  -1.93398163  -1.11901858
  -0.30405554   0.5109075    1.32587054   2.14083359   2.95579663
   3.77075967   4.58572271   5.40068576   6.2156488    7.03061184
   7.84557489   8.66053793   9.47550097  10.29046401  11.10542706
  11.9203901   12.73535314  13.55031618  14.36527923  15.18024227
  15.99520531  16.81016836  17.6251314   18.44009444  19.25505748
  20.07002053  20.88498357  21.69994661  22.51490965  23.3298727
  24.14483574  24.95979878  25.77476183  26.58972487  27.40468791
  28.21965095  29.034614    29.84957704  30.66454008  31.47950312
  32.29446617  33.10942921  33.92439225  34.7393553   35.55431834
  36.36928138  37.18424442  37.99920747  38.81417051  39.62913355
  40.4440966   41.25905964  42.07402268  42.88898572  43.70394877
  44.51891181  45.33387485  46.14883789  46.96380094  47.77876398
  48.59372702  49.40869007  50.22365311  51.03861615  51.85357919
  52.66854224  53.48350528  54.29846832  55.11343136  55.92839441
  56.74335745  57.55832049  58.37328354  59.18824658  60.00320962
  60.81817266  61.63313571  62.44809875  63.26306179  64.07802483
  64.89298788  65.70795092  66.52291396  67.33787701  68.15284005
  68.96780309  69.78276613  70.59772918  71.41269222  72.22765526
  73.0426183   73.85758135  74.67254439  75.48750743  76.30247048
  77.11743352  77.93239656  78.7473596   79.56232265  80.37728569
  81.19224873  82.00721177  82.82217482  83.63713786  84.4521009
  85.26706395  86.08202699  86.89699003  87.71195307  88.52691612
  89.34187916  90.1568422   90.97180524  91.78676829  92.60173133
  93.41669437  94.23165742  95.04662046  95.8615835   96.67654654
  97.49150959  98.30647263  99.12143567  99.93639871 100.75136176
 101.5663248  102.38128784 103.19625089 104.01121393 104.82617697
 105.64114001 106.45610306 107.2710661  108.08602914 108.90099218
 109.71595523 110.53091827 111.34588131 112.16084436 112.9758074
 113.79077044 114.60573348 115.42069653 116.23565957 117.05062261
 117.86558565 118.6805487  119.49551174 120.31047478 121.12543783
 121.94040087 122.75536391 123.57032695 124.38529    125.20025304
 126.01521608 126.83017913 127.64514217 128.46010521 129.27506825
 130.0900313 ], exception=None, expression=x, max=130.4975128173828, min=-78.13302612304688, shape=None, shape_default=256, slice=None, status=Status.READY)

The heatmap itself is again a widget. Thus we can combine it with other widgets to create a more sophisticated interface.

[26]:
x_widget = df.widget.expression(heatmap_xy.model.x, label='X axis')
y_widget = df.widget.expression(heatmap_xy.model.y, label='X axis')

ContainerCard(title='My plot',
              subtitle="using vaex-jupyter and bqplot",
              main=heatmap_xy,
              controls=[x_widget, y_widget, selection_widget],
              show_controls=True,
              card_props={'style': 'min-width: 800px;'})

By switching the tool in the toolbar (click pan_tool, or changing it programmmatically in the next cell), we can zoom in. The plot’s axis bounds are directly synched to the axis object (the x_min is linked to the x_axis min, etc). Thus a zoom action causes the axis objects to be changed, which will trigger a recomputation.

[27]:
heatmap_xy.tool = 'pan-zoom'  # we can also do this programmatically.

Since we can access the Axis objects, we can also programmatically change the heatmap. Note that both the expression widget, the plot axis label and the heatmap it self is updated. Everything is linked together!

[28]:
heatmap_xy.model.x.expression = np.log10(df.x**2)
await vaex.jupyter.gather()  # and we wait before we continue

Another visualization based on bqplot is the interactive histogram. In the example below, we show all the data, but the selection interaction will affect/set the ‘default’ selection.

[29]:
histogram_Lz = df.widget.histogram(df.Lz, selection_interact='default')
histogram_Lz.tool = 'select-x'
histogram_Lz
[30]:
# You can graphically select a particular region, in this case we do it programmatically
# for reproducability of this notebook
histogram_Lz.plot.figure.interaction.selected = [1200, 1300]

This shows an interesting structure in the heatmap above

Creating your own visualizations

The primary goal of Vaex-Jupyter is to provide users with a framework to create dashboard and new visualizations. Over time more visualizations will go into the vaex-jupyter package, but giving you the option to create new ones is more important. To help you create new visualization, we have examples on how to create your own:

If you want to create your own visualization on this framework, check out these examples:

ipyvolume example

ipyvolume example

plotly example

plotly example

The examples can also be found at the Examples page.

Guides

Advanced plotting examples

If you want to try out this notebook with a live Python kernel, use mybinder:

https://mybinder.org/badge_logo.svg

Vaex uses matplotlib for creating plots, which allows for great flexibility. To avoid repetative “boilerplate” code, Vaex tries to cover many use-cases where you want to plot one or more panels using a simple declarative style.

The following examples will make use of the example dataset, which is a the results of a numerical simulation of how a galaxy like our own Milky Way was formed (source). The data contains the 3D position, velocity, angular momentum, energy and iron content for each start particle in the simulation.

Let us start by loading the data:

[1]:
import vaex
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')
[2]:
df = vaex.example()
df.head()
[2]:
# id x y z vx vy vz E L Lz FeH
0 0 1.23187 -0.396929-0.598058 301.155 174.059 27.4275 -149431 407.389 333.956-1.00539
1 23 -0.163701 3.65422 -0.254906 -195 170.472 142.53 -124248 890.241 684.668-1.70867
2 32 -2.12026 3.32605 1.70784 -48.6342 171.647 -2.07944-138501 372.241-202.176-1.83361
3 8 4.71559 4.58525 2.25154 -232.421 -294.851 62.8586 -60037 1297.63 -324.688-1.47869
4 16 7.21719 11.9947 -1.06456 -1.68917 181.329 -11.3336 -83206.81332.8 1328.95 -1.85705
5 16 -7.78437 5.98977 -0.682695 86.7009 -238.778 -2.31309 -86497.61353.25 1339.42 -1.91944
6 12 8.08373 -3.27348 5.54687 -57.4544 120.117 5.37438-101867 1100.8 782.915-1.93517
7 26 -3.55719 5.41363 0.0917156 -67.0511 -145.933 39.6374 -127682 921.008 882.101-1.79423
8 25 3.9848 5.40691 2.57724 -38.7449 -152.407 -92.9073 -113632 493.316-397.824-1.18076
9 8-20.8139 -3.29468 13.4866 99.4067 28.6749-115.079 -55825.31088.46 -269.324-1.28892

A single plot

The simplest case is a single heatmap created by two axes, specified by the first two arguments:

[3]:
df.viz.heatmap('x', 'y', title='Face on galaxy', limits='99%')
_images/guides_advanced_plotting_5_0.png

Multiple plots of the same type

The first argument can be a list of axes pairs. This produces multiple plots:

[4]:
df.viz.heatmap([["x", "y"], ["x", "z"]], title="Face on and edge on", figsize=(10, 4), limits='99%');
_images/guides_advanced_plotting_7_0.png

Multiple plots, same axes, different statistics

If the what argument is a list, it will by default create multiple subplots:

[5]:
df.viz.heatmap("x", "y", what=["count(*)", "mean(vx)", "correlation(vy,vz)"],
               title="Different statistics",
               figsize=(10, 5), limits='99%');
_images/guides_advanced_plotting_9_0.png

Multiple plots, different axes, different statistics

One can specify multiple axes pairs as tje first argument, as well as a list of what arguments. The resulting figure with have a number of subplots where the different axes combinations will form the rows, and the different what statistics will form the columns:

[6]:
df.viz.heatmap([["x", "y"], ["x", "z"], ["y", "z"]],
               what=["count(*)", "mean(vx)", "correlation(vx,vy)", "correlation(vx,vz)"],
               title="Different statistics and plots",
               figsize=(14,12),
               limits='99%');
_images/guides_advanced_plotting_11_0.png

One can also specify the layout of the figure via the visual argument, which can be used to swap the row and column ordering of the subplots:

[7]:
df.viz.heatmap([["x", "y"], ["x", "z"], ["y", "z"]],
               what=["count(*)", "mean(vx)", "correlation(vx,vy)", "correlation(vx,vz)"],
               visual=dict(row="what", column="subspace"),
               title="Different statistics and plots",
               figsize=(14,12),
               limits='99%');
_images/guides_advanced_plotting_13_0.png

Slices in a 3rd dimension

If a 3rd axis (z) is given, you can “slice” through the data, displaying the z slices as rows. Note that here the rows are wrapped, which can be changed with the wrap_columns argument:

[8]:
df.viz.heatmap("Lz", "E", z="FeH:-3,-1,8",
               visual=dict(row="z"),
               figsize=(12, 8),
               f="log",
               wrap_columns=3,
               limits='99%');
_images/guides_advanced_plotting_15_0.png

Many plots with wrapping

If one attempt to create a figure with many subplots, they will be nice wrapped. Where we create heatmaps of all combinations of columns in the example dataset, sorted by their mutual information:

[9]:
# Get all column pars
pairs = df.combinations(exclude=['id'])
# Calculate the mutual information for each pair, sorted by the largest value
mi, pairs_sorted = df.mutual_information(pairs, sort=True)

# Create the figure
df.viz.heatmap(pairs_sorted, f='log', colorbar=False, figsize=(14, 20), limits='99%', wrap_columns=5);
_images/guides_advanced_plotting_17_0.png

Plotting selections

If the selection argument is used, than only the selection is plotted:

[10]:
df.viz.heatmap("x", "y", selection="sqrt(x**2+y**2) < 5", limits=[-10, 10]);
_images/guides_advanced_plotting_19_0.png

If a list of selections is specified (False or None indicates no selection), than every selection by default forms a different “layer” of the figure produced:

[11]:
df.viz.heatmap("x", "y",
               selection=[None, "sqrt(x**2+y**2) < 5", "(sqrt(x**2+y**2) < 7) & (x < 0)"],
               limits=[-10, 10]);
_images/guides_advanced_plotting_21_0.png

Overplotting a vector field on a heatmap

Astronomers argue that galaxies such as our own Milky Way were formed from many pre-galactic clumps that have merged and mixed together. One way to try and find the original pre-galactic fragments is to inspect the 2-dimensinoal distribution of their energy (𝐸) and angular momentum (𝐿𝑧). So let us make such a plot:

[12]:
df.viz.heatmap('Lz', 'E', f='log', figsize=(9, 6));
_images/guides_advanced_plotting_23_0.png

Now, to show that the stars in each clump on the figure above are indeed moving coherently in space, we can overplot their velocity vectors on a positional heatmap.

First, let’s select the stars that belong to one the clusters:

[13]:
# specify ranges of angular momentum (Lz) and energy (E)
limits_Lz_E_clump = (1181.770, 1291.92), (-70850.91, -68491.16)

# Use the rectangle selection method
df.select_rectangle("Lz", "E", limits_Lz_E_clump, name="stream")

# Check how many stars we have selected
print(f'Selection contains {df.count(selection="stream")} "stars".')
Selection contains 9556 "stars".

We can also overplot the selected region, to convince ourselves that we have chosen a good region:

[14]:
df.viz.heatmap("Lz", "E", selection=[None, "stream"], f="log", figsize=(9, 6));
_images/guides_advanced_plotting_27_0.png

Now let us plot the 𝑣𝑦 and 𝑣𝑧 velocity vectors on top of 𝑦−𝑧 plot. To start, lets compute a grid of mean 𝑣𝑦 and 𝑣𝑧 velocities. Notice that we are limiting the range of the 𝑣𝑦 and 𝑣𝑧 values to go between -20 and 20, and the grid resolution is 32x32 bins:

[15]:
limits = [-20, 20]
shape_vector = 32
mean_vy = df.mean("vy", binby=["y", "z"], limits=limits, shape=shape_vector, selection='stream')
mean_vz = df.mean("vz", binby=["y", "z"], limits=limits, shape=shape_vector, selection='stream')

Next, let us create a meshgrid to hold the centres of the bins:

[16]:
# create a 2d array with holds the center of the bins
centers = np.linspace(*limits, shape_vector, endpoint=False) + (limits[1] - limits[0])/2./shape_vector
z, y = np.meshgrid(centers, centers)

To keep the plot “clean”, we also do not want visualize the velocity of the bins with low number counts:

[17]:
# we don't want to show bins with low number of counts
counts = df.count(binby=["y", "z"], limits=limits, shape=shape_vector, selection='stream')
mask = counts.flatten( ) > 10

Finally we can plot a background density map of \(v_y\) vs \(v_z\), and then use plt.quiver to overplot the velocity vectors:

[18]:
df.viz.heatmap("y", "z", limits=limits, f="log1p", figsize=(10, 9), selection=[None, "stream"], shape=128)

# overplot the mean velocity vectors
plt.quiver(y.flatten()[mask],
           z.flatten()[mask],
           mean_vy.flatten()[mask],
           mean_vz.flatten()[mask],
           color="white",
           alpha=0.75);
_images/guides_advanced_plotting_35_0.png

We indeed see that the stars we selected move together, and form a stream!

Plotting a healpix map

Healpix is made available via the healpy package. Vaex does not need special support for healpix, but some helper functions are introduced to make working with healpix easier.

Make sure you have healpy installed. If you do not, you can install it with one of these commands:

!pip install healpy  # if you prefer pip
!conda install -c conda-forge healpy if you are using a conda package manager

To understand this better, we will start from the beginning. If we want to make a density sky plot, we would like to pass to healpy a 1d numpy array where each value represents the density at a location of the sphere, where the location is determined by the array size (the healpix level) and the offset (the location).

This example uses a simulated Gaia dataset. The Gaia data includes the healpix index encoded in the source_id column. By diving source_id by 34359738368 you get a healpix index level 12, and diving it further will take you to lower levels.

Let us start by fetching the dataset (Note: the dataset is ~700MB on disk).

[19]:
import healpy as hp
[20]:
df = vaex.datasets.tgas(full=True)
df.head()
[20]:
# astrometric_delta_q astrometric_excess_noise astrometric_excess_noise_sig astrometric_n_bad_obs_ac astrometric_n_bad_obs_al astrometric_n_good_obs_ac astrometric_n_good_obs_al astrometric_n_obs_ac astrometric_n_obs_al astrometric_primary_flag astrometric_priors_used astrometric_relegation_factor astrometric_weight_ac astrometric_weight_al b dec dec_error dec_parallax_corr dec_pmdec_corr dec_pmra_corr duplicated_source ecl_lat ecl_lon hip l matched_observations parallax parallax_error parallax_pmdec_corr parallax_pmra_corr phot_g_mean_flux phot_g_mean_flux_error phot_g_mean_mag phot_g_n_obsphot_variable_flag pmdec pmdec_error pmra pmra_error pmra_pmdec_corr ra ra_dec_corr ra_error ra_parallax_corr ra_pmdec_corr ra_pmra_corr random_index ref_epoch scan_direction_mean_k1 scan_direction_mean_k2 scan_direction_mean_k3 scan_direction_mean_k4 scan_direction_strength_k1 scan_direction_strength_k2 scan_direction_strength_k3 scan_direction_strength_k4 solution_id source_idtycho2_id
0 1.91906 0.717101 412.606 1 0 78 79 79 79 84 3 2.9361 1.26696e-05 1.81816-48.71440.235392 0.218802 -0.407338 0.0606588 -0.0994513 70 -16.1211 42.6418 13989176.74 9 6.35295 0.30791 -0.101957 -0.00157679 1.03123e+07 10577.4 7.99138 77b'NOT_AVAILABLE' -7.64199 0.0874018 43.7523 0.0705422 0.21467745.0343 -0.414972 0.305989 0.179966 -0.0857597 0.159207 243619 2015 -113.76 21.3929 -41.6784 26.2018 0.382348 0.538266 0.392379 0.9163061635378410781933568 7627862074752b''
1 nan 0.253463 47.3163 2 0 55 57 57 57 84 5 2.65231 3.16002e-05 12.8616 -48.645 0.200068 1.19779 0.837626 -0.975644 0.972577 70 -16.193 42.7612-2147483648176.916 8 3.90033 0.323488 -0.853779 0.839739 949565 1140.17 10.581 62b'NOT_AVAILABLE' -55.1092 2.52293 10.0363 4.61141 -0.99639945.165 -0.995923 2.58388 -0.860911 0.97348 -0.972417 487238 2015 -156.433 22.7661 -36.2397 22.8906 0.711003 0.96597 0.646115 0.86716 1635378410781933568 9277129363072b'55-28-1'
2 nan 0.398901 221.185 4 1 57 60 61 61 84 5 3.9934 2.56339e-05 5.76753-48.66780.248825 0.180326 -0.391891 -0.193256 0.0894205 70 -16.1234 42.6975-2147483648176.78 7 3.15531 0.273484 -0.118552 -0.0418587 817838 1827.38 10.7431 60b'NOT_AVAILABLE' -1.60287 1.03526 2.93228 1.90864 -0.91427145.0862 -0.177443 0.213836 0.307722 -0.184817 0.0468668 1948952 2015 -117.008 19.7722 -43.1082 26.7157 0.482528 0.428758 0.524153 0.903062163537841078193356813297218905216b'55-1191-1'
3 nan 0.422492 179.982 1 0 51 52 52 52 84 5 4.21516 2.86726e-05 5.36086-48.68240.248211 0.200958 -0.337217 -0.223501 0.131811 70 -16.1182 42.6778-2147483648176.76 7 2.29237 0.280972 -0.109202 -0.0494409 602053 905.877 11.0757 61b'NOT_AVAILABLE' -18.4149 1.12985 3.66198 2.06505 -0.92617745.0665 -0.365707 0.276039 0.202878 -0.0589288 -0.0509089 102321 2015 -132.421 22.5693 -38.9545 25.8786 0.494655 0.638456 0.509074 0.898918163537841078193356813469017597184b'55-624-1'
4 nan 0.3175 119.748 2 3 85 84 87 87 84 5 3.23564 2.22788e-05 8.08078-48.572 0.335044 0.17013 -0.438708 -0.279349 0.121792 70 -16.0555 42.7734-2147483648176.739 11 1.58208 0.261539 -0.329196 0.100312 1.38812e+06 2826.43 10.1687 96b'NOT_AVAILABLE' -2.37939 0.710632 0.340802 1.22048 -0.83360445.136 -0.0490526 0.170697 0.471425 -0.156392 -0.152076 409284 2015 -106.86 4.4521 -47.8954 26.7555 0.520654 0.23931 0.653377 0.863385163537841078193356815736760328576b'55-849-1'
5 nan 0.303723 64.6868 2 1 68 69 70 70 84 5 3.10892 2.22511e-05 9.65279-48.55110.359618 0.179848 -0.437142 -0.376402 0.257906 70 -16.0335 42.7861-2147483648176.718 9 8.66308 0.255867 -0.297309 0.0791063 1.66384e+06 1381.58 9.97199 76b'NOT_AVAILABLE' -72.7114 0.720852 -52.8493 1.26429 -0.85278445.1414 -0.264588 0.205008 0.39493 0.102073 -0.36853 204642 2015 -127.824 16.3828 -44.2417 25.1631 0.522809 0.479366 0.621515 0.847412163537841078193356816527034310784b'55-182-1'
6 nan 0.340405 118.911 2 1 76 77 78 78 84 5 3.44745 2.19728e-05 7.91894-48.52420.386343 0.17188 -0.341053 -0.34408 0.1516 70 -16.0114 42.8058-2147483648176.701 9 5.6982 0.263677 -0.367848 0.0846782 1.821e+06 2755.91 9.874 77b'NOT_AVAILABLE' -3.35036 0.707184 24.5272 1.17738 -0.80009845.153 -0.0412512 0.189524 0.488929 -0.163855 -0.195289 540954 2015 -114.478 11.0431 -46.4507 26.2651 0.512088 0.322961 0.637399 0.856398163537841078193356816733192740608b'55-867-1'
7 nan 0.253709 88.6261 3 0 76 79 79 79 84 5 2.65453 2.57372e-05 13.709 -48.55690.380844 0.150943 -0.139315 -0.358996 0.238914 70 -16.0049 42.7641-2147483648176.665 10 2.09081 0.222206 -0.277202 0.093748 967144 601.802 10.561 87b'NOT_AVAILABLE' -11.6616 0.982994 -1.57293 1.73319 -0.90422345.1128 -0.187136 0.206981 0.412381 0.0994892 -0.284353 1081909 2015 -88.3027 14.7861 -47.9744 27.0228 0.39079 0.333692 0.400387 0.90071 163537841078193356816870631694208b'55-72-1'
8 nan 0.401473 226.044 3 1 69 71 72 72 84 5 4.01755 2.45771e-05 5.41389-48.65110.351099 0.169345 -0.276625 -0.175754 0.101633 70 -16.0034 42.6531-2147483648176.589 9 6.20249 0.247253 -0.139338 0.0669677 1.66582e+06 1233.43 9.9707 79b'NOT_AVAILABLE' 9.19541 1.02832 26.308 2.03485 -0.90549645.0103 -0.321544 0.243576 0.263603 -0.143727 0.107397 589318 2015 -106.23 19.3449 -44.7095 25.5226 0.335982 0.520842 0.35827 0.90504 163537841078193356826834955821312b'55-912-1'
9 nan 0.235866 49.3216 2 0 51 53 53 53 84 5 2.49518 2.42543e-05 15.7304 -48.59120.473472 0.163531 -0.0605532 -0.242013 0.14566 70 -15.8759 42.6549-2147483648176.419 8 1.67767 0.222067 -0.18584 0.0668122 1.96682e+06 1184.17 9.79036 62b'NOT_AVAILABLE' -24.5264 1.1319 9.10421 2.20939 -0.92529 44.9747 -0.407078 0.267911 0.236157 -0.0912424 0.0305957 1178636 2015 -99.9696 19.5819 -46.0718 24.0416 0.217998 0.655547 0.219464 0.892649163537841078193356833260226885120b'48-1139-1'

Let’s plot a healpix figure of level 2. We can start by counting the number of stars in each healpix region:

[21]:
level = 2
factor = 34359738368 * (4**(12-level))
nmax = hp.nside2npix(2**level)
counts = df.count(binby="source_id/" + str(factor), limits=[0, nmax], shape=nmax)
counts
[21]:
array([ 4021,  6171,  5318,  7114,  5755, 13420, 12711, 10193,  7782,
       14187, 12578, 22038, 17313, 13064, 17298, 11887,  3859,  3488,
        9036,  5533,  4007,  3899,  4884,  5664, 10741,  7678, 12092,
       10182,  6652,  6793, 10117,  9614,  3727,  5849,  4028,  5505,
        8462, 10059,  6581,  8282,  4757,  5116,  4578,  5452,  6023,
        8340,  6440,  8623,  7308,  6197, 21271, 23176, 12975, 17138,
       26783, 30575, 31931, 29697, 17986, 16987, 19802, 15632, 14273,
       10594,  4807,  4551,  4028,  4357,  4067,  4206,  3505,  4137,
        3311,  3582,  3586,  4218,  4529,  4360,  6767,  7579, 14462,
       24291, 10638, 11250, 29619,  9678, 23322, 18205,  7625,  9891,
        5423,  5808, 14438, 17251,  7833, 15226,  7123,  3708,  6135,
        4110,  3587,  3222,  3074,  3941,  3846,  3402,  3564,  3425,
        4125,  4026,  3689,  4084, 16617, 13577,  6911,  4837, 13553,
       10074,  9534, 20824,  4976,  6707,  5396,  8366, 13494, 19766,
       11012, 16130,  8521,  8245,  6871,  5977,  8789, 10016,  6517,
        8019,  6122,  5465,  5414,  4934,  5788,  6139,  4310,  4144,
       11437, 30731, 13741, 27285, 40227, 16320, 23039, 10812, 14686,
       27690, 15155, 32701, 18780,  5895, 23348,  6081, 17050, 28498,
       35232, 26223, 22341, 15867, 17688,  8580, 24895, 13027, 11223,
        7880,  8386,  6988,  5815,  4717,  9088,  8283, 12059,  9161,
        6952,  4914,  6652,  4666, 12014, 10703, 16518, 10270,  6724,
        4553,  9282,  4981])

Using the healpy package, we can plot this in a molleweide projection

[22]:
hp.mollview(counts, nest=True);
_images/guides_advanced_plotting_44_0.png

To avoid tying the above code all over again, we can use the df.healpix_count method instead:

[23]:
counts = df.healpix_count(healpix_level=6)
hp.mollview(counts, nest=True)
_images/guides_advanced_plotting_46_0.png

Instead of using healpy, we can use vaex’ df.viz.healpix_plot method:

[24]:
df.viz.healpix_heatmap(f="log1p", healpix_level=6, figsize=(10,8), healpix_output="ecliptic")
_images/guides_advanced_plotting_48_0.png

Arrow

Vaex supports Arrow. We will demonstrate vaex+arrow by giving a quick look at a large dataset that does not fit into memory. The NYC taxi dataset for the year 2015 contains about 150 million rows containing information about taxi trips in New York, and is about 23GB in size. You can download it here:

In case you want to convert it to the arrow format, use the code below:

ds_hdf5 = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.hdf5')
# this may take a while to export
ds_hdf5.export('./nyc_taxi2015.arrow')
[1]:
!ls -alh /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
-rw-r--r--  1 maartenbreddels  staff    23G Oct 31 18:56 /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
[3]:
import vaex

Opens instantly

Opening the file goes instantly, since nothing is being copied to memory. The data is only memory mapped, a technique that will only read the data when needed.

[4]:
%time
df = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow')
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs
[5]:
df
[5]:
#VendorIDdropoff_dayofweekdropoff_hourdropoff_latitudedropoff_longitudeextrafare_amountimprovement_surchargemta_taxpassenger_countpayment_typepickup_dayofweekpickup_hourpickup_latitudepickup_longitudetip_amounttolls_amounttotal_amounttpep_dropoff_datetimetpep_pickup_datetimetrip_distance
023.019.040.75061798095703-73.974784851074221.012.00.30.5113.019.040.7501106262207-73.9938964843753.250.017.05numpy.datetime64('2015-01-15T19:23:42.000000000')numpy.datetime64('2015-01-15T19:05:39.000000000')1.59
115.020.040.75910949707031-73.994415283203120.514.50.30.5115.020.040.7242431640625-74.001647949218752.00.017.8numpy.datetime64('2015-01-10T20:53:28.000000000')numpy.datetime64('2015-01-10T20:33:38.000000000')3.3
215.020.040.82441329956055-73.951820373535160.59.50.30.5125.020.040.80278778076172-73.963340759277340.00.010.8numpy.datetime64('2015-01-10T20:43:41.000000000')numpy.datetime64('2015-01-10T20:33:38.000000000')1.8
315.020.040.71998596191406-74.004325866699230.53.50.30.5125.020.040.71381759643555-74.009086608886720.00.04.8numpy.datetime64('2015-01-10T20:35:31.000000000')numpy.datetime64('2015-01-10T20:33:39.000000000')0.5
415.020.040.742652893066406-74.004180908203120.515.00.30.5125.020.040.762428283691406-73.971176147460940.00.016.3numpy.datetime64('2015-01-10T20:52:58.000000000')numpy.datetime64('2015-01-10T20:33:39.000000000')3.0
..................................................................
146,112,98424.00.040.722469329833984-73.986213684082030.57.50.30.5513.023.040.72087097167969-73.993812561035161.760.010.56numpy.datetime64('2016-01-01T00:08:18.000000000')numpy.datetime64('2015-12-31T23:59:56.000000000')1.2
146,112,98514.00.040.75238800048828-73.939514160156250.57.50.30.5223.023.040.76028060913085-73.965270996093750.00.08.8numpy.datetime64('2016-01-01T00:05:19.000000000')numpy.datetime64('2015-12-31T23:59:58.000000000')2.0
146,112,98614.00.040.69329833984375-73.98867034912110.513.50.30.5223.023.040.73907852172852-73.987297058105470.00.014.8numpy.datetime64('2016-01-01T00:12:55.000000000')numpy.datetime64('2015-12-31T23:59:59.000000000')3.8
146,112,98724.00.040.705322265625-74.017120361328120.58.50.30.5123.023.040.72569274902344-73.997558593750.00.09.8numpy.datetime64('2016-01-01T00:10:26.000000000')numpy.datetime64('2015-12-31T23:59:59.000000000')1.96
146,112,98824.00.040.76057052612305-73.990982055664060.513.50.30.5113.023.040.76725769042969-73.984397888183582.960.017.76numpy.datetime64('2016-01-01T00:21:30.000000000')numpy.datetime64('2015-12-31T23:59:59.000000000')1.06

Quick viz of 146 million rows

As can be seen, this dataset contains 146 million rows. Using plot, we can generate a quick overview what the data contains. The pickup locations nicely outline Manhattan.

[6]:
df.viz.heatmap(df.pickup_longitude, df.pickup_latitude, f='log')
_images/guides_arrow_8_0.png
[7]:
df.total_amount.minmax()
[7]:
array([-4.9630000e+02,  3.9506116e+06])

Data cleansing: outliers

As can be seen from the total_amount columns (how much people payed), this dataset contains outliers. From a quick 1d plot, we can see reasonable ways to filter the data

[8]:
df.plot1d(df.total_amount, shape=100, limits=[0, 100])
[8]:
[<matplotlib.lines.Line2D at 0x121d26320>]
_images/guides_arrow_11_1.png
[9]:
# filter the dataset
dff = df[(df.total_amount >= 0) & (df.total_amount < 100)]

Shallow copies

This filtered dataset did not copy any data (otherwise it would have costed us about ~23GB of RAM). Shallow copies of the data are made instead and a booleans mask tracks which rows should be used.

[10]:
dff['ratio'] = dff.tip_amount/dff.total_amount

Virtual column

The new column ratio does not do any computation yet, it only stored the expression and does not waste any memory. However, the new (virtual) column can be used in calculations as if it were a normal column.

[11]:
dff.ratio.mean()
<string>:1: RuntimeWarning: invalid value encountered in true_divide
[11]:
0.09601926650107262

Result

Our final result, the percentage of the tip, can be easily calcualted for this large dataset, it did not require any excessive amount of memory.

Interoperability

Since the data lives as Arrow arrays, we can pass them around to other libraries such as pandas, or even pass it to other processes.

[12]:
arrow_table = df.to_arrow_table()
arrow_table
[12]:
pyarrow.Table
VendorID: int64
dropoff_dayofweek: double
dropoff_hour: double
dropoff_latitude: double
dropoff_longitude: double
extra: double
fare_amount: double
improvement_surcharge: double
mta_tax: double
passenger_count: int64
payment_type: int64
pickup_dayofweek: double
pickup_hour: double
pickup_latitude: double
pickup_longitude: double
tip_amount: double
tolls_amount: double
total_amount: double
tpep_dropoff_datetime: timestamp[ns]
tpep_pickup_datetime: timestamp[ns]
trip_distance: double
[13]:
# Although you can 'convert' (pass the data) in to pandas,
# some memory will be wasted (at least an index will be created by pandas)
# here we just pass a subset of the data
df_pandas = df[:10000].to_pandas_df()
df_pandas
[13]:
VendorID dropoff_dayofweek dropoff_hour dropoff_latitude dropoff_longitude extra fare_amount improvement_surcharge mta_tax passenger_count ... pickup_dayofweek pickup_hour pickup_latitude pickup_longitude tip_amount tolls_amount total_amount tpep_dropoff_datetime tpep_pickup_datetime trip_distance
0 2 3.0 19.0 40.750618 -73.974785 1.0 12.0 0.3 0.5 1 ... 3.0 19.0 40.750111 -73.993896 3.25 0.00 17.05 2015-01-15 19:23:42 2015-01-15 19:05:39 1.59
1 1 5.0 20.0 40.759109 -73.994415 0.5 14.5 0.3 0.5 1 ... 5.0 20.0 40.724243 -74.001648 2.00 0.00 17.80 2015-01-10 20:53:28 2015-01-10 20:33:38 3.30
2 1 5.0 20.0 40.824413 -73.951820 0.5 9.5 0.3 0.5 1 ... 5.0 20.0 40.802788 -73.963341 0.00 0.00 10.80 2015-01-10 20:43:41 2015-01-10 20:33:38 1.80
3 1 5.0 20.0 40.719986 -74.004326 0.5 3.5 0.3 0.5 1 ... 5.0 20.0 40.713818 -74.009087 0.00 0.00 4.80 2015-01-10 20:35:31 2015-01-10 20:33:39 0.50
4 1 5.0 20.0 40.742653 -74.004181 0.5 15.0 0.3 0.5 1 ... 5.0 20.0 40.762428 -73.971176 0.00 0.00 16.30 2015-01-10 20:52:58 2015-01-10 20:33:39 3.00
5 1 5.0 20.0 40.758194 -73.986977 0.5 27.0 0.3 0.5 1 ... 5.0 20.0 40.774048 -73.874374 6.70 5.33 40.33 2015-01-10 20:53:52 2015-01-10 20:33:39 9.00
6 1 5.0 20.0 40.749634 -73.992470 0.5 14.0 0.3 0.5 1 ... 5.0 20.0 40.726009 -73.983276 0.00 0.00 15.30 2015-01-10 20:58:31 2015-01-10 20:33:39 2.20
7 1 5.0 20.0 40.726326 -73.995010 0.5 7.0 0.3 0.5 3 ... 5.0 20.0 40.734142 -74.002663 1.66 0.00 9.96 2015-01-10 20:42:20 2015-01-10 20:33:39 0.80
8 1 5.0 21.0 40.759357 -73.987595 0.0 52.0 0.3 0.5 3 ... 5.0 20.0 40.644356 -73.783043 0.00 5.33 58.13 2015-01-10 21:11:35 2015-01-10 20:33:39 18.20
9 1 5.0 20.0 40.759365 -73.985916 0.5 6.5 0.3 0.5 2 ... 5.0 20.0 40.767948 -73.985588 1.55 0.00 9.35 2015-01-10 20:40:44 2015-01-10 20:33:40 0.90
10 1 5.0 20.0 40.728584 -74.004395 0.5 7.0 0.3 0.5 1 ... 5.0 20.0 40.723103 -73.988617 1.66 0.00 9.96 2015-01-10 20:41:39 2015-01-10 20:33:40 0.90
11 1 5.0 20.0 40.757217 -73.967407 0.5 7.5 0.3 0.5 1 ... 5.0 20.0 40.751419 -73.993782 1.00 0.00 9.80 2015-01-10 20:43:26 2015-01-10 20:33:41 1.10
12 1 5.0 20.0 40.707726 -74.009773 0.5 3.0 0.3 0.5 1 ... 5.0 20.0 40.704376 -74.008362 0.00 0.00 4.30 2015-01-10 20:35:23 2015-01-10 20:33:41 0.30
13 1 5.0 21.0 40.735210 -73.997345 0.5 19.0 0.3 0.5 1 ... 5.0 20.0 40.760448 -73.973946 3.00 0.00 23.30 2015-01-10 21:03:04 2015-01-10 20:33:41 3.10
14 1 5.0 20.0 40.739895 -73.995216 0.5 6.0 0.3 0.5 1 ... 5.0 20.0 40.731777 -74.006721 0.00 0.00 7.30 2015-01-10 20:39:23 2015-01-10 20:33:41 1.10
15 2 3.0 19.0 40.757889 -73.983978 1.0 16.5 0.3 0.5 1 ... 3.0 19.0 40.739811 -73.976425 4.38 0.00 22.68 2015-01-15 19:32:00 2015-01-15 19:05:39 2.38
16 2 3.0 19.0 40.786858 -73.955124 1.0 12.5 0.3 0.5 5 ... 3.0 19.0 40.754246 -73.968704 0.00 0.00 14.30 2015-01-15 19:21:00 2015-01-15 19:05:40 2.83
17 2 3.0 19.0 40.785782 -73.952713 1.0 26.0 0.3 0.5 5 ... 3.0 19.0 40.769581 -73.863060 8.08 5.33 41.21 2015-01-15 19:28:18 2015-01-15 19:05:40 8.33
18 2 3.0 19.0 40.786083 -73.980850 1.0 11.5 0.3 0.5 1 ... 3.0 19.0 40.779423 -73.945541 0.00 0.00 13.30 2015-01-15 19:20:36 2015-01-15 19:05:41 2.37
19 2 3.0 19.0 40.718590 -73.952377 1.0 21.5 0.3 0.5 2 ... 3.0 19.0 40.774010 -73.874458 4.50 0.00 27.80 2015-01-15 19:20:22 2015-01-15 19:05:41 7.13
20 2 3.0 19.0 40.714596 -73.998924 1.0 17.5 0.3 0.5 1 ... 3.0 19.0 40.751896 -73.976601 0.00 0.00 19.30 2015-01-15 19:31:00 2015-01-15 19:05:41 3.60
21 2 3.0 19.0 40.734650 -73.999939 1.0 5.5 0.3 0.5 1 ... 3.0 19.0 40.745079 -73.994957 1.62 0.00 8.92 2015-01-15 19:10:22 2015-01-15 19:05:41 0.89
22 2 3.0 19.0 40.735512 -74.003563 1.0 5.5 0.3 0.5 1 ... 3.0 19.0 40.747063 -74.000938 1.30 0.00 8.60 2015-01-15 19:10:55 2015-01-15 19:05:41 0.96
23 2 3.0 19.0 40.704220 -74.007919 1.0 6.5 0.3 0.5 2 ... 3.0 19.0 40.717892 -74.002777 1.50 0.00 9.80 2015-01-15 19:12:36 2015-01-15 19:05:41 1.25
24 2 3.0 19.0 40.761856 -73.978172 1.0 11.5 0.3 0.5 5 ... 3.0 19.0 40.736362 -73.997459 2.50 0.00 15.80 2015-01-15 19:22:11 2015-01-15 19:05:41 2.11
25 2 3.0 19.0 40.811089 -73.953339 1.0 7.5 0.3 0.5 5 ... 3.0 19.0 40.823994 -73.952278 1.70 0.00 11.00 2015-01-15 19:14:05 2015-01-15 19:05:41 1.15
26 2 3.0 19.0 40.734890 -73.988609 1.0 9.0 0.3 0.5 1 ... 3.0 19.0 40.750080 -73.991127 0.00 0.00 10.80 2015-01-15 19:16:18 2015-01-15 19:05:42 1.53
27 2 3.0 19.0 40.743530 -73.985603 0.0 52.0 0.3 0.5 1 ... 3.0 19.0 40.644127 -73.786575 6.00 5.33 64.13 2015-01-15 19:49:07 2015-01-15 19:05:42 18.06
28 2 3.0 19.0 40.757721 -73.994514 1.0 10.0 0.3 0.5 1 ... 3.0 19.0 40.741447 -73.993668 2.36 0.00 14.16 2015-01-15 19:18:33 2015-01-15 19:05:42 1.76
29 2 3.0 19.0 40.704689 -74.009079 1.0 17.5 0.3 0.5 6 ... 3.0 19.0 40.744083 -73.985291 3.70 0.00 23.00 2015-01-15 19:21:40 2015-01-15 19:05:42 5.19
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9970 1 4.0 11.0 40.719917 -73.955521 0.0 20.0 0.3 0.5 1 ... 4.0 10.0 40.725979 -74.009071 4.00 0.00 24.80 2015-01-30 11:20:08 2015-01-30 10:51:40 3.70
9971 1 4.0 10.0 40.720398 -73.984940 1.0 6.5 0.3 0.5 1 ... 4.0 10.0 40.732452 -73.985001 1.65 0.00 9.95 2015-01-30 10:58:58 2015-01-30 10:51:40 1.10
9972 1 4.0 11.0 40.755405 -74.002457 0.0 8.5 0.3 0.5 2 ... 4.0 10.0 40.751358 -73.990479 1.00 0.00 10.30 2015-01-30 11:03:41 2015-01-30 10:51:41 0.70
9973 2 1.0 19.0 40.763626 -73.969666 1.0 24.5 0.3 0.5 1 ... 1.0 18.0 40.708790 -74.017281 5.10 0.00 31.40 2015-01-13 19:22:18 2015-01-13 18:55:41 7.08
9974 2 1.0 19.0 40.772366 -73.960800 1.0 5.5 0.3 0.5 5 ... 1.0 18.0 40.780003 -73.954681 1.00 0.00 8.30 2015-01-13 19:02:03 2015-01-13 18:55:41 0.64
9975 2 1.0 19.0 40.733429 -73.984154 1.0 9.0 0.3 0.5 1 ... 1.0 18.0 40.749680 -73.991531 0.00 0.00 10.80 2015-01-13 19:06:56 2015-01-13 18:55:41 1.67
9976 2 1.0 19.0 40.774780 -73.957779 1.0 20.0 0.3 0.5 3 ... 1.0 18.0 40.751801 -74.002327 2.00 0.00 23.80 2015-01-13 19:18:39 2015-01-13 18:55:42 5.28
9977 2 1.0 19.0 40.751698 -73.989746 1.0 8.5 0.3 0.5 2 ... 1.0 18.0 40.768433 -73.986137 0.00 0.00 10.30 2015-01-13 19:06:38 2015-01-13 18:55:42 1.38
9978 2 1.0 19.0 40.752941 -73.977470 1.0 7.5 0.3 0.5 1 ... 1.0 18.0 40.745071 -73.987068 1.00 0.00 10.30 2015-01-13 19:05:34 2015-01-13 18:55:42 0.88
9979 2 1.0 19.0 40.735130 -73.976120 1.0 8.5 0.3 0.5 1 ... 1.0 18.0 40.751259 -73.977814 0.00 0.00 10.30 2015-01-13 19:05:41 2015-01-13 18:55:42 1.58
9980 2 1.0 19.0 40.745541 -73.984383 1.0 8.5 0.3 0.5 1 ... 1.0 18.0 40.731110 -74.001350 0.00 0.00 10.30 2015-01-13 19:05:32 2015-01-13 18:55:42 1.58
9981 2 1.0 19.0 40.793671 -73.974327 1.0 5.0 0.3 0.5 2 ... 1.0 18.0 40.791222 -73.965118 0.00 0.00 6.80 2015-01-13 19:00:05 2015-01-13 18:55:42 0.63
9982 2 1.0 19.0 40.754639 -73.986343 1.0 11.0 0.3 0.5 1 ... 1.0 18.0 40.764175 -73.968994 1.00 0.00 13.80 2015-01-13 19:11:57 2015-01-13 18:55:43 1.63
9983 2 1.0 18.0 40.723721 -73.989494 1.0 4.5 0.3 0.5 1 ... 1.0 18.0 40.714985 -73.992409 2.00 0.00 8.30 2015-01-13 18:59:19 2015-01-13 18:55:43 0.70
9984 2 1.0 19.0 40.774590 -73.963249 1.0 5.5 0.3 0.5 5 ... 1.0 18.0 40.764881 -73.968529 1.30 0.00 8.60 2015-01-13 19:01:19 2015-01-13 18:55:44 0.94
9985 2 1.0 19.0 40.774872 -73.982613 1.0 7.0 0.3 0.5 1 ... 1.0 18.0 40.762344 -73.985695 1.60 0.00 10.40 2015-01-13 19:03:54 2015-01-13 18:55:44 1.04
9986 2 1.0 19.0 40.787998 -73.953888 1.0 5.0 0.3 0.5 2 ... 1.0 18.0 40.779526 -73.957619 1.20 0.00 8.00 2015-01-13 19:00:06 2015-01-13 18:55:44 0.74
9987 2 1.0 19.0 40.790218 -73.975128 1.0 11.5 0.3 0.5 1 ... 1.0 18.0 40.762226 -73.985916 2.50 0.00 15.80 2015-01-13 19:10:46 2015-01-13 18:55:44 2.19
9988 2 1.0 19.0 40.739487 -73.989059 1.0 9.5 0.3 0.5 1 ... 1.0 18.0 40.725056 -73.984329 2.10 0.00 13.40 2015-01-13 19:08:40 2015-01-13 18:55:44 1.48
9989 2 1.0 19.0 40.780548 -73.959030 1.0 8.5 0.3 0.5 1 ... 1.0 18.0 40.778542 -73.981949 1.00 0.00 11.30 2015-01-13 19:04:44 2015-01-13 18:55:45 1.83
9990 2 1.0 19.0 40.761524 -73.960602 1.0 15.0 0.3 0.5 1 ... 1.0 18.0 40.746319 -74.001114 0.00 0.00 16.80 2015-01-13 19:14:59 2015-01-13 18:55:45 3.27
9991 2 1.0 19.0 40.720646 -73.989716 1.0 8.0 0.3 0.5 1 ... 1.0 18.0 40.738167 -73.987434 1.00 0.00 10.80 2015-01-13 19:04:58 2015-01-13 18:55:45 1.56
9992 2 1.0 19.0 40.795898 -73.972610 1.0 20.5 0.3 0.5 1 ... 1.0 18.0 40.740582 -73.989738 4.30 0.00 26.60 2015-01-13 19:18:18 2015-01-13 18:55:45 5.40
9993 2 1.0 18.0 40.769939 -73.981316 1.0 4.5 0.3 0.5 1 ... 1.0 18.0 40.772015 -73.979416 1.10 0.00 7.40 2015-01-13 18:59:40 2015-01-13 18:55:45 0.34
9994 2 4.0 18.0 40.773521 -73.955353 1.0 31.0 0.3 0.5 1 ... 4.0 18.0 40.713215 -74.013542 5.00 0.00 37.80 2015-01-23 18:59:52 2015-01-23 18:22:55 9.05
9995 2 4.0 18.0 40.774670 -73.947845 1.0 11.5 0.3 0.5 1 ... 4.0 18.0 40.773186 -73.978043 0.00 0.00 13.30 2015-01-23 18:37:44 2015-01-23 18:22:55 2.32
9996 2 4.0 18.0 40.758148 -73.985626 1.0 8.5 0.3 0.5 2 ... 4.0 18.0 40.752003 -73.973198 0.00 0.00 10.30 2015-01-23 18:34:48 2015-01-23 18:22:56 0.92
9997 2 4.0 18.0 40.768131 -73.964516 1.0 10.5 0.3 0.5 1 ... 4.0 18.0 40.740456 -73.986252 2.46 0.00 14.76 2015-01-23 18:33:58 2015-01-23 18:22:56 2.36
9998 2 4.0 18.0 40.759171 -73.975189 1.0 6.5 0.3 0.5 3 ... 4.0 18.0 40.770500 -73.981323 2.08 0.00 10.38 2015-01-23 18:29:22 2015-01-23 18:22:56 1.05
9999 2 4.0 18.0 40.752113 -73.975189 1.0 5.0 0.3 0.5 1 ... 4.0 18.0 40.761505 -73.968452 0.00 0.00 6.80 2015-01-23 18:27:58 2015-01-23 18:22:57 0.75

10000 rows × 21 columns

Tutorial

If you want to learn more on vaex, take a look at the tutorials to see what is possible.

Async programming with Vaex

Using the Rich based progress bar we can see that if we call two methods on a dataframe, we get two passes over the data (as indicated by the [1] and [2]).

[1]:
import vaex

df = vaex.datasets.taxi()

with vaex.progress.tree('rich', title="Two passes"):
    print(df.tip_amount.sum())
    print(df.passenger_count.sum())
  Two passes                                    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.15s   
├──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
│   └──   vaex.agg.sum('tip_amount')            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s[1]
└──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.06s   
    └──   vaex.agg.sum('passenger_count')       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.06s[2]

Using delay=True

If we pass delay=True, Vaex will not start to execute the tasks it created internally, but will return a promise instead. After calling df.execute() all tasks will execute, and the promises will be resolved, meaning that you can use the .get() method to get the final value, or use the .then() method to represent the result.

[2]:
with vaex.progress.tree('rich', title="Single pass using delay"):
    tip_sum_promise = df.tip_amount.sum(delay=True)
    passengers_promise = df.passenger_count.sum(delay=True)
    df.execute()
    tip_per_passenger = tip_sum_promise.get() / passengers_promise.get()
    print(f"tip_per_passenger = {tip_per_passenger}")
  Single pass using delay                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
├──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
│   └──   vaex.agg.sum('tip_amount')            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s[1]
└──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
    └──   vaex.agg.sum('passenger_count')       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s[1]
tip_per_passenger = 0.5774000691888607

Using the @delayed decorator

To make life easier, Vaex implements the vaex.delayed decorator. Once all arguments are resolved, the decorated function will be executed automatically.

[3]:
with vaex.progress.tree('rich', title="Single pass using delay + using delayed"):
    @vaex.delayed
    def compute(tip_sum, passengers):
        return tip_sum/passengers

    tip_per_passenger_promise = compute(df.tip_amount.sum(delay=True),
                                        df.passenger_count.sum(delay=True))
    df.execute()
    print(f"tip_per_passenger = {tip_per_passenger_promise.get()}")
  Single pass using delay + using delayed       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
├──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
│   └──   vaex.agg.sum('tip_amount')            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s[1]
└──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
    └──   vaex.agg.sum('passenger_count')       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s[1]

Async await

In all of the above cases, we called df.execute() which will synchronously execute all tasks using threads. However, if you are using Async IO in Python, this means you are blocking all other async coroutines from running.

To allow other coroutines to continue running (e.g. in a FastAPI context), we can instead await df.execute_async(). On top of that, we can also await the promise to get the result, instead of calling .get() to make your code look more AsyncIO like.

[4]:
with vaex.progress.tree('rich', title="Single pass using delay + using delayed and await"):
    @vaex.delayed
    def compute(tip_sum, passengers):
        return tip_sum/passengers

    tip_per_passenger_promise = compute(df.tip_amount.sum(delay=True),
                                        df.passenger_count.sum(delay=True))
    await df.execute_async()
    tip_per_passenger = await tip_per_passenger_promise
    print(f"tip_per_passenger = {tip_per_passenger}")
  Single pass using delay + using delayed and await ━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.14s   
├──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.09s   
│   └──   vaex.agg.sum('tip_amount')            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s[1]
└──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
    └──   vaex.agg.sum('passenger_count')       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s[1]
tip_per_passenger = 0.5774000691888603

Note: In the Jupyter notebook, an asyncio event loop is already running. In a script you may need to use asyncio.run(my_top_level_coroutine()) in order to use await.

Async auto execute

In the previous example we manually had to call df.execute_async(). This enables Vaex to execute all tasks in as little passes over the data as possible.

To make life easier, and your code even more AsyncIO like, we can use the df.executor.auto_execute() async context manager that will automatically call df.execute_async() for you when a promise is awaited.

[5]:
with vaex.progress.tree('rich', title="Single pass using auto_execute"):
    async with df.executor.auto_execute():
        @vaex.delayed
        def compute(tip_sum, passengers):
            return tip_sum/passengers

        tip_per_passenger = await compute(df.tip_amount.sum(delay=True),
                                          df.passenger_count.sum(delay=True))
        print(f"tip_per_passenger = {tip_per_passenger}")
  Single pass using auto_execute                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
├──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
│   └──   vaex.agg.sum('tip_amount')            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s[1]
└──   sum                                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s   
    └──   vaex.agg.sum('passenger_count')       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00.08s[1]
tip_per_passenger = 0.5774000691888609

Caching

Vaex can cache task results, such as aggregations, or the internal hashmaps used for groupby operations to make recurring calculations much faster, at the cost of calculating cache keys and storing/retrieving the cached values.

Internally, Vaex calculates fingerprints (e.g. hashes of data, or file paths and mtimes) to create cache keys that are similar across processes, so that a restart of a process will most likely result in similar hash keys.

See configuration of the cache.

Caches can be turned on globally like this:

[1]:
import vaex
df = vaex.datasets.titanic()
vaex.cache.memory();  # cache on globally

One can verify that the cache is turned on via:

[2]:
vaex.cache.is_on()
[2]:
True

The cache can be globally turned off again:

[3]:
vaex.cache.off()
vaex.cache.is_on()
[3]:
False

The cache can also be turned on with a context manager, after which it will be turned off again. Here we use a disk cache. Disk cache is shared among processes, and is ideal for processes that restart, or when using Vaex in a web service with multiple workers. Consider the following example:

[4]:
with vaex.cache.disk(clear=True):
    print(df.age.mean())  # The very first time the mean is computed
29.8811345124283
[5]:
# outside of the context manager, the cache is still off
vaex.cache.is_on()
[5]:
False
[6]:
with vaex.cache.disk():
    print(df.age.mean())  # The second time the result is read from the cache
29.8811345124283
[7]:
vaex.cache.is_on()
[7]:
False

Dask

If you want to try out this notebook with a live Python kernel, use mybinder:

https://mybinder.org/badge_logo.svg

Dask.array

A vaex dataframe can be lazily converted to a dask.array using DataFrame.to_dask_array.

[2]:
import vaex
df = vaex.example()
df
[2]:
# x y z vx vy vz E L Lz FeH
0 -0.7774707672.10626292 1.93743467 53.276722 288.386047 -95.2649078-121238.171875 831.0799560546875 -336.426513671875 -2.309227609164518
1 3.77427316 2.23387194 3.76209331 252.810791 -69.9498444-56.3121033-100819.91406251435.1839599609375-828.7567749023438 -1.788735491591229
2 1.3757627 -6.3283844 2.63250017 96.276474 226.440201 -34.7527161-100559.96093751039.2989501953125920.802490234375 -0.7618109022478798
3 -7.06737804 1.31737781 -6.10543537 204.968842 -205.679016-58.9777031-70174.8515625 2441.724853515625 1183.5899658203125 -1.5208778422936413
4 0.243441463 -0.822781682-0.206593871-311.742371-238.41217 186.824127 -144138.75 374.8164367675781 -314.5353088378906 -2.655341358427361
... ... ... ... ... ... ... ... ... ... ...
329,9953.76883793 4.66251659 -4.42904139 107.432999 -2.1377129617.5130272 -119687.3203125746.8833618164062 -508.96484375 -1.6499842518381402
329,9969.17409325 -8.87091351 -8.61707687 32.0 108.089264 179.060638 -68933.8046875 2395.633056640625 1275.490234375 -1.4336036247720836
329,997-1.14041007 -8.4957695 2.25749826 8.46711349 -38.2765236-127.541473-112580.359375 1182.436279296875 115.58557891845703 -1.9306227597361942
329,998-14.2985935 -5.51750422 -8.65472317 110.221558 -31.392559186.2726822 -74862.90625 1324.59265136718751057.017333984375 -1.225019818838568
329,99910.5450506 -8.86106777 -4.65835428 -2.10541415-27.61088563.80799961 -95361.765625 351.0955505371094 -309.81439208984375-2.5689636894079477
[10]:
# convert a set of columns in the dataframe to a 2d dask array
A = df[['x', 'y', 'z']].to_dask_array()
A
[10]:
Array Chunk
Bytes 7.92 MB 7.92 MB
Shape (330000, 3) (330000, 3)
Count 2 Tasks 1 Chunks
Type float64 numpy.ndarray
3 330000
[11]:
import dask.array as da
# lazily compute with dask
r = da.sqrt(A[:,0]**2 + A[:,1]**2 + A[:,2]**2)
r
[11]:
Array Chunk
Bytes 2.64 MB 2.64 MB
Shape (330000,) (330000,)
Count 11 Tasks 1 Chunks
Type float64 numpy.ndarray
330000 1
[12]:
# materialize the data
r_computed = r.compute()
r_computed
[15]:
# put it back in the dataframe
df['r'] = r_computed
df
[15]:
# x y z vx vy vz E L Lz FeH r
0 -0.7774707672.10626292 1.93743467 53.276722 288.386047 -95.2649078-121238.171875 831.0799560546875 -336.426513671875 -2.309227609164518 2.9655450396553587
1 3.77427316 2.23387194 3.76209331 252.810791 -69.9498444-56.3121033-100819.91406251435.1839599609375-828.7567749023438 -1.788735491591229 5.77829281049018
2 1.3757627 -6.3283844 2.63250017 96.276474 226.440201 -34.7527161-100559.96093751039.2989501953125920.802490234375 -0.76181090224787986.99079603950256
3 -7.06737804 1.31737781 -6.10543537 204.968842 -205.679016-58.9777031-70174.8515625 2441.724853515625 1183.5899658203125 -1.52087784229364139.431842752707537
4 0.243441463 -0.822781682-0.206593871-311.742371-238.41217 186.824127 -144138.75 374.8164367675781 -314.5353088378906 -2.655341358427361 0.8825613121347967
... ... ... ... ... ... ... ... ... ... ... ...
329,9953.76883793 4.66251659 -4.42904139 107.432999 -2.1377129617.5130272 -119687.3203125746.8833618164062 -508.96484375 -1.64998425183814027.453831761514681
329,9969.17409325 -8.87091351 -8.61707687 32.0 108.089264 179.060638 -68933.8046875 2395.633056640625 1275.490234375 -1.433603624772083615.398412491068198
329,997-1.14041007 -8.4957695 2.25749826 8.46711349 -38.2765236-127.541473-112580.359375 1182.436279296875 115.58557891845703 -1.93062275973619428.864250273925633
329,998-14.2985935 -5.51750422 -8.65472317 110.221558 -31.392559186.2726822 -74862.90625 1324.59265136718751057.017333984375 -1.225019818838568 17.601047186042507
329,99910.5450506 -8.86106777 -4.65835428 -2.10541415-27.61088563.80799961 -95361.765625 351.0955505371094 -309.81439208984375-2.568963689407947714.540181524970293
[ ]:

Data Types

Vaex is a hybrid DataFrame - it supports both numpy and arrow data types. This page outlines exactly which data types are supported in Vaex, and which we hope to support in the future. We also provide some tips on how to approach common data structures.

For some additional insight, you are welcome to look at this post as well.

Supported Data Types in Vaex

In the table below we define:

  • Supported: a column or expression of that type can exist and can be stored in at least one file format;

  • Unsupported: a column or expression of that type can currently not live within a Vaex DataFrame, but can supported be added in the future;

  • Will not support: This datatype will not be supported in Vaex going forward.

Framework

Dtype

Supported

Remarks

Python

int

yes

Will be converted to a numpy array

Python

float

yes

Will be converted to a numpy array

Python

datetime

not yet

Python

timedelta

not yet

Python

str

yes

Will be converted to Arrow array

numpy

int8

yes

numpy

int16

yes

numpy

int32

yes

numpy

int64

yes

numpy

float16

yes

Operations should be upcast to float64

numpy

float32

yes

numpy

float64

yes

numpy

datetime64

yes

numpy

timedelta64

yes

numpy

object ('O')

no

arrow

int8

yes

arrow

int16

yes

arrow

int32

yes

arrow

int64

yes

arrow

float16

yes

Operations should be upcast to float64

arrow

float32

yes

arrow

float64

yes

arrow

date32

yes

arrow

time64

yes

arrow

time32

yes

arrow

duration

yes

arrow

struct

yes

Can’t be exported to HDF5 yet, but possible

arrow

dictionary

yes

arrow

union

not yet

General advice on data types in Vaex

Vaex requires that each column or expression be of a single data type, as in the case of databases. Having a column of different data type can result in a data type object, which is not supported, and can also give raise to various problems.

The general advice is to prepare your data to have a uniform data type per column prior to using Vaex with it.

[1]:
import vaex
import numpy as np
import pyarrow as pa

Higher dimensional arrays

Vaex support high dimensional numpy arrays. The one requirement the arrays must have the same shape. Currently DataFrames that contain higher dimensional numpy arrays can only be exported to HDF5. We hope that arrow will add support for this soon, so we can export such columns to the arrow and parquet formats also.

For example:

[2]:
x = np.random.randn(100, 10, 10)
df = vaex.from_arrays(x=x)
df
[2]:
# x
0 'array([[ 1.83097431e+00, -9.90736404e-01, -8.85...
1 'array([[ 1.99466370e+00, 8.92569841e-01, 2.11...
2 'array([[-0.69977757, 0.61319317, 0.01313954, ...
3 'array([[ 0.25304255, -0.84425097, -1.18806199, ...
4 'array([[ 0.31611316, -1.36148251, 1.67342284, ...
... ...
95'array([[-0.60892972, -0.52389881, -0.92493729, ...
96'array([[ 1.10435809, 1.06875633, 1.45812865, ...
97'array([[-0.59311765, 0.10650056, -0.14413671, ...
98'array([[-0.24467361, -0.40743024, 0.6914302 , ...
99'array([[-1.0646038 , 0.53975242, -1.70715565, ...

We can also pass a nested list of lists structure to Vaex. This will be converted on the fly to a numpy ndarray. As before, the condition is that the resulting ndarray must be regular.

For example:

[3]:
x = [[1, 2], [3, 4]]
df = vaex.from_arrays(x=x)
df
[3]:
# x
0array([1, 2])
1array([3, 4])

If we happen to have a non-regular list of lists, i.e. a list of lists where the inner lists are of different lengths, we first need to convert it to an arrow array before passing it to vaex:

[4]:
x = [[1, 2, 3, 4, 5], [6, 7], [8, 9, 10]]
x = pa.array(x)
df = vaex.from_arrays(x=x)
df
[4]:
# x
0[1, 2, 3, 4, 5]
1[6, 7]
2[8, 9, 10]

Note the arrow structs and lists can not be exported to HDF5 for the time being.

String support in Vaex

Vaex uses arrow under the hood to work with strings. Any strings passed to a Vaex DataFrame will be converted to an arrow type.

For example:

[5]:
x = ['This', 'is', 'a', 'string', 'column']
y = np.array(['This', 'is', 'one', 'also', None])

df = vaex.from_arrays(x=x, y=y)
print(df)

display(df.x.values)
display(df.y.values)
  #  x       y
  0  This    This
  1  is      is
  2  a       one
  3  string  also
  4  column  --
<pyarrow.lib.StringArray object at 0x7f277b9b9040>
[
  "This",
  "is",
  "a",
  "string",
  "column"
]
<pyarrow.lib.StringArray object at 0x7f277b9b9d60>
[
  "This",
  "is",
  "one",
  "also",
  null
]

It is useful to know that string operations in Vaex also work on lists of lists of strings (and also on lists of lists of lists of strings, and so on).

[6]:
x = pa.array([['Reggie', 'Miller'], ['Michael', 'Jordan']])
df = vaex.from_arrays(x=x)
df.x.str.lower()
[6]:
Expression = str_lower(x)
Length: 2 dtype: list<item: string> (expression)
------------------------------------------------
0   ['reggie', 'miller']
1  ['michael', 'jordan']

GraphQL

If you want to try out this notebook with a live Python kernel, use mybinder:

https://mybinder.org/badge_logo.svg

vaex-graphql is a plugin package that exposes a DataFrame via a GraphQL interface. This allows easy sharing of data or aggregations/statistics or machine learning models to frontends or other programs with a standard query languages.

(Install with $ pip install vaex-graphql, no conda-forge support yet)

[3]:
import vaex
df = vaex.datasets.titanic()
df
[3]:
# pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 True Allen, Miss. Elisabeth Walton female29.0 0 0 24160 211.3375B5 S 2 nan St Louis, MO
1 1 True Allison, Master. Hudson Trevor male 0.91671 2 113781 151.55 C22 C26S 11 nan Montreal, PQ / Chesterville, ON
2 1 False Allison, Miss. Helen Loraine female2.0 1 2 113781 151.55 C22 C26S None nan Montreal, PQ / Chesterville, ON
3 1 False Allison, Mr. Hudson Joshua Creighton male 30.0 1 2 113781 151.55 C22 C26S None 135.0 Montreal, PQ / Chesterville, ON
4 1 False Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.0 1 2 113781 151.55 C22 C26S None nan Montreal, PQ / Chesterville, ON
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1,3043 False Zabour, Miss. Hileni female14.5 1 0 2665 14.4542 None C None 328.0 None
1,3053 False Zabour, Miss. Thamine femalenan 1 0 2665 14.4542 None C None nan None
1,3063 False Zakarian, Mr. Mapriededer male 26.5 0 0 2656 7.225 None C None 304.0 None
1,3073 False Zakarian, Mr. Ortin male 27.0 0 0 2670 7.225 None C None nan None
1,3083 False Zimmerman, Mr. Leo male 29.0 0 0 315082 7.875 None S None nan None
[10]:
result = df.graphql.execute("""
    {
        df {
            min {
                age
                fare
            }
            mean {
                age
                fare
            }
            max {
                age
                fare
            }
            groupby {
                sex {
                   count
                   mean {
                       age
                   }
                }
            }
        }
    }
    """)
result.data
[10]:
OrderedDict([('df',
              OrderedDict([('min',
                            OrderedDict([('age', 0.1667), ('fare', 0.0)])),
                           ('mean',
                            OrderedDict([('age', 29.8811345124283),
                                         ('fare', 33.29547928134572)])),
                           ('max',
                            OrderedDict([('age', 80.0), ('fare', 512.3292)])),
                           ('groupby',
                            OrderedDict([('sex',
                                          OrderedDict([('count', [466, 843]),
                                                       ('mean',
                                                        OrderedDict([('age',
                                                                      [28.6870706185567,
                                                                       30.585232978723408])]))]))]))]))])

Pandas support

After importing vaex.graphql, vaex also installs a pandas accessor, so it is also accessible for Pandas DataFrames.

[11]:
df_pandas = df.to_pandas_df()
[20]:
df_pandas.graphql.execute("""
    {
        df(where: {age: {_gt: 20}}) {
            row(offset: 3, limit: 2) {
                name
                survived
            }
        }
    }
    """
).data
[20]:
OrderedDict([('df',
              OrderedDict([('row',
                            [OrderedDict([('name', 'Anderson, Mr. Harry'),
                                          ('survived', True)]),
                             OrderedDict([('name',
                                           'Andrews, Miss. Kornelia Theodosia'),
                                          ('survived', True)])])]))])

Server

The easiest way to learn to use the GraphQL language/vaex interface is to launch a server, and play with the GraphiQL graphical interface, its autocomplete, and the schema explorer.

We try to stay close to the Hasura API: https://docs.hasura.io/1.0/graphql/manual/api-reference/graphql-api/query.html

A server can be started from the command line:

$ python -m vaex.graphql myfile.hdf5

Or from within Python using df.graphql.serve

GraphiQL

See https://github.com/mariobuikhuizen/ipygraphql for a graphical widget, or a mybinder to try out a live example. image

[ ]:

I/O Kung-Fu: get your data in and out of Vaex

If you want to try out this notebook with a live Python kernel, use mybinder:

https://mybinder.org/badge_logo.svg

Data input

Every project starts with reading in some data. Vaex supports several data sources:

  • Binary file formats:

  • Text based file formats:

  • In-memory data representations:

    • pandas DataFrames and everything that pandas can read

    • Apache Arrow Tables

    • numpy arrays

    • Python dictionaries

    • Single row DataFrames

  • Cloud support:

    • Amazon Web Services S3

    • Google Cloud Storage

    • Other cloud storage options

  • Extras:

    • Aliases

The following examples show the best practices of getting your data in Vaex.

Opening binary file formats

If your data is already in one of the supported binary file formats (HDF5, Apache Arrow, Apache Parquet, FITS), opening it with Vaex rather simple:

[1]:
import vaex

# Reading a HDF5 file
df_names = vaex.open('../data/io/sample_names_1.hdf5')
df_names
[1]:
# name agecity
0John 17Edinburgh
1Sally 33Groningen

When opening a HDF5 file, one can specify which group to read:

df_group = vaex.open('my_file_with_groups.hdf5', group='/path/to/my/table')

For a worked example please see the Exporting binary file formats section.

Opening an arrow or a parquet file is just as simple:

[2]:
# Reading an arrow file
df_fruits = vaex.open('../data/io/sample_fruits.arrow')
df_fruits
[2]:
# fruit amountorigin
0mango 5Malaya
1banana 10Ecuador
2orange 7Spain

Opening such data is instantenous regardless of the file size on disk: Vaex will just memory-map the data instead of reading it in memory. This is the optimal way of working with large datasets that are larger than available RAM.

If your data is contained within multiple files, one can open them all simultaneously like this:

[3]:
df_names_all = vaex.open('../data/io/sample_names_*.hdf5')
df_names_all
[3]:
# name agecity
0John 17Edinburgh
1Sally 33Groningen
2Maria 23Caracas
3Monica 55New York

Alternatively, one can use the open_many method to pass a list of files to open:

[4]:
df_names_all = vaex.open_many(['../data/io/sample_names_1.hdf5',
                               '../data/io/sample_names_2.hdf5'])
df_names_all
[4]:
# name agecity
0John 17Edinburgh
1Sally 33Groningen
2Maria 23Caracas
3Monica 55New York

The result will be a single DataFrame object containing all of the data coming from all files.

[5]:
# Reading a parquet file
df_cars = vaex.open('../data/io/sample_cars.parquet')
df_cars
[5]:
# car color year
0renaultred 1996
1audi black 2005
2toyota blue 2000

Text based file formats

Datasets are still commonly stored in text-based file formats such as CSV and JSON. Vaex supports various methods for reading such datasets.

New in 4.14:

The vaex.open method can also be used to read a CSV file. With this method Vaex will lazily read the CSV file, i.e. the data from the CSV file will be streamed when computations need to be executed. This is powered by Apache Arrow under the hood. In this way you can work with arbitraruly large CSV files without caring about RAM!

Note: When opening a CSV file in this way, Vaex will first quickly scan the file to determine some basic metadata such as the number of rows, column names and their data types. The duration of this can vary depending on the number of rows, columns, your disk read speed, and infer_schema_fraction telling Vaex what fraction of the file to read to determine the metadata.

One can use the convert argument, for example vaex.open('my_file.csv', convert='my_file.hdf5'), easily covert a CSV file to HDF5 for faster access, or to a Parquet file for storage considerations.

[6]:
df_nba_lazy = vaex.open('../data/io/sample_nba_1.csv')  # Read lazily, not kept in RAM
df_nba_lazy
[6]:
# city team player
0IndianopolisPacers Reggie Miller
1Chicago Bulls Michael Jordan
2Boston CelticsLarry Bird

It can be more practicle to simply read smaller datasets in memory. This is easily done with:

[7]:
df_nba = vaex.from_csv('../data/io/sample_nba_1.csv', copy_index=False)
df_nba
[7]:
# city team player
0IndianopolisPacers Reggie Miller
1Chicago Bulls Michael Jordan
2Boston CelticsLarry Bird

or alternatively:

[8]:
df_nba = vaex.read_csv('../data/io/sample_nba_1.csv', copy_index=False)
df_nba
[8]:
# city team player
0IndianopolisPacers Reggie Miller
1Chicago Bulls Michael Jordan
2Boston CelticsLarry Bird

Vaex is using Pandas for reading CSV files in the background, so one can pass any arguments to the vaex.from_csv or vaex.read_csv as one would pass to pandas.read_csv and specify for example separators, column names and column types. The copy_index parameter specifies if the index column of the Pandas DataFrame should be read as a regular column, or left out to save memory. In addition to this, if you specify the convert=True argument, the data will be automatically converted to an HDF5 file behind the scenes, thus freeing RAM and allowing you to work with your data in a memory-efficient, out-of-core manner.

If the CSV file is so large that it can not fit into RAM all at one time, one can convert the data to HDF5 simply by:

df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)

When the above line is executed, Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file on disk. All temporary files are then concatenated into a single HDF5 file, and the temporary files deleted. The size of the individual chunks to be read can be specified via the chunk_size argument. Note that this automatic conversion requires free disk space of twice the final HDF5 file size.

It often happens that the data we need to analyse is spread over multiple CSV files. One can convert them to the HDF5 file format like this:

[9]:
list_of_files = ['../data/io/sample_nba_1.csv',
                 '../data/io/sample_nba_2.csv',
                 '../data/io/sample_nba_3.csv',]

# Convert each CSV file to HDF5
for file in list_of_files:
    df_tmp = vaex.from_csv(file, convert=True, copy_index=False)

The above code block converts in turn each CSV file to the HDF5 format. Note that the conversion will work regardless of the file size of each individual CSV file, provided there is sufficient storage space.

Working with all of the data is now easy: just open all of the relevant HDF5 files as described above:

[10]:
df = vaex.open('../data/io/sample_nba_*.csv.hdf5')
df
[10]:
# city team player
0IndianopolisPacers Reggie Miller
1Chicago Bulls Michael Jordan
2Boston CelticsLarry Bird
3Los Angeles Lakers Kobe Bryant
4Toronto RaptorsVince Carter
5Philadelphia76ers Allen Iverson
6San Antonio Spurs Tim Duncan

One can than additionally export this combined DataFrame to a single HDF5 file. This should lead to minor performance improvements.

[11]:
df.export('../data/io/sample_nba_combined.hdf5')

Reading larger CSV files via Pandas can be slow. Apache Arrow provides a considerably faster of reading such files. Vaex conveniently exposes this functionality:

[12]:
df_nba_arrow = vaex.from_csv_arrow('../data/io/sample_nba_1.csv')
df_nba_arrow
[12]:
# city team player
0IndianopolisPacers Reggie Miller
1Chicago Bulls Michael Jordan
2Boston CelticsLarry Bird

In fact, Apache Arrow parses CSV files so fast, it can be used to stream the data and effectively enable lazy reading. By passing lazy=True to the method above, one can work with CSV files that are much larger than available RAM. This is what is used under the hood of vaex.open to provide lazy reading of CSV files.

[13]:
df_nba_arrow_lazy = vaex.from_csv_arrow('../data/io/sample_nba_1.csv', lazy=True)
df_nba_arrow_lazy
[13]:
# city team player
0IndianopolisPacers Reggie Miller
1Chicago Bulls Michael Jordan
2Boston CelticsLarry Bird

It is also common the data to be stored in JSON files. To read such data in Vaex one can do:

[14]:
df_isles = vaex.from_json('../data/io/sample_isles.json', orient='table', copy_index=False)
df_isles
[14]:
# isle size_sqkm
0Easter Island 163.6
1Fiji 18.333
2Tortuga 178.7

This is a convenience method which simply wraps pandas.read_json, so the same arguments and file reading strategy applies. If the data is distributed amongs multiple JSON files, one can apply a similar strategy as in the case of multiple CSV files: read each JSON file with the vaex.from_json method, convert it to a HDF5 or Arrow file format. Than use vaex.open or vaex.open_many methods to open all the converted files as a single DataFrame.

To learn more about different options of exporting data with Vaex, please read the next section below.

Cloud Support

Vaex supports streaming of HDF5, Apache Arrow, Apache Parquet, and CSV files from Amazon’s S3 and Google Cloud Storage. Here is an example of streaming an HDF5 file directly from S3:

[15]:
df = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true')
df.head_and_tail_print(3)
# vendor_id pickup_datetime dropoff_datetime passenger_count payment_type trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
0 VTS 2015-02-27 22:11:38.0000000002015-02-27 22:22:51.0000000005 1 2.26 -74.006645 40.707497 1.0 0.0 -74.0096 40.73462 10.0 0.5 0.5 2.0 0.0 13.3
1 VTS 2015-08-04 00:36:01.0000000002015-08-04 00:47:11.0000000001 1 5.13 -74.00747 40.705235 1.0 0.0 -73.96727 40.755196 16.0 0.5 0.5 3.46 0.0 20.76
2 VTS 2015-01-28 19:56:52.0000000002015-01-28 20:03:27.0000000001 2 1.89 -73.97189 40.76286 1.0 0.0 -73.95513 40.78596 7.5 1.0 0.5 0.0 0.0 9.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
299,997CMT 2015-06-18 09:05:52.0000000002015-06-18 09:28:19.0000000001 1 2.7 -73.95231 40.78091 1.0 0.0 -73.97917 40.75542 15.0 0.0 0.5 1.25 0.0 17.05
299,998VTS 2015-04-17 11:13:46.0000000002015-04-17 11:33:19.0000000001 2 1.75 -73.951935 40.77804 1.0 0.0 -73.9692 40.763924 13.0 0.0 0.5 0.0 0.0 13.8
299,999VTS 2015-05-29 07:00:45.0000000002015-05-29 07:17:47.0000000005 2 8.94 -73.95345 40.77932 1.0 0.0 -73.86702 40.77094 26.0 0.0 0.5 0.0 5.54 32.34

One can also use the fs_options to specify any arguments that need to be passed to an external file system if needed:

For example:

[16]:
df = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5', fs_options={'anon': True})
df.head(3)
[16]:
# vendor_id pickup_datetime dropoff_datetime passenger_count payment_type trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
0VTS 2015-02-27 22:11:38.0000000002015-02-27 22:22:51.000000000 5 1 2.26 -74.0066 40.7075 1 0 -74.0096 40.7346 10 0.5 0.5 2 0 13.3
1VTS 2015-08-04 00:36:01.0000000002015-08-04 00:47:11.000000000 1 1 5.13 -74.0075 40.7052 1 0 -73.9673 40.7552 16 0.5 0.5 3.46 0 20.76
2VTS 2015-01-28 19:56:52.0000000002015-01-28 20:03:27.000000000 1 2 1.89 -73.9719 40.7629 1 0 -73.9551 40.786 7.5 1 0.5 0 0 9.3

When streaming HDF5 files, fs_options also accepts the “cache” options. When True, as is the default, Vaex will lazily download and cache the data to the local machine. “Lazily download” means that Vaex will only download the portions of the data you really need.

For example: imagine that we have a file hosted on S3 that has 100 columns and 1 billion rows. Getting a preview of the DataFrame via print(df) for instance will download only the first and last 5 rows. If we then proceed to make calculations or plots with only 5 columns, only the data from those columns will be downloaded and cached to the local machine.

By default, the data streamed from S3 and GCS is cached at $HOME/.vaex/file-cache/s3 and $HOME/.vaex/file-cache/gs respectively, and thus successive access is as fast as native disk access.

Streaming Apache Arrow and Apache Parquet is just as simple. Caching is available for these file formats, but using the Apache Arrow format will currently read all the data when opening the file, so less useful. For maximum performance, we always advise to use a compute instance at the same region as the bucket.

Here is an example of reading an Apache Arrow file straight from Google Cloud Storage:

df = vaex.open('gs://vaex-data/airlines/us_airline_2019_mini.arrow', fs_options={'anon': True})
df

Apache Parquet files typically compressed, and therefore are often a better choice for cloud environments, since the tend to keep the storage and transfer costs lower. Here is an example of opening a Parquet file from Google Cloud Storage.

df = vaex.open('gs://vaex-data/airlines/us_airline_2019_mini.parquet', fs_options={'anon': True})
df

The following table summarizes the current capabilities of Vaex to read, cache and write different file formats to Amazon S3 and Google Cloud Storage.

Format

Read

Cache

Write

HDF5

Yes

Yes

No*

Arrow

Yes

No*

Yes

Parquet

Yes

No*

Yes

FITS

Yes

No*

Yes

CSV

Yes

???

Yes

No* - this is not available now, but should be possible in the future. Please contact vaex.io for more information.

Other cloud storage options - Minio example

Minio is an S3 compatible object storage server, which can be used instead of AWS’ S3 service. Assuming a Minio setup like this:

$ export DATA_ROOT=/data/tmp
$ mkdir $DATA_ROOT/taxi
$ wget https://github.com/vaexio/vaex-datasets/releases/download/1.1/yellow_taxi_2012_zones.parquet --directory-prefix $DATA_ROOT/taxi/
$ docker run -it --rm -p 9000:9000 --name minio1 -v $DATA_ROOT:/data -e "MINIO_ROOT_USER=AKIAIOSFODNN7EXAMPLE" -e "MINIO_ROOT_PASSWORD=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" minio/minio server /data

This creates a running Minio server available at localhost:9000, hosting a bucket called ‘taxi’, with 1 parquet file. We can now connect to it using Vaex. From the web interface we can get a URL, in this case in the form of:

http://localhost:9000/taxi/yellow_taxi_2012_zones.parquet?Content-Disposition=attachment%3B%20filename%3D%22yellow_taxi_2012_zones.parquet%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20210707%2F%2Fs3%2Faws4_request&X-Amz-Date=20210707T085053Z&X-Amz-Expires=432000&X-Amz-SignedHeaders=host&X-Amz-Signature=03e0b6718a95be0fd0d679c4fc52bc26f9ce9f7845877866d5caa709e9b0e12c

This is not the S3 URL you should provide to Vaex (or Apache Arrow for that matter, which is used by Vaex). Instead the correct URL is of the form s3://bucket/path/to/file.ext. We also need to tell Vaex to connect to the server by passing the appropriate fs_options:

df = vaex.open('s3://taxi/yellow_taxi_2012_zones.parquet', fs_options=dict(
    endpoint_override='localhost:9000',
    access_key='AKIAIOSFODNN7EXAMPLE',
    secret_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    scheme='http')
)

In-memory data representations

One can construct a Vaex DataFrame from a variety of in-memory data representations. Such a common operation is converting a Pandas into a Vaex DataFrame. Let us read in a CSV file with Pandas and than convert it to a Vaex DataFrame:

[17]:
import pandas as pd

pandas_df = pd.read_csv('../data/io/sample_nba_1.csv')
pandas_df
[17]:
city team player
0 Indianopolis Pacers Reggie Miller
1 Chicago Bulls Michael Jordan
2 Boston Celtics Larry Bird
[18]:
df = vaex.from_pandas(df=pandas_df, copy_index=True)
df
[18]:
# city team player index
0IndianopolisPacers Reggie Miller 0
1Chicago Bulls Michael Jordan 1
2Boston CelticsLarry Bird 2

The copy_index argument specifies whether the index column of a Pandas DataFrame should be imported into the Vaex DataFrame. Converting a Pandas into a Vaex DataFrame is particularly useful since Pandas can read data from a large variety of file formats. For instance, we can use Pandas to read data from a database, and then pass it to Vaex like so:

import vaex
import pandas as pd
import sqlalchemy

connection_string = 'postgresql://readonly:' + 'my_password' + '@server.company.com:1234/database_name'
engine = sqlalchemy.create_engine(connection_string)

pandas_df = pd.read_sql_query('SELECT * FROM MYTABLE', con=engine)
df = vaex.from_pandas(pandas_df, copy_index=False)

Another example is using pandas to read in SAS files:

[19]:
pandas_df = pd.read_sas('../data/io/sample_airline.sas7bdat')
df = vaex.from_pandas(pandas_df, copy_index=False)
df
[19]:
# YEAR Y W R L K
0 1948.01.21399998664855960.243000000715255740.1454000025987625 1.41499996185302730.6119999885559082
1 1949.01.35399997234344480.259999990463256840.218099996447563171.38399994373321530.5590000152587891
2 1950.01.569000005722046 0.277999997138977050.3156999945640564 1.38800001144409180.5730000138282776
3 1951.01.94799995422363280.296999990940094 0.393999993801116941.54999995231628420.5640000104904175
4 1952.02.265000104904175 0.3100000023841858 0.355899989604949951.80200004577636720.5740000009536743
... ... ... ... ... ... ...
271975.018.72100067138672 1.246999979019165 0.230100005865097055.72200012207031259.062000274658203
281976.019.25 1.375 0.3452000021934509 5.76200008392334 8.26200008392334
291977.020.64699935913086 1.5440000295639038 0.450800001621246345.876999855041504 7.473999977111816
301978.022.72599983215332 1.7029999494552612 0.5877000093460083 6.107999801635742 7.104000091552734
311979.023.6189994812011721.7790000438690186 0.534600019454956 6.85200023651123056.874000072479248

One can read in an arrow table as a Vaex DataFrame in a similar manner. Let us first use pyarrow to read in a CSV file as an arrow table.

[20]:
import pyarrow.csv

arrow_table = pyarrow.csv.read_csv('../data/io/sample_nba_1.csv')
arrow_table
[20]:
pyarrow.Table
city: string
team: string
player: string
----
city: [["Indianopolis","Chicago","Boston"]]
team: [["Pacers","Bulls","Celtics"]]
player: [["Reggie Miller","Michael Jordan","Larry Bird"]]

Once we have the arrow table, converting it to a DataFrame is simple:

[21]:
df = vaex.from_arrow_table(arrow_table)
df
[21]:
# city team player
0IndianopolisPacers Reggie Miller
1Chicago Bulls Michael Jordan
2Boston CelticsLarry Bird

It also common to construct a Vaex DataFrame from numpy arrays. That can be done like this:

[22]:
import numpy as np

x = np.arange(2)
y = np.array([10, 20])
z = np.array(['dog', 'cat'])


df_numpy = vaex.from_arrays(x=x, y=y, z=z)
df_numpy
[22]:
# x yz
0 0 10dog
1 1 20cat

Constructing a DataFrame from a Python dict is also straight-forward:

[23]:
# Construct a DataFrame from Python dictionary
data_dict = dict(x=[2, 3], y=[30, 40], z=['cow', 'horse'])

df_dict = vaex.from_dict(data_dict)
df_dict
[23]:
# x yz
0 2 30cow
1 3 40horse

At times, one may need to create a single row DataFrame. Vaex has a convenience method which takes individual elements (scalars) and creates the DataFrame:

[24]:
df_single_row = vaex.from_scalars(x=4, y=50, z='mouse')
df_single_row
[24]:
# x yz
0 4 50mouse

Finally, we can choose to concatenate different DataFrames, without any memory penalties like so:

[25]:
df = vaex.concat([df_numpy, df_dict, df_single_row])
df
[25]:
# x yz
0 0 10dog
1 1 20cat
2 2 30cow
3 3 40horse
4 4 50mouse

Extras

Vaex allows you to make alias to the locations of your most used datasets. They can be local or in the cloud:

[26]:
vaex.aliases['nba'] = '../data/io/sample_nba_1.csv'
vaex.aliases['nyc_taxi_aws'] = 's3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true'
[27]:
df = vaex.open('nba')
df
[27]:
# city team player
0IndianopolisPacers Reggie Miller
1Chicago Bulls Michael Jordan
2Boston CelticsLarry Bird
[28]:
df = vaex.open('nyc_taxi_aws')
df.head_and_tail_print(3)
# vendor_id pickup_datetime dropoff_datetime passenger_count payment_type trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
0 VTS 2015-02-27 22:11:38.0000000002015-02-27 22:22:51.0000000005 1 2.26 -74.006645 40.707497 1.0 0.0 -74.0096 40.73462 10.0 0.5 0.5 2.0 0.0 13.3
1 VTS 2015-08-04 00:36:01.0000000002015-08-04 00:47:11.0000000001 1 5.13 -74.00747 40.705235 1.0 0.0 -73.96727 40.755196 16.0 0.5 0.5 3.46 0.0 20.76
2 VTS 2015-01-28 19:56:52.0000000002015-01-28 20:03:27.0000000001 2 1.89 -73.97189 40.76286 1.0 0.0 -73.95513 40.78596 7.5 1.0 0.5 0.0 0.0 9.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
299,997CMT 2015-06-18 09:05:52.0000000002015-06-18 09:28:19.0000000001 1 2.7 -73.95231 40.78091 1.0 0.0 -73.97917 40.75542 15.0 0.0 0.5 1.25 0.0 17.05
299,998VTS 2015-04-17 11:13:46.0000000002015-04-17 11:33:19.0000000001 2 1.75 -73.951935 40.77804 1.0 0.0 -73.9692 40.763924 13.0 0.0 0.5 0.0 0.0 13.8
299,999VTS 2015-05-29 07:00:45.0000000002015-05-29 07:17:47.0000000005 2 8.94 -73.95345 40.77932 1.0 0.0 -73.86702 40.77094 26.0 0.0 0.5 0.0 5.54 32.34

Data export

One can export Vaex DataFrames to multiple file or in-memory data representations:

Exporting binary file formats

The most efficient way to store data on disk when you work with Vaex is to use binary file formats. Vaex can export a DataFrame to HDF5, Apache Arrow, Apache Parquet and FITS:

[29]:
df.export_hdf5('../data/io/output_data.hdf5')
df.export_arrow('../data/io/output_data.arrow')
df.export_parquet('../data/io/output_data.parquet')

Alternatively, one can simply use:

[30]:
df.export('../data/io/output_data.hdf5')
df.export('../data/io/output_data.arrow')
df.export('../data/io/output_data.parquet')

where Vaex will determine the file format of the based on the specified extension of the file name. If the extension is not recognized, an exception will be raised.

When exporing to HDF5, you can specify a particular group. An existing HDF5 file can also be appended, by another dataset, but it needs to be in another group. For example:

[31]:
df1 = vaex.from_arrays(x=[1, 2, 3], y=[-0.5, 0, 0.5])
df2 = vaex.from_arrays(s1=['Apple', 'Orange', 'Peach'], s2=['potato', 'carrot', 'cucumber'])

df1.export_hdf5('../data/io/output_hdf5_file_with_multiple_groups.hdf5', mode='w', group='/numbers')
df2.export_hdf5('../data/io/output_hdf5_file_with_multiple_groups.hdf5', mode='a', group='/food')

As explained in the Opening binary formats section, this newly created file can be opened by passing the group argument to the vaex.open method:

[32]:
df_food = vaex.open('../data/io/output_hdf5_file_with_multiple_groups.hdf5', group='food')
df_food
[32]:
# s1 s2
0Apple potato
1Orangecarrot
2Peach cucumber
[33]:
df_nums = vaex.open('../data/io/output_hdf5_file_with_multiple_groups.hdf5', group='numbers')
df_nums
[33]:
# x y
0 1-0.5
1 2 0
2 3 0.5

When exporting to Apache Arrow and Apache Parquet file format, the data is written in chunks thus enabling to export of data that does not fit in RAM all at once. A custom chunk size can be specified via the chunk_size argument, the default value of which is 1048576. For example:

[34]:
df.export('../data/io/output_data.parquet', chunk_size=10_000)

Vaex supports direct writing to Amazon’s S3 and Google Cloud Storage buckets when exporting the data to Apache Arrow and Apache Parquet file formats. Much like when opening a file, the fs_options dictionary can be specified to pass arguments to the underlying file system, for example authentication credentials. Here are two examples:

# Export to Google Cloud Storage
df.export_arrow(to='gs://my-gs-bucket/my_data.arrow', fs_options={'token': my_token})

# Export to Amazon's S3
df.export_parquet(to='s3://my-s3-bucket/my_data.parquet', fs_options={'access_key': my_key, 'secret_key': my_secret_key})

Text based file format

At times, it may be useful to export the data to disk in a text based file format such as CSV. In that case one can simply do:

[35]:
df.export_csv('../data/io/output_data.csv')  # `chunk_size` has a default value of 1_000_000

The df.export_csv method is using pandas_df.to_csv behind the scenes, and thus one can pass any argument to df.export_csv as would to pandas_df.to_csv. The data is exported in chunks and the size of those chunks can be specified by the chunk_size argument in df.export_csv. In this way, data that is too large to fit in RAM can be saved to disk.

If one needs to export a larger DataFrame to CSV, the Apache Arrow backend provides better performance:

[36]:
df.export_csv_arrow('../data/io/output_data.csv')

Export to multiple files in parallel

With the export_many method one can export a DataFrame to muliple files of the same type in parallel. This is likely to be more performant when exporting very large DataFrames to the cloud compared to writing a single large Arrow of Parquet file, where each chunk is written in succession. The method also accepts the fs_options dictonary, and can be particularly convenient when exporting to cloud storage.

[37]:
df.export_many('../data/io/output_chunk-{i:02}.parquet', chunk_size=100_000)
[38]:
!ls ./data/io/output_chunk*.parquet
./data/io/output_chunk-00.parquet  ./data/io/output_chunk-02.parquet
./data/io/output_chunk-01.parquet

In memory data representation

Python has a rich ecosystem comprised of various libraries for data manipulation, that offer different functionality. Thus, it is often useful to be able to pass data from one library to another. Vaex is able to pass on its data to other libraries via a number of in-memory representations.

DataFrame representations

A Vaex DataFrame can be converted to a pandas DataFrame like so:

[39]:
df = vaex.open('../data/io/sample_simple.hdf5')
pandas_df = df.to_pandas_df()
pandas_df  # looks the same doesn't it?
[39]:
x y z
0 0 10 dog
1 1 20 cat
2 2 30 cow
3 3 40 horse
4 4 50 mouse

For DataFrames that are too large to fit in memory, one can specify the chunk_size argument, in which case the to_pandas_dfmethod returns a generator yileding a pandas DataFrame with as many rows as indicated by the chunk_size argument:

[40]:
gen = df.to_pandas_df(chunk_size=3)

for i1, i2, chunk in gen:
    print(i1, i2)
    print(chunk)
    print()
0 3
   x   y    z
0  0  10  dog
1  1  20  cat
2  2  30  cow

3 5
   x   y      z
0  3  40  horse
1  4  50  mouse

The generator also yields the row number of the first and the last element of that chunk, so we know exactly where in the parent DataFrame we are. The following DataFrame methods also support the chunk_size argument with the same behaviour.

Converting a Vaex DataFrame into an arrow table is similar:

[41]:
arrow_table = df.to_arrow_table()
arrow_table
[41]:
pyarrow.Table
x: int64
y: int64
z: string
----
x: [[0,1,2,3,4]]
y: [[10,20,30,40,50]]
z: [["dog","cat","cow","horse","mouse"]]

One can simply convert the DataFrame to a list of arrays. By default, the data is exposed as a list of numpy or arrow arrays:

[42]:
arrays = df.to_arrays()
arrays
[42]:
[array([0, 1, 2, 3, 4]),
 array([10, 20, 30, 40, 50]),
 <pyarrow.lib.StringArray object at 0x7ff265f91c40>
 [
   "dog",
   "cat",
   "cow",
   "horse",
   "mouse"
 ]]

By specifying the array_type argument, one can choose whether the data will be represented by numpy arrays, xarrays, or Python lists.

[43]:
arrays = df.to_arrays(array_type='xarray')
arrays  # list of xarrays
[43]:
[<xarray.DataArray (dim_0: 5)>
 array([0, 1, 2, 3, 4])
 Dimensions without coordinates: dim_0,
 <xarray.DataArray (dim_0: 5)>
 array([10, 20, 30, 40, 50])
 Dimensions without coordinates: dim_0,
 <xarray.DataArray (dim_0: 5)>
 array(['dog', 'cat', 'cow', 'horse', 'mouse'], dtype=object)
 Dimensions without coordinates: dim_0]
[44]:
arrays = df.to_arrays(array_type='list')
arrays  # list of lists
[44]:
[[0, 1, 2, 3, 4],
 [10, 20, 30, 40, 50],
 ['dog', 'cat', 'cow', 'horse', 'mouse']]

Keeping it close to pure Python, one can export a Vaex DataFrame as a dictionary. The same array_type keyword argument applies here as well:

[45]:
d_dict = df.to_dict(array_type='numpy')
d_dict
[45]:
{'x': array([0, 1, 2, 3, 4]),
 'y': array([10, 20, 30, 40, 50]),
 'z': array(['dog', 'cat', 'cow', 'horse', 'mouse'], dtype=object)}

Alternatively, one can also convert a DataFrame to a list of tuples, were the first element of the tuple is the column name, while the second element is the array representation of the data.

[46]:
# Get a single item list
items = df.to_items(array_type='list')
items
[46]:
[('x', [0, 1, 2, 3, 4]),
 ('y', [10, 20, 30, 40, 50]),
 ('z', ['dog', 'cat', 'cow', 'horse', 'mouse'])]

When interacting with various types of APIs, it is common to pass a list of “records”, where a record is a dictionary describing a single row of the DataFrame:

[47]:
records = df.to_records()
records
[47]:
[{'x': 0, 'y': 10, 'z': 'dog'},
 {'x': 1, 'y': 20, 'z': 'cat'},
 {'x': 2, 'y': 30, 'z': 'cow'},
 {'x': 3, 'y': 40, 'z': 'horse'},
 {'x': 4, 'y': 50, 'z': 'mouse'}]

As mentioned earlier, with all of the above example, one can use the chunk_size argument which creates a generator, yielding a portion of the DataFrame in the specified format. In the case of .to_dict method:

[48]:
gen = df.to_dict(array_type='list', chunk_size=2)

for i1, i2, chunk in gen:
    print(i1, i2, chunk)
0 2 {'x': [0, 1], 'y': [10, 20], 'z': ['dog', 'cat']}
2 4 {'x': [2, 3], 'y': [30, 40], 'z': ['cow', 'horse']}
4 5 {'x': [4], 'y': [50], 'z': ['mouse']}

Last but not least, a Vaex DataFrame can be lazily exposed as a Dask array:

[49]:
dask_arrays = df[['x', 'y']].to_dask_array()   # String support coming soon
dask_arrays
[49]:
Array Chunk
Bytes 80 B 80 B
Shape (5, 2) (5, 2)
Count 2 Tasks 1 Chunks
Type int64 numpy.ndarray
2 5
Expression representations

A single Vaex Expression can be also converted to a variety of in-memory representations:

[50]:
# pandas Series
x_series = df.x.to_pandas_series()
x_series
[50]:
0    0
1    1
2    2
3    3
4    4
dtype: int64
[51]:
# numpy array
x_numpy = df.x.to_numpy()
x_numpy
[51]:
array([0, 1, 2, 3, 4])
[52]:
# Python list
x_list = df.x.tolist()
x_list
[52]:
[0, 1, 2, 3, 4]
[53]:
# Dask array
x_dask_array = df.x.to_dask_array()
x_dask_array
[53]:
Array Chunk
Bytes 40 B 40 B
Shape (5,) (5,)
Count 2 Tasks 1 Chunks
Type int64 numpy.ndarray
5 1

Handling missing or invalid data

Data in the real world is seldom clean and never perfect. It often happens that we end up with “missing” or “invalid” data. There are countless reasons for why data can be missing: an instrument failed to make a recording in the real world, there was a temporary or no connection between the instrument and the computer storing the readings, maybe our scraper failed to gather all of the data, or our tracking tool did not manage to record all events.. this list can go on and on.

In addition to this, during our analysis we can sometimes make a wrong turn and “corrupt” our data by dividing by zero, or taking the logarithm of a negative number. In addition, a sensor or a human may record invalid values that we want to highlight in a special way.

In Vaex we have 3 ways of representing these special values:

  • “missing” or “masked” values;

  • “not a number” or nan values;

  • “not available” or na values.

If you have used Vaex, you may have noticed some DataFrame methods, Expression methods, or method arguments referencing “missing”, “nan”, “na”. Here are some examples:

“missing”

“nan”

“na”

df.dropmissing

df.dropnan

df.dropna

df.x.countmissing

df.x.countnan

df.x.countna

df.x.ismissing

df.x.isnan

df.x.isna

df.x.fillmissing

df.x.fillnan

df.x.fillna

In what follows we will explain the difference between these 3 types of values, when they should be used, and why does Vaex makes the distinction between them.

“nan” vs “missing” vs “na”

Summary (TLDR;)

The following table summarizes the differences between missing values, nan values and na:

missing or masked values

Not a number (nan)

Not available (na)

dtype

Any dtype

Float

Any dtype but only truly relevant for float

Meaning

Total absence of data

Data is present, but is corrupted or can not be represented in numeric form (e.g. log(-5))

Union of missing and nan values

Use case

Sensor did not make a measurement

Sensor made a measurement but the data is corrupted, or mathematical transformation leads to an invalid / non-numerical values

It is up to the user to decide

Not a number or nan

Many data practitioners, perhaps erroneously, interchangeably use the term nan and the term missing values. In fact nan values are commonly used as sentinel values to generally indicate invalid data. This is inaccurate because nan values are in fact special float values. nan is a shorthand for “not a number”, which is meant to indicate a value that is not a number in a sequence of floats, and thus in itself is not missing. It is used to represent values that are undefined mathematically, such as 0/0 or log(-5), or for data that does exist but is corrupted or can not be represented in numerical form. Note that there is no such corresponding value for integers for example, or for non-numeric types such as string.

In Python one can use nan values via the math standard library (e.g.: math.nan) or via the numpy library (e.g.: numpy.nan).

So why are nan values synonymous with missing values? It is hard to tell. One guess is that data practitioners found using numpy.nan a convenient shortcut to representing an “missing” or invalid value in arrays. Numpy does have a proper way of indicating a missing values via masked arrays (more on that in the next section), but for many that API can be less convenient and requires an addition knowledge of how to handle those array types. This effect might have been more enhanced by Pandas, in which for a long time nan values were the only way to indicate both invalid/corrupted and truly missing data.

Missing or masked values

Perhaps a better way to mark the absence of data is via missing or masked values. Python itself has a special object to indicate missing or no data, and that is the None object, which has its own NoneType type. The None object in Python is equivalent to the NULL value in SQL.

Modern data analysis libraries also implement their own ways of indicating missing values. For arrays that have missing data, Numpy implements so-called “masked arrays”. When constructing the arrays, in addition to data one is also required to provide a boolean mask. A True value in the mask array, indicates that the corresponding element in the data array is missing. In the example below, the last element of the masked array is missing:

import numpy as np

data = [23, 31, 0]
mask = [False, False, True]

my_masked_array = np.ma.masked_array(data, mask)

Pyarrow also implements a null type to indicate missing values in their data structures. Unlike Numpy that uses bytemasks, Pyarrow uses bitmasks to indicate missing data which make it more memory efficient. Note that in Pyarrow, if the mask has a value of 1 it means that the data is present, while 0 indicates missing data. Similarly to Vaex, Pyarrow also makes the distinction between nan and null values.

In more recent versions, Pandas also implements a pd.NA value to indicate missing values, which can be used in arrays or Series various types and not just float.

In Vaex, missing data are null values if the underlying array is backed by Pyarrow, and masked values if the underlying array is a Numpy masked array.

When are missing values used in practice? They are used to indicate data that was not collected, i.e. a sensor was scheduled to make a reading but it did not, or a doctor was supposed to make scan of a patient but they did not.

To contrast with nan values: missing or masked values indicate a complete absence of data, while nan values indicate the presence of data that can not be interpreted numerically. This can be a subtle but sometimes an important distinction to make

Not available or na

Vaex also implements methods referring to na which stands for Not available”, and is a union of both nan and missing values. This only really matters when dealing with Expressions of float type, since that is the only type that can have both missing and nan values. Of course if you do not make the distinction between nan and missing values in your code, use can use methods that refer to na to encompass both cases and simplify development.

Examples

Let us consider the following DataFrame:

[1]:
import vaex
import numpy as np
import pyarrow as pa

x = np.ma.array(data=[1, 0, 3, 4, np.nan], mask=[False, True, False, False, False])
y = pa.array([10, 20, None, 40, 50])
z = pa.array(['Reggie Miller', 'Michael Jordan', None, None, 'Kobe Bryant'])
w = pa.array([
        {'city': 'Indianapolis', 'team': 'Pacers'},
        None,
        {'city': 'Dallas', 'team': 'Mavericks'},
        None,
        {'city': 'Los Angeles', 'team': 'Lakers'}
    ])

df = vaex.from_arrays(x=x, y=y, z=z, w=w)
df
[1]:
# x y z w
01.010 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}
1-- 20 Michael Jordan--
23.0-- -- {'city': 'Dallas', 'team': 'Mavericks'}
34.040 -- --
4nan50 Kobe Bryant {'city': 'Los Angeles', 'team': 'Lakers'}

The df contains a float column x which in turn contains both a missing (masked) value and a nan value. The columns y, z, and w which are of dtype int, string, and struct respectively can only contain masked values in addition to their nominal type.

For example, if we want to drop all rows with missing values from the entire DataFrame, we can use the dropmissing method:

[2]:
df.dropmissing()
[2]:
# x yz w
0 1 10Reggie Miller{'city': 'Indianapolis', 'team': 'Pacers'}
1nan 50Kobe Bryant {'city': 'Los Angeles', 'team': 'Lakers'}

We see that all missing (masked) values are dropped, but the nan value in column x is still present since it is not technically “missing”.

If we want drop all nan values from the DataFrame we can do so via the corresponding dropnan method:

[3]:
df.dropnan()
[3]:
# x y z w
01.010 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}
1-- 20 Michael Jordan--
23.0-- -- {'city': 'Dallas', 'team': 'Mavericks'}
34.040 -- --

Now we see that the nan value from the column x is no longer in the DataFrame, but all the other missing values are still there.

If we simply want to get rid of all values that are not available for us to use directly, we can use the dropna method:

[4]:
df.dropna()
[4]:
# x yz w
0 1 10Reggie Miller{'city': 'Indianapolis', 'team': 'Pacers'}

Now we see that only rows containing valid data entries remain.

Machine Learning: the Iris dataset

If you want to try out this notebook with a live Python kernel, use mybinder:

https://mybinder.org/badge_logo.svg

While vaex.ml does not yet implement predictive models, we provide wrappers to powerful libraries (e.g. Scikit-learn, xgboost) and make them work efficiently with vaex. vaex.ml does implement a variety of standard data transformers (e.g. PCA, numerical scalers, categorical encoders) and a very efficient KMeans algorithm that take full advantage of vaex.

The following is a simple example on use of vaex.ml. We will be using the well known Iris dataset, and we will use it to build a model which distinguishes between the three Irish species (Iris setosa, Iris virginica and Iris versicolor).

Lets start by importing the common libraries, load and inspect the data.

[1]:
import vaex
import vaex.ml

import matplotlib.pyplot as plt


df = vaex.datasets.iris()
df
[1]:
# sepal_length sepal_width petal_length petal_width class_
0 5.9 3.0 4.2 1.5 1
1 6.1 3.0 4.6 1.4 1
2 6.6 2.9 4.6 1.3 1
3 6.7 3.3 5.7 2.1 2
4 5.5 4.2 1.4 0.2 0
... ... ... ... ... ...
1455.2 3.4 1.4 0.2 0
1465.1 3.8 1.6 0.2 0
1475.8 2.6 4.0 1.2 1
1485.7 3.8 1.7 0.3 0
1496.2 2.9 4.3 1.3 1

Splitting the data into train and test steps should be done immediately, before any manipulation is done on the data. vaex.ml contains a train_test_split method which creates shallow copies of the main DataFrame, meaning that no extra memory is used when defining train and test sets. Note that the train_test_split method does an ordered split of the main DataFrame to create the two sets. In some cases, one may need to shuffle the data.

If shuffling is required, we recommend the following:

df.shuffle().export("shuffled.hdf5")
df = vaex.open("shuffled.hdf5")
df_train, df_test = df.ml.train_test_split(test_size=0.2)

In the present scenario, the dataset is already shuffled, so we can simply do the split right away.

[2]:
# Orderd split in train and test
df_train, df_test = df.ml.train_test_split(test_size=0.2)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/ml/__init__.py:209: UserWarning: Make sure the DataFrame is shuffled
  warnings.warn('Make sure the DataFrame is shuffled')

As this is a very simple tutorial, we will just use the columns already provided as features for training the model.

[3]:
features = df_train.column_names[:4]
features
[3]:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

PCA

The vaex.ml module contains several classes for dataset transformations that are commonly used to pre-process data prior to building a model. These include numerical feature scalers, category encoders, and PCA transformations. We have adopted the scikit-learn API, meaning that all transformers have the .fit and .transform methods.

Let’s use apply a PCA transformation on the training set. There is no need to scale the data beforehand, since the PCA also normalizes the data.

[4]:
pca = vaex.ml.PCA(features=features, n_components=4)
df_train = pca.fit_transform(df_train)
df_train
[4]:
# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3
0 5.4 3.0 4.5 1.5 1 -0.5819340944906611 -0.5192084328455534 -0.4079706950207428 -0.22843325658378022
1 4.8 3.4 1.6 0.2 0 2.628040487885542 -0.05578001049524599-0.09961452867004605-0.14960589756342935
2 6.9 3.1 4.9 1.5 1 -1.438496521671396 0.5307778852279289 0.32322065776316616 -0.0066478967991949744
3 4.4 3.2 1.3 0.2 0 3.00633586736142 -0.41909744036887703-0.17571839830952185-0.05420541515837107
4 5.6 2.8 4.9 2.0 2 -1.1948465297428466 -0.6200295372229213 -0.4751905348367903 0.08724845774327505
... ... ... ... ... ... ... ... ... ...
1155.2 3.4 1.4 0.2 0 2.6608856211270933 0.2619681501203415 0.12886483875694454 0.06429707648769989
1165.1 3.8 1.6 0.2 0 2.561545765055359 0.4288927940763031 -0.18633294617759266-0.20573646329612738
1175.8 2.6 4.0 1.2 1 -0.22075578997244774-0.401523366515551370.25417836518749715 0.04952191889168374
1185.7 3.8 1.7 0.3 0 2.23068249078231 0.826166758833374 0.07863720599424912 0.0004035597987264161
1196.2 2.9 4.3 1.3 1 -0.6256358184862005 0.0239304743336751680.21203674475657858 -0.0077954052328795265

The result of pca .fit_transform method is a shallow copy of the DataFrame which contains the resulting columns of the transformation, in this case the PCA components, as virtual columns. This means that the transformed DataFrame takes no memory at all! So while this example is made with only 120 sample, this would work in the same way even for millions or billions of samples.

Gradient boosting trees

Now let’s train a gradient boosting model. While vaex.ml does not currently include this type of models, we support the popular boosted trees libraries xgboost, lightgbm, and catboost. In this tutorial we will use the lightgbm classifier.

[9]:
import lightgbm
import vaex.ml.sklearn

# Features on which to train the model
train_features = df_train.get_column_names(regex='PCA_.*')
# The target column
target = 'class_'

# Instantiate the LightGBM Classifier
booster = lightgbm.sklearn.LGBMClassifier(num_leaves=5,
                                          max_depth=5,
                                          n_estimators=100,
                                          random_state=42)

# Make it a vaex transformer (for the automagic pipeline and lazy predictions)
model = vaex.ml.sklearn.Predictor(features=train_features,
                                  target=target,
                                  model=booster,
                                  prediction_name='prediction')

# Train and predict
model.fit(df=df_train)
df_train = model.transform(df=df_train)

df_train
[9]:
# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3 prediction
0 5.4 3.0 4.5 1.5 1 -0.5819340944906611 -0.5192084328455534 -0.4079706950207428 -0.22843325658378022 1
1 4.8 3.4 1.6 0.2 0 2.628040487885542 -0.05578001049524599-0.09961452867004605-0.14960589756342935 0
2 6.9 3.1 4.9 1.5 1 -1.438496521671396 0.5307778852279289 0.32322065776316616 -0.00664789679919497441
3 4.4 3.2 1.3 0.2 0 3.00633586736142 -0.41909744036887703-0.17571839830952185-0.05420541515837107 0
4 5.6 2.8 4.9 2.0 2 -1.1948465297428466 -0.6200295372229213 -0.4751905348367903 0.08724845774327505 2
... ... ... ... ... ... ... ... ... ... ...
1155.2 3.4 1.4 0.2 0 2.6608856211270933 0.2619681501203415 0.12886483875694454 0.06429707648769989 0
1165.1 3.8 1.6 0.2 0 2.561545765055359 0.4288927940763031 -0.18633294617759266-0.20573646329612738 0
1175.8 2.6 4.0 1.2 1 -0.22075578997244774-0.401523366515551370.25417836518749715 0.04952191889168374 1
1185.7 3.8 1.7 0.3 0 2.23068249078231 0.826166758833374 0.07863720599424912 0.0004035597987264161 0
1196.2 2.9 4.3 1.3 1 -0.6256358184862005 0.0239304743336751680.21203674475657858 -0.00779540523287952651

Notice that after training the model, we use the .transform method to obtain a shallow copy of the DataFrame which contains the prediction of the model, in a form of a virtual column. This makes it easy to evaluate the model, and easily create various diagnostic plots. If required, one can call the .predict method, which will result in an in-memory numpy.array housing the predictions.

Automatic pipelines

Assuming we are happy with the performance of the model, we can continue and apply our transformations and model to the test set. Unlike other libraries, we do not need to explicitly create a pipeline here in order to propagate the transformations. In fact, with vaex and vaex.ml, a pipeline is automatically being created as one is doing the exploration of the data. Each vaex DataFrame contains a state, which is a (serializable) object containing information of all transformations applied to the DataFrame (filtering, creation of new virtual columns, transformations).

Recall that the outputs of both the PCA transformation and the boosted model were in fact virtual columns, and thus are stored in the state of df_train. All we need to do, is to apply this state to another similar DataFrame (e.g. the test set), and all the changes will be propagated.

[6]:
state = df_train.state_get()
df_test.state_set(state)

df_test
[6]:
# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3 prediction
0 5.9 3.0 4.2 1.5 1 -0.4978687101343986 -0.11289245880584761-0.119626012060696370.0625954090178564 1
1 6.1 3.0 4.6 1.4 1 -0.8754765898560835 -0.039024021195735940.022944044447894815-0.14143773065379384 1
2 6.6 2.9 4.6 1.3 1 -1.0228803632878913 0.2503709022470443 0.4130613754204865 -0.0303919115590032821
3 6.7 3.3 5.7 2.1 2 -2.2544508624315838 0.3431374410700749 -0.28908707579214765-0.07059175451207655 2
4 5.5 4.2 1.4 0.2 0 2.632289228948536 1.020394958612415 -0.20769510079946696-0.13744144140286718 0
... ... ... ... ... ... ... ... ... ... ...
255.5 2.5 4.0 1.3 1 -0.16189655085432594-0.6871827581512436 0.09773053160021669 0.07093166682594204 1
265.8 2.7 3.9 1.2 1 -0.12526327170089271-0.3148233189949767 0.19720893202789733 0.060419826927667064 1
274.4 2.9 1.4 0.2 0 2.8918941837640526 -0.6426744898497139 0.0061717958745104440.007700652884580328 0
284.5 2.3 1.3 0.3 0 2.850207707200544 -0.9710397723109179 0.38501428492268475 0.377723418991853 0
296.9 3.2 5.7 2.3 2 -2.405639277483925 0.4027072938482219 -0.229448178035409730.17443211711742812 2

Production

Now df_test contains all the transformations we applied on the training set (df_train), including the model prediction. The transfer of state from one DataFrame to another can be extremely valuable for putting models in production.

Performance

Finally, let’s check the model performance.

[7]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_true=df_test.class_.values, y_pred=df_test.prediction.values)
acc *= 100.
print(f'Test set accuracy: {acc}%')
Test set accuracy: 100.0%

The model get perfect accuracy of 100%. This is not surprising as this problem is rather easy: doing a PCA transformation on the features nicely separates the 3 flower species. Plotting the first two PCA axes, and colouring the samples according to their class already shows an almost perfect separation.

[8]:
plt.figure(figsize=(8, 4))
df_test.scatter(df_test.PCA_0, df_test.PCA_1, c_expr=df_test.class_, s=50)
plt.show()
_images/guides_ml_iris_16_0.png

Machine Learning: the Titanic dataset

If you want to try out this notebook with a live Python kernel, use mybinder:

https://mybinder.org/badge_logo.svg

In the following is a more involved machine learning example, in which we will use a larger variety of methods in veax to do data cleaning, feature engineering, pre-processing and finally to train a couple of models. To do this, we will use the well known Titanic dataset. Our task is to predict which passengers are more likely to have survived the disaster.

Before we begin, there are two important notes to consider: - The following example is not to provide a competitive score for any competitions that might use the Titanic dataset. It’s primary goal is to show how various methods provided by vaex and vaex.ml can be used to clean data, create new features, and do general data manipulations in a machine learning context. - While the Titanic dataset is rather small in side, all the methods and operations presented in the solution below will work on a dataset of arbitrary size, as long as the data fits on the hard-drive of your machine.

Now, with that out of the way, let’s get started!

[1]:
import vaex
import vaex.ml

import numpy as np
import matplotlib.pyplot as plt

Adjusting matplotlib parmeters

Intermezzo: we modify some of the matplotlib default settings, just to make the plots a bit more legible.

[2]:
SMALL_SIZE = 12
MEDIUM_SIZE = 14
BIGGER_SIZE = 16

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

Get the data

First of all we need to read in the data. Since the Titanic dataset is quite well known for trying out different classification algorithms, as well as commonly used as a teaching tool for aspiring data scientists, it ships (no pun intended) together with vaex.ml. So let’s read it in, see the description of its contents, and get a preview of the data.

[3]:
# Load the titanic dataset
df = vaex.datasets.titanic()

# See the description
df.info()

titanic

rows: 1,309

Columns:

columntypeunitdescriptionexpression
pclassint64
survivedbool
namestr
sexstr
agefloat64
sibspint64
parchint64
ticketstr
farefloat64
cabinstr
embarkedstr
boatstr
bodyfloat64
home_deststr

Data:

# pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 True Allen, Miss. Elisabeth Walton female29.0 0 0 24160 211.3375B5 S 2 nan St Louis, MO
1 1 True Allison, Master. Hudson Trevor male 0.91671 2 113781 151.55 C22 C26S 11 nan Montreal, PQ / Chesterville, ON
2 1 False Allison, Miss. Helen Loraine female2.0 1 2 113781 151.55 C22 C26S -- nan Montreal, PQ / Chesterville, ON
3 1 False Allison, Mr. Hudson Joshua Creighton male 30.0 1 2 113781 151.55 C22 C26S -- 135.0 Montreal, PQ / Chesterville, ON
4 1 False Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.0 1 2 113781 151.55 C22 C26S -- nan Montreal, PQ / Chesterville, ON
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1,3043 False Zabour, Miss. Hileni female14.5 1 0 2665 14.4542 -- C -- 328.0 --
1,3053 False Zabour, Miss. Thamine femalenan 1 0 2665 14.4542 -- C -- nan --
1,3063 False Zakarian, Mr. Mapriededer male 26.5 0 0 2656 7.225 -- C -- 304.0 --
1,3073 False Zakarian, Mr. Ortin male 27.0 0 0 2670 7.225 -- C -- nan --
1,3083 False Zimmerman, Mr. Leo male 29.0 0 0 315082 7.875 -- S -- nan --

Shuffling

From the preview of the DataFrame we notice that the data is sorted alphabetically by name and by passenger class. Thus we need to shuffle it before we split it into train and test sets.

[4]:
# The dataset is ordered, so let's shuffle it
df = df.shuffle(random_state=31)

Shuffling for large datasets

As mentioned in The Iris tutorial, you are likely to get a better performance if you export to disk your shuffled dataset, especially when the dataset is larger in size:

df.shuffle().export("shuffled.hdf5")
df = vaex.open("shuffled.hdf5")
df_train, df_test = df.ml.train_test_split(test_size=0.2)

Split into train and test

Once the data is shuffled, let’s split it into train and test sets. The test set will comprise 20% of the data. Note that we do not shuffle the data for you, since vaex cannot assume your data fits into memory, you are responsible for either writing it in shuffled order on disk, or shuffle it in memory (the previous step).

[5]:
# Train and test split, no shuffling occurs
df_train, df_test = df.ml.train_test_split(test_size=0.2, verbose=False)

Sanity checks

Before we move on to process the data, let’s verify that our train and test sets are “similar” enough. We will not be very rigorous here, but just look at basic statistics of some of the key features.

For starters, let’s check that the fraction of survivals is similar between the train and test sets.

[6]:
# Inspect the target variable
train_survived_value_counts = df_train.survived.value_counts()
test_survived_value_counts = df_test.survived.value_counts()


plt.figure(figsize=(12, 4))

plt.subplot(121)
train_survived_value_counts.plot.bar()
train_sex_ratio = train_survived_value_counts[True]/train_survived_value_counts[False]
plt.title(f'Train set: survivied ratio: {train_sex_ratio:.2f}')
plt.ylabel('Number of passengers')

plt.subplot(122)
test_survived_value_counts.plot.bar()
test_sex_ratio = test_survived_value_counts[True]/test_survived_value_counts[False]
plt.title(f'Test set: surived ratio: {test_sex_ratio:.2f}')


plt.tight_layout()
plt.show()
_images/guides_ml_titanic_14_0.png

Next up, let’s check whether the ratio of male to female passengers is not too dissimilar between the two sets.

[7]:
# Check the sex balance
train_sex_value_counts = df_train.sex.value_counts()
test_sex_value_counts = df_test.sex.value_counts()

plt.figure(figsize=(12, 4))

plt.subplot(121)
train_sex_value_counts.plot.bar()
train_sex_ratio = train_sex_value_counts['male']/train_sex_value_counts['female']
plt.title(f'Train set: male vs female ratio: {train_sex_ratio:.2f}')
plt.ylabel('Number of passengers')

plt.subplot(122)
test_sex_value_counts.plot.bar()
test_sex_ratio = test_sex_value_counts['male']/test_sex_value_counts['female']
plt.title(f'Test set: male vs female ratio: {test_sex_ratio:.2f}')


plt.tight_layout()
plt.show()
_images/guides_ml_titanic_16_0.png

Finally, lets check that the relative number of passenger per class is similar between the train and test sets.

[8]:
# Check the class balance
train_pclass_value_counts = df_train.pclass.value_counts() / len(df_train)
test_pclass_value_counts = df_test.pclass.value_counts() / len(df_test)

plt.figure(figsize=(12, 4))

plt.subplot(121)
plt.title('Train set: passenger class')
plt.ylabel('Fraction of passengers')
train_pclass_value_counts.plot.bar()

plt.subplot(122)
plt.title('Test set: passenger class')
test_pclass_value_counts.plot.bar()

plt.tight_layout()
plt.show()
_images/guides_ml_titanic_18_0.png

From the above diagnostics, we are satisfied that, at least in these few categories, the train and test are similar enough, and we can move forward.

Feature engineering

In this section we will use vaex to create meaningful features that will be used to train a classification model. To start with, let’s get a high level overview of the training data.

[9]:
df_train.describe()
[9]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest
data_type int64 bool string string float64 int64 int64 string float64 string string string float64 string
count 1047 1047 1047 1047 841 1047 1047 1047 1046 233 1046 380 102 592
NA 0 0 0 0 206 0 0 0 1 814 1 667 945 455
mean 2.3075453677172875 0.3744030563514804 -- -- 29.565299286563608 0.5100286532951289 0.3982808022922636 -- 32.92609101338429 -- -- -- 159.6764705882353 --
std 0.833269 0.483968 -- -- 14.161953 1.071309 0.890852 -- 50.678261 -- -- -- 96.220759 --
min 1 False -- -- 0.1667 0 0 -- 0.0 -- -- -- 1.0 --
max 3 True -- -- 80.0 8 9 -- 512.3292 -- -- -- 327.0 --

Imputing

We notice that there are 3 columns that have missing data, so our first task will be to impute the missing values with suitable substitutes. This is our strategy:

  • age: impute with the median age value

  • fare: impute with the mean fare of the 5 most common values.

  • cabin: impute with “M” for “Missing”

  • Embarked: Impute with with the most common value.

[10]:
# Handle missing values

# Age - just do the median of the training set for now
fill_age = df_train.percentile_approx(expression='age', percentage=50.0)
# For some numpy versions the `np.percentile` method is broken and returns nan.
# As a failsafe, in those cases fill with the mean.
if np.isnan(fill_age):
    fill_age = df_train.mean(expression='age')
df_train['age'] = df_train.age.fillna(value=fill_age)

# Fare: the mean of the 5 most common ticket prices.
fill_fares = df_train.fare.value_counts(dropna=True)
fill_fare = fill_fares.iloc[:5].index.values.mean()
df_train['fare'] = df_train.fare.fillna(value=fill_fare)

# Cabing: this is a string column so let's mark it as "M" for "Missing"
df_train['cabin'] = df_train.cabin.fillna(value='M')

# Embarked: Similar as for Cabin, let's mark the missing values with "U" for unknown
fill_embarked = df_train.embarked.value_counts(dropna=True).index[0]
df_train['embarked'] = df_train.embarked.fillna(value=fill_embarked)

String processing

Next up, let’s engineer some new, more meaningful features out of the “raw” data that is present in the dataset. Starting with the name of the passengers, we are going to extract the titles, as well as we are going to count the number of words a name contains. These features can be a loose proxy to the age and status of the passengers.

[11]:
# Engineer features from the names

# Titles
df_train['name_title'] = df_train['name'].str.replace('.* ([A-Z][a-z]+)\..*', "\\1", regex=True)
display(df_train['name_title'])

# Number of words in the name
df_train['name_num_words'] = df_train['name'].str.count("[ ]+", regex=True) + 1
display(df_train['name_num_words'])
Expression = name_title
Length: 1,047 dtype: large_string (column)
------------------------------------------
   0      Mr
   1      Mr
   2     Mrs
   3    Miss
   4      Mr
    ...
1042  Master
1043     Mrs
1044  Master
1045      Mr
1046      Mr
Expression = name_num_words
Length: 1,047 dtype: int64 (column)
-----------------------------------
   0  3
   1  4
   2  5
   3  4
   4  4
  ...
1042  4
1043  6
1044  4
1045  4
1046  3

From the cabin colum, we will engineer 3 features: - “deck”: extacting the deck on which the cabin is located, which is encoded in each cabin value; - “multi_cabin: a boolean feature indicating whether a passenger is allocated more than one cabin - “has_cabin”: since there were plenty of values in the original cabin column that had missing values, we are just going to build a feature which tells us whether a passenger had an assigned cabin or not.

[12]:
#  Extract the deck
df_train['deck'] = df_train.cabin.str.slice(start=0, stop=1)
display(df_train['deck'])

# Passengers under which name have several rooms booked, these are all for 1st class passengers
df_train['multi_cabin'] = ((df_train.cabin.str.count(pat='[A-Z]', regex=True) > 1) &\
                           ~(df_train.deck == 'F')).astype('int')
display(df_train['multi_cabin'])

# Out of these, cabin has the most missing values, so let's create a feature tracking if a passenger had a cabin
df_train['has_cabin'] = df_train.cabin.notna().astype('int')
display(df_train['has_cabin'])
Expression = deck
Length: 1,047 dtype: string (column)
------------------------------------
   0  M
   1  B
   2  M
   3  M
   4  M
  ...
1042  M
1043  M
1044  M
1045  B
1046  M
Expression = multi_cabin
Length: 1,047 dtype: int64 (column)
-----------------------------------
   0  0
   1  0
   2  0
   3  0
   4  0
  ...
1042  0
1043  0
1044  0
1045  1
1046  0
Expression = has_cabin
Length: 1,047 dtype: int64 (column)
-----------------------------------
   0  1
   1  1
   2  1
   3  1
   4  1
  ...
1042  1
1043  1
1044  1
1045  1
1046  1

More features

There are two features that give an indication whether a passenger is travelling alone, or with a famly. These are the “sibsp” and “parch” columns that tell us the number of siblinds or spouses and the number of parents or children each passenger has on-board respectively. We are going to use this information to build two columns: - “family_size” the size of the family of each passenger; - “is_alone” an additional boolean feature which indicates whether a passenger is traveling without their family.

[13]:
# Size of family that are on board: passenger + number of siblings, spouses, parents, children.
df_train['family_size'] = (df_train.sibsp + df_train.parch + 1)
display(df_train['family_size'])

# Whether or not a passenger is alone
df_train['is_alone'] = (df_train.family_size == 0).astype('int')
display(df_train['is_alone'])
Expression = family_size
Length: 1,047 dtype: int64 (column)
-----------------------------------
   0  1
   1  1
   2  3
   3  4
   4  1
  ...
1042  8
1043  2
1044  3
1045  2
1046  1
Expression = is_alone
Length: 1,047 dtype: int64 (column)
-----------------------------------
   0  0
   1  0
   2  0
   3  0
   4  0
  ...
1042  0
1043  0
1044  0
1045  0
1046  0

Finally, let’s create two new features: - age \(\times\) class - fare per family member, i.e. fare \(/\) family_size

[14]:
# Create new features
df_train['age_times_class'] = df_train.age * df_train.pclass

# fare per person in the family
df_train['fare_per_family_member'] = df_train.fare / df_train.family_size

Modeling (part 1): gradient boosted trees

Since this dataset contains a lot of categorical features, we will start with a tree based model. This we will gear the following feature pre-processing towards the use of tree-based models.

Feature pre-processing for boosted tree models

The features “sex”, “embarked”, and “deck” can be simply label encoded. The feature “name_tite” contains certain a larger degree of cardinality, relative to the size of the training set, and in this case we will use the Frequency Encoder.

[15]:
label_encoder = vaex.ml.LabelEncoder(features=['sex', 'embarked', 'deck'], allow_unseen=True)
df_train = label_encoder.fit_transform(df_train)

# While doing a transform, previously unseen values will be encoded as "zero".
frequency_encoder = vaex.ml.FrequencyEncoder(features=['name_title'], unseen='zero')
df_train = frequency_encoder.fit_transform(df_train)
df_train
[15]:
# pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest name_title name_num_words deck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title
0 3 False Stoytcheff, Mr. Ilia male 19.0 0 0 349205 7.8958 M S -- nan -- Mr 3 M 0 1 1 0 57.0 7.8958 1 1 0 0.5787965616045845
1 1 False Payne, Mr. Vivian Ponsonby male 23.0 0 0 12749 93.5 B24 S -- nan Montreal, PQ Mr 4 B 0 1 1 0 23.0 93.5 1 1 1 0.5787965616045845
2 3 True Abbott, Mrs. Stanton (Rosa Hunt) female35.0 1 1 C.A. 267320.25 M S A nan East Providence, RI Mrs 5 M 0 1 3 0 105.0 6.75 0 1 0 0.1451766953199618
3 2 True Hocking, Miss. Ellen "Nellie" female20.0 2 1 29105 23.0 M S 4 nan Cornwall / Akron, OH Miss 4 M 0 1 4 0 40.0 5.75 0 1 0 0.20152817574021012
4 3 False Nilsson, Mr. August Ferdinand male 21.0 0 0 350410 7.8542 M S -- nan -- Mr 4 M 0 1 1 0 63.0 7.8542 1 1 0 0.5787965616045845
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1,0423 False Goodwin, Master. Sidney Leonard male 1.0 5 2 CA 2144 46.9 M S -- nan Wiltshire, England Niagara Falls, NYMaster 4 M 0 1 8 0 3.0 5.8625 1 1 0 0.045845272206303724
1,0433 False Ahlin, Mrs. Johan (Johanna Persdotter Larsson)female40.0 1 0 7546 9.475 M S -- nan Sweden Akeley, MN Mrs 6 M 0 1 2 0 120.0 4.7375 0 1 0 0.1451766953199618
1,0443 True Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 M S 15 nan -- Master 4 M 0 1 3 0 12.0 3.7111 1 1 0 0.045845272206303724
1,0451 False Baxter, Mr. Quigg Edmond male 24.0 0 1 PC 17558 247.5208B58 B60C -- nan Montreal, PQ Mr 4 B 1 1 2 0 24.0 123.7604 1 0 1 0.5787965616045845
1,0463 False Coleff, Mr. Satio male 24.0 0 0 349209 7.4958 M S -- nan -- Mr 3 M 0 1 1 0 72.0 7.4958 1 1 0 0.5787965616045845

Once all the categorical data is encoded, we can select the features we are going to use for training the model.

[16]:
# features to use for the trainin of the boosting model
encoded_features = df_train.get_column_names(regex='^freque|^label')
features = encoded_features + ['multi_cabin', 'name_num_words',
                               'has_cabin', 'is_alone',
                               'family_size', 'age_times_class',
                               'fare_per_family_member',
                               'age', 'fare']

# Preview the feature matrix
df_train[features].head(5)
[16]:
# label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title multi_cabin name_num_words has_cabin is_alone family_size age_times_class fare_per_family_member age fare
0 1 1 0 0.578797 0 3 1 0 1 57 7.8958 19 7.8958
1 1 1 1 0.578797 0 4 1 0 1 23 93.5 2393.5
2 0 1 0 0.145177 0 5 1 0 3 105 6.75 3520.25
3 0 1 0 0.201528 0 4 1 0 4 40 5.75 2023
4 1 1 0 0.578797 0 4 1 0 1 63 7.8542 21 7.8542

Estimator: xgboost

Now let’s feed this data into an a tree based estimator. In this example we will use xgboost. In principle, any algorithm that follows the scikit-learn API convention, i.e. it contains the .fit, .predict methods is compatable with vaex. However, the data will be materialized, i.e. will be read into memory before it is passed on to the estimators. We are hard at work trying to make at least some of the estimators from scikit-learn run out-of-core!

[17]:
import xgboost
import vaex.ml.sklearn

# Instantiate the xgboost model normally, using the scikit-learn API
xgb_model = xgboost.sklearn.XGBClassifier(max_depth=11,
                                          learning_rate=0.1,
                                          n_estimators=500,
                                          subsample=0.75,
                                          colsample_bylevel=1,
                                          colsample_bytree=1,
                                          scale_pos_weight=1.5,
                                          reg_lambda=1.5,
                                          reg_alpha=5,
                                          n_jobs=8,
                                          random_state=42,
                                          use_label_encoder=False,
                                          verbosity=0)

# Make it work with vaex (for the automagic pipeline and lazy predictions)
vaex_xgb_model = vaex.ml.sklearn.Predictor(features=features,
                                           target='survived',
                                           model=xgb_model,
                                           prediction_name='prediction_xgb')
# Train the model
vaex_xgb_model.fit(df_train)
# Get the prediction of the model on the training data
df_train = vaex_xgb_model.transform(df_train)

# Preview the resulting train dataframe that contans the predictions
df_train
[17]:
# pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest name_title name_num_words deck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title prediction_xgb
0 3 False Stoytcheff, Mr. Ilia male 19.0 0 0 349205 7.8958 M S -- nan -- Mr 3 M 0 1 1 0 57.0 7.8958 1 1 0 0.5787965616045845 0
1 1 False Payne, Mr. Vivian Ponsonby male 23.0 0 0 12749 93.5 B24 S -- nan Montreal, PQ Mr 4 B 0 1 1 0 23.0 93.5 1 1 1 0.5787965616045845 0
2 3 True Abbott, Mrs. Stanton (Rosa Hunt) female35.0 1 1 C.A. 267320.25 M S A nan East Providence, RI Mrs 5 M 0 1 3 0 105.0 6.75 0 1 0 0.1451766953199618 1
3 2 True Hocking, Miss. Ellen "Nellie" female20.0 2 1 29105 23.0 M S 4 nan Cornwall / Akron, OH Miss 4 M 0 1 4 0 40.0 5.75 0 1 0 0.20152817574021012 1
4 3 False Nilsson, Mr. August Ferdinand male 21.0 0 0 350410 7.8542 M S -- nan -- Mr 4 M 0 1 1 0 63.0 7.8542 1 1 0 0.5787965616045845 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1,0423 False Goodwin, Master. Sidney Leonard male 1.0 5 2 CA 2144 46.9 M S -- nan Wiltshire, England Niagara Falls, NYMaster 4 M 0 1 8 0 3.0 5.8625 1 1 0 0.045845272206303724 0
1,0433 False Ahlin, Mrs. Johan (Johanna Persdotter Larsson)female40.0 1 0 7546 9.475 M S -- nan Sweden Akeley, MN Mrs 6 M 0 1 2 0 120.0 4.7375 0 1 0 0.1451766953199618 0
1,0443 True Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 M S 15 nan -- Master 4 M 0 1 3 0 12.0 3.7111 1 1 0 0.045845272206303724 1
1,0451 False Baxter, Mr. Quigg Edmond male 24.0 0 1 PC 17558 247.5208B58 B60C -- nan Montreal, PQ Mr 4 B 1 1 2 0 24.0 123.7604 1 0 1 0.5787965616045845 0
1,0463 False Coleff, Mr. Satio male 24.0 0 0 349209 7.4958 M S -- nan -- Mr 3 M 0 1 1 0 72.0 7.4958 1 1 0 0.5787965616045845 0

Notice that in the above cell block, we call .transform on the vaex_xgb_model object. This adds the “prediction_xgb” column as virtual column in the output dataframe. This can be quite convenient when calculating various metrics and making diagnosic plots. Of course, one can call a .predict on the vaex_xgb_model object, which returns an in-memory numpy array object housing the predictions.

Performance on training set

Anyway, let’s see what the performance is of the model on the training set. First let’s create a convenience function that will help us get multiple metrics at once.

[18]:
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
def binary_metrics(y_true, y_pred):
    acc = accuracy_score(y_true=y_true, y_pred=y_pred)
    f1 = f1_score(y_true=y_true, y_pred=y_pred)
    roc = roc_auc_score(y_true=y_true, y_score=y_pred)
    print(f'Accuracy: {acc:.3f}')
    print(f'f1 score: {f1:.3f}')
    print(f'roc-auc: {roc:.3f}')

Now let’s check the performance of the model on the training set.

[19]:
print('Metrics for the training set:')
binary_metrics(y_true=df_train.survived.values, y_pred=df_train.prediction_xgb.values)
Metrics for the training set:
Accuracy: 0.924
f1 score: 0.896
roc-auc: 0.914

Automatic pipelines

Now, let’s inspect the performance of the model on the test set. You probably noticed that, unlike when using other libraries, we did not bother to create a pipeline while doing all the cleaning, inputing, feature engineering and categorial encoding. Well, we did not explicitly create a pipeline. In fact veax keeps track of all the changes one applies to a DataFrame in something called a state. A state is the place which contains all the informations regarding, for instance, the virtual columns we’ve created, which includes the newly engineered features, the categorically encoded columns, and even the model prediction! So all we need to do, is to extract the state from the training DataFrame, and apply it to the test DataFrame.

[20]:
# state transfer to the test set
state = df_train.state_get()
df_test.state_set(state)

# Preview of the "transformed" test set
df_test.head(5)
[20]:
# pclasssurvived name sex age sibsp parchticket farecabin embarked boat bodyhome_dest name_title name_num_wordsdeck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title prediction_xgb
0 3False O'Connor, Mr. Patrick male 28.032 0 0366713 7.75 M Q -- nan-- Mr 3M 0 1 1 0 84.096 7.75 1 2 0 0.578797 0
1 3False Canavan, Mr. Patrick male 21 0 0364858 7.75 M Q -- nanIreland Philadelphia, PAMr 3M 0 1 1 0 63 7.75 1 2 0 0.578797 0
2 1False Ovies y Rodriguez, Mr. Servando male 28.5 0 0PC 17562 27.7208D43 C -- 189?Havana, Cuba Mr 5D 0 1 1 0 28.5 27.7208 1 0 4 0.578797 1
3 3False Windelov, Mr. Einar male 21 0 0SOTON/OQ 3101317 7.25 M S -- nan-- Mr 3M 0 1 1 0 63 7.25 1 1 0 0.578797 0
4 2True Shelley, Mrs. William (Imanita Parrish Hall)female25 0 1230433 26 M S 12 nanDeer Lodge, MT Mrs 6M 0 1 2 0 50 13 0 1 0 0.145177 1

Notice that once we apply the state from the train to the test set, the test DataFrame contains all the features we created or modified in the training data, and even the predictions of the xgboost model!

The state is a simple Python dictionary, which can be easily stored as JSON to disk, which makes it very easy to deploy.

Performance on test set

Now it is trivial to check the model performance on the test set:

[21]:
print('Metrics for the test set:')
binary_metrics(y_true=df_test.survived.values, y_pred=df_test.prediction_xgb.values)
Metrics for the test set:
Accuracy: 0.786
f1 score: 0.728
roc-auc: 0.773

Feature importance

Let’s now look at the feature importance of the xgboost model.

[22]:
plt.figure(figsize=(6, 9))

ind = np.argsort(xgb_model.feature_importances_)[::-1]
features_sorted = np.array(features)[ind]
importances_sorted = xgb_model.feature_importances_[ind]

plt.barh(y=range(len(features)), width=importances_sorted, height=0.2)
plt.title('Gain')
plt.yticks(ticks=range(len(features)), labels=features_sorted)
plt.gca().invert_yaxis()
plt.show()
_images/guides_ml_titanic_46_0.png

Modeling (part 2): Linear models & Ensembles

Given the randomness of the Titanic dataset , we can be satisfied with the performance of xgboost model above. Still, it is always usefull to try a variety of models and approaches, especially since vaex makes makes this process rather simple.

In the following part we will use a couple of linear models as our predictors, this time straight from scikit-learn. This requires us to pre-process the data in a slightly different way.

Feature pre-processing for linear models

When using linear models, the safest option is to encode categorical variables with the one-hot encoding scheme, especially if they have low cardinality. We will do this for the “family_size” and “deck” features. Note that the “sex” feature is already encoded since it has only unique values options.

The “name_title” feature is a bit more tricky. Since in its original form it has some values that only appear a couple of times, we will do a trick: we will one-hot encode the frequency encoded values. This will reduce cardinality of the feature, while also preserving the most important, i.e. most common values.

Regarding the “age” and “fare”, to add some variance in the model, we will not convert them to categorical as before, but simply remove their mean and standard-deviations (standard-scaling). We will do the same to the “fare_per_family_member” feature.

Finally, we will drop out any other features.

[23]:
# One-hot encode categorical features
one_hot = vaex.ml.OneHotEncoder(features=['deck', 'family_size', 'name_title'])
df_train = one_hot.fit_transform(df_train)
[24]:
# Standard scale numerical features
standard_scaler = vaex.ml.StandardScaler(features=['age', 'fare', 'fare_per_family_member'])
df_train = standard_scaler.fit_transform(df_train)
[25]:
# Get the features for training a linear model
features_linear = df_train.get_column_names(regex='^deck_|^family_size_|^frequency_encoded_name_title_')
features_linear += df_train.get_column_names(regex='^standard_scaled_')
features_linear += ['label_encoded_sex']
features_linear
[25]:
['deck_A',
 'deck_B',
 'deck_C',
 'deck_D',
 'deck_E',
 'deck_F',
 'deck_G',
 'deck_M',
 'family_size_1',
 'family_size_2',
 'family_size_3',
 'family_size_4',
 'family_size_5',
 'family_size_6',
 'family_size_7',
 'family_size_8',
 'family_size_11',
 'standard_scaled_age',
 'standard_scaled_fare',
 'standard_scaled_fare_per_family_member',
 'label_encoded_sex']

Estimators: SVC and LogisticRegression

[26]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
[27]:
# The Support Vector Classifier
vaex_svc = vaex.ml.sklearn.Predictor(features=features_linear,
                                     target='survived',
                                     model=SVC(max_iter=1000, random_state=42),
                                     prediction_name='prediction_svc')

# Logistic Regression
vaex_logistic = vaex.ml.sklearn.Predictor(features=features_linear,
                                          target='survived',
                                          model=LogisticRegression(max_iter=1000, random_state=42),
                                          prediction_name='prediction_lr')

# Train the new models and apply the transformation to the train dataframe
for model in [vaex_svc, vaex_logistic]:
    model.fit(df_train)
    df_train = model.transform(df_train)

# Preview of the train DataFrame
df_train.head(5)
/home/jovan/miniconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:258: ConvergenceWarning: Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  % self.max_iter, ConvergenceWarning)
[27]:
# pclasssurvived name sex age sibsp parchticket farecabin embarked boat bodyhome_dest name_title name_num_wordsdeck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title prediction_xgb deck_A deck_B deck_C deck_D deck_E deck_F deck_G deck_M family_size_1 family_size_2 family_size_3 family_size_4 family_size_5 family_size_6 family_size_7 family_size_8 family_size_11 name_title_Capt name_title_Col name_title_Countess name_title_Don name_title_Dona name_title_Dr name_title_Jonkheer name_title_Lady name_title_Major name_title_Master name_title_Miss name_title_Mlle name_title_Mme name_title_Mr name_title_Mrs name_title_Ms name_title_Rev standard_scaled_age standard_scaled_fare standard_scaled_fare_per_family_memberprediction_svc prediction_lr
0 3False Stoytcheff, Mr. Ilia male 19 0 0349205 7.8958M S -- nan-- Mr 3M 0 1 1 0 57 7.8958 1 1 0 0.578797 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.807704 -0.493719 -0.342804False False
1 1False Payne, Mr. Vivian Ponsonby male 23 0 012749 93.5 B24 S -- nanMontreal, PQ Mr 4B 0 1 1 0 23 93.5 1 1 1 0.578797 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.492921 1.19613 1.99718 False True
2 3True Abbott, Mrs. Stanton (Rosa Hunt)female 35 1 1C.A. 267320.25 M S A nanEast Providence, RI Mrs 5M 0 1 3 0 105 6.75 0 1 0 0.145177 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0.45143 -0.249845 -0.374124True True
3 2True Hocking, Miss. Ellen "Nellie" female 20 2 129105 23 M S 4 nanCornwall / Akron, OHMiss 4M 0 1 4 0 40 5.75 0 1 0 0.201528 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 -0.729008 -0.195559 -0.401459True True
4 3False Nilsson, Mr. August Ferdinand male 21 0 0350410 7.8542M S -- nan-- Mr 4M 0 1 1 0 63 7.8542 1 1 0 0.578797 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.650312 -0.494541 -0.343941False False

Ensemble

Just as before, the predictions from the SVC and the LogisticRegression classifiers are added as virtual columns in the training dataset. This is quite powerful, since now we can easily use them to create an ensemble! For example, let’s do a weighted mean.

[28]:
# Weighed mean of the classes
prediction_final = (df_train.prediction_xgb.astype('int') * 0.3 +
                    df_train.prediction_svc.astype('int') * 0.5 +
                    df_train.prediction_xgb.astype('int') * 0.2)
# Get the predicted class
prediction_final = (prediction_final >= 0.5)
# Add the expression to the train DataFrame
df_train['prediction_final'] = prediction_final

# Preview
df_train[df_train.get_column_names(regex='^predict')]
[28]:
# prediction_xgb prediction_svc prediction_lr prediction_final
0 0 False False False
1 0 False True False
2 1 True True True
3 1 True True True
4 0 False False False
... ... ... ... ...
1,0420 False False False
1,0430 True True True
1,0441 True False True
1,0450 True True True
1,0460 False False False

Performance (part 2)

Applying the ensembler to the test set is just as easy as before. We just need to get the new state of the training DataFrame, and transfer it to the test DataFrame.

[29]:
# State transfer
state_new = df_train.state_get()
df_test.state_set(state_new)

# Preview
df_test.head(5)
[29]:
# pclasssurvived name sex age sibsp parchticket farecabin embarked boat bodyhome_dest name_title name_num_wordsdeck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title prediction_xgb deck_A deck_B deck_C deck_D deck_E deck_F deck_G deck_M family_size_1 family_size_2 family_size_3 family_size_4 family_size_5 family_size_6 family_size_7 family_size_8 family_size_11 name_title_Capt name_title_Col name_title_Countess name_title_Don name_title_Dona name_title_Dr name_title_Jonkheer name_title_Lady name_title_Major name_title_Master name_title_Miss name_title_Mlle name_title_Mme name_title_Mr name_title_Mrs name_title_Ms name_title_Rev standard_scaled_age standard_scaled_fare standard_scaled_fare_per_family_memberprediction_svc prediction_lr prediction_final
0 3False O'Connor, Mr. Patrick male 28.032 0 0366713 7.75 M Q -- nan-- Mr 3M 0 1 1 0 84.096 7.75 1 2 0 0.578797 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.096924 -0.496597 -0.346789False False False
1 3False Canavan, Mr. Patrick male 21 0 0364858 7.75 M Q -- nanIreland Philadelphia, PAMr 3M 0 1 1 0 63 7.75 1 2 0 0.578797 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.650312 -0.496597 -0.346789False False False
2 1False Ovies y Rodriguez, Mr. Servando male 28.5 0 0PC 17562 27.7208D43 C -- 189?Havana, Cuba Mr 5D 0 1 1 0 28.5 27.7208 1 0 4 0.578797 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.0600935 -0.102369 0.19911 False False True
3 3False Windelov, Mr. Einar male 21 0 0SOTON/OQ 3101317 7.25 M S -- nan-- Mr 3M 0 1 1 0 63 7.25 1 1 0 0.578797 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.650312 -0.506468 -0.360456False False False
4 2True Shelley, Mrs. William (Imanita Parrish Hall)female25 0 1230433 26 M S 12 nanDeer Lodge, MT Mrs 6M 0 1 2 0 50 13 0 1 0 0.145177 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -0.335529 -0.136338 -0.203281True True True

Finally, let’s check the performance of all the individual models as well as on the ensembler, on the test set.

[30]:
pred_columns = df_train.get_column_names(regex='^prediction_')
for i in pred_columns:
    print(i)
    binary_metrics(y_true=df_test.survived.values, y_pred=df_test[i].values)
    print(' ')
prediction_xgb
Accuracy: 0.786
f1 score: 0.728
roc-auc: 0.773

prediction_svc
Accuracy: 0.802
f1 score: 0.743
roc-auc: 0.786

prediction_lr
Accuracy: 0.779
f1 score: 0.713
roc-auc: 0.762

prediction_final
Accuracy: 0.809
f1 score: 0.771
roc-auc: 0.804

We see that our ensembler is doing a better job than any idividual model, as expected.

Thanks you for going over this example. Feel free to copy, modify, and in general play around with this notebook.

Performance notes

In most cases, minimizing memory usage is Vaex’ first priority, and performance comes seconds. This allows Vaex to work with very large datasets, without shooting yourself in the foot.

However, this sometimes comes at the cost of performance.

Virtual columns

When we add a new column to a dataframe based on existing, Vaex will create a virtual column, e.g.:

[18]:
import vaex
import numpy as np
x = np.arange(100_000_000, dtype='float64')
df = vaex.from_arrays(x=x)
df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log()

In this dataframe, x uses memory, while y does not, it will be evaluate in chunks when needed. To demonstate the performance implications, let us compute with the column, to force the evaluation.

[21]:
%%time
df.x.mean()
CPU times: user 2.74 s, sys: 12.3 ms, total: 2.75 s
Wall time: 71.2 ms
[21]:
array(49999999.5)
[22]:
%%time
df.y.mean()
CPU times: user 3.88 s, sys: 635 ms, total: 4.52 s
Wall time: 304 ms
[22]:
array(-17.42068049)

From this, we can see that a similar computation (the mean), with a virtual column can be much slower, a penalty we pay for saving memory.

Materializing the columns

We can ask Vaex to materialize a column, or all virtual column using df.materialize

[23]:
df_mat = df.materialize()
[24]:
%%time
df_mat.x.mean()
CPU times: user 2.54 s, sys: 14 ms, total: 2.56 s
Wall time: 68.1 ms
[24]:
array(49999999.5)
[25]:
%%time
df_mat.y.mean()
CPU times: user 2.64 s, sys: 18.7 ms, total: 2.66 s
Wall time: 68.1 ms
[25]:
array(-17.42068049)

We now get equal performance for both columns

Consideration in backends with multiple workers

As often is the case with web frameworks in Python, we use multiple workers, e.g. using gunicorn. If all workers would materialize, it would waste a lot of memory, there are two solutions to this issue:

Save to disk

Export the dataframe to disk in hdf5 or arrow format as a pre-process step, and let all workers access the same file. Due to memory mapping, each worker will share the same memory.

e.g.

df.export('materialized-data.hdf5', progress=True)

Materialize a single time

Gunicorn has the following command line flag:

--preload             Load application code before the worker processes are forked. [False]

This will let gunicorn first run you app (a single time), allowing you to do the materialize step. After your script run, it will fork, and all workers will share the same memory.

Tip:

A good ida could be to mix the two, and use use Vaex’ df.fingerprint method to cache the file to disk.

E.g.

import vaex
import numpy as np
import os

x = np.arange(100_000_000, dtype='float64')
df = vaex.from_arrays(x=x)
df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log()

filename = "vaex-cache-" + df.fingerprint() + ".hdf5"
if not os.path.exists(filename):
    df.export(filename, progress=True)
else:
    df = vaex.open(filename)

In case the virtual columns change, rerunning will create a new cache file, and changing back will use the previously generated cache file. This is especially useful during development.

In this case, it is still important to let gunicorn run a single process first (using the --preload flag), to avoid multiple workers doing the same work.

[28]:

[ ]:

Progress Bars

Basic progress bars

Progress bars are an excellent way to get an idea of how long a certain computation might take. Most of the methods responsible for computations or aggregations in Vaex support the display of progressbars. Displaying progress bars is as easy as:

[1]:
import vaex

df = vaex.datasets.taxi()
df.total_amount.mean(progress=True)
mean [########################################] 100.00% elapsed time  :     0.09s =  0.0m =  0.0h

[1]:
array(11.6269824)

If you are in the Jupyter notebook, you can pass progress='widget' to get a nicer looking progress bar, provided by ipywidgets:

[2]:
df.payment_type.unique(progress='widget')
[2]:
['CRD', 'CSH']

Rich based progress bars

Using Rich based progress bars we can take this idea to the next level. With Rich one gets to see a tree structure of progress bars that give the user an idea of what Vaex does internally, and how long each step takes. Each leaf in this tree is a Task, while the nodes are used to group tasks logically. For instance, in the following example the last node named ‘mean’ uses the mean aggregation, which creates two tasks: sum and count agregations.

[3]:
with vaex.progress.tree('rich', title="My Vaex computations"):
    result_1 = df.groupby('passenger_count', agg='count')
    result_2 = df.groupby('vendor_id', agg=vaex.agg.sum('tip_amount'))
    result_3 = df.tip_amount.mean()

In the last column (between brackets) we also see how many passes over the data Vaex had to do to compute all results. The last two tasks are done together in the 5th pass.

If we want to do all computations in a single pass over the data for performance reason, we can use Vaex’ async way, by adding the delayed argument (see Async programming with Vaex for more details).

[4]:
with vaex.progress.tree('rich', title="My Vaex computations"):
    result_1 = df.groupby('passenger_count', agg='count', delay=True)
    result_2 = df.groupby('vendor_id', agg=vaex.agg.sum('tip_amount'), delay=True)
    result_3 = df.tip_amount.mean(delay=True)
    df.execute()
result_1 = result_1.get()
result_2 = result_2.get()
result_3 = result_3.get()

We see that all computations are done in a single pass over the data, which is slightly faster in this case because we are not IO bound. On slower disks, or slower formats (e.g. parquet) this difference will be larger.

Combining this with the caching feature, we can clearly see the effect on later calculations, and the efficiency of Vaex:

[5]:
vaex.cache.disk(clear=True)  # turn on cache, and delete all cache entries

with vaex.progress.tree('rich', title="Warm up cache"):
    result_1 = df.groupby('passenger_count', agg='count', delay=True)
    result_2 = df.groupby('vendor_id', agg=vaex.agg.sum('tip_amount'), delay=True)
    df.execute()


with vaex.progress.tree('rich', title="My Vaex computations"):
    result_1 = df.groupby('passenger_count', agg='count', delay=True)
    result_2 = df.groupby('vendor_id', agg=vaex.agg.sum('tip_amount'), delay=True)
    result_3 = df.tip_amount.mean(delay=True)
    df.execute()
vaex.cache.off();

Vaex server

Why

There are various cases where the calculations and/or aggregations need to happen on a different computer than where the (aggregated) data is needed. For instance, when making a dashboard, the dashboard server might not be powerful enough for the calculations. Another example is where the client lives in a different process, such as a browser.

Starting the dataframe server

Use our server first

You can skip running your own server and first try out using https://dataframe.vaex.io

The vaex (web) server can be started from the command line like:

$ vaex server --port 8082 /data/taxi/yellow_taxi_2012.hdf5 gaia=/data/gaia/gaia-edr3-x-ps1.hdf5
INFO:MainThread:vaex.server:yellow_taxi_2012:  http://0.0.0.0:8082/dataset/yellow_taxi_2012 for REST or ws://0.0.0.0:8082/yellow_taxi_2012 for websocket
INFO:MainThread:vaex.server:gaia:  http://0.0.0.0:8082/dataset/gaia for REST or ws://0.0.0.0:8082/gaia for websocket
INFO:     Started server process [617048]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8082 (Press CTRL+C to quit)

Pass files on the command line, or query help by passing the --help flag.

Python API

When the client is a Python program, the easiest API is the remote dataframe in the vaex packages itself. This does not use the REST API, but communicates over a websocket for low latency bi-directional communication.

import vaex
# the data is kept remote
df = vaex.open('vaex+wss://dataframe.vaex.io/example')
# only the result of the aggregations are send over the wire
df.x.mean()

This means you can use almost all features of a normal (local) Vaex dataframe, without having to download the data.

REST API

When the client is non-Python, or when you want to avoid the vaex dependency, the REST API can be used.

A Vaex server is running at dataframe.vaex.io and it’s API documentation can be browsed at https://dataframe.vaex.io/docs

Some endpoints can be easily queries using curl

$ curl -i https://dataframe.vaex.io/histogram/example/x\?shape\=16
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Date: Thu, 01 Apr 2021 11:23:16 GMT
Content-Type: application/json
Content-Length: 430
Connection: keep-alive
x-process-time: 0.03632664680480957
x-data-passes: 2

{"dataset_id":"example","centers":[-71.61332178115845,-58.57391309738159,-45.534504413604736,-32.49509572982788,-19.455687046051025,-6.41627836227417,6.6231303215026855,19.66253900527954,32.7019476890564,45.74135637283325,58.78076505661011,71.82017374038696,84.85958242416382,97.89899110794067,110.93839979171753,123.97780847549438],"values":[3.0,0.0,3.0,917.0,13706.0,154273.0,147171.0,12963.0,960.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0]}

While the POST method might be more convenient from a Javascript client or using a HTTP Library.

Python using requests

Requests is an easy to use HTTP library.

import requests
data = {
    'dataset_id': 'gaia-dr2',
    'expression_x': 'l',
    'expression_y': 'b',
    'filter': None,
    'virtual_columns': [],
    'min_x': 0,
    'max_x': 360,
    'min_y': -90,
    'max_y': 90,
    'shape_x': 512,
    'shape_y': 256,
}
response = requests.post('https://dataframe.vaex.io/heatmap', json=data)
response.json()
assert response.status_code == 200, 'oops, something went wrong'
{'dataset_id': 'gaia-dr2',
 'centers_x': [22.5, 67.5, 112.5, 157.5, 202.5, 247.5, 292.5, 337.5],
 'centers_y': [-67.5, -22.5, 22.5, 67.5],
 'values': [[3508786.0,
   2711710.0,
   2287021.0,
   2042114.0,
   2009057.0,
   2448207.0,
   3716951.0,
   3644323.0],
  [250883466.0,
   100064757.0,
   49538929.0,
   28273970.0,
   30521201.0,
   53214391.0,
   159460735.0,
   251170124.0],
  [166984543.0,
   110774989.0,
   43475771.0,
   31343345.0,
   31584354.0,
   44061582.0,
   108436851.0,
   189699927.0],
  [3388522.0,
   2848641.0,
   2221241.0,
   1997993.0,
   1941090.0,
   2215271.0,
   3061986.0,
   3387287.0]]}
var inputData = {
    dataset_id: 'gaia-dr2',
    expression_x: 'l',
    expression_y: 'b',
    filter: null,
    virtual_columns: [],
    min_x: 0,
    max_x: 360,
    min_y: -90,
    max_y: 90,
    shape: [512, 256],
};
var result = await fetch("https://dataframe.vaex.io/heatmap", {method: 'POST', body: JSON.stringify(inputData)})
var data = await result.json();
console.log(data);
{dataset_id: "gaia-dr2", centers_x: Array(512), centers_y: Array(256), values: Array(256)}
centers_x: (512) [0.3515625, 1.0546875, 1.7578125, 2.4609375, 3.1640625, 3.8671875, 4.5703125, 5.2734375, 5.9765625, 6.6796875, 7.3828125, 8.0859375, 8.7890625, 9.4921875, 10.1953125, 10.8984375, 11.6015625, 12.3046875, 13.0078125, 13.7109375, 14.4140625, 15.1171875, 15.8203125, 16.5234375, 17.2265625, 17.9296875, 18.6328125, 19.3359375, 20.0390625, 20.7421875, 21.4453125, 22.1484375, 22.8515625, 23.5546875, 24.2578125, 24.9609375, 25.6640625, 26.3671875, 27.0703125, 27.7734375, 28.4765625, 29.1796875, 29.8828125, 30.5859375, 31.2890625, 31.9921875, 32.6953125, 33.3984375, 34.1015625, 34.8046875, 35.5078125, 36.2109375, 36.9140625, 37.6171875, 38.3203125, 39.0234375, 39.7265625, 40.4296875, 41.1328125, 41.8359375, 42.5390625, 43.2421875, 43.9453125, 44.6484375, 45.3515625, 46.0546875, 46.7578125, 47.4609375, 48.1640625, 48.8671875, 49.5703125, 50.2734375, 50.9765625, 51.6796875, 52.3828125, 53.0859375, 53.7890625, 54.4921875, 55.1953125, 55.8984375, 56.6015625, 57.3046875, 58.0078125, 58.7109375, 59.4140625, 60.1171875, 60.8203125, 61.5234375, 62.2265625, 62.9296875, 63.6328125, 64.3359375, 65.0390625, 65.7421875, 66.4453125, 67.1484375, 67.8515625, 68.5546875, 69.2578125, 69.9609375, …]
centers_y: (256) [-89.6484375, -88.9453125, -88.2421875, -87.5390625, -86.8359375, -86.1328125, -85.4296875, -84.7265625, -84.0234375, -83.3203125, -82.6171875, -81.9140625, -81.2109375, -80.5078125, -79.8046875, -79.1015625, -78.3984375, -77.6953125, -76.9921875, -76.2890625, -75.5859375, -74.8828125, -74.1796875, -73.4765625, -72.7734375, -72.0703125, -71.3671875, -70.6640625, -69.9609375, -69.2578125, -68.5546875, -67.8515625, -67.1484375, -66.4453125, -65.7421875, -65.0390625, -64.3359375, -63.6328125, -62.9296875, -62.2265625, -61.5234375, -60.8203125, -60.1171875, -59.4140625, -58.7109375, -58.0078125, -57.3046875, -56.6015625, -55.8984375, -55.1953125, -54.4921875, -53.7890625, -53.0859375, -52.3828125, -51.6796875, -50.9765625, -50.2734375, -49.5703125, -48.8671875, -48.1640625, -47.4609375, -46.7578125, -46.0546875, -45.3515625, -44.6484375, -43.9453125, -43.2421875, -42.5390625, -41.8359375, -41.1328125, -40.4296875, -39.7265625, -39.0234375, -38.3203125, -37.6171875, -36.9140625, -36.2109375, -35.5078125, -34.8046875, -34.1015625, -33.3984375, -32.6953125, -31.9921875, -31.2890625, -30.5859375, -29.8828125, -29.1796875, -28.4765625, -27.7734375, -27.0703125, -26.3671875, -25.6640625, -24.9609375, -24.2578125, -23.5546875, -22.8515625, -22.1484375, -21.4453125, -20.7421875, -20.0390625, …]
dataset_id: "gaia-dr2"
values: (256) [ …]
__proto__: Object

Example using plotly.js

Combining the previous with the plotly.js library we can make an interactive plot:

Sky map

First, make sure we have a div

<div id="plotlyHeatmap"></div>

Then load the data, and plot it using plotly.js:

var skyMapInput = {
    dataset_id: 'gaia-dr2',
    expression_x: 'l',
    expression_y: 'b',
    virtual_columns: {
        distance: "1/parallax"
    },
    filter: this.filter,
    min_x: 0,
    max_x: 360,
    min_y: -90,
    max_y: 90,
    shape: [512, 256],
};

async function loadData(heatmapInput) {
    const result = await fetch("https://dataframe-dev.vaex.io/heatmap", {method: 'POST', body: JSON.stringify(heatmapInput)})
    const data = await result.json();
    return data;
}

function plotData(elementId, data, log, xaxis, yaxis) {
    const trace_data = {
        x: data.centers_x,
        y: data.centers_y,
        z: log ? data.values.map((ar1d) => ar1d.map(Math.log1p)) : data.values,
        type: 'heatmap',
        colorscale: 'plasma',
        transpose: true,
    };
    var layout = {
        xaxis: {
            title: {
                text: data.expression_x,
            },
            ...xaxis
        },
        yaxis: {
            title: {
                text: data.expression_y,
            },
            ...yaxis
        }
    };
    Plotly.react(elementId, [trace_data], layout);
}

async function plot(elementId, heatmapInput, xaxis, yaxis) {
    const heatmapOutput = await loadData(heatmapInput);
    await plotData(elementId, heatmapOutput, true, xaxis, yaxis);
}

plot('plotlyHeatmap', skyMapInput);

Adding an event handler, will refine the data when we zoom in:

function addZoomHandler(elementId, heatmapInput) {
    document.getElementById(elementId).on('plotly_relayout', async (e) => {
        // mutate input data
        heatmapInput.min_x = e["xaxis.range[0]"]
        heatmapInput.max_x = e["xaxis.range[1]"]
        heatmapInput.min_y = e["yaxis.range[0]"]
        heatmapInput.max_y = e["yaxis.range[1]"]
        // and plot again
        plot(elementId, heatmapInput);
    })
}

CMD

We can now easily add a second heatmap

<div id="plotlyHeatmapCMD"></div>

And plot a different heatmap (a color-magnitude diagram) on this div.

var cmdInput = {
    dataset_id: 'gaia-dr2',
    expression_x: 'phot_bp_mean_mag-phot_rp_mean_mag',
    expression_y: 'M_g',
    virtual_columns: {
        distance: "1/parallax",
        M_g: "phot_g_mean_mag-(5*log10(distance)+10)"
    },
    filter: '((pmra**2+pmdec**2)<100)&(parallax_over_error>10)&(abs(b)>20)',
    min_x: -1,
    max_x: 5,
    min_y: 15,
    max_y: -5,
    shape_x: 256,
    shape_y: 256,
};

async () => {
    await plot('plotlyHeatmapCMD', cmdInput);
    addZoomHandler('plotlyHeatmapCMD', cmdInput);
}

Configuration

All settings in Vaex can be configured in a uniform way, based on Pydantic. From a Python runtime, configuration of settings can be done via the vaex.settings module.

import vaex
vaex.settings.main.thread_count = 10
vaex.settings.display.max_columns = 50

Via environmental variables:

$ VAEX_NUM_THREADS=10 VAEX_DISPLAY_MAX_COLUMNS=50 python myservice.py

Otherwise, values are obtained from a .env file using dotenv from the current working directory.

VAEX_NUM_THREADS=22
VAEX_CHUNK_SIZE_MIN=2048

Lastly, a global yaml file from $VAEX_PATH_HOME/.vaex/main.yaml is loaded (with last priority).

thread_count: 33
display:
  max_columns: 44
  max_rows: 20

If we now run vaex settings yaml, we see the effective settings as yaml output:

$ VAEX_NUM_THREADS=10 VAEX_DISPLAY_MAX_COLUMNS=50 vaex settings yaml
...
chunk:
  size: null
  size_min: 2048
  size_max: 1048576
display:
  max_columns: 50
  max_rows: 20
thread_count: 10
...

Developers

When updating vaex/settings.py, run the vaex settings watch to generate this documentation below automatically when saving the file.

Schema

A JSON schema can be generated using

$ vaex settings schema > vaex-settings.schema.json

Settings

General settings for vaex

aliases

Aliases to be used for vaex.open

Environmental variable: VAEX_ALIASES

Python settings vaex.settings.main.aliases

async

How to run async code in the local executor

Environmental variable: VAEX_ASYNC

Example use:

$ VAEX_ASYNC=nest python myscript.py

Python settings vaex.settings.main.async_

Example use: vaex.settings.main.async_ = 'nest'

home

Home directory for vaex, which defaults to $HOME/.vaex, If both $VAEX_HOME and $HOME are not defined, the current working directory is used. (Note that this setting cannot be configured from the vaex home directory itself).

Environmental variable: VAEX_HOME

Example use:

$ VAEX_HOME=/home/docs/.vaex python myscript.py

Python settings vaex.settings.main.home

Example use: vaex.settings.main.home = '/home/docs/.vaex'

mmap

Experimental to turn off, will avoid using memory mapping if set to False

Environmental variable: VAEX_MMAP

Example use:

$ VAEX_MMAP=True python myscript.py

Python settings vaex.settings.main.mmap

Example use: vaex.settings.main.mmap = True

process_count

Number of processes to use for multiprocessing (e.g. apply), defaults to thread_count setting

Environmental variable: VAEX_PROCESS_COUNT

Example use:

$ VAEX_PROCESS_COUNT=2 python myscript.py

Python settings vaex.settings.main.process_count

Example use: vaex.settings.main.process_count = 2

thread_count

Number of threads to use for computations, defaults to multiprocessing.cpu_count()

Environmental variable: VAEX_NUM_THREADS

Example use:

$ VAEX_NUM_THREADS=2 python myscript.py

Python settings vaex.settings.main.thread_count

Example use: vaex.settings.main.thread_count = 2

thread_count_io

Number of threads to use for IO, defaults to thread_count_io + 1

Environmental variable: VAEX_NUM_THREADS_IO

Example use:

$ VAEX_NUM_THREADS_IO=2 python myscript.py

Python settings vaex.settings.main.thread_count_io

Example use: vaex.settings.main.thread_count_io = 2

path_lock

Directory to store lock files for vaex, which defaults to ${VAEX_HOME}/lock/, Due to possible race conditions lock files cannot be removed while processes using Vaex are running (on Unix systems).

Environmental variable: VAEX_LOCK

Example use:

$ VAEX_LOCK=/home/docs/.vaex/lock python myscript.py

Python settings vaex.settings.main.path_lock

Example use: vaex.settings.main.path_lock = '/home/docs/.vaex/lock'

Cache

Setting for caching of computation or task results, see the API for more details.

type

Type of cache, e.g. ‘memory_infinite’, ‘memory’, ‘disk’, ‘redis’, or a multilevel cache, e.g. ‘memory,disk’

Environmental variable: VAEX_CACHE

Python settings vaex.settings.cache.type

disk_size_limit

Maximum size for cache on disk, e.g. 10GB, 500MB

Environmental variable: VAEX_CACHE_DISK_SIZE_LIMIT

Example use:

$ VAEX_CACHE_DISK_SIZE_LIMIT=10GB python myscript.py

Python settings vaex.settings.cache.disk_size_limit

Example use: vaex.settings.cache.disk_size_limit = '10GB'

memory_size_limit

Maximum size for cache in memory, e.g. 1GB, 500MB

Environmental variable: VAEX_CACHE_MEMORY_SIZE_LIMIT

Example use:

$ VAEX_CACHE_MEMORY_SIZE_LIMIT=1GB python myscript.py

Python settings vaex.settings.cache.memory_size_limit

Example use: vaex.settings.cache.memory_size_limit = '1GB'

path

Storage location for cache results. Defaults to ${VAEX_HOME}/cache

Environmental variable: VAEX_CACHE_PATH

Example use:

$ VAEX_CACHE_PATH=/home/docs/.vaex/cache python myscript.py

Python settings vaex.settings.cache.path

Example use: vaex.settings.cache.path = '/home/docs/.vaex/cache'

Chunk

Configure how a dataset is broken down in smaller chunks. The executor dynamically adjusts the chunk size based on size_min and size_max and the number of threads when size is not set.

size

When set, fixes the number of chunks, e.g. do not dynamically adjust between min and max

Environmental variable: VAEX_CHUNK_SIZE

Python settings vaex.settings.main.chunk.size

size_min

Minimum chunk size

Environmental variable: VAEX_CHUNK_SIZE_MIN

Example use:

$ VAEX_CHUNK_SIZE_MIN=1024 python myscript.py

Python settings vaex.settings.main.chunk.size_min

Example use: vaex.settings.main.chunk.size_min = 1024

size_max

Maximum chunk size

Environmental variable: VAEX_CHUNK_SIZE_MAX

Example use:

$ VAEX_CHUNK_SIZE_MAX=1048576 python myscript.py

Python settings vaex.settings.main.chunk.size_max

Example use: vaex.settings.main.chunk.size_max = 1048576

Data

Data configuration

path

Storage location for data files, like vaex.example(). Defaults to ${VAEX_HOME}/data/

Environmental variable: VAEX_DATA_PATH

Example use:

$ VAEX_DATA_PATH=/home/docs/.vaex/data python myscript.py

Python settings vaex.settings.data.path

Example use: vaex.settings.data.path = '/home/docs/.vaex/data'

Display

How a dataframe displays

max_columns

How many column to display when printing out a dataframe

Environmental variable: VAEX_DISPLAY_MAX_COLUMNS

Example use:

$ VAEX_DISPLAY_MAX_COLUMNS=200 python myscript.py

Python settings vaex.settings.display.max_columns

Example use: vaex.settings.display.max_columns = 200

max_rows

How many rows to print out before showing the first and last rows

Environmental variable: VAEX_DISPLAY_MAX_ROWS

Example use:

$ VAEX_DISPLAY_MAX_ROWS=10 python myscript.py

Python settings vaex.settings.display.max_rows

Example use: vaex.settings.display.max_rows = 10

FileSystem

Filesystem configuration

path

Storage location for caching files from remote file systems. Defaults to ${VAEX_HOME}/file-cache/

Environmental variable: VAEX_FS_PATH

Example use:

$ VAEX_FS_PATH=/home/docs/.vaex/file-cache python myscript.py

Python settings vaex.settings.fs.path

Example use: vaex.settings.fs.path = '/home/docs/.vaex/file-cache'

MemoryTracker

Memory tracking/protection when using vaex in a service

type

Which memory tracker to use when executing tasks

Environmental variable: VAEX_MEMORY_TRACKER

Example use:

$ VAEX_MEMORY_TRACKER=default python myscript.py

Python settings vaex.settings.main.memory_tracker.type

Example use: vaex.settings.main.memory_tracker.type = 'default'

max

How much memory the executor can use maximally (only used for type=’limit’)

Environmental variable: VAEX_MEMORY_TRACKER_MAX

Python settings vaex.settings.main.memory_tracker.max

TaskTracker

task tracking/protection when using vaex in a service

type

Comma seperated string of trackers to run while executing tasks

Environmental variable: VAEX_TASK_TRACKER

Example use:

$ VAEX_TASK_TRACKER= python myscript.py

Python settings vaex.settings.main.task_tracker.type

Logging

Configure logging for Vaex. By default Vaex sets up logging, which is useful when running a script. When Vaex is used in applications or services that already configure logging, set the environomental variables VAEX_LOGGING_SETUP to false.

See the API docs for more details.

Note that settings vaex.settings.main.logging.info etc at runtime, has no direct effect, since logging is already configured. When needed, call vaex.logging.reset() and vaex.logging.setup() to reconfigure logging.

setup

Setup logging for Vaex at import time.

Environmental variable: VAEX_LOGGING_SETUP

Example use:

$ VAEX_LOGGING_SETUP=True python myscript.py

Python settings vaex.settings.main.logging.setup

Example use: vaex.settings.main.logging.setup = True

rich

Use rich logger (colored fancy output).

Environmental variable: VAEX_LOGGING_RICH

Example use:

$ VAEX_LOGGING_RICH=True python myscript.py

Python settings vaex.settings.main.logging.rich

Example use: vaex.settings.main.logging.rich = True

debug

Comma seperated list of loggers to set to the debug level (e.g. ‘vaex.settings,vaex.cache’), or a ‘1’ to set the root logger (‘vaex’)

Environmental variable: VAEX_LOGGING_DEBUG

Example use:

$ VAEX_LOGGING_DEBUG= python myscript.py

Python settings vaex.settings.main.logging.debug

info

Comma seperated list of loggers to set to the info level (e.g. ‘vaex.settings,vaex.cache’), or a ‘1’ to set the root logger (‘vaex’)

Environmental variable: VAEX_LOGGING_INFO

Example use:

$ VAEX_LOGGING_INFO= python myscript.py

Python settings vaex.settings.main.logging.info

warning

Comma seperated list of loggers to set to the warning level (e.g. ‘vaex.settings,vaex.cache’), or a ‘1’ to set the root logger (‘vaex’)

Environmental variable: VAEX_LOGGING_WARNING

Example use:

$ VAEX_LOGGING_WARNING=vaex python myscript.py

Python settings vaex.settings.main.logging.warning

Example use: vaex.settings.main.logging.warning = 'vaex'

error

Comma seperated list of loggers to set to the error level (e.g. ‘vaex.settings,vaex.cache’), or a ‘1’ to set the root logger (‘vaex’)

Environmental variable: VAEX_LOGGING_ERROR

Example use:

$ VAEX_LOGGING_ERROR= python myscript.py

Python settings vaex.settings.main.logging.error

Progress

Data configuration

type

Default progressbar to show: ‘simple’, ‘rich’ or ‘widget’

Environmental variable: VAEX_PROGRESS_TYPE

Example use:

$ VAEX_PROGRESS_TYPE=simple python myscript.py

Python settings vaex.settings.main.progress.type

Example use: vaex.settings.main.progress.type = 'simple'

force

Force showing a progress bar of this type, even when no progress bar was requested from user code

Environmental variable: VAEX_PROGRESS

Python settings vaex.settings.main.progress.force

Settings

Configuration options for the FastAPI server

add_example

Add example dataset

Environmental variable: VAEX_SERVER_ADD_EXAMPLE

Example use:

$ VAEX_SERVER_ADD_EXAMPLE=True python myscript.py

Python settings vaex.settings.server.add_example

Example use: vaex.settings.server.add_example = True

graphql

Add graphql endpoint

Environmental variable: VAEX_SERVER_GRAPHQL

Example use:

$ VAEX_SERVER_GRAPHQL=False python myscript.py

Python settings vaex.settings.server.graphql

Example use: vaex.settings.server.graphql = False

files

Mapping of name to path

Environmental variable: VAEX_SERVER_FILES

Python settings vaex.settings.server.files

API documentation for vaex library

Quick lists

Opening/reading in your data.

vaex.open(path[, convert, progress, ...])

Open a DataFrame from file given by path.

vaex.open_many(filenames)

Open a list of filenames, and return a DataFrame with all DataFrames concatenated.

vaex.from_arrays(**arrays)

Create an in memory DataFrame from numpy arrays.

vaex.from_arrow_dataset(arrow_dataset)

Create a DataFrame from an Apache Arrow dataset.

vaex.from_arrow_table(table)

Creates a vaex DataFrame from an arrow Table.

vaex.from_ascii(path[, seperator, names, ...])

Create an in memory DataFrame from an ascii file (whitespace seperated by default).

vaex.from_astropy_table(table)

Create a vaex DataFrame from an Astropy Table.

vaex.from_csv(filename_or_buffer[, ...])

Load a CSV file as a DataFrame, and optionally convert to an HDF5 file.

vaex.from_csv_arrow(file[, read_options, ...])

Fast CSV reader using Apache Arrow.

vaex.from_dataset(dataset)

Create a Vaex DataFrame from a Vaex Dataset

vaex.from_dict(data)

Create an in memory dataset from a dict with column names as keys and list/numpy-arrays as values

vaex.from_items(*items)

Create an in memory DataFrame from numpy arrays, in contrast to from_arrays this keeps the order of columns intact (for Python < 3.6).

vaex.from_json(path_or_buffer[, orient, ...])

A method to read a JSON file using pandas, and convert to a DataFrame directly.

vaex.from_pandas(df[, name, copy_index, ...])

Create an in memory DataFrame from a pandas DataFrame.

vaex.from_records(records[, array_type, ...])

Create a dataframe from a list of dict.

Visualizations.

vaex.viz.DataFrameAccessorViz.heatmap([x, ...])

Viz data in a 2d histogram/heatmap.

vaex.viz.DataFrameAccessorViz.histogram([x, ...])

Plot a histogram.

vaex.viz.DataFrameAccessorViz.scatter(x, y)

Viz (small amounts) of data in 2d using a scatter plot

Statistics.

vaex.dataframe.DataFrame.correlation(x[, y, ...])

Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between x and y, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.count([expression, ...])

Count the number of non-NaN values (or all, if expression is None or "*").

vaex.dataframe.DataFrame.cov(x[, y, binby, ...])

Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.max(expression[, ...])

Calculate the maximum for given expressions, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.mean(expression[, ...])

Calculate the mean for expression, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.median_approx(...)

Calculate the median, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.min(expression[, ...])

Calculate the minimum for given expressions, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.minmax(expression)

Calculate the minimum and maximum for expressions, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.mode(expression[, ...])

Calculate/estimate the mode.

vaex.dataframe.DataFrame.mutual_information(x)

Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.std(expression[, ...])

Calculate the standard deviation for the given expression, possible on a grid defined by binby

vaex.dataframe.DataFrame.unique(expression)

Returns all unique values.

vaex.dataframe.DataFrame.var(expression[, ...])

Calculate the sample variance for the given expression, possible on a grid defined by binby

vaex-core

Vaex is a library for dealing with larger than memory DataFrames (out of core).

The most important class (datastructure) in vaex is the DataFrame. A DataFrame is obtained by either opening the example dataset:

>>> import vaex
>>> df = vaex.example()

Or using open() to open a file.

>>> df1 = vaex.open("somedata.hdf5")
>>> df2 = vaex.open("somedata.fits")
>>> df2 = vaex.open("somedata.arrow")
>>> df4 = vaex.open("somedata.csv")

Or connecting to a remove server:

>>> df_remote = vaex.open("http://try.vaex.io/nyc_taxi_2015")

A few strong features of vaex are:

  • Performance: works with huge tabular data, process over a billion (> 109) rows/second.

  • Expression system / Virtual columns: compute on the fly, without wasting ram.

  • Memory efficient: no memory copies when doing filtering/selections/subsets.

  • Visualization: directly supported, a one-liner is often enough.

  • User friendly API: you will only need to deal with a DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.

  • Very fast statistics on N dimensional grids such as histograms, running mean, heatmaps.

Follow the tutorial at https://docs.vaex.io/en/latest/tutorial.html to learn how to use vaex.

vaex.concat(dfs, resolver='flexible') vaex.dataframe.DataFrame[source]

Concatenate a list of DataFrames.

Parameters

resolver – How to resolve schema conflicts, see DataFrame.concat().

vaex.delayed(f)[source]

Decorator to transparantly accept delayed computation.

Example:

>>> delayed_sum = ds.sum(ds.E, binby=ds.x, limits=limits,
>>>                   shape=4, delay=True)
>>> @vaex.delayed
>>> def total_sum(sums):
>>>     return sums.sum()
>>> sum_of_sums = total_sum(delayed_sum)
>>> ds.execute()
>>> sum_of_sums.get()
See the tutorial for a more complete example https://docs.vaex.io/en/latest/tutorial.html#Parallel-computations
vaex.example()[source]

Result of an N-body simulation of the accretion of 33 satellite galaxies into a Milky Way dark matter halo.

Data was greated by Helmi & de Zeeuw 2000. The data contains the position (x, y, z), velocitie (vx, vy, vz), the energy (E), the angular momentum (L, Lz) and iron content (FeH) of the particles.

Return type

DataFrame

vaex.from_arrays(**arrays) vaex.dataframe.DataFrameLocal[source]

Create an in memory DataFrame from numpy arrays.

Example

>>> import vaex, numpy as np
>>> x = np.arange(5)
>>> y = x ** 2
>>> vaex.from_arrays(x=x, y=y)
  #    x    y
  0    0    0
  1    1    1
  2    2    4
  3    3    9
  4    4   16
>>> some_dict = {'x': x, 'y': y}
>>> vaex.from_arrays(**some_dict)  # in case you have your columns in a dict
  #    x    y
  0    0    0
  1    1    1
  2    2    4
  3    3    9
  4    4   16
Parameters

arrays – keyword arguments with arrays

Return type

DataFrame

vaex.from_arrow_dataset(arrow_dataset) vaex.dataframe.DataFrame[source]

Create a DataFrame from an Apache Arrow dataset.

vaex.from_arrow_table(table) vaex.dataframe.DataFrame[source]

Creates a vaex DataFrame from an arrow Table.

Parameters

as_numpy – Will lazily cast columns to a NumPy ndarray.

Return type

DataFrame

vaex.from_ascii(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]

Create an in memory DataFrame from an ascii file (whitespace seperated by default).

>>> ds = vx.from_ascii("table.asc")
>>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])
Parameters
  • path – file path

  • seperator – value seperator, by default whitespace, use “,” for comma seperated values.

  • names – If True, the first line is used for the column names, otherwise provide a list of strings with names

  • skip_lines – skip lines at the start of the file

  • skip_after – skip lines at the end of the file

  • kwargs

Return type

DataFrame

vaex.from_astropy_table(table)[source]

Create a vaex DataFrame from an Astropy Table.

vaex.from_csv(filename_or_buffer, copy_index=False, chunk_size=None, convert=False, fs_options={}, progress=None, fs=None, **kwargs)[source]

Load a CSV file as a DataFrame, and optionally convert to an HDF5 file.

Parameters
  • filename_or_buffer (str or file) – CSV file path or file-like

  • copy_index (bool) – copy index when source is read via Pandas

  • chunk_size (int) –

    if the CSV file is too big to fit in the memory this parameter can be used to read CSV file in chunks. For example:

    >>> import vaex
    >>> for i, df in enumerate(vaex.read_csv('taxi.csv', chunk_size=100_000)):
    >>>     df = df[df.passenger_count < 6]
    >>>     df.export_hdf5(f'taxi_{i:02}.hdf5')
    

  • convert (bool or str) – convert files to an hdf5 file for optimization, can also be a path. The CSV file will be read in chunks: either using the provided chunk_size argument, or a default size. Each chunk will be saved as a separate hdf5 file, then all of them will be combined into one hdf5 file. So for a big CSV file you will need at least double of extra space on the disk. Default chunk_size for converting is 5 million rows, which corresponds to around 1Gb memory on an example of NYC Taxi dataset.

  • progress – (Only applies when convert is not False) True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • kwargs – extra keyword arguments, currently passed to Pandas read_csv function, but the implementation might change in future versions.

Returns

DataFrame

vaex.from_dataset(dataset: vaex.dataset.Dataset) vaex.dataframe.DataFrame[source]

Create a Vaex DataFrame from a Vaex Dataset

vaex.from_dict(data)[source]

Create an in memory dataset from a dict with column names as keys and list/numpy-arrays as values

Example

>>> data = {'A':[1,2,3],'B':['a','b','c']}
>>> vaex.from_dict(data)
  #    A    B
  0    1   'a'
  1    2   'b'
  2    3   'c'
Parameters

data – A dict of {column:[value, value,…]}

Return type

DataFrame

vaex.from_items(*items)[source]

Create an in memory DataFrame from numpy arrays, in contrast to from_arrays this keeps the order of columns intact (for Python < 3.6).

Example

>>> import vaex, numpy as np
>>> x = np.arange(5)
>>> y = x ** 2
>>> vaex.from_items(('x', x), ('y', y))
  #    x    y
  0    0    0
  1    1    1
  2    2    4
  3    3    9
  4    4   16
Parameters

items – list of [(name, numpy array), …]

Return type

DataFrame

vaex.from_json(path_or_buffer, orient=None, precise_float=False, lines=False, copy_index=False, **kwargs)[source]

A method to read a JSON file using pandas, and convert to a DataFrame directly.

Parameters
  • path_or_buffer (str) – a valid JSON string or file-like, default: None The string could be a URL. Valid URL schemes include http, ftp, s3, gcs, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/table.json

  • orient (str) – Indication of expected JSON string format. Allowed values are split, records, index, columns, and values.

  • precise_float (bool) – Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality

  • lines (bool) – Read the file as a json object per line.

Return type

DataFrame

vaex.from_pandas(df, name='pandas', copy_index=False, index_name='index')[source]

Create an in memory DataFrame from a pandas DataFrame.

Param

pandas.DataFrame df: Pandas DataFrame

Param

name: unique for the DataFrame

>>> import vaex, pandas as pd
>>> df_pandas = pd.from_csv('test.csv')
>>> df = vaex.from_pandas(df_pandas)
Return type

DataFrame

vaex.from_records(records: List[Dict], array_type='arrow', defaults={}) vaex.dataframe.DataFrame[source]

Create a dataframe from a list of dict.

Warning

This is for convenience only, for performance pass arrays to from_arrays() for instance.

Parameters
  • array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

  • defaults (dict) – default values if a record has a missing entry

vaex.open(path, convert=False, progress=None, shuffle=False, fs_options={}, fs=None, *args, **kwargs)[source]

Open a DataFrame from file given by path.

Example:

>>> df = vaex.open('sometable.hdf5')
>>> df = vaex.open('somedata*.csv', convert='bigdata.hdf5')
Parameters
  • path (str or list) – local or absolute path to file, or glob string, or list of paths

  • convert – Uses dataframe.export when convert is a path. If True, convert=path+'.hdf5' The conversion is skipped if the input file or conversion argument did not change.

  • progress – (Only applies when convert is not False) True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • shuffle (bool) – shuffle converted DataFrame or not

  • fs_options (dict) – Extra arguments passed to an optional file system if needed. See below

  • group – (optional) Specify the group to be read from and HDF5 file. By default this is set to “/table”.

  • fs – Apache Arrow FileSystem object, or FSSpec FileSystem object, if specified, fs_options should be empty.

  • args – extra arguments for file readers that need it

  • kwargs – extra keyword arguments

Returns

return a DataFrame on success, otherwise None

Return type

DataFrame

Note: From version 4.14.0 vaex.open() will lazily read CSV files. If you prefer to read the entire CSV file into memory, use vaex.from_csv() or vaex.from_csv_arrow() instead.

Cloud storage support:

Vaex supports streaming of HDF5 files from Amazon AWS S3 and Google Cloud Storage. Files are by default cached in $HOME/.vaex/file-cache/(s3|gs) such that successive access is as fast as native disk access.

Amazon AWS S3 options:

The following common fs_options are used for S3 access:

  • anon: Use anonymous access or not (false by default). (Allowed values are: true,True,1,false,False,0)

  • anonymous - Alias for anon

  • cache: Use the disk cache or not, only set to false if the data should be accessed once. (Allowed values are: true,True,1,false,False,0)

  • access_key - AWS access key, if not provided will use the standard env vars, or the ~/.aws/credentials file

  • secret_key - AWS secret key, similar to access_key

  • profile - If multiple profiles are present in ~/.aws/credentials, pick this one instead of ‘default’, see https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html

  • region - AWS Region, e.g. ‘us-east-1`, will be determined automatically if not provided.

  • endpoint_override - URL/ip to connect to, instead of AWS, e.g. ‘localhost:9000’ for minio

All fs_options can also be encoded in the file path as a query string.

Examples:

>>> df = vaex.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5', fs_options={'anonymous': True})
>>> df = vaex.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5?anon=true')
>>> df = vaex.open('s3://mybucket/path/to/file.hdf5', fs_options={'access_key': my_key, 'secret_key': my_secret_key})
>>> df = vaex.open(f's3://mybucket/path/to/file.hdf5?access_key={my_key}&secret_key={my_secret_key}')
>>> df = vaex.open('s3://mybucket/path/to/file.hdf5?profile=myproject')

Google Cloud Storage options:

The following fs_options are used for GCP access:

Examples:

>>> df = vaex.open('gs://vaex-data/airlines/us_airline_data_1988_2019.hdf5', fs_options={'token': None})
>>> df = vaex.open('gs://vaex-data/airlines/us_airline_data_1988_2019.hdf5?token=anon')
>>> df = vaex.open('gs://vaex-data/testing/xys.hdf5?token=anon&cache=False')
vaex.open_many(filenames)[source]

Open a list of filenames, and return a DataFrame with all DataFrames concatenated.

The filenames can be of any format that is supported by vaex.open(), namely hdf5, arrow, parquet, csv, etc.

Parameters

filenames (list[str]) – list of filenames/paths

Return type

DataFrame

vaex.register_function(scope=None, as_property=False, name=None, on_expression=True, df_accessor=None, multiprocessing=False)[source]

Decorator to register a new function with vaex.

If on_expression is True, the function will be available as a method on an Expression, where the first argument will be the expression itself.

If df_accessor is given, it is added as a method to that dataframe accessor (see e.g. vaex/geo.py)

Example:

>>> import vaex
>>> df = vaex.example()
>>> @vaex.register_function()
>>> def invert(x):
>>>     return 1/x
>>> df.x.invert()
>>> import numpy as np
>>> df = vaex.from_arrays(departure=np.arange('2015-01-01', '2015-12-05', dtype='datetime64'))
>>> @vaex.register_function(as_property=True, scope='dt')
>>> def dt_relative_day(x):
>>>     return vaex.functions.dt_dayofyear(x)/365.
>>> df.departure.dt.relative_day
vaex.vconstant(value, length, dtype=None, chunk_size=1024)[source]

Creates a virtual column with constant values, which uses 0 memory.

Parameters
  • value – The value with which to fill the column

  • length – The length of the column, i.e. the number of rows it should contain.

  • dtype – The preferred dtype for the column.

  • chunk_size – Could be used to optimize the performance (evaluation) of this column.

vaex.vrange(start, stop, step=1, dtype='f8')[source]

Creates a virtual column which is the equivalent of numpy.arange, but uses 0 memory

Parameters
  • start (int) – Start of interval. The interval includes this value.

  • stop (int) – End of interval. The interval does not include this value,

  • step (int) – Spacing between values.

Dtype

The preferred dtype for the column.

Aggregation and statistics

class vaex.stat.Expression[source]

Bases: object

Describes an expression for a statistic

calculate(ds, binby=[], shape=256, limits=None, selection=None)[source]

Calculate the statistic for a Dataset

vaex.stat.correlation(x, y)[source]

Creates a standard deviation statistic

vaex.stat.count(expression='*')[source]

Creates a count statistic

vaex.stat.covar(x, y)[source]

Creates a standard deviation statistic

vaex.stat.mean(expression)[source]

Creates a mean statistic

vaex.stat.std(expression)[source]

Creates a standard deviation statistic

vaex.stat.sum(expression)[source]

Creates a sum statistic

class vaex.agg.AggregatorDescriptorKurtosis(name, expression, short_name='kurtosis', selection=None, edges=False)[source]

Bases: vaex.agg.AggregatorDescriptorMulti

class vaex.agg.AggregatorDescriptorMean(name, expressions, short_name='mean', selection=None, edges=False)[source]

Bases: vaex.agg.AggregatorDescriptorMulti

class vaex.agg.AggregatorDescriptorMulti(name, expressions, short_name, selection=None, edges=False)[source]

Bases: vaex.agg.AggregatorDescriptor

Uses multiple operations/aggregation to calculate the final aggretation

class vaex.agg.AggregatorDescriptorSkew(name, expression, short_name='skew', selection=None, edges=False)[source]

Bases: vaex.agg.AggregatorDescriptorMulti

class vaex.agg.AggregatorDescriptorStd(name, expression, short_name='var', ddof=0, selection=None, edges=False)[source]

Bases: vaex.agg.AggregatorDescriptorVar

class vaex.agg.AggregatorDescriptorVar(name, expression, short_name='var', ddof=0, selection=None, edges=False)[source]

Bases: vaex.agg.AggregatorDescriptorMulti

vaex.agg.all(expression=None, selection=None)[source]

Aggregator that returns True when all of the values in the group are True, or when all of the data in the group is valid (i.e. not missing values or np.nan). The aggregator returns False if there is no data in the group when the selection argument is used.

Parameters
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

vaex.agg.any(expression=None, selection=None)[source]

Aggregator that returns True when any of the values in the group are True, or when there is any data in the group that is valid (i.e. not missing values or np.nan). The aggregator returns False if there is no data in the group when the selection argument is used.

Parameters
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

vaex.agg.count(expression='*', selection=None, edges=False)[source]

Creates a count aggregation

vaex.agg.first(expression, order_expression=None, selection=None, edges=False)[source]

Creates a first aggregation.

Parameters
  • expression – {expression_one}.

  • order_expression – Order the values in the bins by this expression.

  • selection – {selection1}

  • edges – {edges}

vaex.agg.kurtosis(expression, selection=None, edges=False)[source]

Create a kurtosis aggregation.

vaex.agg.last(expression, order_expression=None, selection=None, edges=False)[source]

Creates a first aggregation.

Parameters
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y .

  • order_expression – Order the values in the bins by this expression.

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

class vaex.agg.list(expression, selection=None, dropna=False, dropnan=False, dropmissing=False, edges=False)[source]

Bases: vaex.agg.AggregatorDescriptorBasic

Aggregator that returns a list of values belonging to the specified expression.

Parameters
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

  • dropmissing – Drop rows with missing values

  • dropnan – Drop rows with NaN values

  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

vaex.agg.max(expression, selection=None, edges=False)[source]

Creates a max aggregation

vaex.agg.mean(expression, selection=None, edges=False)[source]

Creates a mean aggregation

vaex.agg.min(expression, selection=None, edges=False)[source]

Creates a min aggregation

vaex.agg.nunique(expression, dropna=False, dropnan=False, dropmissing=False, selection=None, edges=False)[source]

Aggregator that calculates the number of unique items per bin.

Parameters
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • dropmissing – Drop rows with missing values

  • dropnan – Drop rows with NaN values

  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

vaex.agg.skew(expression, selection=None, edges=False)[source]

Create a skew aggregation.

vaex.agg.std(expression, ddof=0, selection=None, edges=False)[source]

Creates a standard deviation aggregation

vaex.agg.sum(expression, selection=None, edges=False)[source]

Creates a sum aggregation

vaex.agg.var(expression, ddof=0, selection=None, edges=False)[source]

Creates a variance aggregation

Caching

(Currently experimental, use at own risk) Vaex can cache task results, such as aggregations, or the internal hashmaps used for groupby to make recurring calculations much faster, at the cost of calculating cache keys and storing/retrieving the cached values.

Internally, Vaex calculates fingerprints (such as hashes of data, or file paths and mtimes) to create cache keys that are similar across processes, such that a restart of a process will most likely result in similar hash keys.

Caches can turned on globally, or used as a context manager:

>>> import vaex
>>> df = vaex.example()
>>> vaex.cache.memory_infinite()  # cache on globally
<cache restore context manager>
>>> vaex.cache.is_on()
True
>>> vaex.cache.off() # cache off globally
<cache restore context manager>
>>> vaex.cache.is_on()
False
>>> with vaex.cache.memory_infinite():
...     df.x.sum()  # calculated without cache
array(-20884.64307324)
>>> vaex.cache.is_on()
False

The functions vaex.cache.set() and vaex.cache.get() simply look up the values in a global dict (vaex.cache.cache), but can be set for more complex behaviour.

A good library to use for in-memory caching is cachetools (https://pypi.org/project/cachetools/)

>>> import vaex
>>> import cachetools
>>> df = vaex.example()
>>> vaex.cache.cache = cachetools.LRUCache(1_000_000_000)  # 1gb cache

Configure using environment variables

See Configuration for more configuration options.

Especially when using the vaex server it can be useful to turn on caching externally using enviroment variables.

$ VAEX_CACHE=disk VAEX_CACHE_DISK_SIZE_LIMIT=”10GB” python -m vaex.server

Will enable caching using vaex.cache.disk() and configure it to use at max 10 GB of disk space.

When using Vaex in combination with Flask or Plotly Dash, and using gunicorn for scaling, it can be useful to use a multilevel cache, where the first cache is small but low latency (and private for each progress), and a second higher latency disk cache that is shared among all processes.

$ VAEX_CACHE=”memory,disk” VAEX_CACHE_DISK_SIZE_LIMIT=”10GB” VAEX_CACHE_MEMORY_SIZE_LIMIT=”1GB” gunicorn -w 16 app:server

vaex.cache.disk(clear=False, size_limit='10GB', eviction_policy='least-recently-stored')[source]

Stored cached values using the diskcache library.

See configuration details at configuration of cache. and configuration of paths

Parameters
vaex.cache.get(key, default=None, type=None)[source]

Looks up the cache value for the key, or returns the default

Will return None if the cache is turned off.

Parameters
  • key (str) – Cache key.

  • default – Return when cache is on, but key not in cache

  • type – Currently unused.

vaex.cache.is_on()[source]

Returns True when caching is enabled

vaex.cache.memory(maxsize='1GB', classname='LRUCache', clear=False)[source]

Sets a memory cache using cachetools (https://cachetools.readthedocs.io/).

Calling multiple times with clear=False will keep the current cache (useful in notebook usage).

Parameters
  • maxsize (int or str) – Max size of cache in bytes (or use a string like ‘128MB’)

  • classname (str) – classname in the cachetools library used for the cache (e.g. LRUCache, MRUCache).

  • clear (bool) – If False, will always set a new cache, when true, it will keep the cache when it is of the same type.

vaex.cache.memory_infinite(clear=False)[source]

Sets a dict a cache, creating an infinite cache.

Calling multiple times with clear=False will keep the current cache (useful in notebook usage)

vaex.cache.off()[source]

Turns off caching, or temporary when used as context manager

>>> import vaex
>>> df = vaex.example()
>>> vaex.cache.memory_infinite()  # cache on
<cache restore context manager>
>>> with vaex.cache.off():
...     df.x.sum()  # calculated without cache
array(-20884.64307324)
>>> df.x.sum()  # calculated with cache
array(-20884.64307324)
>>> vaex.cache.off()  # cache off
<cache restore context manager>
>>> df.x.sum()  # calculated without cache
array(-20884.64307324)
vaex.cache.redis(client=None)[source]

Uses Redis for caching.

Parameters

client – Redis client, if None, will call redis.Redis()

vaex.cache.set(key, value, type=None, duration_wallclock=None)[source]

Set a cache value

Useful to more advanced strategies, where we want to have different behaviour based on the type and costs. Implementations can set this function override the default behaviour:

>>> import vaex
>>> vaex.cache.memory_infinite()  
>>> def my_smart_cache_setter(key, value, type=None, duration_wallclock=None):
...     if duration_wallclock >= 0.1:  # skip fast calculations
...         vaex.cache.cache[key] = value
...
>>> vaex.cache.set = my_smart_cache_setter
Parameters
  • key (str) – key for caching

  • type – Currently unused.

  • duration_wallclock (float) – Time spend on calculating the result (in wallclock time).

  • value – Any value, typically needs to be pickleable (unless stored in memory)

DataFrame class

class vaex.dataframe.DataFrame(name=None, executor=None)[source]

Bases: object

All local or remote datasets are encapsulated in this class, which provides a pandas like API to your dataset.

Each DataFrame (df) has a number of columns, and a number of rows, the length of the DataFrame.

All DataFrames have multiple ‘selection’, and all calculations are done on the whole DataFrame (default) or for the selection. The following example shows how to use the selection.

>>> df.select("x < 0")
>>> df.sum(df.y, selection=True)
>>> df.sum(df.y, selection=[df.x < 0, df.x > 0])
__dataframe__(nan_as_null: bool = False, allow_copy: bool = True)[source]
__delitem__(item)[source]

Alias of df.drop(item, inplace=True)

__getitem__(item)[source]

Convenient way to get expressions, (shallow) copies of a few columns, or to apply filtering.

Example:

>>> df['Lz']  # the expression 'Lz
>>> df['Lz/2'] # the expression 'Lz/2'
>>> df[["Lz", "E"]] # a shallow copy with just two columns
>>> df[df.Lz < 0]  # a shallow copy with the filter Lz < 0 applied
__init__(name=None, executor=None)[source]
__iter__()[source]

Iterator over the column names.

__len__()[source]

Returns the number of rows in the DataFrame (filtering applied).

__repr__()[source]

Return repr(self).

__setitem__(name, value)[source]

Convenient way to add a virtual column / expression to this DataFrame.

Example:

>>> import vaex, numpy as np
>>> df = vaex.example()
>>> df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2)
>>> df.r
<vaex.expression.Expression(expressions='r')> instance at 0x121687e80 values=[2.9655450396553587, 5.77829281049018, 6.99079603950256, 9.431842752707537, 0.8825613121347967 ... (total 330000 values) ... 7.453831761514681, 15.398412491068198, 8.864250273925633, 17.601047186042507, 14.540181524970293]
__str__()[source]

Return str(self).

__weakref__

list of weak references to the object (if defined)

add_column(name, f_or_array, dtype=None)[source]

Add an in memory array as a column.

add_variable(name, expression, overwrite=True, unique=True)[source]

Add a variable to a DataFrame.

A variable may refer to other variables, and virtual columns and expression may refer to variables.

Example

>>> df.add_variable('center', 0)
>>> df.add_virtual_column('x_prime', 'x-center')
>>> df.select('x_prime < 0')
Param

str name: name of virtual varible

Param

expression: expression for the variable

add_virtual_column(name, expression, unique=False)[source]

Add a virtual column to the DataFrame.

Example:

>>> df.add_virtual_column("r", "sqrt(x**2 + y**2 + z**2)")
>>> df.select("r < 10")
Param

str name: name of virtual column

Param

expression: expression for the column

Parameters

unique (str) – if name is already used, make it unique by adding a postfix, e.g. _1, or _2

apply(f, arguments=None, vectorize=False, multiprocessing=True)[source]

Apply a function on a per row basis across the entire DataFrame.

Example:

>>> import vaex
>>> df = vaex.example()
>>> def func(x, y):
...     return (x+y)/(x-y)
...
>>> df.apply(func, arguments=[df.x, df.y])
Expression = lambda_function(x, y)
Length: 330,000 dtype: float64 (expression)
-------------------------------------------
     0  -0.460789
     1    3.90038
     2  -0.642851
     3   0.685768
     4  -0.543357
Parameters
  • f – The function to be applied

  • arguments – List of arguments to be passed on to the function f.

  • vectorize – Call f with arrays instead of a scalars (for better performance).

  • multiprocessing (bool) – Use multiple processes to avoid the GIL (Global interpreter lock).

Returns

A function that is lazily evaluated.

byte_size(selection=False, virtual=False)[source]

Return the size in bytes the whole DataFrame requires (or the selection), respecting the active_fraction.

cat(i1, i2, format='html')[source]

Display the DataFrame from row i1 till i2

For format, see https://pypi.org/project/tabulate/

Parameters
  • i1 (int) – Start row

  • i2 (int) – End row.

  • format (str) – Format to use, e.g. ‘html’, ‘plain’, ‘latex’

close()[source]

Close any possible open file handles or other resources, the DataFrame will not be in a usable state afterwards.

property col

Gives direct access to the columns only (useful for tab completion).

Convenient when working with ipython in combination with small DataFrames, since this gives tab-completion.

Columns can be accessed by their names, which are attributes. The attributes are currently expressions, so you can do computations with them.

Example

>>> ds = vaex.example()
>>> df.plot(df.col.x, df.col.y)
column_count(hidden=False)[source]

Returns the number of columns (including virtual columns).

Parameters

hidden (bool) – If True, include hidden columns in the tally

Returns

Number of columns in the DataFrame

combinations(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]

Generate a list of combinations for the possible expressions for the given dimension.

Parameters
  • expressions_list – list of list of expressions, where the inner list defines the subspace

  • dimensions – if given, generates a subspace with all possible combinations for that dimension

  • exclude – list of

correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, delay=False, progress=None, array_type=None)[source]

Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between x and y, possibly on a grid defined by binby.

The x and y arguments can be single expressions of lists of expressions. - If x and y are single expression, it computes the correlation between x and y; - If x is a list of expressions and y is a single expression, it computes the correlation between each expression in x and the expression in y; - If x is a list of expressions and y is None, it computes the correlation matrix amongst all expressions in x; - If x is a list of tuples of length 2, it computes the correlation for the specified dimension pairs; - If x and y are lists of expressions, it computes the correlation matrix defined by the two expression lists.

Example:

>>> import vaex
>>> df = vaex.example()
>>> df.correlation("x**2+y**2+z**2", "-log(-E+1)")
array(0.6366637382215669)
>>> df.correlation("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 0.40594394,  0.69868851,  0.61394099,  0.65266318])
>>> df.correlation(x=['x', 'y', 'z'])
array([[ 1.        , -0.06668907, -0.02709719],
       [-0.06668907,  1.        ,  0.03450365],
       [-0.02709719,  0.03450365,  1.        ]])
>>> df.correlation(x=['x', 'y', 'z'], y=['E', 'Lz'])
array([[-0.01116315, -0.00369268],
       [-0.0059848 ,  0.02472491],
       [ 0.01428211, -0.05900035]])
Parameters
  • x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • y – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

count(expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None, array_type=None)[source]

Count the number of non-NaN values (or all, if expression is None or “*”).

Example:

>>> df.count()
330000
>>> df.count("*")
330000.0
>>> df.count("*", binby=["x"], shape=4)
array([  10925.,  155427.,  152007.,   10748.])
Parameters
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

cov(x, y=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby.

Either x and y are expressions, e.g.:

>>> df.cov("x", "y")

Or only the x argument is given with a list of expressions, e.g.:

>>> df.cov(["x, "y, "z"])

Example:

>>> df.cov("x", "y")
array([[ 53.54521742,  -3.8123135 ],
[ -3.8123135 ,  60.62257881]])
>>> df.cov(["x", "y", "z"])
array([[ 53.54521742,  -3.8123135 ,  -0.98260511],
[ -3.8123135 ,  60.62257881,   1.21381057],
[ -0.98260511,   1.21381057,  25.55517638]])
>>> df.cov("x", "y", binby="E", shape=2)
array([[[  9.74852878e+00,  -3.02004780e-02],
[ -3.02004780e-02,   9.99288215e+00]],
[[  8.43996546e+01,  -6.51984181e+00],
[ -6.51984181e+00,   9.68938284e+01]]])
Parameters
  • x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • y – if previous argument is not a list, this argument should be given

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)

covar(x, y, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the covariance cov[x,y] between x and y, possibly on a grid defined by binby.

Example:

>>> df.covar("x**2+y**2+z**2", "-log(-E+1)")
array(52.69461456005138)
>>> df.covar("x**2+y**2+z**2", "-log(-E+1)")/(df.std("x**2+y**2+z**2") * df.std("-log(-E+1)"))
0.63666373822156686
>>> df.covar("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 10.17387143,  51.94954078,  51.24902796,  20.2163929 ])
Parameters
  • x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • y – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

data_type(expression, array_type=None, internal=False, axis=0)[source]

Return the datatype for the given expression, if not a column, the first row will be evaluated to get the data type.

Example:

>>> df = vaex.from_scalars(x=1, s='Hi')
Parameters
  • array_type (str) – ‘numpy’, ‘arrow’ or None, to indicate if the data type should be converted

  • axis (int) – If a nested type (like list), it will return the value_type of the nested type, axis levels deep.

delete_variable(name)[source]

Deletes a variable from a DataFrame.

delete_virtual_column(name)[source]

Deletes a virtual column from a DataFrame.

describe(strings=True, virtual=True, selection=None)[source]

Give a description of the DataFrame.

>>> import vaex
>>> df = vaex.example()[['x', 'y', 'z']]
>>> df.describe()
                 x          y          z
dtype      float64    float64    float64
count       330000     330000     330000
missing          0          0          0
mean    -0.0671315 -0.0535899  0.0169582
std        7.31746    7.78605    5.05521
min       -128.294   -71.5524   -44.3342
max        271.366    146.466    50.7185
>>> df.describe(selection=df.x > 0)
                   x         y          z
dtype        float64   float64    float64
count         164060    164060     164060
missing       165940    165940     165940
mean         5.13572 -0.486786 -0.0868073
std          5.18701   7.61621    5.02831
min      1.51635e-05  -71.5524   -44.3342
max          271.366   78.0724    40.2191
Parameters
  • strings (bool) – Describe string columns or not

  • virtual (bool) – Describe virtual columns or not

  • selection – Optional selection to use.

Returns

Pandas dataframe

diff(periods=1, column=None, fill_value=None, trim=False, inplace=False, reverse=False)[source]

Calculate the difference between the current row and the row offset by periods

Parameters
  • periods (int) – Which row to take the difference with

  • column (str or list[str]) – Column or list of columns to use (default is all).

  • fill_value – Value to use instead of missing values.

  • trim (bool) – Do not include rows that would otherwise have missing values

  • reverse (bool) – When true, calculate row[periods] - row[current]

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

drop(columns, inplace=False, check=True)[source]

Drop columns (or a single column).

Parameters
  • columns – List of columns or a single column name

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

  • check – When true, it will check if the column is used in virtual columns or the filter, and hide it instead.

drop_filter(inplace=False)[source]

Removes all filters from the DataFrame

dropinf(column_names=None, how='any')[source]

Create a shallow copy of a DataFrame, with filtering set using isinf.

Parameters
  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • how (str) – One of (“any”, “all”). If “any”, then drop rows where any of the columns are inf. If “all”, then drop rows where all of the columns are inf.

Return type

DataFrame

dropmissing(column_names=None, how='any')[source]

Create a shallow copy of a DataFrame, with filtering set using ismissing.

Parameters
  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • how (str) – One of (“any”, “all”). If “any”, then drop rows where any of the columns are missing. If “all”, then drop rows where all of the columns are missing.

Return type

DataFrame

dropna(column_names=None, how='any')[source]

Create a shallow copy of a DataFrame, with filtering set using isna.

Parameters
  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • how (str) – One of (“any”, “all”). If “any”, then drop rows where any of the columns are na. If “all”, then drop rows where all of the columns are na.

Return type

DataFrame

dropnan(column_names=None, how='any')[source]

Create a shallow copy of a DataFrame, with filtering set using isnan.

Parameters
  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • how (str) – One of (“any”, “all”). If “any”, then drop rows where any of the columns are nan. If “all”, then drop rows where all of the columns are nan.

Return type

DataFrame

property dtypes

Gives a Pandas series object containing all numpy dtypes of all columns (except hidden).

evaluate(expression, i1=None, i2=None, out=None, selection=None, filtered=True, array_type=None, parallel=True, chunk_size=None, progress=None)[source]

Evaluate an expression, and return a numpy array with the results for the full column or a part of it.

Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.

To get partial results, use i1 and i2

Parameters
  • expression (str) – Name/expression to evaluate

  • i1 (int) – Start row index, default is the start (0)

  • i2 (int) – End row index, default is the length of the DataFrame

  • out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to a memory mapped array)

  • progress – {progress}

  • selection – selection to apply

Returns

evaluate_iterator(expression, s1=None, s2=None, out=None, selection=None, filtered=True, array_type=None, parallel=True, chunk_size=None, prefetch=True, progress=None)[source]

Generator to efficiently evaluate expressions in chunks (number of rows).

See DataFrame.evaluate() for other arguments.

Example:

>>> import vaex
>>> df = vaex.example()
>>> for i1, i2, chunk in df.evaluate_iterator(df.x, chunk_size=100_000):
...     print(f"Total of {i1} to {i2} = {chunk.sum()}")
...
Total of 0 to 100000 = -7460.610158279056
Total of 100000 to 200000 = -4964.85827154921
Total of 200000 to 300000 = -7303.271340043915
Total of 300000 to 330000 = -2424.65234724951
Parameters
  • progress – {progress}

  • prefetch – Prefetch/compute the next chunk in parallel while the current value is yielded/returned.

evaluate_variable(name)[source]

Evaluates the variable given by name.

execute()[source]

Execute all delayed jobs.

async execute_async()[source]

Async version of execute

extract()[source]

Return a DataFrame containing only the filtered rows.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

The resulting DataFrame may be more efficient to work with when the original DataFrame is heavily filtered (contains just a small number of rows).

If no filtering is applied, it returns a trimmed view. For the returned df, len(df) == df.length_original() == df.length_unfiltered()

Return type

DataFrame

fillna(value, column_names=None, prefix='__original_', inplace=False)[source]

Return a DataFrame, where missing values/NaN are filled with ‘value’.

The original columns will be renamed, and by default they will be hidden columns. No data is lost.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Note

Note that filtering will be ignored (since they may change), you may want to consider running extract() first.

Example:

>>> import vaex
>>> import numpy as np
>>> x = np.array([3, 1, np.nan, 10, np.nan])
>>> df = vaex.from_arrays(x=x)
>>> df_filled = df.fillna(value=-1, column_names=['x'])
>>> df_filled
  #    x
  0    3
  1    1
  2   -1
  3   10
  4   -1
Parameters
  • value (float) – The value to use for filling nan or masked values.

  • fill_na (bool) – If True, fill np.nan values with value.

  • fill_masked (bool) – If True, fill masked values with values.

  • column_names (list) – List of column names in which to fill missing values.

  • prefix (str) – The prefix to give the original columns.

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

filter(expression, mode='and')[source]

General version of df[<boolean expression>] to modify the filter applied to the DataFrame.

See DataFrame.select() for usage of selection.

Note that using df = df[<boolean expression>], one can only narrow the filter (i.e. only less rows can be selected). Using the filter method, and a different boolean mode (e.g. “or”) one can actually cause more rows to be selected. This differs greatly from numpy and pandas for instance, which can only narrow the filter.

Example:

>>> import vaex
>>> import numpy as np
>>> x = np.arange(10)
>>> df = vaex.from_arrays(x=x, y=x**2)
>>> df
#    x    y
0    0    0
1    1    1
2    2    4
3    3    9
4    4   16
5    5   25
6    6   36
7    7   49
8    8   64
9    9   81
>>> dff = df[df.x<=2]
>>> dff
#    x    y
0    0    0
1    1    1
2    2    4
>>> dff = dff.filter(dff.x >=7, mode="or")
>>> dff
#    x    y
0    0    0
1    1    1
2    2    4
3    7   49
4    8   64
5    9   81
fingerprint(dependencies=None, treeshake=False)[source]

Id that uniquely identifies a dataframe (cross runtime).

Parameters
  • dependencies (set[str]) – set of column, virtual column, function or selection names to be used.

  • treeshake (bool) – Get rid of unused variables before calculating the fingerprint.

first(expression, order_expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None, array_type=None)[source]

Return the first element of a binned expression, where the values each bin are sorted by order_expression.

Example:

>>> import vaex
>>> df = vaex.example()
>>> df.first(df.x, df.y, shape=8)
>>> df.first(df.x, df.y, shape=8, binby=[df.y])
>>> df.first(df.x, df.y, shape=8, binby=[df.y])
array([-4.81883764, 11.65378   ,  9.70084476, -7.3025589 ,  4.84954977,
        8.47446537, -5.73602629, 10.18783   ])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • order_expression – Order the values in the bins by this expression.

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Ndarray containing the first elements.

Return type

numpy.array

get_active_fraction()[source]

Value in the range (0, 1], to work only with a subset of rows.

get_column_names(virtual=True, strings=True, hidden=False, regex=None, dtype=None)[source]

Return a list of column names

Example:

>>> import vaex
>>> df = vaex.from_scalars(x=1, x2=2, y=3, s='string')
>>> df['r'] = (df.x**2 + df.y**2)**2
>>> df.get_column_names()
['x', 'x2', 'y', 's', 'r']
>>> df.get_column_names(virtual=False)
['x', 'x2', 'y', 's']
>>> df.get_column_names(regex='x.*')
['x', 'x2']
>>> df.get_column_names(dtype='string')
['s']
Parameters
  • virtual – If False, skip virtual columns

  • hidden – If False, skip hidden columns

  • strings – If False, skip string columns

  • regex – Only return column names matching the (optional) regular expression

  • dtype – Only return column names with the given dtype. Can be a single or a list of dtypes.

Return type

list of str

get_current_row()[source]

Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked.

get_names(hidden=False)[source]

Return a list of column names and variable names.

get_private_dir(create=False)[source]

Each DataFrame has a directory where files are stored for metadata etc.

Example

>>> import vaex
>>> ds = vaex.example()
>>> vaex.get_private_dir()
'/Users/users/breddels/.vaex/dfs/_Users_users_breddels_vaex-testing_data_helmi-dezeeuw-2000-10p.hdf5'
Parameters

create (bool) – is True, it will create the directory if it does not exist

get_selection(name='default')[source]

Get the current selection object (mostly for internal use atm).

get_variable(name)[source]

Returns the variable given by name, it will not evaluate it.

For evaluation, see DataFrame.evaluate_variable(), see also DataFrame.set_variable()

has_current_row()[source]

Returns True/False if there currently is a picked row.

has_selection(name='default')[source]

Returns True if there is a selection with the given name.

head(n=10)[source]

Return a shallow copy a DataFrame with the first n rows.

head_and_tail_print(n=5)[source]

Display the first and last n elements of a DataFrame.

healpix_count(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, delay=False, progress=None, selection=None)[source]

Count non missing value for expression on an array which represents healpix data.

Parameters
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows

  • healpix_expression – {healpix_max_level}

  • healpix_max_level – {healpix_max_level}

  • healpix_level – {healpix_level}

  • binby – {binby}, these dimension follow the first healpix dimension.

  • limits – {limits}

  • shape – {shape}

  • selection – {selection}

  • delay – {delay}

  • progress – {progress}

Returns

is_category(column)[source]

Returns true if column is a category.

is_local()[source]

Returns True if the DataFrame is local, False when a DataFrame is remote.

is_masked(column)[source]

Return if a column is a masked (numpy.ma) column.

kurtosis(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]

Calculate the kurtosis for the given expression, possible on a grid defined by binby.

Example:

>>> df.kurtosis('vz')
0.33414303
>>> df.kurtosis("vz", binby=["E"], shape=4)
array([0.35286113, 0.14455428, 0.52955107, 5.06716345])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

last(expression, order_expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None, array_type=None)[source]

Return the last element of a binned expression, where the values each bin are sorted by order_expression.

Parameters
  • expression – The value to be placed in the bin.

  • order_expression – Order the values in the bins by this expression.

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Ndarray containing the first elements.

Return type

numpy.array

length_original()[source]

the full length of the DataFrame, independent what active_fraction is, or filtering. This is the real length of the underlying ndarrays.

length_unfiltered()[source]

The length of the arrays that should be considered (respecting active range), but without filtering.

limits(expression, value=None, square=False, selection=None, delay=False, progress=None, shape=None)[source]

Calculate the [min, max] range for expression, as described by value, which is ‘minmax’ by default.

If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.

Example:

>>> import vaex
>>> df = vaex.example()
>>> df.limits("x")
array([-128.293991,  271.365997])
>>> df.limits("x", "99.7%")
array([-28.86381927,  28.9261226 ])
>>> df.limits(["x", "y"])
(array([-128.293991,  271.365997]), array([ -71.5523682,  146.465836 ]))
>>> df.limits(["x", "y"], "99.7%")
(array([-28.86381927,  28.9261226 ]), array([-28.60476934,  28.96535249]))
>>> df.limits(["x", "y"], ["minmax", "90%"])
(array([-128.293991,  271.365997]), array([-13.37438402,  13.4224423 ]))
>>> df.limits(["x", "y"], ["minmax", [0, 10]])
(array([-128.293991,  271.365997]), [0, 10])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • value – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns

List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

limits_percentage(expression, percentage=99.73, square=False, selection=False, progress=None, delay=False)[source]

Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.

The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:

Example:

>>> df.limits_percentage("x", 90)
array([-12.35081376,  12.14858052]
>>> df.percentile_approx("x", 5), df.percentile_approx("x", 95)
(array([-12.36813152]), array([ 12.13275818]))

NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code

Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • percentage (float) – Value between 0 and 100

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns

List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

materialize(column=None, inplace=False, virtual_column=None)[source]

Turn columns into native CPU format for optimal performance at cost of memory.

Warning

This may use of lot of memory, be mindfull.

Virtual columns will be evaluated immediately, and all real columns will be cached in memory when used for the first time.

Example for virtual column:

>>> x = np.arange(1,4)
>>> y = np.arange(2,5)
>>> df = vaex.from_arrays(x=x, y=y)
>>> df['r'] = (df.x**2 + df.y**2)**0.5 # 'r' is a virtual column (computed on the fly)
>>> df = df.materialize('r')  # now 'r' is a 'real' column (i.e. a numpy array)

Example with parquet file >>> df = vaex.open(‘somewhatslow.parquet’) >>> df.x.sum() # slow >>> df = df.materialize() >>> df.x.sum() # slow, but will fill the cache >>> df.x.sum() # as fast as possible, will use memory

Parameters
  • column – string or list of strings with column names to materialize, all columns when None

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

  • virtual_column – for backward compatibility

max(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]

Calculate the maximum for given expressions, possibly on a grid defined by binby.

Example:

>>> df.max("x")
array(271.365997)
>>> df.max(["x", "y"])
array([ 271.365997,  146.465836])
>>> df.max("x", binby="x", shape=5, limits=[-10, 10])
array([-6.00010443, -2.00002384,  1.99998057,  5.99983597,  9.99984646])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mean(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]

Calculate the mean for expression, possibly on a grid defined by binby.

Example:

>>> df.mean("x")
-0.067131491264005971
>>> df.mean("(x**2+y**2)**0.5", binby="E", shape=4)
array([  2.43483742,   4.41840721,   8.26742458,  15.53846476])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

median_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=256, percentile_limits='minmax', selection=False, delay=False, progress=None)[source]

Calculate the median, possibly on a grid defined by binby.

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’

  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

min(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]

Calculate the minimum for given expressions, possibly on a grid defined by binby.

Example:

>>> df.min("x")
array(-128.293991)
>>> df.min(["x", "y"])
array([-128.293991 ,  -71.5523682])
>>> df.min("x", binby="x", shape=5, limits=[-10, 10])
array([-9.99919128, -5.99972439, -1.99991322,  2.0000093 ,  6.0004878 ])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

minmax(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the minimum and maximum for expressions, possibly on a grid defined by binby.

Example:

>>> df.minmax("x")
array([-128.293991,  271.365997])
>>> df.minmax(["x", "y"])
array([[-128.293991 ,  271.365997 ],
           [ -71.5523682,  146.465836 ]])
>>> df.minmax("x", binby="x", shape=5, limits=[-10, 10])
array([[-9.99919128, -6.00010443],
           [-5.99972439, -2.00002384],
           [-1.99991322,  1.99998057],
           [ 2.0000093 ,  5.99983597],
           [ 6.0004878 ,  9.99984646]])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mode(expression, binby=[], limits=None, shape=256, mode_shape=64, mode_limits=None, progressbar=False, selection=None)[source]

Calculate/estimate the mode.

mutual_information(x, y=None, dimension=2, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, delay=False)[source]

Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby.

The x and y arguments can be single expressions of lists of expressions: - If x and y are single expression, it computes the mutual information between x and y; - If x is a list of expressions and y is a single expression, it computes the mutual information between each expression in x and the expression in y; - If x is a list of expressions and y is None, it computes the mutual information matrix amongst all expressions in x; - If x is a list of tuples of length 2, it computes the mutual information for the specified dimension pairs; - If x and y are lists of expressions, it computes the mutual information matrix defined by the two expression lists.

If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order.

Example:

>>> import vaex
>>> df = vaex.example()
>>> df.mutual_information("x", "y")
array(0.1511814526380327)
>>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]])
array([ 0.15118145,  0.18439181,  1.07067379])
>>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True)
(array([ 1.07067379,  0.18439181,  0.15118145]),
[['E', 'Lz'], ['x', 'z'], ['x', 'y']])
>>> df.mutual_information(x=['x', 'y', 'z'])
array([[3.53535106, 0.06893436, 0.11656418],
       [0.06893436, 3.49414866, 0.14089177],
       [0.11656418, 0.14089177, 3.96144906]])
>>> df.mutual_information(x=['x', 'y', 'z'], y=['E', 'Lz'])
array([[0.32316291, 0.16110026],
       [0.36573065, 0.17802792],
       [0.35239151, 0.21677695]])
Parameters
  • x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • y – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,

property nbytes

Alias for df.byte_size(), see DataFrame.byte_size().

nop(expression=None, progress=False, delay=False)[source]

Evaluates expression or a list of expressions, and drops the result. Usefull for benchmarking, since vaex is usually lazy.

Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns

None

percentile_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, delay=False, progress=None)[source]

Calculate the percentile given by percentage, possibly on a grid defined by binby.

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits.

Example:

>>> df.percentile_approx("x", 10), df.percentile_approx("x", 90)
(array([-8.3220355]), array([ 7.92080358]))
>>> df.percentile_approx("x", 50, binby="x", shape=5, limits=[-10, 10])
array([[-7.56462982],
           [-3.61036641],
           [-0.01296306],
           [ 3.56697863],
           [ 7.45838367]])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’

  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

plot2d_contour(x=None, y=None, what='count(*)', limits=None, shape=256, selection=None, f='identity', figsize=None, xlabel=None, ylabel=None, aspect='auto', levels=None, fill=False, colorbar=False, colorbar_label=None, colormap=None, colors=None, linewidths=None, linestyles=None, vmin=None, vmax=None, grid=None, show=None, **kwargs)

Plot conting contours on 2D grid.

Parameters
  • x – {expression}

  • y – {expression}

  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)

  • limits – {limits}

  • shape – {shape}

  • selection – {selection}

  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value

  • figsize – (x, y) tuple passed to plt.figure for setting the figure size

  • xlabel – label of the x-axis (defaults to param x)

  • ylabel – label of the y-axis (defaults to param y)

  • aspect – the aspect ratio of the figure

  • levels – the contour levels to be passed on plt.contour or plt.contourf

  • colorbar – plot a colorbar or not

  • colorbar_label – the label of the colourbar (defaults to param what)

  • colormap – matplotlib colormap to pass on to plt.contour or plt.contourf

  • colors – the colours of the contours

  • linewidths – the widths of the contours

  • linestyles – the style of the contour lines

  • vmin – instead of automatic normalization, scale the data between vmin and vmax

  • vmax – see vmin

  • grid – {grid}

  • show

plot3d(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]

Use at own risk, requires ipyvolume

plot_bq(x, y, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, **kwargs)[source]

Deprecated: use plot_widget

plot_widget(x, y, limits=None, f='identity', **kwargs)[source]

Deprecated: use df.widget.heatmap

propagate_uncertainties(columns, depending_variables=None, cov_matrix='auto', covariance_format='{}_{}_covariance', uncertainty_format='{}_uncertainty')[source]

Propagates uncertainties (full covariance matrix) for a set of virtual columns.

Covariance matrix of the depending variables is guessed by finding columns prefixed by “e” or “e_” or postfixed by “_error”, “_uncertainty”, “e” and “_e”. Off diagonals (covariance or correlation) by postfixes with “_correlation” or “_corr” for correlation or “_covariance” or “_cov” for covariances. (Note that x_y_cov = x_e * y_e * x_y_correlation.)

Example

>>> df = vaex.from_scalars(x=1, y=2, e_x=0.1, e_y=0.2)
>>> df["u"] = df.x + df.y
>>> df["v"] = np.log10(df.x)
>>> df.propagate_uncertainties([df.u, df.v])
>>> df.u_uncertainty, df.v_uncertainty
Parameters
  • columns – list of columns for which to calculate the covariance matrix.

  • depending_variables – If not given, it is found out automatically, otherwise a list of columns which have uncertainties.

  • cov_matrix – List of list with expressions giving the covariance matrix, in the same order as depending_variables. If ‘full’ or ‘auto’, the covariance matrix for the depending_variables will be guessed, where ‘full’ gives an error if an entry was not found.

remove_virtual_meta()[source]

Removes the file with the virtual column etc, it does not change the current virtual columns etc.

rename(name, new_name, unique=False)[source]

Renames a column or variable, and rewrite expressions such that they refer to the new name

rolling(window, trim=False, column=None, fill_value=None, edge='right')[source]

Create a vaex.rolling.Rolling rolling window object

Parameters
  • window (int) – Size of the rolling window.

  • trim (bool) – Trim off begin or end of dataframe to avoid missing values

  • column (str or list[str]) – Column name or column names of columns affected (None for all)

  • fill_value (any) – Scalar value to use for data outside of existing rows.

  • edge (str) – Where the edge of the rolling window is for the current row.

sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]

Returns a DataFrame with a random set of rows

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Provide either n or frac.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5))
>>> df
  #  s      x
  0  a      1
  1  b      2
  2  c      3
  3  d      4
>>> df.sample(n=2, random_state=42) # 2 random rows, fixed seed
  #  s      x
  0  b      2
  1  d      4
>>> df.sample(frac=1, random_state=42) # 'shuffling'
  #  s      x
  0  c      3
  1  a      1
  2  d      4
  3  b      2
>>> df.sample(frac=1, replace=True, random_state=42) # useful for bootstrap (may contain repeated samples)
  #  s      x
  0  d      4
  1  a      1
  2  a      1
  3  d      4
Parameters
  • n (int) – number of samples to take (default 1 if frac is None)

  • frac (float) – fractional number of takes to take

  • replace (bool) – If true, a row may be drawn multiple times

  • weights (str or expression) – (unnormalized) probability that a row can be drawn

  • RandomState (int or) – Sets a seed or RandomState for reproducability. When None, a random seed it chosen.

Returns

Returns a new DataFrame with a shallow copy/view of the underlying data

Return type

DataFrame

schema()[source]

Similar to df.dtypes, but returns a dict

schema_arrow(reduce_large=False)[source]

Similar to schema(), but returns an arrow schema

Parameters

reduce_large (bool) – change large_string to normal string

select(boolean_expression, mode='replace', name='default', executor=None)[source]

Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode.

Selections are recorded in a history tree, per name, undo/redo can be done for them separately.

Parameters
  • boolean_expression (str) – Any valid column expression, with comparison operators

  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract

  • name (str) – history tree or selection ‘slot’ to use

  • executor

Returns

select_box(spaces, limits, mode='replace', name='default')[source]

Select a n-dimensional rectangular box bounded by limits.

The following examples are equivalent:

>>> df.select_box(['x', 'y'], [(0, 10), (0, 1)])
>>> df.select_rectangle('x', 'y', [(0, 10), (0, 1)])
Parameters
  • spaces – list of expressions

  • limits – sequence of shape [(x1, x2), (y1, y2)]

  • mode

  • name

Returns

select_circle(x, y, xc, yc, r, mode='replace', name='default', inclusive=True)[source]

Select a circular region centred on xc, yc, with a radius of r.

Example:

>>> df.select_circle('x','y',2,3,1)
Parameters
  • x – expression for the x space

  • y – expression for the y space

  • xc – location of the centre of the circle in x

  • yc – location of the centre of the circle in y

  • r – the radius of the circle

  • name – name of the selection

  • mode

Returns

select_ellipse(x, y, xc, yc, width, height, angle=0, mode='replace', name='default', radians=False, inclusive=True)[source]

Select an elliptical region centred on xc, yc, with a certain width, height and angle.

Example:

>>> df.select_ellipse('x','y', 2, -1, 5,1, 30, name='my_ellipse')
Parameters
  • x – expression for the x space

  • y – expression for the y space

  • xc – location of the centre of the ellipse in x

  • yc – location of the centre of the ellipse in y

  • width – the width of the ellipse (diameter)

  • height – the width of the ellipse (diameter)

  • angle – (degrees) orientation of the ellipse, counter-clockwise measured from the y axis

  • name – name of the selection

  • mode

Returns

select_inverse(name='default', executor=None)[source]

Invert the selection, i.e. what is selected will not be, and vice versa

Parameters
  • name (str) –

  • executor

Returns

select_lasso(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]

For performance reasons, a lasso selection is handled differently.

Parameters
  • expression_x (str) – Name/expression for the x coordinate

  • expression_y (str) – Name/expression for the y coordinate

  • xsequence – list of x numbers defining the lasso, together with y

  • ysequence

  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract

  • name (str) –

  • executor

Returns

select_non_missing(drop_nan=True, drop_masked=True, column_names=None, mode='replace', name='default')[source]

Create a selection that selects rows having non missing values for all columns in column_names.

The name reflects Pandas, no rows are really dropped, but a mask is kept to keep track of the selection

Parameters
  • drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)

  • drop_masked – drop rows when there is a masked value in any of the columns

  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract

  • name (str) – history tree or selection ‘slot’ to use

Returns

select_nothing(name='default')[source]

Select nothing.

select_rectangle(x, y, limits, mode='replace', name='default')[source]

Select a 2d rectangular box in the space given by x and y, bounded by limits.

Example:

>>> df.select_box('x', 'y', [(0, 10), (0, 1)])
Parameters
  • x – expression for the x space

  • y – expression fo the y space

  • limits – sequence of shape [(x1, x2), (y1, y2)]

  • mode

selected_length()[source]

Returns the number of rows that are selected.

selection_can_redo(name='default')[source]

Can selection name be redone?

selection_can_undo(name='default')[source]

Can selection name be undone?

selection_redo(name='default', executor=None)[source]

Redo selection, for the name.

selection_undo(name='default', executor=None)[source]

Undo selection, for the name.

set_active_fraction(value)[source]

Sets the active_fraction, set picked row to None, and remove selection.

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_active_range(i1, i2)[source]

Sets the active_fraction, set picked row to None, and remove selection.

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_current_row(value)[source]

Set the current row, and emit the signal signal_pick.

set_selection(selection, name='default', executor=None)[source]

Sets the selection object

Parameters
  • selection – Selection object

  • name – selection ‘slot’

  • executor

Returns

set_variable(name, expression_or_value, write=True)[source]

Set the variable to an expression or value defined by expression_or_value.

Example

>>> df.set_variable("a", 2.)
>>> df.set_variable("b", "a**2")
>>> df.get_variable("b")
'a**2'
>>> df.evaluate_variable("b")
4.0
Parameters
  • name – Name of the variable

  • write – write variable to meta file

  • expression – value or expression

shift(periods, column=None, fill_value=None, trim=False, inplace=False)[source]

Shift a column or multiple columns by periods amounts of rows.

Parameters
  • periods (int) – Shift column forward (when positive) or backwards (when negative)

  • column (str or list[str]) – Column or list of columns to shift (default is all).

  • fill_value – Value to use instead of missing values.

  • trim (bool) – Do not include rows that would otherwise have missing values

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

shuffle(random_state=None)[source]

Shuffle order of rows (equivalent to df.sample(frac=1))

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c']), x=np.arange(1,4))
>>> df
  #  s      x
  0  a      1
  1  b      2
  2  c      3
>>> df.shuffle(random_state=42)
  #  s      x
  0  a      1
  1  b      2
  2  c      3
Parameters

RandomState (int or) – Sets a seed or RandomState for reproducability. When None, a random seed it chosen.

Returns

Returns a new DataFrame with a shallow copy/view of the underlying data

Return type

DataFrame

skew(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]

Calculate the skew for the given expression, possible on a grid defined by binby.

Example:

>>> df.skew("vz")
0.02116528
>>> df.skew("vz", binby=["E"], shape=4)
array([-0.069976  , -0.01003445,  0.05624177, -2.2444322 ])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

sort(by, ascending=True)[source]

Return a sorted DataFrame, sorted by the expression ‘by’.

Both ‘by’ and ‘ascending’ arguments can be lists. Note that missing/nan/NA values will always be pushed to the end, no matter the sorting order.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Note

Note that filtering will be ignored (since they may change), you may want to consider running extract() first.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5))
>>> df['y'] = (df.x-1.8)**2
>>> df
  #  s      x     y
  0  a      1  0.64
  1  b      2  0.04
  2  c      3  1.44
  3  d      4  4.84
>>> df.sort('y', ascending=False)  # Note: passing '(x-1.8)**2' gives the same result
  #  s      x     y
  0  d      4  4.84
  1  c      3  1.44
  2  a      1  0.64
  3  b      2  0.04
Parameters
  • by (str or expression or list of str/expressions) – expression to sort by.

  • ascending (bool or list of bools) – ascending (default, True) or descending (False).

split(into=None)[source]

Returns a list containing ordered subsets of the DataFrame.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Example:

>>> import vaex
>>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> for dfs in df.split(into=0.3):
...     print(dfs.x.values)
...
[0 1 3]
[3 4 5 6 7 8 9]
>>> for split in df.split(into=[0.2, 0.3, 0.5]):
...     print(dfs.x.values)
[0 1]
[2 3 4]
[5 6 7 8 9]
Parameters

into (int/float/list) – If float will split the DataFrame in two, the first of which will have a relative length as specified by this parameter. When a list, will split into as many portions as elements in the list, where each element defines the relative length of that portion. Note that such a list of fractions will always be re-normalized to 1. When an int, split DataFrame into n dataframes of equal length (last one may deviate), if len(df) < n, it will return len(df) DataFrames.

split_random(into, random_state=None)[source]

Returns a list containing random portions of the DataFrame.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Example:

>>> import vaex, import numpy as np
>>> np.random.seed(111)
>>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> for dfs in df.split_random(into=0.3, random_state=42):
...     print(dfs.x.values)
...
[8 1 5]
[0 7 2 9 4 3 6]
>>> for split in df.split_random(into=[0.2, 0.3, 0.5], random_state=42):
...     print(dfs.x.values)
[8 1]
[5 0 7]
[2 9 4 3 6]
Parameters
  • into (int/float/list) – If float will split the DataFrame in two, the first of which will have a relative length as specified by this parameter. When a list, will split into as many portions as elements in the list, where each element defines the relative length of that portion. Note that such a list of fractions will always be re-normalized to 1. When an int, split DataFrame into n dataframes of equal length (last one may deviate), if len(df) < n, it will return len(df) DataFrames.

  • RandomState (int or) – Sets a seed or RandomState for reproducability. When None, a random seed it chosen.

Returns

A list of DataFrames.

Return type

list

state_load(file, use_active_range=False, keep_columns=None, set_filter=True, trusted=True, fs_options=None, fs=None)[source]

Load a state previously stored by DataFrame.state_write(), see also DataFrame.state_set().

Parameters
  • file (str) – filename (ending in .json or .yaml)

  • use_active_range (bool) – Whether to use the active range or not.

  • keep_columns (list) – List of columns that should be kept if the state to be set contains less columns.

  • set_filter (bool) – Set the filter from the state (default), or leave the filter as it is it.

  • fs_options (dict) – arguments to pass the the file system handler (s3fs or gcsfs)

  • fs – ‘Pass a file system object directly, see vaex.open()

state_write(file, fs_options=None, fs=None)[source]

Write the internal state to a json or yaml file (see DataFrame.state_get())

Example

>>> import vaex
>>> df = vaex.from_scalars(x=1, y=2)
>>> df['r'] = (df.x**2 + df.y**2)**0.5
>>> df.state_write('state.json')
>>> print(open('state.json').read())
{
"virtual_columns": {
    "r": "(((x ** 2) + (y ** 2)) ** 0.5)"
},
"column_names": [
    "x",
    "y",
    "r"
],
"renamed_columns": [],
"variables": {
    "pi": 3.141592653589793,
    "e": 2.718281828459045,
    "km_in_au": 149597870.7,
    "seconds_per_year": 31557600
},
"functions": {},
"selections": {
    "__filter__": null
},
"ucds": {},
"units": {},
"descriptions": {},
"description": null,
"active_range": [
    0,
    1
]
}
>>> df.state_write('state.yaml')
>>> print(open('state.yaml').read())
active_range:
- 0
- 1
column_names:
- x
- y
- r
description: null
descriptions: {}
functions: {}
renamed_columns: []
selections:
__filter__: null
ucds: {}
units: {}
variables:
pi: 3.141592653589793
e: 2.718281828459045
km_in_au: 149597870.7
seconds_per_year: 31557600
virtual_columns:
r: (((x ** 2) + (y ** 2)) ** 0.5)
Parameters
  • file (str) – filename (ending in .json or .yaml)

  • fs_options (dict) – arguments to pass the the file system handler (s3fs or gcsfs)

  • fs – ‘Pass a file system object directly, see vaex.open()

std(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, array_type=None)[source]

Calculate the standard deviation for the given expression, possible on a grid defined by binby

>>> df.std("vz")
110.31773397535071
>>> df.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

sum(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]

Calculate the sum for the given expression, possible on a grid defined by binby

Example:

>>> df.sum("L")
304054882.49378014
>>> df.sum("L", binby="E", shape=4)
array([  8.83517994e+06,   5.92217598e+07,   9.55218726e+07,
                 1.40008776e+08])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

tail(n=10)[source]

Return a shallow copy a DataFrame with the last n rows.

take(indices, filtered=True, dropfilter=True)[source]

Returns a DataFrame containing only rows indexed by indices

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5))
>>> df.take([0,2])
 #  s      x
 0  a      1
 1  c      3
Parameters
  • indices – sequence (list or numpy array) with row numbers

  • filtered – (for internal use) The indices refer to the filtered data.

  • dropfilter – (for internal use) Drop the filter, set to False when indices refer to unfiltered, but may contain rows that still need to be filtered out.

Returns

DataFrame which is a shallow copy of the original data.

Return type

DataFrame

to_arrays(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type=None)[source]

Return a list of ndarrays

Parameters
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

list of arrays

to_arrow_table(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, reduce_large=False)[source]

Returns an arrow Table object containing the arrays corresponding to the evaluated data

Parameters
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • reduce_large (bool) – If possible, cast large_string to normal string

Returns

pyarrow.Table object or iterator of

to_astropy_table(column_names=None, selection=None, strings=True, virtual=True, index=None, parallel=True)[source]

Returns a astropy table object containing the ndarrays corresponding to the evaluated data

Parameters
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • index – if this column is given it is used for the index of the DataFrame

Returns

astropy.table.Table object

to_dask_array(chunks='auto')[source]

Lazily expose the DataFrame as a dask.array

Example

>>> df = vaex.example()
>>> A = df[['x', 'y', 'z']].to_dask_array()
>>> A
dask.array<vaex-df-1f048b40-10ec-11ea-9553, shape=(330000, 3), dtype=float64, chunksize=(330000, 3), chunktype=numpy.ndarray>
>>> A+1
dask.array<add, shape=(330000, 3), dtype=float64, chunksize=(330000, 3), chunktype=numpy.ndarray>
Parameters

chunks – How to chunk the array, similar to dask.array.from_array().

Returns

dask.array.Array object.

to_dict(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type=None)[source]

Return a dict containing the ndarray corresponding to the evaluated data

Parameters
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

dict

to_items(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type=None)[source]

Return a list of [(column_name, ndarray), …)] pairs where the ndarray corresponds to the evaluated data

Parameters
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

list of (name, ndarray) pairs or iterator of

to_pandas_df(column_names=None, selection=None, strings=True, virtual=True, index_name=None, parallel=True, chunk_size=None, array_type=None)[source]

Return a pandas DataFrame containing the ndarray corresponding to the evaluated data

If index is given, that column is used for the index of the dataframe.

Example

>>> df_pandas = df.to_pandas_df(["x", "y", "z"])
>>> df_copy = vaex.from_pandas(df_pandas)
Parameters
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • index_column – if this column is given it is used for the index of the DataFrame

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

pandas.DataFrame object or iterator of

to_records(index=None, selection=None, column_names=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type='python')[source]

Return a list of [{column_name: value}, …)] “records” where each dict is an evaluated row.

Parameters
  • index – an index to use to get the record of a specific row when provided

  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

list of [{column_name:value}, …] records

trim(inplace=False)[source]

Return a DataFrame, where all columns are ‘trimmed’ by the active range.

For the returned DataFrame, df.get_active_range() returns (0, df.length_original()).

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Parameters

inplace – If True, make modifications to self, otherwise return a new DataFrame

Return type

DataFrame

ucd_find(ucds, exclude=[])[source]

Find a set of columns (names) which have the ucd, or part of the ucd.

Prefixed with a ^, it will only match the first part of the ucd.

Example

>>> df.ucd_find('pos.eq.ra', 'pos.eq.dec')
['RA', 'DEC']
>>> df.ucd_find('pos.eq.ra', 'doesnotexist')
>>> df.ucds[df.ucd_find('pos.eq.ra')]
'pos.eq.ra;meta.main'
>>> df.ucd_find('meta.main')]
'dec'
>>> df.ucd_find('^meta.main')]
unique(expression, return_inverse=False, dropna=False, dropnan=False, dropmissing=False, progress=False, selection=None, axis=None, delay=False, limit=None, limit_raise=True, array_type='python')[source]

Returns all unique values.

Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • return_inverse – Return the inverse mapping from unique values to original values.

  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • dropnan – Drop rows with NaN values

  • dropmissing – Drop rows with missing values

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • axis (int) – Axis over which to determine the unique elements (None will flatten arrays or lists)

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • limit (int) – Limit the amount of results

  • limit_raise (bool) – Raise vaex.RowLimitException when limit is exceeded, or return at maximum ‘limit’ amount of results.

  • array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

unit(expression, default=None)[source]

Returns the unit (an astropy.unit.Units object) for the expression.

Example

>>> import vaex
>>> ds = vaex.example()
>>> df.unit("x")
Unit("kpc")
>>> df.unit("x*L")
Unit("km kpc2 / s")
Parameters
  • expression – Expression, which can be a column name

  • default – if no unit is known, it will return this

Returns

The resulting unit of the expression

Return type

astropy.units.Unit

validate_expression(expression)[source]

Validate an expression (may throw Exceptions)

var(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, array_type=None)[source]

Calculate the sample variance for the given expression, possible on a grid defined by binby

Example:

>>> df.var("vz")
12170.002429456246
>>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 15271.90481083,   7284.94713504,   3738.52239232,   1449.63418988])
>>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

DataFrameLocal class

class vaex.dataframe.DataFrameLocal(dataset=None, name=None)[source]

Bases: vaex.dataframe.DataFrame

Base class for DataFrames that work with local file/data

__array__(dtype=None, parallel=True)[source]

Gives a full memory copy of the DataFrame into a 2d numpy array of shape (n_rows, n_columns). Note that the memory order is fortran, so all values of 1 column are contiguous in memory for performance reasons.

Note this returns the same result as:

>>> np.array(ds)

If any of the columns contain masked arrays, the masks are ignored (i.e. the masked elements are returned as well).

__call__(*expressions, **kwargs)[source]

The local implementation of DataFrame.__call__()

__init__(dataset=None, name=None)[source]
as_arrow()[source]

Lazily cast all columns to arrow, except object types.

as_numpy(strict=False)[source]

Lazily cast all numerical columns to numpy.

If strict is True, it will also cast non-numerical types.

binby(by=None, agg=None, sort=False, copy=True, delay=False, progress=None)[source]

Return a BinBy or DataArray object when agg is not None

The binby operation does not return a ‘flat’ DataFrame, instead it returns an N-d grid in the form of an xarray.

Parameters
  • agg (dict, list or agg) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the binby object.

  • copy (bool) – Copy the dataframe (shallow, does not cost memory) so that the fingerprint of the original dataframe is not modified.

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns

DataArray or BinBy object.

categorize(column, min_value=0, max_value=None, labels=None, inplace=False)[source]

Mark column as categorical.

This may help speed up calculations using integer columns between a range of [min_value, max_value].

If max_value is not given, the [min_value and max_value] are calcuated from the data.

Example:

>>> import vaex
>>> df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
>>> df = df.categorize('year', min_value=2020, max_value=2019)
>>> df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
>>> df
  #    year    weekday
  0    2012          0
  1    2015          4
  2    2019          6
>>> df.is_category('year')
True
Parameters
  • column – column to assume is categorical.

  • labels – labels to associate to the values between min_value and max_value

  • min_value – minimum integer value (if max_value is not given, this is calculated)

  • max_value – maximum integer value (if max_value is not given, this is calculated)

  • labels – Labels to associate to each value, list(range(min_value, max_value+1)) by default

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

compare(other, report_missing=True, report_difference=False, show=10, orderby=None, column_names=None)[source]

Compare two DataFrames and report their difference, use with care for large DataFrames

concat(*others, resolver='flexible') vaex.dataframe.DataFrame[source]

Concatenates multiple DataFrames, adding the rows of the other DataFrame to the current, returned in a new DataFrame.

In the case of resolver=’flexible’, when not all columns has the same names, the missing data is filled with missing values.

In the case of resolver=’strict’ all datasets need to have matching column names.

Parameters
  • others – The other DataFrames that are concatenated with this DataFrame

  • resolver (str) – How to resolve schema conflicts, ‘flexible’ or ‘strict’.

Returns

New DataFrame with the rows concatenated

copy(column_names=None, treeshake=False)[source]

Make a shallow copy of a DataFrame. One can also specify a subset of columns.

This is a fairly cheap operation, since no memory copies of the underlying data are made.

{note_copy}

Parameters
  • column_names (list) – A subset of columns to use for the DataFrame copy. If None, all the columns are copied.

  • treeshake (bool) – Get rid of variables not used.

Return type

DataFrame

property data

Gives direct access to the data as numpy arrays.

Convenient when working with IPython in combination with small DataFrames, since this gives tab-completion. Only real columns (i.e. no virtual) columns can be accessed, for getting the data from virtual columns, use DataFrame.evaluate(…).

Columns can be accessed by their names, which are attributes. The attributes are of type numpy.ndarray.

Example:

>>> df = vaex.example()
>>> r = np.sqrt(df.data.x**2 + df.data.y**2)
export(path, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None, **kwargs)[source]

Exports the DataFrame to a file depending on the file extension.

E.g if the filename ends on .hdf5, df.export_hdf5 is called.

Parameters
  • path (str) – path for file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration, if supported.

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

Returns

export_arrow(to, progress=None, chunk_size=1048576, parallel=True, reduce_large=True, fs_options=None, fs=None, as_stream=True)[source]

Exports the DataFrame to a file of stream written with arrow

Parameters
  • to – filename, file object, or pyarrow.RecordBatchStreamWriter, py:data:pyarrow.RecordBatchFileWriter or pyarrow.parquet.ParquetWriter

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • reduce_large (bool) – If True, convert arrow large_string type to string type

  • as_stream (bool) – Write as an Arrow stream if true, else a file. see also https://arrow.apache.org/docs/format/Columnar.html?highlight=arrow1#ipc-file-format

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

Returns

export_csv(path, progress=None, chunk_size=1048576, parallel=True, backend='pandas', **kwargs)[source]

Exports the DataFrame to a CSV file.

Parameters
  • path (str) – path to the file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • backend (str) – Which backend to use, either ‘pandas’ or ‘arrow’. Arrow is considerably faster, but pandas is more flexible.

  • kwargs – additional keyword arguments are passed to the the backends. See DataFrameLocal.export_csv_pandas() and DataFrameLocal.export_csv_arrow() for more details.

export_csv_arrow(to, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None)[source]

Exports the DataFrame to a CSV file via PyArrow.

Parameters
  • to (str) – path to the file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

  • fs – Pass a file system object directly, see vaex.open()

export_csv_pandas(path, progress=None, chunk_size=1048576, parallel=True, **kwargs)[source]

Exports the DataFrame to a CSV file via the Pandas.

Parameters
  • path (str) – Path for file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel – Evaluate the (virtual) columns in parallel

  • kwargs – Extra keyword arguments to be passed on pandas.DataFrame.to_csv()

export_feather(to, parallel=True, reduce_large=True, compression='lz4', fs_options=None, fs=None)[source]

Exports the DataFrame to an arrow file using the feather file format version 2

Feather is exactly represented as the Arrow IPC file format on disk, but also support compression.

see also https://arrow.apache.org/docs/python/feather.html

Parameters
  • to – filename or file object

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • reduce_large (bool) – If True, convert arrow large_string type to string type

  • compression – Can be one of ‘zstd’, ‘lz4’ or ‘uncompressed’

  • fs_options – see vaex.open() e.g. for S3 {“profile”: “myproject”}

  • fs – Pass a file system object directly, see vaex.open()

Returns

export_fits(path, progress=None)[source]

Exports the DataFrame to a fits file that is compatible with TOPCAT colfits format

Parameters
  • path (str) – path for file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns

export_hdf5(path, byteorder='=', progress=None, chunk_size=1048576, parallel=True, column_count=1, writer_threads=0, group='/table', mode='w')[source]

Exports the DataFrame to a vaex hdf5 file

Parameters
  • path (str) – path for file

  • byteorder (str) – = for native, < for little endian and > for big endian

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • column_count (int) – How many columns to evaluate and export in parallel (>1 requires fast random access, like and SSD drive).

  • writer_threads (int) – Use threads for writing or not, only useful when column_count > 1.

  • group (str) – Write the data into a custom group in the hdf5 file.

  • mode (str) – If set to “w” (write), an existing file will be overwritten. If set to “a”, one can append additional data to the hdf5 file, but it needs to be in a different group.

Returns

export_json(to, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None)[source]

Exports the DataFrame to a CSV file.

Parameters
  • to – filename or file object

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel – Evaluate the (virtual) columns in parallel

  • fs_options – see vaex.open() e.g. for S3 {“profile”: “myproject”}

  • fs – Pass a file system object directly, see vaex.open()

Returns

export_many(path, progress=None, chunk_size=1048576, parallel=True, max_workers=None, fs_options=None, fs=None, **export_kwargs)[source]

Export the DataFrame to multiple files of the same type in parallel.

The path will be formatted using the i parameter (which is the chunk index).

Example:

>>> import vaex
>>> df = vaex.open('my_big_dataset.hdf5')
>>> print(f'number of rows: {len(df):,}')
number of rows: 193,938,982
>>> df.export_many(path='my/destination/folder/chunk-{i:03}.arrow')
>>> df_single_chunk = vaex.open('my/destination/folder/chunk-00001.arrow')
>>> print(f'number of rows: {len(df_single_chunk):,}')
number of rows: 1,048,576
>>> df_all_chunks = vaex.open('my/destination/folder/chunk-*.arrow')
>>> print(f'number of rows: {len(df_all_chunks):,}')
number of rows: 193,938,982
Parameters
  • path (str) – Path for file, formatted by chunk index i (e.g. ‘chunk-{i:05}.parquet’)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • max_workers (int) – Number of workers/threads to use for writing in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

export_parquet(path, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None, **kwargs)[source]

Exports the DataFrame to a parquet file.

Note: This may require that all of the data fits into memory (memory mapped data is an exception).

Parameters
  • path (str) – path for file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

  • fs – Pass a file system object directly, see vaex.open()

  • kwargs – Extra keyword arguments to be passed on to py:data:pyarrow.parquet.ParquetWriter.

Returns

export_partitioned(path, by, directory_format='{key}={value}', progress=None, chunk_size=1048576, parallel=True, fs_options={}, fs=None)[source]

Expertimental: export files using hive partitioning.

If no extension is found in the path, we assume parquet files. Otherwise you can specify the format like an format-string. Where {i} is a zero based index, {uuid} a unique id, and {subdir} the Hive key=value directory.

Example paths:
  • ‘/some/dir/{subdir}/{i}.parquet’

  • ‘/some/dir/{subdir}/fixed_name.parquet’

  • ‘/some/dir/{subdir}/{uuid}.parquet’

  • ‘/some/dir/{subdir}/{uuid}.parquet’

Parameters
  • path – directory where to write the files to.

  • str (str or list of) – Which column to partition by.

  • directory_format (str) – format string for directories, default ‘{key}={value}’ for Hive layout.

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

groupby(by=None, agg=None, sort=False, ascending=True, assume_sparse='auto', row_limit=None, copy=True, progress=None, delay=False)[source]

Return a GroupBy or DataFrame object when agg is not None

Examples:

>>> import vaex
>>> import numpy as np
>>> np.random.seed(42)
>>> x = np.random.randint(1, 5, 10)
>>> y = x**2
>>> df = vaex.from_arrays(x=x, y=y)
>>> df.groupby(df.x, agg='count')
#    x    y_count
0    3          4
1    4          2
2    1          3
3    2          1
>>> df.groupby(df.x, agg=[vaex.agg.count('y'), vaex.agg.mean('y')])
#    x    y_count    y_mean
0    3          4         9
1    4          2        16
2    1          3         1
3    2          1         4
>>> df.groupby(df.x, agg={'z': [vaex.agg.count('y'), vaex.agg.mean('y')]})
#    x    z_count    z_mean
0    3          4         9
1    4          2        16
2    1          3         1
3    2          1         4

Example using datetime:

>>> import vaex
>>> import numpy as np
>>> t = np.arange('2015-01-01', '2015-02-01', dtype=np.datetime64)
>>> y = np.arange(len(t))
>>> df = vaex.from_arrays(t=t, y=y)
>>> df.groupby(vaex.BinnerTime.per_week(df.t)).agg({'y' : 'sum'})
#  t                      y
0  2015-01-01 00:00:00   21
1  2015-01-08 00:00:00   70
2  2015-01-15 00:00:00  119
3  2015-01-22 00:00:00  168
4  2015-01-29 00:00:00   87
Parameters
  • agg (dict, list or agg) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the groupby object.

  • sort (bool) – Sort columns for which we group by.

  • ascending (bool or list of bools) – ascending (default, True) or descending (False).

  • assume_sparse (bool or str) – Assume that when grouping by multiple keys, that the existing pairs are sparse compared to the cartesian product. If ‘auto’, let vaex decide (e.g. a groupby with 10_000 rows but only 4*3=12 combinations does not matter much to compress into say 8 existing combinations, and will save another pass over the data)

  • row_limit (int) – Limits the resulting dataframe to the number of rows (default is not to check, only works when assume_sparse is True). Throws a vaex.RowLimitException when the condition is not met.

  • copy (bool) – Copy the dataframe (shallow, does not cost memory) so that the fingerprint of the original dataframe is not modified.

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns

DataFrame or GroupBy object.

hashed(inplace=False) vaex.dataframe.DataFrame[source]

Return a DataFrame with a hashed dataset

is_local()[source]

The local implementation of DataFrame.evaluate(), always returns True.

join(other, on=None, left_on=None, right_on=None, lprefix='', rprefix='', lsuffix='', rsuffix='', how='left', allow_duplication=False, prime_growth=False, cardinality_other=None, inplace=False)[source]

Return a DataFrame joined with other DataFrames, matched by columns/expression on/left_on/right_on

If neither on/left_on/right_on is given, the join is done by simply adding the columns (i.e. on the implicit row index).

Note: The filters will be ignored when joining, the full DataFrame will be joined (since filters may change). If either DataFrame is heavily filtered (contains just a small number of rows) consider running DataFrame.extract() first.

Example:

>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds1 = vaex.from_arrays(a=a, x=x)
>>> b = np.array(['a', 'b', 'd'])
>>> y = x**2
>>> ds2 = vaex.from_arrays(b=b, y=y)
>>> ds1.join(ds2, left_on='a', right_on='b')
Parameters
  • other – Other DataFrame to join with (the right side)

  • on – default key for the left table (self)

  • left_on – key for the left table (self), overrides on

  • right_on – default key for the right table (other), overrides on

  • lprefix – prefix to add to the left column names in case of a name collision

  • rprefix – similar for the right

  • lsuffix – suffix to add to the left column names in case of a name collision

  • rsuffix – similar for the right

  • how – how to join, ‘left’ keeps all rows on the left, and adds columns (with possible missing values) ‘right’ is similar with self and other swapped. ‘inner’ will only return rows which overlap.

  • allow_duplication (bool) – Allow duplication of rows when the joined column contains non-unique values.

  • cardinality_other (int) – Number of unique elements (or estimate of) for the other table.

  • prime_growth (bool) – Growth strategy for the hashmaps used internally, can improve performance in some case (e.g. integers with low bits unused).

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

Returns

label_encode(column, values=None, inplace=False, lazy=False)

Deprecated: use ordinal_encode

Encode column as ordinal values and mark it as categorical.

The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)-1].

param lazy

When False, it will materialize the ordinal codes.

length(selection=False)[source]

Get the length of the DataFrames, for the selection of the whole DataFrame.

If selection is False, it returns len(df).

TODO: Implement this in DataFrameRemote, and move the method up in DataFrame.length()

Parameters

selection – When True, will return the number of selected rows

Returns

ordinal_encode(column, values=None, inplace=False, lazy=False)[source]

Deprecated: use ordinal_encode

Encode column as ordinal values and mark it as categorical.

The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)-1].

param lazy

When False, it will materialize the ordinal codes.

selected_length(selection='default')[source]

The local implementation of DataFrame.selected_length()

shallow_copy(virtual=True, variables=True)[source]

Creates a (shallow) copy of the DataFrame.

It will link to the same data, but will have its own state, e.g. virtual columns, variables, selection etc.

property values

Gives a full memory copy of the DataFrame into a 2d numpy array of shape (n_rows, n_columns). Note that the memory order is fortran, so all values of 1 column are contiguous in memory for performance reasons.

Note this returns the same result as:

>>> np.array(ds)

If any of the columns contain masked arrays, the masks are ignored (i.e. the masked elements are returned as well).

Date/time operations

class vaex.expression.DateTime(expression)[source]

Bases: object

DateTime operations

Usually accessed using e.g. df.birthday.dt.dayofweek

__init__(expression)[source]
__weakref__

list of weak references to the object (if defined)

property date

Return the date part of the datetime value

Returns

an expression containing the date portion of a datetime value

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.date
Expression = dt_date(date)
Length: 3 dtype: datetime64[D] (expression)
-------------------------------------------
0  2009-10-12
1  2016-02-11
2  2015-11-12
property day

Extracts the day from a datetime sample.

Returns

an expression containing the day extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.day
Expression = dt_day(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  12
1  11
2  12
property day_name

Returns the day names of a datetime sample in English.

Returns

an expression containing the day names extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.day_name
Expression = dt_day_name(date)
Length: 3 dtype: str (expression)
---------------------------------
0    Monday
1  Thursday
2  Thursday
property dayofweek

Obtain the day of the week with Monday=0 and Sunday=6

Returns

an expression containing the day of week.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.dayofweek
Expression = dt_dayofweek(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  0
1  3
2  3
property dayofyear

The ordinal day of the year.

Returns

an expression containing the ordinal day of the year.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.dayofyear
Expression = dt_dayofyear(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  285
1   42
2  316
floor(freq, *args)

Perform floor operation on an expression for a given frequency.

Parameters

freq – The frequency level to floor the index to. Must be a fixed frequency like ‘S’ (second), or ‘H’ (hour), but not ‘ME’ (month end).

Returns

an expression containing the floored datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.floor("H")
Expression = dt_floor(date, 'H')
Length: 3 dtype: datetime64[ns] (expression)
--------------------------------------------
0  2009-10-12 03:00:00.000000000
1  2016-02-11 10:00:00.000000000
2  2015-11-12 11:00:00.000000000
property halfyear

Return the half-year of the date. Values can be 1 and 2, for the first and second half of the year respectively.

Returns

an expression containing the half-year extracted from the datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.halfyear
Expression = dt_halfyear(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  2
1  1
2  2
property hour

Extracts the hour out of a datetime samples.

Returns

an expression containing the hour extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.hour
Expression = dt_hour(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0   3
1  10
2  11
property is_leap_year

Check whether a year is a leap year.

Returns

an expression which evaluates to True if a year is a leap year, and to False otherwise.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.is_leap_year
Expression = dt_is_leap_year(date)
Length: 3 dtype: bool (expression)
----------------------------------
0  False
1   True
2  False
property minute

Extracts the minute out of a datetime samples.

Returns

an expression containing the minute extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.minute
Expression = dt_minute(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  31
1  17
2  34
property month

Extracts the month out of a datetime sample.

Returns

an expression containing the month extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.month
Expression = dt_month(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  10
1   2
2  11
property month_name

Returns the month names of a datetime sample in English.

Returns

an expression containing the month names extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.month_name
Expression = dt_month_name(date)
Length: 3 dtype: str (expression)
---------------------------------
0   October
1  February
2  November
property quarter

Return the quarter of the date. Values range from 1-4.

Returns

an expression containing the quarter extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.quarter
Expression = dt_quarter(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  4
1  1
2  4
property second

Extracts the second out of a datetime samples.

Returns

an expression containing the second extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.second
Expression = dt_second(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0   0
1  34
2  22
strftime(date_format)

Returns a formatted string from a datetime sample.

Returns

an expression containing a formatted string extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.strftime("%Y-%m")
Expression = dt_strftime(date, '%Y-%m')
Length: 3 dtype: object (expression)
------------------------------------
0  2009-10
1  2016-02
2  2015-11
property weekofyear

Returns the week ordinal of the year.

Returns

an expression containing the week ordinal of the year, extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.weekofyear
Expression = dt_weekofyear(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  42
1   6
2  46
property year

Extracts the year out of a datetime sample.

Returns

an expression containing the year extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.year
Expression = dt_year(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  2009
1  2016
2  2015

Expression class

class vaex.expression.Expression(ds, expression, ast=None, _selection=False)[source]

Bases: object

Expression class

__abs__()[source]

Returns the absolute value of the expression

__bool__()[source]

Cast expression to boolean. Only supports (<expr1> == <expr2> and <expr1> != <expr2>)

The main use case for this is to support assigning to traitlets. e.g.:

>>> bool(expr1 == expr2)

This will return True when expr1 and expr2 are exactly the same (in string representation). And similarly for:

>>> bool(expr != expr2)

All other cases will return True.

__eq__(b)

Return self==value.

__ge__(b)

Return self>=value.

__getitem__(slicer)[source]

Provides row and optional field access (struct arrays) via bracket notation.

Examples:

>>> import vaex
>>> import pyarrow as pa
>>> array = pa.StructArray.from_arrays(arrays=[[1, 2, 3], ["a", "b", "c"]], names=["col1", "col2"])
>>> df = vaex.from_arrays(array=array, integer=[5, 6, 7])
>>> df
#       array                       integer
0       {'col1': 1, 'col2': 'a'}        5
1       {'col1': 2, 'col2': 'b'}        6
2       {'col1': 3, 'col2': 'c'}        7
>>> df.integer[1:]
Expression = integer
Length: 2 dtype: int64 (column)
-------------------------------
0  6
1  7
>>> df.array[1:]
Expression = array
Length: 2 dtype: struct<col1: int64, col2: string> (column)
-----------------------------------------------------------
0  {'col1': 2, 'col2': 'b'}
1  {'col1': 3, 'col2': 'c'}
>>> df.array[:, "col1"]
Expression = struct_get(array, 'col1')
Length: 3 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  3
>>> df.array[1:, ["col1"]]
Expression = struct_project(array, ['col1'])
Length: 2 dtype: struct<col1: int64> (expression)
-------------------------------------------------
0  {'col1': 2}
1  {'col1': 3}
>>> df.array[1:, ["col2", "col1"]]
Expression = struct_project(array, ['col2', 'col1'])
Length: 2 dtype: struct<col2: string, col1: int64> (expression)
---------------------------------------------------------------
0  {'col2': 'b', 'col1': 2}
1  {'col2': 'c', 'col1': 3}
__gt__(b)

Return self>value.

__hash__ = None
__init__(ds, expression, ast=None, _selection=False)[source]
__le__(b)

Return self<=value.

__lt__(b)

Return self<value.

__ne__(b)

Return self!=value.

__repr__()[source]

Return repr(self).

__str__()[source]

Return str(self).

__weakref__

list of weak references to the object (if defined)

abs(**kwargs)

Lazy wrapper around numpy.abs

apply(f, vectorize=False, multiprocessing=True)[source]

Apply a function along all values of an Expression.

Shorthand for df.apply(f, arguments=[expression]), see DataFrame.apply()

Example:

>>> df = vaex.example()
>>> df.x
Expression = x
Length: 330,000 dtype: float64 (column)
---------------------------------------
     0  -0.777471
     1    3.77427
     2    1.37576
     3   -7.06738
     4   0.243441
>>> def func(x):
...     return x**2
>>> df.x.apply(func)
Expression = lambda_function(x)
Length: 330,000 dtype: float64 (expression)
-------------------------------------------
     0   0.604461
     1    14.2451
     2    1.89272
     3    49.9478
     4  0.0592637
Parameters
  • f – A function to be applied on the Expression values

  • vectorize – Call f with arrays instead of a scalars (for better performance).

  • multiprocessing (bool) – Use multiple processes to avoid the GIL (Global interpreter lock).

Returns

A function that is lazily evaluated when called.

arccos(**kwargs)

Lazy wrapper around numpy.arccos

arccosh(**kwargs)

Lazy wrapper around numpy.arccosh

arcsin(**kwargs)

Lazy wrapper around numpy.arcsin

arcsinh(**kwargs)

Lazy wrapper around numpy.arcsinh

arctan(**kwargs)

Lazy wrapper around numpy.arctan

arctan2(**kwargs)

Lazy wrapper around numpy.arctan2

arctanh(**kwargs)

Lazy wrapper around numpy.arctanh

as_arrow()

Lazily convert to Apache Arrow array type

as_numpy(strict=False)

Lazily convert to NumPy ndarray type

property ast

Returns the abstract syntax tree (AST) of the expression

clip(**kwargs)

Lazy wrapper around numpy.clip

copy(df=None)[source]

Efficiently copies an expression.

Expression objects have both a string and AST representation. Creating the AST representation involves parsing the expression, which is expensive.

Using copy will deepcopy the AST when the expression was already parsed.

Parameters

df – DataFrame for which the expression will be evaluated (self.df if None)

cos(**kwargs)

Lazy wrapper around numpy.cos

cosh(**kwargs)

Lazy wrapper around numpy.cosh

count(binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]

Shortcut for ds.count(expression, …), see Dataset.count

countmissing()[source]

Returns the number of missing values in the expression.

countna()[source]

Returns the number of Not Availiable (N/A) values in the expression. This includes missing values and np.nan values.

countnan()[source]

Returns the number of NaN values in the expression.

deg2rad(**kwargs)

Lazy wrapper around numpy.deg2rad

dependencies()[source]

Get all dependencies of this expression, including ourselves

digitize(**kwargs)

Lazy wrapper around numpy.digitize

dot_product(b)

Compute the dot product between a and b.

Parameters
  • a – A list of Expressions or a list of values (e.g. a vector)

  • b – A list of Expressions or a list of values (e.g. a vector)

Returns

Vaex expression

property dt

Gives access to datetime operations via DateTime

exp(**kwargs)

Lazy wrapper around numpy.exp

expand(stop=[])[source]

Expand the expression such that no virtual columns occurs, only normal columns.

Example:

>>> df = vaex.example()
>>> r = np.sqrt(df.data.x**2 + df.data.y**2)
>>> r.expand().expression
'sqrt(((x ** 2) + (y ** 2)))'
expm1(**kwargs)

Lazy wrapper around numpy.expm1

fillmissing(value)[source]

Returns an array where missing values are replaced by value.

See :ismissing for the definition of missing values.

fillna(value)

Returns an array where NA values are replaced by value. See :isna for the definition of missing values.

fillnan(value)

Returns an array where nan values are replaced by value. See :isnan for the definition of missing values.

format(format)

Uses http://www.cplusplus.com/reference/string/to_string/ for formatting

hashmap_apply(hashmap, check_missing=False)

Apply values to hashmap, if check_missing is True, missing values in the hashmap will translated to null/missing values

isfinite(**kwargs)

Lazy wrapper around numpy.isfinite

isin(values, use_hashmap=True)[source]

Lazily tests if each value in the expression is present in values.

Parameters
  • values – List/array of values to check

  • use_hashmap – use a hashmap or not (especially faster when values contains many elements)

Returns

Expression with the lazy expression.

isinf(**kwargs)

Lazy wrapper around numpy.isinf

ismissing()

Returns True where there are missing values (masked arrays), missing strings or None

isna()

Returns a boolean expression indicating if the values are Not Availiable (missing or NaN).

isnan()

Returns an array where there are NaN values

kurtosis(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Shortcut for df.kurtosis(expression, …), see DataFrame.kurtosis

log(**kwargs)

Lazy wrapper around numpy.log

log10(**kwargs)

Lazy wrapper around numpy.log10

log1p(**kwargs)

Lazy wrapper around numpy.log1p

map(mapper, nan_value=None, missing_value=None, default_value=None, allow_missing=False, axis=None)[source]

Map values of an expression or in memory column according to an input dictionary or a custom callable function.

Example:

>>> import vaex
>>> df = vaex.from_arrays(color=['red', 'red', 'blue', 'red', 'green'])
>>> mapper = {'red': 1, 'blue': 2, 'green': 3}
>>> df['color_mapped'] = df.color.map(mapper)
>>> df
#  color      color_mapped
0  red                   1
1  red                   1
2  blue                  2
3  red                   1
4  green                 3
>>> import numpy as np
>>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, np.nan])
>>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user', np.nan: 'unknown'})
>>> df
#    type  role
0       0  admin
1       1  maintainer
2       2  user
3       2  user
4       2  user
5     nan  unknown
>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, 4])
>>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user'}, default_value='unknown')
>>> df
#    type  role
0       0  admin
1       1  maintainer
2       2  user
3       2  user
4       2  user
5       4  unknown
:param mapper: dict like object used to map the values from keys to values
:param nan_value: value to be used when a nan is present (and not in the mapper)
:param missing_value: value to use used when there is a missing value
:param default_value: value to be used when a value is not in the mapper (like dict.get(key, default))
:param allow_missing: used to signal that values in the mapper should map to a masked array with missing values,
    assumed True when default_value is not None.
:param bool axis: Axis over which to determine the unique elements (None will flatten arrays or lists)
:return: A vaex expression
:rtype: vaex.expression.Expression
property masked

Alias to df.is_masked(expression)

max(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Shortcut for ds.max(expression, …), see Dataset.max

maximum(**kwargs)

Lazy wrapper around numpy.maximum

mean(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Shortcut for ds.mean(expression, …), see Dataset.mean

min(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Shortcut for ds.min(expression, …), see Dataset.min

minimum(**kwargs)

Lazy wrapper around numpy.minimum

minmax(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Shortcut for ds.minmax(expression, …), see Dataset.minmax

nop()[source]

Evaluates expression, and drop the result, usefull for benchmarking, since vaex is usually lazy

notna()

Opposite of isna

nunique(dropna=False, dropnan=False, dropmissing=False, selection=None, axis=None, limit=None, limit_raise=True, progress=None, delay=False)[source]

Counts number of unique values, i.e. len(df.x.unique()) == df.x.nunique().

Parameters
  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • dropnan – Drop rows with NaN values

  • dropmissing – Drop rows with missing values

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • axis (bool) – Axis over which to determine the unique elements (None will flatten arrays or lists)

  • limit (int) – Limit the amount of results

  • limit_raise (bool) – Raise vaex.RowLimitException when limit is exceeded, or return at maximum ‘limit’ amount of results.

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

rad2deg(**kwargs)

Lazy wrapper around numpy.rad2deg

round(**kwargs)

Lazy wrapper around numpy.round

searchsorted(**kwargs)

Lazy wrapper around numpy.searchsorted

sin(**kwargs)

Lazy wrapper around numpy.sin

sinc(**kwargs)

Lazy wrapper around numpy.sinc

sinh(**kwargs)

Lazy wrapper around numpy.sinh

skew(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Shortcut for df.skew(expression, …), see DataFrame.skew

sqrt(**kwargs)

Lazy wrapper around numpy.sqrt

std(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Shortcut for ds.std(expression, …), see Dataset.std

property str

Gives access to string operations via StringOperations

property str_pandas

Gives access to string operations via StringOperationsPandas (using Pandas Series)

property struct

Gives access to struct operations via StructOperations

sum(axis=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Sum elements over given axis.

If no axis is given, it will sum over all axes.

For non list elements, this is a shortcut for ds.sum(expression, …), see Dataset.sum.

>>> list_data = [1, 2, None], None, [], [1, 3, 4, 5]
>>> df = vaex.from_arrays(some_list=pa.array(list_data))
>>> df.some_list.sum().item()  # will sum over all axis
16
>>> df.some_list.sum(axis=1).tolist()  # sums the list elements
[3, None, 0, 13]
Parameters

axis (int) – Axis over which to determine the unique elements (None will flatten arrays or lists)

tan(**kwargs)

Lazy wrapper around numpy.tan

tanh(**kwargs)

Lazy wrapper around numpy.tanh

property td

Gives access to timedelta operations via TimeDelta

to_arrow(convert_to_native=False)[source]

Convert to Apache Arrow array (will byteswap/copy if convert_to_native=True).

to_numpy(strict=True)[source]

Return a numpy representation of the data

to_pandas_series()[source]

Return a pandas.Series representation of the expression.

Note: Pandas is likely to make a memory copy of the data.

to_string()

Cast/convert to string, same as expression.astype(‘str’)

tolist(i1=None, i2=None)[source]

Short for expr.evaluate().tolist()

property transient

If this expression is not transient (e.g. on disk) optimizations can be made

unique(dropna=False, dropnan=False, dropmissing=False, selection=None, axis=None, limit=None, limit_raise=True, array_type='list', progress=None, delay=False)[source]

Returns all unique values.

Parameters
  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • dropnan – Drop rows with NaN values

  • dropmissing – Drop rows with missing values

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • axis (bool) – Axis over which to determine the unique elements (None will flatten arrays or lists)

  • limit (int) – Limit the amount of results

  • limit_raise (bool) – Raise vaex.RowLimitException when limit is exceeded, or return at maximum ‘limit’ amount of results.

  • array_type (bool) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

value_counts(dropna=False, dropnan=False, dropmissing=False, ascending=False, progress=False, axis=None, delay=False)[source]

Computes counts of unique values.

WARNING:
  • If the expression/column is not categorical, it will be converted on the fly

  • dropna is False by default, it is True by default in pandas

Parameters
  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • dropnan – Drop rows with NaN values

  • dropmissing – Drop rows with missing values

  • ascending – when False (default) it will report the most frequent occuring item first

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • axis (bool) – Axis over which to determine the unique elements (None will flatten arrays or lists)

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns

Pandas series containing the counts

var(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Shortcut for ds.std(expression, …), see Dataset.var

variables(ourself=False, expand_virtual=True, include_virtual=True)[source]

Return a set of variables this expression depends on.

Example:

>>> df = vaex.example()
>>> r = np.sqrt(df.data.x**2 + df.data.y**2)
>>> r.variables()
{'x', 'y'}
where(x, y, dtype=None)

Return the values row-wise chosen from x or y depending on the condition.

This a useful function when you want to create based on some condition. If the condition is True, the value from x is taken, and othewise the value from y is taken. An easy way to think about the syntax is df.func.where(“if”, “then”, “else”). Please see the example below.

Note: if the function is used as a method of an expression, that expression is assumed to be the condition.

Parameters
  • condition – An boolean expression

  • x – A single value or an expression, the value passed on if the condition is satisfied.

  • y – A single value or an expression, the value passed on if the condition is not satisfied.

  • dtype – Optionally specify the dtype of the resulting expression

Return type

Expression

Example:

>>> import vaex
>>> df = vaex.from_arrays(x=[0, 1, 2, 3])
>>> df['y'] = df.func.where(df.x >=2, df.x, -1)
>>> df
#    x    y
0    0   -1
1    1   -1
2    2    2
3    3    3

Geo operations

class vaex.geo.DataFrameAccessorGeo(df)[source]

Bases: object

Geometry/geographic helper methods

Example:

>>> df_xyz = df.geo.spherical2cartesian(df.longitude, df.latitude, df.distance)
>>> df_xyz.x.mean()
__init__(df)[source]
__weakref__

list of weak references to the object (if defined)

bearing(lon1, lat1, lon2, lat2, bearing='bearing', inplace=False)[source]

Calculates a bearing, based on http://www.movable-type.co.uk/scripts/latlong.html

cartesian2spherical(x='x', y='y', z='z', alpha='l', delta='b', distance='distance', radians=False, center=None, center_name='solar_position', inplace=False)[source]

Convert cartesian to spherical coordinates.

Parameters
  • x

  • y

  • z

  • alpha

  • delta – name for polar angle, ranges from -90 to 90 (or -pi to pi when radians is True).

  • distance

  • radians

  • center

  • center_name

Returns

cartesian_to_polar(x='x', y='y', radius_out='r_polar', azimuth_out='phi_polar', propagate_uncertainties=False, radians=False, inplace=False)[source]

Convert cartesian to polar coordinates

Parameters
  • x – expression for x

  • y – expression for y

  • radius_out – name for the virtual column for the radius

  • azimuth_out – name for the virtual column for the azimuth angle

  • propagate_uncertainties – {propagate_uncertainties}

  • radians – if True, azimuth is in radians, defaults to degrees

Returns

inside_polygon(y, px, py)

Test if points defined by x and y are inside the polygon px, py

Example:

>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4])
>>> px = np.array([1.5, 2.5, 2.5, 1.5])
>>> py = np.array([2.5, 2.5, 3.5, 3.5])
>>> df['inside'] = df.geo.inside_polygon(df.x, df.y, px, py)
>>> df
#    x    y  inside
0    1    2  False
1    2    3  True
2    3    4  False
Parameters
  • x – {expression_one}

  • y – {expression_one}

  • px – list of x coordinates for the polygon

  • px – list of y coordinates for the polygon

Returns

Expression, which is true if point is inside, else false.

inside_polygons(y, pxs, pys, any=True)

Test if points defined by x and y are inside all or any of the the polygons px, py

Example:

>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4])
>>> px = np.array([1.5, 2.5, 2.5, 1.5])
>>> py = np.array([2.5, 2.5, 3.5, 3.5])
>>> df['inside'] = df.geo.inside_polygons(df.x, df.y, [px, px + 1], [py, py + 1], any=True)
>>> df
#    x    y  inside
0    1    2  False
1    2    3  True
2    3    4  True
Parameters
  • x – {expression_one}

  • y – {expression_one}

  • pxs – list of N ndarrays with x coordinates for the polygon, N is the number of polygons

  • pxs – list of N ndarrays with y coordinates for the polygon

  • any – return true if in any polygon, or all polygons

Returns

Expression , which is true if point is inside, else false.

inside_which_polygon(y, pxs, pys)

Find in which polygon (0 based index) a point resides

Example:

>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4])
>>> px = np.array([1.5, 2.5, 2.5, 1.5])
>>> py = np.array([2.5, 2.5, 3.5, 3.5])
>>> df['polygon_index'] = df.geo.inside_which_polygon(df.x, df.y, [px, px + 1], [py, py + 1])
>>> df
#    x    y  polygon_index
0    1    2  --
1    2    3  0
2    3    4  1
Parameters
  • x – {expression_one}

  • y – {expression_one}

  • px – list of N ndarrays with x coordinates for the polygon, N is the number of polygons

  • px – list of N ndarrays with y coordinates for the polygon

Returns

Expression, 0 based index to which polygon the point belongs (or missing/masked value)

inside_which_polygons(x, y, pxss, pyss=None, any=True)[source]

Find in which set of polygons (0 based index) a point resides.

If any=True, it will be the first matching polygon set index, if any=False, it will be the first index that matches all polygons in the set.

>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4])
>>> px = np.array([1.5, 2.5, 2.5, 1.5])
>>> py = np.array([2.5, 2.5, 3.5, 3.5])
>>> polygonA = [px, py]
>>> polygonB = [px + 1, py + 1]
>>> pxs = [[polygonA, polygonB], [polygonA]]
>>> df['polygon_index'] = df.geo.inside_which_polygons(df.x, df.y, pxs, any=True)
>>> df
#    x    y  polygon_index
0    1    2  --
1    2    3  0
2    3    4  0
>>> df['polygon_index'] = df.geo.inside_which_polygons(df.x, df.y, pxs, any=False)
>>> df
#    x    y  polygon_index
0    1    2  --
1    2    3  1
2    3    4  --
Parameters
  • x – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • y – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • px – list of N ndarrays with x coordinates for the polygon, N is the number of polygons

  • px – list of N ndarrays with y coordinates for the polygon, if None, the shape of the ndarrays of the last dimention of the x arrays should be 2 (i.e. have the x and y coordinates)

  • any – test if point it in any polygon (logically or), or all polygons (logically and)

Returns

Expression, 0 based index to which polygon the point belongs (or missing/masked value)

project_aitoff(alpha, delta, x, y, radians=True, inplace=False)[source]

Add aitoff (https://en.wikipedia.org/wiki/Aitoff_projection) projection

Parameters
  • alpha – azimuth angle

  • delta – polar angle

  • x – output name for x coordinate

  • y – output name for y coordinate

  • radians – input and output in radians (True), or degrees (False)

Returns

project_gnomic(alpha, delta, alpha0=0, delta0=0, x='x', y='y', radians=False, postfix='', inplace=False)[source]

Adds a gnomic projection to the DataFrame

rotation_2d(x, y, xnew, ynew, angle_degrees, propagate_uncertainties=False, inplace=False)[source]

Rotation in 2d.

Parameters
  • x (str) – Name/expression of x column

  • y (str) – idem for y

  • xnew (str) – name of transformed x column

  • ynew (str) –

  • angle_degrees (float) – rotation in degrees, anti clockwise

Returns

spherical2cartesian(alpha, delta, distance, xname='x', yname='y', zname='z', propagate_uncertainties=False, center=[0, 0, 0], radians=False, inplace=False)[source]

Convert spherical to cartesian coordinates.

Parameters
  • alpha

  • delta – polar angle, ranging from the -90 (south pole) to 90 (north pole)

  • distance – radial distance, determines the units of x, y and z

  • xname

  • yname

  • zname

  • propagate_uncertainties – If true, will propagate errors for the new virtual columns, see propagate_uncertainties() for details

  • center

  • radians

Returns

New dataframe (in inplace is False) with new x,y,z columns

velocity_cartesian2polar(x='x', y='y', vx='vx', radius_polar=None, vy='vy', vr_out='vr_polar', vazimuth_out='vphi_polar', propagate_uncertainties=False, inplace=False)[source]

Convert cartesian to polar velocities.

Parameters
  • x

  • y

  • vx

  • radius_polar – Optional expression for the radius, may lead to a better performance when given.

  • vy

  • vr_out

  • vazimuth_out

  • propagate_uncertainties – If true, will propagate errors for the new virtual columns, see propagate_uncertainties() for details

Returns

velocity_cartesian2spherical(x='x', y='y', z='z', vx='vx', vy='vy', vz='vz', vr='vr', vlong='vlong', vlat='vlat', distance=None, inplace=False)[source]

Convert velocities from a cartesian to a spherical coordinate system

TODO: uncertainty propagation

Parameters
  • x – name of x column (input)

  • y – y

  • z – z

  • vx – vx

  • vy – vy

  • vz – vz

  • vr – name of the column for the radial velocity in the r direction (output)

  • vlong – name of the column for the velocity component in the longitude direction (output)

  • vlat – name of the column for the velocity component in the latitude direction, positive points to the north pole (output)

  • distance – Expression for distance, if not given defaults to sqrt(x**2+y**2+z**2), but if this column already exists, passing this expression may lead to a better performance

Returns

velocity_polar2cartesian(x='x', y='y', azimuth=None, vr='vr_polar', vazimuth='vphi_polar', vx_out='vx', vy_out='vy', propagate_uncertainties=False, inplace=False)[source]

Convert cylindrical polar velocities to Cartesian.

Parameters
  • x

  • y

  • azimuth – Optional expression for the azimuth in degrees , may lead to a better performance when given.

  • vr

  • vazimuth

  • vx_out

  • vy_out

  • propagate_uncertainties – If true, will propagate errors for the new virtual columns, see propagate_uncertainties() for details

Logging

Sets up logging for vaex.

See configuration of logging how to configure logging.

vaex.logging.remove_handler()[source]

Disabled logging, remove default hander and add null handler

vaex.logging.reset()[source]

Reset configuration of logging (i.e. remove the default handler)

vaex.logging.set_log_level(loggers=['vaex'], level=10)[source]

set log level to debug

vaex.logging.set_log_level_debug(loggers=['vaex'])[source]

set log level to debug

vaex.logging.set_log_level_error(loggers=['vaex'])[source]

set log level to exception/error

vaex.logging.set_log_level_info(loggers=['vaex'])[source]

set log level to info

vaex.logging.set_log_level_warning(loggers=['vaex'])[source]

set log level to warning

vaex.logging.setup()[source]

Setup logging based on the configuration in vaex.settings

This function is automatically called when importing vaex. If settings are changed, call reset() and this function again to re-apply the settings.

String operations

class vaex.expression.StringOperations(expression)[source]

Bases: object

String operations.

Usually accessed using e.g. df.name.str.lower()

__init__(expression)[source]
__weakref__

list of weak references to the object (if defined)

byte_length()

Returns the number of bytes in a string sample.

Returns

an expression contains the number of bytes in each sample of a string column.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.byte_length()
Expression = str_byte_length(text)
Length: 5 dtype: int64 (expression)
-----------------------------------
0   9
1  11
2   9
3   3
4   4
capitalize()

Capitalize the first letter of a string sample.

Returns

an expression containing the capitalized strings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.capitalize()
Expression = str_capitalize(text)
Length: 5 dtype: str (expression)
---------------------------------
0    Something
1  Very pretty
2    Is coming
3          Our
4         Way.
cat(other)

Concatenate two string columns on a row-by-row basis.

Parameters

other (expression) – The expression of the other column to be concatenated.

Returns

an expression containing the concatenated columns.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.cat(df.text)
Expression = str_cat(text, text)
Length: 5 dtype: str (expression)
---------------------------------
0      SomethingSomething
1  very prettyvery pretty
2      is comingis coming
3                  ourour
4                way.way.
center(width, fillchar=' ')

Fills the left and right side of the strings with additional characters, such that the sample has a total of width characters.

Parameters
  • width (int) – The total number of characters of the resulting string sample.

  • fillchar (str) – The character used for filling.

Returns

an expression containing the filled strings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.center(width=11, fillchar='!')
Expression = str_center(text, width=11, fillchar='!')
Length: 5 dtype: str (expression)
---------------------------------
0  !Something!
1  very pretty
2  !is coming!
3  !!!!our!!!!
4  !!!!way.!!!
contains(pattern, regex=True)

Check if a string pattern or regex is contained within a sample of a string column.

Parameters
  • pattern (str) – A string or regex pattern

  • regex (bool) – If True,

Returns

an expression which is evaluated to True if the pattern is found in a given sample, and it is False otherwise.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.contains('very')
Expression = str_contains(text, 'very')
Length: 5 dtype: bool (expression)
----------------------------------
0  False
1   True
2  False
3  False
4  False
count(pat, regex=False)

Count the occurences of a pattern in sample of a string column.

Parameters
  • pat (str) – A string or regex pattern

  • regex (bool) – If True,

Returns

an expression containing the number of times a pattern is found in each sample.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.count(pat="et", regex=False)
Expression = str_count(text, pat='et', regex=False)
Length: 5 dtype: int64 (expression)
-----------------------------------
0  1
1  1
2  0
3  0
4  0
endswith(pat)

Check if the end of each string sample matches the specified pattern.

Parameters

pat (str) – A string pattern or a regex

Returns

an expression evaluated to True if the pattern is found at the end of a given sample, False otherwise.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.endswith(pat="ing")
Expression = str_endswith(text, pat='ing')
Length: 5 dtype: bool (expression)
----------------------------------
0   True
1  False
2   True
3  False
4  False
equals(y)

Tests if strings x and y are the same

Returns

a boolean expression

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.equals(df.text)
Expression = str_equals(text, text)
Length: 5 dtype: bool (expression)
----------------------------------
0  True
1  True
2  True
3  True
4  True
>>> df.text.str.equals('our')
Expression = str_equals(text, 'our')
Length: 5 dtype: bool (expression)
----------------------------------
0  False
1  False
2  False
3   True
4  False
extract_regex(pattern)

Extract substrings defined by a regular expression using Apache Arrow (Google RE2 library).

Parameters

pattern (str) – A regular expression which needs to contain named capture groups, e.g. ‘letter’ and ‘digit’ for the regular expression ‘(?P<letter>[ab])(?P<digit>d)’.

Returns

an expression containing a struct with field names corresponding to capture group identifiers.

Example:

>>> import vaex
>>> email = ["foo@bar.org", "bar@foo.org", "open@source.org", "invalid@address.com"]
>>> df = vaex.from_arrays(email=email)
>>> df
#  email
0  foo@bar.org
1  bar@foo.org
2  open@source.org
3  invalid@address.com
>>> pattern = "(?P<name>.*)@(?P<address>.*)\.org"
>>> df.email.str.extract_regex(pattern=pattern)
Expression = str_extract_regex(email, pattern='(?P<name>.*)@(?P<addres...
Length: 4 dtype: struct<name: string, address: string> (expression)
-------------------------------------------------------------------
0      {'name': 'foo', 'address': 'bar'}
1      {'name': 'bar', 'address': 'foo'}
2  {'name': 'open', 'address': 'source'}
3                                     --
find(sub, start=0, end=None)

Returns the lowest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned.

Parameters
  • sub (str) – A substring to be found in the samples

  • start (int) –

  • end (int) –

Returns

an expression containing the lowest indices specifying the start of the substring.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.find(sub="et")
Expression = str_find(text, sub='et')
Length: 5 dtype: int64 (expression)
-----------------------------------
0   3
1   7
2  -1
3  -1
4  -1
get(i)

Extract a character from each sample at the specified position from a string column. Note that if the specified position is out of bound of the string sample, this method returns ‘’, while pandas retunrs nan.

Parameters

i (int) – The index location, at which to extract the character.

Returns

an expression containing the extracted characters.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.get(5)
Expression = str_get(text, 5)
Length: 5 dtype: str (expression)
---------------------------------
0    h
1    p
2    m
3
4
index(sub, start=0, end=None)

Returns the lowest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned. It is the same as str.find.

Parameters
  • sub (str) – A substring to be found in the samples

  • start (int) –

  • end (int) –

Returns

an expression containing the lowest indices specifying the start of the substring.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.index(sub="et")
Expression = str_find(text, sub='et')
Length: 5 dtype: int64 (expression)
-----------------------------------
0   3
1   7
2  -1
3  -1
4  -1
isalnum(ascii=False)

Check if all characters in a string sample are alphanumeric.

Parameters

ascii (bool) – Transform only ascii characters (usually faster).

Returns

an expression evaluated to True if a sample contains only alphanumeric characters, otherwise False.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.isalnum()
Expression = str_isalnum(text)
Length: 5 dtype: bool (expression)
----------------------------------
0   True
1  False
2  False
3   True
4  False
isalpha()

Check if all characters in a string sample are alphabetic.

Returns

an expression evaluated to True if a sample contains only alphabetic characters, otherwise False.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.isalpha()
Expression = str_isalpha(text)
Length: 5 dtype: bool (expression)
----------------------------------
0   True
1  False
2  False
3   True
4  False
isdigit()

Check if all characters in a string sample are digits.

Returns

an expression evaluated to True if a sample contains only digits, otherwise False.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', '6']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  6
>>> df.text.str.isdigit()
Expression = str_isdigit(text)
Length: 5 dtype: bool (expression)
----------------------------------
0  False
1  False
2  False
3  False
4   True
islower()

Check if all characters in a string sample are lowercase characters.

Returns

an expression evaluated to True if a sample contains only lowercase characters, otherwise False.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.islower()
Expression = str_islower(text)
Length: 5 dtype: bool (expression)
----------------------------------
0  False
1   True
2   True
3   True
4   True
isspace()

Check if all characters in a string sample are whitespaces.

Returns

an expression evaluated to True if a sample contains only whitespaces, otherwise False.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', '      ', ' ']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3
  4
>>> df.text.str.isspace()
Expression = str_isspace(text)
Length: 5 dtype: bool (expression)
----------------------------------
0  False
1  False
2  False
3   True
4   True
istitle(ascii=False)

TODO

isupper()

Check if all characters in a string sample are lowercase characters.

Returns

an expression evaluated to True if a sample contains only lowercase characters, otherwise False.

Example:

>>> import vaex
>>> text = ['SOMETHING', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  SOMETHING
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.isupper()
Expression = str_isupper(text)
Length: 5 dtype: bool (expression)
----------------------------------
0   True
1  False
2  False
3  False
4  False
join(sep)

Same as find (difference with pandas is that it does not raise a ValueError)

len()

Returns the length of a string sample.

Returns

an expression contains the length of each sample of a string column.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.len()
Expression = str_len(text)
Length: 5 dtype: int64 (expression)
-----------------------------------
0   9
1  11
2   9
3   3
4   4
ljust(width, fillchar=' ')

Fills the right side of string samples with a specified character such that the strings are right-hand justified.

Parameters
  • width (int) – The minimal width of the strings.

  • fillchar (str) – The character used for filling.

Returns

an expression containing the filled strings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.ljust(width=10, fillchar='!')
Expression = str_ljust(text, width=10, fillchar='!')
Length: 5 dtype: str (expression)
---------------------------------
0   Something!
1  very pretty
2   is coming!
3   our!!!!!!!
4   way.!!!!!!
lower()

Converts string samples to lower case.

Returns

an expression containing the converted strings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.lower()
Expression = str_lower(text)
Length: 5 dtype: str (expression)
---------------------------------
0    something
1  very pretty
2    is coming
3          our
4         way.
lstrip(to_strip=None)

Remove leading characters from a string sample.

Parameters

to_strip (str) – The string to be removed

Returns

an expression containing the modified string column.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.lstrip(to_strip='very ')
Expression = str_lstrip(text, to_strip='very ')
Length: 5 dtype: str (expression)
---------------------------------
0  Something
1     pretty
2  is coming
3        our
4       way.
match(pattern)

Check if a string sample matches a given regular expression.

Parameters

pattern (str) – a string or regex to match to a string sample.

Returns

an expression which is evaluated to True if a match is found, False otherwise.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.match(pattern='our')
Expression = str_match(text, pattern='our')
Length: 5 dtype: bool (expression)
----------------------------------
0  False
1  False
2  False
3   True
4  False
notequals(y)

Tests if strings x and y are the not same

Returns

a boolean expression

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.notequals(df.text)
Expression = str_notequals(text, text)
Length: 5 dtype: bool (expression)
----------------------------------
0  False
1  False
2  False
3  False
4  False
>>> df.text.str.notequals('our')
Expression = str_notequals(text, 'our')
Length: 5 dtype: bool (expression)
----------------------------------
0   True
1   True
2   True
3  False
4   True
pad(width, side='left', fillchar=' ')

Pad strings in a given column.

Parameters
  • width (int) – The total width of the string

  • side (str) – If ‘left’ than pad on the left, if ‘right’ than pad on the right side the string.

  • fillchar (str) – The character used for padding.

Returns

an expression containing the padded strings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.pad(width=10, side='left', fillchar='!')
Expression = str_pad(text, width=10, side='left', fillchar='!')
Length: 5 dtype: str (expression)
---------------------------------
0   !Something
1  very pretty
2   !is coming
3   !!!!!!!our
4   !!!!!!way.
repeat(repeats)

Duplicate each string in a column.

Parameters

repeats (int) – number of times each string sample is to be duplicated.

Returns

an expression containing the duplicated strings

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.repeat(3)
Expression = str_repeat(text, 3)
Length: 5 dtype: str (expression)
---------------------------------
0        SomethingSomethingSomething
1  very prettyvery prettyvery pretty
2        is comingis comingis coming
3                          ourourour
4                       way.way.way.
replace(pat, repl, n=- 1, flags=0, regex=False)

Replace occurences of a pattern/regex in a column with some other string.

Parameters
  • pattern (str) – string or a regex pattern

  • replace (str) – a replacement string

  • n (int) – number of replacements to be made from the start. If -1 make all replacements.

  • flags (int) –

    ??

  • regex (bool) – If True, …?

Returns

an expression containing the string replacements.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.replace(pat='et', repl='__')
Expression = str_replace(text, pat='et', repl='__')
Length: 5 dtype: str (expression)
---------------------------------
0    Som__hing
1  very pr__ty
2    is coming
3          our
4         way.
rfind(sub, start=0, end=None)

Returns the highest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned.

Parameters
  • sub (str) – A substring to be found in the samples

  • start (int) –

  • end (int) –

Returns

an expression containing the highest indices specifying the start of the substring.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.rfind(sub="et")
Expression = str_rfind(text, sub='et')
Length: 5 dtype: int64 (expression)
-----------------------------------
0   3
1   7
2  -1
3  -1
4  -1
rindex(sub, start=0, end=None)

Returns the highest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned. Same as str.rfind.

Parameters
  • sub (str) – A substring to be found in the samples

  • start (int) –

  • end (int) –

Returns

an expression containing the highest indices specifying the start of the substring.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.rindex(sub="et")
Expression = str_rindex(text, sub='et')
Length: 5 dtype: int64 (expression)
-----------------------------------
0   3
1   7
2  -1
3  -1
4  -1
rjust(width, fillchar=' ')

Fills the left side of string samples with a specified character such that the strings are left-hand justified.

Parameters
  • width (int) – The minimal width of the strings.

  • fillchar (str) – The character used for filling.

Returns

an expression containing the filled strings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.rjust(width=10, fillchar='!')
Expression = str_rjust(text, width=10, fillchar='!')
Length: 5 dtype: str (expression)
---------------------------------
0   !Something
1  very pretty
2   !is coming
3   !!!!!!!our
4   !!!!!!way.
rstrip(to_strip=None)

Remove trailing characters from a string sample.

Parameters

to_strip (str) – The string to be removed

Returns

an expression containing the modified string column.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.rstrip(to_strip='ing')
Expression = str_rstrip(text, to_strip='ing')
Length: 5 dtype: str (expression)
---------------------------------
0       Someth
1  very pretty
2       is com
3          our
4         way.
slice(start=0, stop=None)

Slice substrings from each string element in a column.

Parameters
  • start (int) – The start position for the slice operation.

  • end (int) – The stop position for the slice operation.

Returns

an expression containing the sliced substrings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.slice(start=2, stop=5)
Expression = str_pandas_slice(text, start=2, stop=5)
Length: 5 dtype: str (expression)
---------------------------------
0  met
1   ry
2   co
3    r
4   y.
startswith(pat)

Check if a start of a string matches a pattern.

Parameters

pat (str) – A string pattern. Regular expressions are not supported.

Returns

an expression which is evaluated to True if the pattern is found at the start of a string sample, False otherwise.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.startswith(pat='is')
Expression = str_startswith(text, pat='is')
Length: 5 dtype: bool (expression)
----------------------------------
0  False
1  False
2   True
3  False
4  False
strip(to_strip=None)

Removes leading and trailing characters.

Strips whitespaces (including new lines), or a set of specified characters from each string saple in a column, both from the left right sides.

Parameters
  • to_strip (str) – The characters to be removed. All combinations of the characters will be removed. If None, it removes whitespaces.

  • returns – an expression containing the modified string samples.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.strip(to_strip='very')
Expression = str_strip(text, to_strip='very')
Length: 5 dtype: str (expression)
---------------------------------
0  Something
1      prett
2  is coming
3         ou
4       way.
title()

Converts all string samples to titlecase.

Returns

an expression containing the converted strings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.title()
Expression = str_title(text)
Length: 5 dtype: str (expression)
---------------------------------
0    Something
1  Very Pretty
2    Is Coming
3          Our
4         Way.
upper(ascii=False)

Converts all strings in a column to uppercase.

Parameters

ascii (bool) – Transform only ascii characters (usually faster).

Returns

an expression containing the converted strings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.upper()
Expression = str_upper(text)
Length: 5 dtype: str (expression)
---------------------------------
0    SOMETHING
1  VERY PRETTY
2    IS COMING
3          OUR
4         WAY.
zfill(width)

Pad strings in a column by prepanding “0” characters.

Parameters

width (int) – The minimum length of the resulting string. Strings shorter less than width will be prepended with zeros.

Returns

an expression containing the modified strings.

Example:

>>> import vaex
>>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
>>> df = vaex.from_arrays(text=text)
>>> df
  #  text
  0  Something
  1  very pretty
  2  is coming
  3  our
  4  way.
>>> df.text.str.zfill(width=12)
Expression = str_zfill(text, width=12)
Length: 5 dtype: str (expression)
---------------------------------
0  000Something
1  0very pretty
2  000is coming
3  000000000our
4  00000000way.

String (pandas) operations

class vaex.expression.StringOperationsPandas(expression)[source]

Bases: object

String operations using Pandas Series (much slower)

__init__(expression)[source]
__weakref__

list of weak references to the object (if defined)

byte_length(**kwargs)

Wrapper around pandas.Series.byte_length

capitalize(**kwargs)

Wrapper around pandas.Series.capitalize

cat(**kwargs)

Wrapper around pandas.Series.cat

center(**kwargs)

Wrapper around pandas.Series.center

contains(**kwargs)

Wrapper around pandas.Series.contains

count(**kwargs)

Wrapper around pandas.Series.count

endswith(**kwargs)

Wrapper around pandas.Series.endswith

equals(**kwargs)

Wrapper around pandas.Series.equals

extract_regex(**kwargs)

Wrapper around pandas.Series.extract_regex

find(**kwargs)

Wrapper around pandas.Series.find

get(**kwargs)

Wrapper around pandas.Series.get

index(**kwargs)

Wrapper around pandas.Series.index

isalnum(**kwargs)

Wrapper around pandas.Series.isalnum

isalpha(**kwargs)

Wrapper around pandas.Series.isalpha

isdigit(**kwargs)

Wrapper around pandas.Series.isdigit

islower(**kwargs)

Wrapper around pandas.Series.islower

isspace(**kwargs)

Wrapper around pandas.Series.isspace

istitle(**kwargs)

Wrapper around pandas.Series.istitle

isupper(**kwargs)

Wrapper around pandas.Series.isupper

join(**kwargs)

Wrapper around pandas.Series.join

len(**kwargs)

Wrapper around pandas.Series.len

ljust(**kwargs)

Wrapper around pandas.Series.ljust

lower(**kwargs)

Wrapper around pandas.Series.lower

lstrip(**kwargs)

Wrapper around pandas.Series.lstrip

match(**kwargs)

Wrapper around pandas.Series.match

notequals(**kwargs)

Wrapper around pandas.Series.notequals

pad(**kwargs)

Wrapper around pandas.Series.pad

repeat(**kwargs)

Wrapper around pandas.Series.repeat

replace(**kwargs)

Wrapper around pandas.Series.replace

rfind(**kwargs)

Wrapper around pandas.Series.rfind

rindex(**kwargs)

Wrapper around pandas.Series.rindex

rjust(**kwargs)

Wrapper around pandas.Series.rjust

rsplit(**kwargs)

Wrapper around pandas.Series.rsplit

rstrip(**kwargs)

Wrapper around pandas.Series.rstrip

slice(**kwargs)

Wrapper around pandas.Series.slice

split(**kwargs)

Wrapper around pandas.Series.split

startswith(**kwargs)

Wrapper around pandas.Series.startswith

strip(**kwargs)

Wrapper around pandas.Series.strip

title(**kwargs)

Wrapper around pandas.Series.title

upper(**kwargs)

Wrapper around pandas.Series.upper

zfill(**kwargs)

Wrapper around pandas.Series.zfill

Struct (arrow) operations

class vaex.expression.StructOperations(expression)[source]

Bases: collections.abc.Mapping

Struct Array operations.

Usually accessed using e.g. df.name.struct.get(‘field1’)

__getitem__(key)[source]

Return struct field by either field name (string) or index position (index).

In case of ambiguous field names, a LookupError is raised.

__init__(expression)[source]
__len__()[source]

Return the number of struct fields contained in struct array.

__weakref__

list of weak references to the object (if defined)

property dtypes

Return all field names along with corresponding types.

Returns

a pandas series with keys as index and types as values.

Example:

>>> import vaex
>>> import pyarrow as pa
>>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"])
>>> df = vaex.from_arrays(array=array)
>>> df
#       array
0       {'col1': 1, 'col2': 'a'}
1       {'col1': 2, 'col2': 'b'}
>>> df.array.struct.dtypes
col1     int64
col2    string
dtype: object
get(field)

Return a single field from a struct array. You may also use the shorthand notation df.name[:, ‘field’].

Please note, in case of duplicated field labels, a field can’t be uniquely identified. Please use index position based access instead. To get corresponding field indices, please use {{idx: key for idx, key in enumerate(df.array.struct)}}.

Parameters

field ({str, int}) – A string (label) or integer (index position) identifying a struct field.

Returns

an expression containing a struct field.

Example:

>>> import vaex
>>> import pyarrow as pa
>>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"])
>>> df = vaex.from_arrays(array=array)
>>> df
  #  array
  0  {'col1': 1, 'col2': 'a'}
  1  {'col1': 2, 'col2': 'b'}
>>> df.array.struct.get("col1")
Expression = struct_get(array, 'col1')
Length: 2 dtype: int64 (expression)
-----------------------------------
0  1
1  2
>>> df.array.struct.get(0)
Expression = struct_get(array, 0)
Length: 2 dtype: int64 (expression)
-----------------------------------
0  1
1  2
>>> df.array[:, 'col1']
Expression = struct_get(array, 'col1')
Length: 2 dtype: int64 (expression)
-----------------------------------
0  1
1  2
items()[source]

Return all fields with names along with corresponding vaex expressions.

Returns

list of tuples with field names and fields as vaex expressions.

Example:

>>> import vaex
>>> import pyarrow as pa
>>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"])
>>> df = vaex.from_arrays(array=array)
>>> df
#       array
0       {'col1': 1, 'col2': 'a'}
1       {'col1': 2, 'col2': 'b'}
>>> df.array.struct.items()
[('col1',
  Expression = struct_get(array, 0)
  Length: 2 dtype: int64 (expression)
  -----------------------------------
  0  1
  1  2),
 ('col2',
  Expression = struct_get(array, 1)
  Length: 2 dtype: string (expression)
  ------------------------------------
  0  a
  1  b)]
keys()[source]

Return all field names contained in struct array.

Returns

list of field names.

Example:

>>> import vaex
>>> import pyarrow as pa
>>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"])
>>> df = vaex.from_arrays(array=array)
>>> df
#       array
0       {'col1': 1, 'col2': 'a'}
1       {'col1': 2, 'col2': 'b'}
>>> df.array.struct.keys()
["col1", "col2"]
project(fields)

Project one or more fields of a struct array to a new struct array. You may also use the shorthand notation df.name[:, [‘field1’, ‘field2’]].

Parameters

field (list) – A list of strings (label) or integers (index position) identifying one or more fields.

Returns

an expression containing a struct array.

Example:

>>> import vaex
>>> import pyarrow as pa
>>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"], [3, 4]], names=["col1", "col2", "col3"])
>>> df = vaex.from_arrays(array=array)
>>> df
  #  array
  0  {'col1': 1, 'col2': 'a', 'col3': 3}
  1  {'col1': 2, 'col2': 'b', 'col3': 4}
>>> df.array.struct.project(["col3", "col1"])
Expression = struct_project(array, ['col3', 'col1'])
Length: 2 dtype: struct<col3: int64, col1: int64> (expression)
--------------------------------------------------------------
0  {'col3': 3, 'col1': 1}
1  {'col3': 4, 'col1': 2}
>>> df.array.struct.project([2, 0])
Expression = struct_project(array, [2, 0])
Length: 2 dtype: struct<col3: int64, col1: int64> (expression)
--------------------------------------------------------------
0  {'col3': 3, 'col1': 1}
1  {'col3': 4, 'col1': 2}
>>> df.array[:, ["col3", "col1"]]
Expression = struct_project(array, ['col3', 'col1'])
Length: 2 dtype: struct<col3: int64, col1: int64> (expression)
--------------------------------------------------------------
0  {'col3': 3, 'col1': 1}
1  {'col3': 4, 'col1': 2}
values()[source]

Return all fields as vaex expressions.

Returns

list of vaex expressions corresponding to each field in struct.

Example:

>>> import vaex
>>> import pyarrow as pa
>>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"])
>>> df = vaex.from_arrays(array=array)
>>> df
#       array
0       {'col1': 1, 'col2': 'a'}
1       {'col1': 2, 'col2': 'b'}
>>> df.array.struct.values()
[Expression = struct_get(array, 0)
 Length: 2 dtype: int64 (expression)
 -----------------------------------
 0  1
 1  2,
 Expression = struct_get(array, 1)
 Length: 2 dtype: string (expression)
 ------------------------------------
 0  a
 1  b]

Timedelta operations

class vaex.expression.TimeDelta(expression)[source]

Bases: object

TimeDelta operations

Usually accessed using e.g. df.delay.td.days

__init__(expression)[source]
__weakref__

list of weak references to the object (if defined)

property days

Number of days in each timedelta sample.

Returns

an expression containing the number of days in a timedelta sample.

Example:

>>> import vaex
>>> import numpy as np
>>> delta = np.array([17658720110,   11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]')
>>> df = vaex.from_arrays(delta=delta)
>>> df
  #  delta
  0  204 days +9:12:00
  1  1 days +6:41:10
  2  471 days +5:03:56
  3  -22 days +23:31:15
>>> df.delta.td.days
Expression = td_days(delta)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  204
1    1
2  471
3  -22
property microseconds

Number of microseconds (>= 0 and less than 1 second) in each timedelta sample.

Returns

an expression containing the number of microseconds in a timedelta sample.

Example:

>>> import vaex
>>> import numpy as np
>>> delta = np.array([17658720110,   11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]')
>>> df = vaex.from_arrays(delta=delta)
>>> df
  #  delta
  0  204 days +9:12:00
  1  1 days +6:41:10
  2  471 days +5:03:56
  3  -22 days +23:31:15
>>> df.delta.td.microseconds
Expression = td_microseconds(delta)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  290448
1  978582
2   19583
3  709551
property nanoseconds

Number of nanoseconds (>= 0 and less than 1 microsecond) in each timedelta sample.

Returns

an expression containing the number of nanoseconds in a timedelta sample.

Example:

>>> import vaex
>>> import numpy as np
>>> delta = np.array([17658720110,   11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]')
>>> df = vaex.from_arrays(delta=delta)
>>> df
  #  delta
  0  204 days +9:12:00
  1  1 days +6:41:10
  2  471 days +5:03:56
  3  -22 days +23:31:15
>>> df.delta.td.nanoseconds
Expression = td_nanoseconds(delta)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  384
1   16
2  488
3  616
property seconds

Number of seconds (>= 0 and less than 1 day) in each timedelta sample.

Returns

an expression containing the number of seconds in a timedelta sample.

Example:

>>> import vaex
>>> import numpy as np
>>> delta = np.array([17658720110,   11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]')
>>> df = vaex.from_arrays(delta=delta)
>>> df
  #  delta
  0  204 days +9:12:00
  1  1 days +6:41:10
  2  471 days +5:03:56
  3  -22 days +23:31:15
>>> df.delta.td.seconds
Expression = td_seconds(delta)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  30436
1  39086
2  28681
3  23519
total_seconds()

Total duration of each timedelta sample expressed in seconds.

Returns

an expression containing the total number of seconds in a timedelta sample.

Example:

>>> import vaex
>>> import numpy as np
>>> delta = np.array([17658720110,   11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]')
>>> df = vaex.from_arrays(delta=delta)
>>> df
#  delta
0  204 days +9:12:00
1  1 days +6:41:10
2  471 days +5:03:56
3  -22 days +23:31:15
>>> df.delta.td.total_seconds()
Expression = td_total_seconds(delta)
Length: 4 dtype: float64 (expression)
-------------------------------------
0  -7.88024e+08
1  -2.55032e+09
2   6.72134e+08
3   2.85489e+08

vaex-graphql

class vaex.graphql.DataFrameAccessorGraphQL(df)[source]

Bases: object

Exposes a GraphQL layer to a DataFrame

See the GraphQL example for more usage.

The easiest way to learn to use the GraphQL language/vaex interface is to launch a server, and play with the GraphiQL graphical interface, its autocomplete, and the schema explorer.

We try to stay close to the Hasura API: https://docs.hasura.io/1.0/graphql/manual/api-reference/graphql-api/query.html

__init__(df)[source]
__weakref__

list of weak references to the object (if defined)

execute(*args, **kwargs)[source]

Creates a schema, and execute the query (first argument)

query(name='df')[source]

Creates a graphene query object exposing this DataFrame named name

schema(name='df', auto_camelcase=False, **kwargs)[source]

Creates a graphene schema for this DataFrame

serve(port=9001, address='', name='df', verbose=True)[source]

Serve the DataFrame via a http server

vaex-jupyter

class vaex.jupyter.DataFrameAccessorWidget(df)[source]

Bases: object

__init__(df)[source]
__weakref__

list of weak references to the object (if defined)

data_array(axes=[], selection=None, shared=False, display_function=<function display>, **kwargs)[source]

Create a vaex.jupyter.model.DataArray() model and vaex.jupyter.view.DataArray() widget and links them.

This is a convenience method to create the model and view, and hook them up.

execute_debounced()[source]

Schedules an execution of dataframe tasks in the near future (debounced).

expression(value=None, label='Custom expression')[source]

Create a widget to edit a vaex expression.

If value is an vaex.jupyter.model.Axis object, its expression will be (bi-directionally) linked to the widget.

Parameters

value – Valid expression (string or Expression object), or Axis

vaex.jupyter.debounced(delay_seconds=0.5, skip_gather=False, on_error=None, reentrant=True)[source]

A decorator to debounce many method/function call into 1 call.

Note: this only works in an async environment, such as a Jupyter notebook context. Outside of this context, calling flush() will execute pending calls.

Parameters
  • delay_seconds (float) – The amount of seconds that should pass without any call, before the (final) call will be executed.

  • method (bool) – The decorator should know if the callable is a a method or not, otherwise the debounced is on a per-class basis.

  • skip_gather (bool) – The decorated function will be be waited for when calling vaex.jupyter.gather()

  • on_error – callback function that takes an exception as argument.

  • reentrant (bool) – reentrant function or not

vaex.jupyter.flush(recursive_counts=- 1, ignore_exceptions=False, all=False)[source]

Run all non-executed debounced functions.

If execution of debounced calls lead to scheduling of new calls, they will be recursively executed, with a limit or recursive_counts calls. recursive_counts=-1 means infinite.

vaex.jupyter.interactive_selection(df)[source]

vaex.jupyter.model

class vaex.jupyter.model.Axis(*, bin_centers=None, df, exception=None, expression=None, max=None, min=None, shape=None, shape_default=64, slice=None, status=Status.NO_LIMITS, **kwargs)[source]

Bases: vaex.jupyter.model._HasState

class Status(value)[source]

Bases: enum.Enum

State transitions NO_LIMITS -> STAGED_CALCULATING_LIMITS -> CALCULATING_LIMITS -> CALCULATED_LIMITS -> READY

when expression changes:
STAGED_CALCULATING_LIMITS:

calculation.cancel() ->NO_LIMITS

CALCULATING_LIMITS:

calculation.cancel() ->NO_LIMITS

when min/max changes:
STAGED_CALCULATING_LIMITS:

calculation.cancel() ->NO_LIMITS

CALCULATING_LIMITS:

calculation.cancel() ->NO_LIMITS

ABORTED = 7
CALCULATED_LIMITS = 4
CALCULATING_LIMITS = 3
EXCEPTION = 6
NO_LIMITS = 1
READY = 5
STAGED_CALCULATING_LIMITS = 2
bin_centers

A trait which allows any value.

computation()[source]
df

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

exception

A trait which allows any value.

expression
property has_missing_limit
max

A casting version of the float trait.

min

A casting version of the float trait.

on_change_expression(change)[source]
on_change_limits
on_change_shape(change)[source]
on_change_shape_default(change)[source]
shape

A casting version of the int trait.

shape_default

A casting version of the int trait.

slice

A casting version of the int trait.

status

Use a Enum class as model for the data type description. Note that if no default-value is provided, the first enum-value is used as default-value.

# -- SINCE: Python 3.4 (or install backport: pip install enum34)
import enum
from traitlets import HasTraits, UseEnum

class Color(enum.Enum):
    red = 1         # -- IMPLICIT: default_value
    blue = 2
    green = 3

class MyEntity(HasTraits):
    color = UseEnum(Color, default_value=Color.blue)

entity = MyEntity(color=Color.red)
entity.color = Color.green    # USE: Enum-value (preferred)
entity.color = "green"        # USE: name (as string)
entity.color = "Color.green"  # USE: scoped-name (as string)
entity.color = 3              # USE: number (as int)
assert entity.color is Color.green
class vaex.jupyter.model.DataArray(*, axes, df, exception=None, grid, grid_sliced, selection=None, shape=64, status=Status.MISSING_LIMITS, status_text='Initializing', **kwargs)[source]

Bases: vaex.jupyter.model._HasState

class Status(value)[source]

Bases: enum.Enum

An enumeration.

CALCULATED_GRID = 9
CALCULATED_LIMITS = 5
CALCULATING_GRID = 8
CALCULATING_LIMITS = 4
EXCEPTION = 11
MISSING_LIMITS = 1
NEEDS_CALCULATING_GRID = 6
READY = 10
STAGED_CALCULATING_GRID = 7
STAGED_CALCULATING_LIMITS = 3
axes

An instance of a Python list.

df

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

exception

A trait which allows any value.

grid

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

grid_sliced

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

property has_missing_limits
on_progress_grid(f)[source]
selection

A trait which allows any value.

shape

A casting version of the int trait.

status

Use a Enum class as model for the data type description. Note that if no default-value is provided, the first enum-value is used as default-value.

# -- SINCE: Python 3.4 (or install backport: pip install enum34)
import enum
from traitlets import HasTraits, UseEnum

class Color(enum.Enum):
    red = 1         # -- IMPLICIT: default_value
    blue = 2
    green = 3

class MyEntity(HasTraits):
    color = UseEnum(Color, default_value=Color.blue)

entity = MyEntity(color=Color.red)
entity.color = Color.green    # USE: Enum-value (preferred)
entity.color = "green"        # USE: name (as string)
entity.color = "Color.green"  # USE: scoped-name (as string)
entity.color = 3              # USE: number (as int)
assert entity.color is Color.green
status_text

A trait for unicode strings.

class vaex.jupyter.model.GridCalculator(**kwargs: Any)[source]

Bases: vaex.jupyter.model._HasState

A grid is responsible for scheduling the grid calculations and possible slicing

class Status(value)[source]

Bases: enum.Enum

An enumeration.

CALCULATING = 4
READY = 9
STAGED_CALCULATION = 3
VOID = 1
computation()[source]
df

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

model_add(model)[source]
models

An instance of a Python list.

on_regrid(ignore=None)[source]
progress(f)[source]
reslice(source_model=None)[source]
status

Use a Enum class as model for the data type description. Note that if no default-value is provided, the first enum-value is used as default-value.

# -- SINCE: Python 3.4 (or install backport: pip install enum34)
import enum
from traitlets import HasTraits, UseEnum

class Color(enum.Enum):
    red = 1         # -- IMPLICIT: default_value
    blue = 2
    green = 3

class MyEntity(HasTraits):
    color = UseEnum(Color, default_value=Color.blue)

entity = MyEntity(color=Color.red)
entity.color = Color.green    # USE: Enum-value (preferred)
entity.color = "green"        # USE: name (as string)
entity.color = "Color.green"  # USE: scoped-name (as string)
entity.color = 3              # USE: number (as int)
assert entity.color is Color.green
class vaex.jupyter.model.Heatmap(*, axes, df, exception=None, grid, grid_sliced, selection=None, shape=64, status=Status.MISSING_LIMITS, status_text='Initializing', **kwargs)[source]

Bases: vaex.jupyter.model.DataArray

x

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

y

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

class vaex.jupyter.model.Histogram(*, axes, df, exception=None, grid, grid_sliced, selection=None, shape=64, status=Status.MISSING_LIMITS, status_text='Initializing', **kwargs)[source]

Bases: vaex.jupyter.model.DataArray

x

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

vaex.jupyter.view

class vaex.jupyter.view.DataArray(**kwargs: Any)[source]

Bases: vaex.jupyter.view.ViewBase

Will display a DataArray interactively, with an optional custom display_function.

By default, it will simply display(…) the DataArray, using xarray’s default display mechanism.

Public constructor

clear_output

Clear output each time the data changes

display_function

A trait which allows any value.

matplotlib_autoshow

Will call plt.show() inside output context if open figure handles exist

model

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

numpy_errstate

Default numpy errstate during display to avoid showing error messsages, see numpy.errstate

update_output(change=None)[source]
class vaex.jupyter.view.Heatmap(**kwargs: Any)[source]

Bases: vaex.jupyter.view.ViewBase

Public constructor

TOOLS_SUPPORTED = ['pan-zoom', 'select-rect', 'select-x']
blend

A trait for unicode strings.

colormap

A trait for unicode strings.

dimension_alternative

A trait for unicode strings.

dimension_facets

A trait for unicode strings.

dimension_fade

A trait for unicode strings.

model

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

normalize

A boolean (True, False) trait.

supports_normalize = False
supports_transforms = True
tool

A trait for unicode strings.

transform

A trait for unicode strings.

update_heatmap(change=None)[source]
class vaex.jupyter.view.Histogram(**kwargs: Any)[source]

Bases: vaex.jupyter.view.ViewBase

Public constructor

TOOLS_SUPPORTED = ['pan-zoom', 'select-x']
create_plot()[source]
dimension_facets

A trait for unicode strings.

dimension_groups

A trait for unicode strings.

dimension_overplot

A trait for unicode strings.

model

A trait whose value must be an instance of a specified class.

The value can also be an instance of a subclass of the specified class.

Subclasses can declare default classes by overriding the klass attribute

normalize

A boolean (True, False) trait.

supports_normalize = True
supports_transforms = False
transform

A trait for unicode strings.

update_data(change=None)[source]
class vaex.jupyter.view.PieChart(**kwargs: Any)[source]

Bases: vaex.jupyter.view.Histogram

Public constructor

create_plot()[source]
radius_split_fraction = 0.8
class vaex.jupyter.view.ViewBase(**kwargs: Any)[source]

Bases: ipyvuetify.generated.Container.Container

Public constructor

hide_progress()[source]
on_grid_progress(fraction)[source]
select_nothing()[source]
select_rectangle(x1, x2, y1, y2)[source]
select_x_range(x1, x2)[source]
selection_interact

A trait for unicode strings.

selection_mode

A trait for unicode strings.

tool

A trait for unicode strings.

vaex.jupyter.widgets

class vaex.jupyter.widgets.ColumnExpressionAdder(**kwargs: Any)[source]

Bases: vaex.jupyter.widgets.ColumnPicker

Public constructor

component

A trait which allows any value.

target

A trait for unicode strings.

vue_menu_click(data)[source]
class vaex.jupyter.widgets.ColumnList(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate, vaex.jupyter.traitlets.ColumnsMixin

Public constructor

column_filter

A trait for unicode strings.

dialog_open

A boolean (True, False) trait.

editor

A trait which allows any value.

editor_open

A boolean (True, False) trait.

template

A trait for unicode strings.

tooltip

A trait for unicode strings.

valid_expression

A boolean (True, False) trait.

vue_add_virtual_column(data)[source]
vue_column_click(data)[source]
vue_save_column(data)[source]
class vaex.jupyter.widgets.ColumnPicker(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate, vaex.jupyter.traitlets.ColumnsMixin

Public constructor

label

A trait for unicode strings.

template

A trait for unicode strings.

value
class vaex.jupyter.widgets.ColumnSelectionAdder(**kwargs: Any)[source]

Bases: vaex.jupyter.widgets.ColumnPicker

Public constructor

component

A trait which allows any value.

target

A trait for unicode strings.

vue_menu_click(data)[source]
class vaex.jupyter.widgets.ContainerCard(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

card_props

An instance of a Python dict.

One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.

Changed in version 5.0: Added key_trait for validating dict keys.

Changed in version 5.0: Deprecated ambiguous trait, traits args in favor of value_trait, per_key_traits.

controls

An instance of a Python list.

main

A trait which allows any value.

main_props

An instance of a Python dict.

One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.

Changed in version 5.0: Added key_trait for validating dict keys.

Changed in version 5.0: Deprecated ambiguous trait, traits args in favor of value_trait, per_key_traits.

show_controls

A boolean (True, False) trait.

subtitle

A trait for unicode strings.

text

A trait for unicode strings.

title

A trait for unicode strings.

class vaex.jupyter.widgets.Counter(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

characters

An instance of a Python list.

format

A trait for unicode strings.

postfix

A trait for unicode strings.

prefix

A trait for unicode strings.

template

A trait for unicode strings.

value

An int trait.

class vaex.jupyter.widgets.Expression(**kwargs: Any)[source]

Bases: ipyvuetify.generated.TextField.TextField

Public constructor

check_expression()[source]
df

A trait which allows any value.

valid

A boolean (True, False) trait.

value
class vaex.jupyter.widgets.ExpressionSelectionTextArea(**kwargs: Any)[source]

Bases: vaex.jupyter.widgets.Expression

Public constructor

selection_name

A trait which allows any value.

update_custom_selection
update_selection()[source]
vaex.jupyter.widgets.ExpressionTextArea

alias of vaex.jupyter.widgets.Expression

class vaex.jupyter.widgets.Html(**kwargs: Any)[source]

Bases: ipyvuetify.Html.Html

Public constructor

Bases: vaex.jupyter.widgets.VuetifyTemplate

Public constructor

items

An instance of a Python list.

class vaex.jupyter.widgets.PlotTemplate(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

button_text

A trait for unicode strings.

clipped

A boolean (True, False) trait.

components

An instance of a Python dict.

One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.

Changed in version 5.0: Added key_trait for validating dict keys.

Changed in version 5.0: Deprecated ambiguous trait, traits args in favor of value_trait, per_key_traits.

dark

A boolean (True, False) trait.

drawer

A boolean (True, False) trait.

drawers

A trait which allows any value.

floating

A boolean (True, False) trait.

items

An instance of a Python list.

mini

A boolean (True, False) trait.

model

A trait which allows any value.

new_output

A boolean (True, False) trait.

show_output

A boolean (True, False) trait.

template

A trait for unicode strings.

title

A trait for unicode strings.

type

A trait for unicode strings.

class vaex.jupyter.widgets.ProgressCircularNoAnimation(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

v-progress-circular that avoids animations

Public constructor

color

A trait for unicode strings.

hidden

A boolean (True, False) trait.

parts

An instance of a Python list.

size

An int trait.

template

A trait for unicode strings.

text

A trait for unicode strings.

value

A float trait.

width

An int trait.

class vaex.jupyter.widgets.Selection(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

df

A trait which allows any value.

name

A trait for unicode strings.

value

A trait for unicode strings.

class vaex.jupyter.widgets.SelectionEditor(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

adder

A trait which allows any value.

components

An instance of a Python dict.

One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.

Changed in version 5.0: Added key_trait for validating dict keys.

Changed in version 5.0: Deprecated ambiguous trait, traits args in favor of value_trait, per_key_traits.

df

A trait which allows any value.

input

A trait which allows any value.

on_close

A trait which allows any value.

template

A trait for unicode strings.

class vaex.jupyter.widgets.SelectionToggleList(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

df

A trait which allows any value.

selection_names

An instance of a Python list.

title

A trait for unicode strings.

value

An instance of a Python list.

class vaex.jupyter.widgets.SettingsEditor(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

schema

An instance of a Python dict.

One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.

Changed in version 5.0: Added key_trait for validating dict keys.

Changed in version 5.0: Deprecated ambiguous trait, traits args in favor of value_trait, per_key_traits.

template_file = '/home/docs/checkouts/readthedocs.org/user_builds/vaex/envs/latest/lib/python3.7/site-packages/vaex/jupyter/vue/vjsf.vue'
valid

A boolean (True, False) trait.

values

An instance of a Python dict.

One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.

Changed in version 5.0: Added key_trait for validating dict keys.

Changed in version 5.0: Deprecated ambiguous trait, traits args in favor of value_trait, per_key_traits.

vjsf_loaded

A boolean (True, False) trait.

class vaex.jupyter.widgets.Status(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

template

A trait for unicode strings.

value

A trait for unicode strings.

class vaex.jupyter.widgets.ToolsSpeedDial(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

children

An instance of a Python list.

expand

A boolean (True, False) trait.

items

A trait which allows any value.

template

A trait for unicode strings.

value

A trait for unicode strings.

vue_action(data)[source]
class vaex.jupyter.widgets.ToolsToolbar(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

interact_items

A trait which allows any value.

interact_value

A trait for unicode strings.

normalize

A boolean (True, False) trait.

selection_mode

A trait for unicode strings.

selection_mode_items

A trait which allows any value.

supports_normalize

A boolean (True, False) trait.

supports_transforms

A boolean (True, False) trait.

transform_items

An instance of a Python list.

transform_value

A trait for unicode strings.

z_normalize

A boolean (True, False) trait.

class vaex.jupyter.widgets.UsesVaexComponents(**kwargs: Any)[source]

Bases: traitlets.traitlets.HasTraits

class vaex.jupyter.widgets.VirtualColumnEditor(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

adder

A trait which allows any value.

column_name

A trait for unicode strings.

components

An instance of a Python dict.

One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.

Changed in version 5.0: Added key_trait for validating dict keys.

Changed in version 5.0: Deprecated ambiguous trait, traits args in favor of value_trait, per_key_traits.

df

A trait which allows any value.

editor

A trait which allows any value.

on_close

A trait which allows any value.

save_column()[source]
template

A trait for unicode strings.

class vaex.jupyter.widgets.VuetifyTemplate(**kwargs: Any)[source]

Bases: ipyvuetify.VuetifyTemplate.VuetifyTemplate

Public constructor

vaex.jupyter.widgets.component(name)[source]
vaex.jupyter.widgets.load_template(filename)[source]
vaex.jupyter.widgets.watch()[source]

vaex-ml

See the ML tutorial an introduction, and the ML examples for more advanced usage.

Transformers & Encoders

vaex.ml.transformations.FrequencyEncoder(...)

Encode categorical columns by the frequency of their respective samples.

vaex.ml.transformations.LabelEncoder(**kwargs)

Encode categorical columns with integer values between 0 and num_classes-1.

vaex.ml.transformations.MaxAbsScaler(**kwargs)

Scale features by their maximum absolute value.

vaex.ml.transformations.MinMaxScaler(**kwargs)

Will scale a set of features to a given range.

vaex.ml.transformations.OneHotEncoder(**kwargs)

Encode categorical columns according ot the One-Hot scheme.

vaex.ml.transformations.MultiHotEncoder(**kwargs)

Encode categorical columns according to a binary multi-hot scheme.

vaex.ml.transformations.PCA(**kwargs)

Transform a set of features using a Principal Component Analysis.

vaex.ml.transformations.RobustScaler(**kwargs)

The RobustScaler removes the median and scales the data according to a given percentile range.

vaex.ml.transformations.StandardScaler(**kwargs)

Standardize features by removing thir mean and scaling them to unit variance.

vaex.ml.transformations.CycleTransformer(...)

A strategy for transforming cyclical features (e.g.

vaex.ml.transformations.BayesianTargetEncoder(...)

Encode categorical variables with a Bayesian Target Encoder.

vaex.ml.transformations.WeightOfEvidenceEncoder(...)

Encode categorical variables with a Weight of Evidence Encoder.

vaex.ml.transformations.KBinsDiscretizer(...)

Bin continous features into discrete bins.

vaex.ml.transformations.GroupByTransformer(...)

The GroupByTransformer creates aggregations via the groupby operation, which are joined to a DataFrame.

class vaex.ml.transformations.FrequencyEncoder(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Encode categorical columns by the frequency of their respective samples.

Example:

>>> import vaex
>>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red', 'green'])
>>> df
 #  color
 0  red
 1  green
 2  green
 3  blue
 4  red
>>> encoder = vaex.ml.FrequencyEncoder(features=['color'])
>>> encoder.fit_transform(df)
 #  color      frequency_encoded_color
 0  red                       0.333333
 1  green                     0.5
 2  green                     0.5
 3  blue                      0.166667
 4  red                       0.333333
 5  green                     0.5
Parameters
  • features – List of features to transform.

  • prefix – Prefix for the names of the transformed features.

  • unseen – Strategy to deal with unseen values.

fit(df)[source]

Fit FrequencyEncoder to the DataFrame.

Parameters

df – A vaex DataFrame.

prefix

Prefix for the names of the transformed features.

transform(df)[source]

Transform a DataFrame with a fitted FrequencyEncoder.

Parameters

df – A vaex DataFrame.

Returns

A shallow copy of the DataFrame that includes the encodings.

Return type

DataFrame

unseen

Strategy to deal with unseen values.

class vaex.ml.transformations.LabelEncoder(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Encode categorical columns with integer values between 0 and num_classes-1.

Example:

>>> import vaex
>>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red'])
>>> df
 #  color
 0  red
 1  green
 2  green
 3  blue
 4  red
>>> encoder = vaex.ml.LabelEncoder(features=['color'])
>>> encoder.fit_transform(df)
 #  color      label_encoded_color
 0  red                          2
 1  green                        1
 2  green                        1
 3  blue                         0
 4  red                          2
Parameters
  • allow_unseen – If True, unseen values will be encoded with -1, otherwise an error is raised

  • features – List of features to transform.

  • prefix – Prefix for the names of the transformed features.

allow_unseen

If True, unseen values will be encoded with -1, otherwise an error is raised

fit(df)[source]

Fit LabelEncoder to the DataFrame.

Parameters

df – A vaex DataFrame.

labels_

The encoded labels of each feature.

prefix

Prefix for the names of the transformed features.

transform(df)[source]

Transform a DataFrame with a fitted LabelEncoder.

Parameters

df – A vaex DataFrame.

Returns: :return copy: A shallow copy of the DataFrame that includes the encodings. :rtype: DataFrame

class vaex.ml.transformations.MaxAbsScaler(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Scale features by their maximum absolute value.

Example:

>>> import vaex
>>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10])
>>> df
 #    x    y
 0    2   -2
 1    5    3
 2    7    0
 3    2    0
 4   15   10
>>> scaler = vaex.ml.MaxAbsScaler(features=['x', 'y'])
>>> scaler.fit_transform(df)
 #    x    y    absmax_scaled_x    absmax_scaled_y
 0    2   -2           0.133333               -0.2
 1    5    3           0.333333                0.3
 2    7    0           0.466667                0
 3    2    0           0.133333                0
 4   15   10           1                       1
Parameters
  • features – List of features to transform.

  • prefix – Prefix for the names of the transformed features.

absmax_

Tha maximum absolute value of a feature.

fit(df)[source]

Fit MinMaxScaler to the DataFrame.

Parameters

df – A vaex DataFrame.

prefix

Prefix for the names of the transformed features.

transform(df)[source]

Transform a DataFrame with a fitted MaxAbsScaler.

Parameters

df – A vaex DataFrame.

Return copy

a shallow copy of the DataFrame that includes the scaled features.

Return type

DataFrame

class vaex.ml.transformations.MinMaxScaler(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Will scale a set of features to a given range.

Example:

>>> import vaex
>>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10])
>>> df
 #    x    y
 0    2   -2
 1    5    3
 2    7    0
 3    2    0
 4   15   10
>>> scaler = vaex.ml.MinMaxScaler(features=['x', 'y'])
>>> scaler.fit_transform(df)
 #    x    y    minmax_scaled_x    minmax_scaled_y
 0    2   -2           0                  0
 1    5    3           0.230769           0.416667
 2    7    0           0.384615           0.166667
 3    2    0           0                  0.166667
 4   15   10           1                  1
Parameters
  • feature_range – The range the features are scaled to.

  • features – List of features to transform.

  • prefix – Prefix for the names of the transformed features.

feature_range

The range the features are scaled to.

fit(df)[source]

Fit MinMaxScaler to the DataFrame.

Parameters

df – A vaex DataFrame.

fmax_

The minimum value of a feature.

fmin_

The maximum value of a feature.

prefix

Prefix for the names of the transformed features.

transform(df)[source]

Transform a DataFrame with a fitted MinMaxScaler.

Parameters

df – A vaex DataFrame.

Return copy

a shallow copy of the DataFrame that includes the scaled features.

Return type

DataFrame

class vaex.ml.transformations.OneHotEncoder(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Encode categorical columns according ot the One-Hot scheme.

Example:

>>> import vaex
>>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red'])
>>> df
 #  color®
 0  red
 1  green
 2  green
 3  blue
 4  red
>>> encoder = vaex.ml.OneHotEncoder(features=['color'])
>>> encoder.fit_transform(df)
 #  color      color_blue    color_green    color_red
 0  red                 0              0            1
 1  green               0              1            0
 2  green               0              1            0
 3  blue                1              0            0
 4  red                 0              0            1
Parameters
  • features – List of features to transform.

  • one – Value to encode when a category is present.

  • prefix – Prefix for the names of the transformed features.

  • zero – Value to encode when category is absent.

fit(df)[source]

Fit OneHotEncoder to the DataFrame.

Parameters

df – A vaex DataFrame.

one

Value to encode when a category is present.

prefix

Prefix for the names of the transformed features.

transform(df)[source]

Transform a DataFrame with a fitted OneHotEncoder.

Parameters

df – A vaex DataFrame.

Returns

A shallow copy of the DataFrame that includes the encodings.

Return type

DataFrame

uniques_

The unique elements found in each feature.

zero

Value to encode when category is absent.

class vaex.ml.transformations.MultiHotEncoder(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Encode categorical columns according to a binary multi-hot scheme.

With Multi-Hot Encoder (sometimes called Binary Encoder), the categorical variables are first ordinal encoded, and those encodings are converted to a binary number. Each digit of that binary number is a separate column, containing either a “0” or a “1”. This is can be considered as an improvement over the One-Hot encoder as it guards against generating too many new columns when the cardinality of the categorical column is high, while effecively removing the ordinality that an Ordinal Encoder would introduce.

Example:

>>> import vaex
>>> import vaex.ml
>>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red'])
>>> df
#  color
0  red
1  green
2  green
3  blue
4  red
>>> encoder = vaex.ml.MultiHotEncoder(features=['color'])
>>> encoder.fit_transform(df)
#  color      color_0    color_1    color_2
0  red              0          1          1
1  green            0          1          0
2  green            0          1          0
3  blue             0          0          1
4  red              0          1          1
Parameters
  • features – List of features to transform.

  • prefix – Prefix for the names of the transformed features.

fit(df)[source]

Fit MultiHotEncoder to the DataFrame.

Parameters

df – A vaex DataFrame.

labels_

The ordinal-encoded labels of each feature.

prefix

Prefix for the names of the transformed features.

transform(df)[source]

Transform a DataFrame with a fitted MultiHotEncoder.

Parameters

df – A vaex DataFrame.

Returns

A shallow copy of the DataFrame that includes the encodings.

Return type

DataFrame

class vaex.ml.transformations.PCA(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Transform a set of features using a Principal Component Analysis.

Example:

>>> import vaex
>>> import vaex.ml
>>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10])
>>> df
 #   x   y
 0   2   -2
 1   5   3
 2   7   0
 3   2   0
 4   15  10
>>> pca = vaex.ml.PCA(n_components=2, features=['x', 'y'])
>>> pca.fit_transform(df)
 #    x    y       PCA_0      PCA_1
 0    2   -2    5.92532    0.413011
 1    5    3    0.380494  -1.39112
 2    7    0    0.840049   2.18502
 3    2    0    4.61287   -1.09612
 4   15   10  -11.7587    -0.110794
Parameters
  • features – List of features to transform.

  • n_components – Number of components to retain. If None, all the components will be retained.

  • prefix – Prefix for the names of the transformed features.

  • whiten – If True perform whitening, i.e. remove the relative variance schale of the transformed components.

eigen_values_

The eigen values that correspond to each feature.

eigen_vectors_

The eigen vectors corresponding to each feature

explained_variance_

Variance explained by each of the components. Same as the eigen values.

explained_variance_ratio_

Percentage of variance explained by each of the selected components.

fit(df, progress=None)[source]

Fit the PCA model to the DataFrame.

Parameters
  • df – A vaex DataFrame.

  • progress – If True or ‘widget’, display a progressbar of the fitting process.

means_

The mean of each feature

n_components

Number of components to retain. If None, all the components will be retained.

prefix

Prefix for the names of the transformed features.

transform(df, n_components=None)[source]

Apply the PCA transformation to the DataFrame.

Parameters
  • df – A vaex DataFrame.

  • n_components – The number of PCA components to retain.

Return copy

A shallow copy of the DataFrame that includes the PCA components.

Return type

DataFrame

whiten

If True perform whitening, i.e. remove the relative variance schale of the transformed components.

class vaex.ml.transformations.RobustScaler(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

The RobustScaler removes the median and scales the data according to a given percentile range. By default, the scaling is done between the 25th and the 75th percentile. Centering and scaling happens independently for each feature (column).

Example:

>>> import vaex
>>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10])
>>> df
 #    x    y
 0    2   -2
 1    5    3
 2    7    0
 3    2    0
 4   15   10
>>> scaler = vaex.ml.MaxAbsScaler(features=['x', 'y'])
>>> scaler.fit_transform(df)
 #    x    y    robust_scaled_x    robust_scaled_y
 0    2   -2       -0.333686             -0.266302
 1    5    3       -0.000596934           0.399453
 2    7    0        0.221462              0
 3    2    0       -0.333686              0
 4   15   10        1.1097                1.33151
Parameters
  • features – List of features to transform.

  • percentile_range – The percentile range to which to scale each feature to.

  • prefix – Prefix for the names of the transformed features.

  • with_centering – If True, remove the median.

  • with_scaling – If True, scale each feature between the specified percentile range.

center_

The median of each feature.

fit(df)[source]

Fit RobustScaler to the DataFrame.

Parameters

df – A vaex DataFrame.

percentile_range

The percentile range to which to scale each feature to.

prefix

Prefix for the names of the transformed features.

scale_

The percentile range for each feature.

transform(df)[source]

Transform a DataFrame with a fitted RobustScaler.

Parameters

df – A vaex DataFrame.

Returns copy

a shallow copy of the DataFrame that includes the scaled features.

Return type

DataFrame

with_centering

If True, remove the median.

with_scaling

If True, scale each feature between the specified percentile range.

class vaex.ml.transformations.StandardScaler(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Standardize features by removing thir mean and scaling them to unit variance.

Example:

>>> import vaex
>>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10])
>>> df
 #    x    y
 0    2   -2
 1    5    3
 2    7    0
 3    2    0
 4   15   10
>>> scaler = vaex.ml.StandardScaler(features=['x', 'y'])
>>> scaler.fit_transform(df)
 #    x    y    standard_scaled_x    standard_scaled_y
 0    2   -2            -0.876523            -0.996616
 1    5    3            -0.250435             0.189832
 2    7    0             0.166957            -0.522037
 3    2    0            -0.876523            -0.522037
 4   15   10             1.83652              1.85086
Parameters
  • features – List of features to transform.

  • prefix – Prefix for the names of the transformed features.

  • with_mean – If True, remove the mean from each feature.

  • with_std – If True, scale each feature to unit variance.

fit(df)[source]

Fit StandardScaler to the DataFrame.

Parameters

df – A vaex DataFrame.

mean_

The mean of each feature

prefix

Prefix for the names of the transformed features.

std_

The standard deviation of each feature.

transform(df)[source]

Transform a DataFrame with a fitted StandardScaler.

Parameters

df – A vaex DataFrame.

Returns copy

a shallow copy of the DataFrame that includes the scaled features.

Return type

DataFrame

with_mean

If True, remove the mean from each feature.

with_std

If True, scale each feature to unit variance.

class vaex.ml.transformations.CycleTransformer(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

A strategy for transforming cyclical features (e.g. angles, time).

Think of each feature as an angle of a unit circle in polar coordinates, and then and then obtaining the x and y coordinate projections, or the cos and sin components respectively.

Suitable for a variaty of machine learning tasks. It preserves the cyclical continuity of the feature. Inspired by: http://blog.davidkaleko.com/feature-engineering-cyclical-features.html

>>> df = vaex.from_arrays(days=[0, 1, 2, 3, 4, 5, 6])
>>> cyctrans = vaex.ml.CycleTransformer(n=7, features=['days'])
>>> cyctrans.fit_transform(df)
  #    days     days_x     days_y
  0       0   1          0
  1       1   0.62349    0.781831
  2       2  -0.222521   0.974928
  3       3  -0.900969   0.433884
  4       4  -0.900969  -0.433884
  5       5  -0.222521  -0.974928
  6       6   0.62349   -0.781831
Parameters
  • features – List of features to transform.

  • n – The number of elements in one cycle.

  • prefix_x – Prefix for the x-component of the transformed features.

  • prefix_y – Prefix for the y-component of the transformed features.

  • suffix_x – Suffix for the x-component of the transformed features.

  • suffix_y – Suffix for the y-component of the transformed features.

fit(df)[source]

Fit a CycleTransformer to the DataFrame.

This is a dummy method, as it is not needed for the transformation to be applied.

Parameters

df – A vaex DataFrame.

n

The number of elements in one cycle.

prefix_x

Prefix for the x-component of the transformed features.

prefix_y

Prefix for the y-component of the transformed features.

suffix_x

Suffix for the x-component of the transformed features.

suffix_y

Suffix for the y-component of the transformed features.

transform(df)[source]

Transform a DataFrame with a CycleTransformer.

Parameters

df – A vaex DataFrame.

class vaex.ml.transformations.BayesianTargetEncoder(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Encode categorical variables with a Bayesian Target Encoder.

The categories are encoded by the mean of their target value, which is adjusted by the global mean value of the target variable using a Bayesian schema. For a larger weight value, the target encodings are smoothed toward the global mean, while for a weight of 0, the encodings are just the mean target value per class.

Reference: https://www.wikiwand.com/en/Bayes_estimator#/Practical_example_of_Bayes_estimators

Example:

>>> import vaex
>>> import vaex.ml
>>> df = vaex.from_arrays(x=['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
...                       y=[1, 1, 1, 0, 0, 0, 0, 1])
>>> target_encoder = vaex.ml.BayesianTargetEncoder(features=['x'], weight=4)
>>> target_encoder.fit_transform(df, 'y')
  #  x      y    mean_encoded_x
  0  a      1             0.625
  1  a      1             0.625
  2  a      1             0.625
  3  a      0             0.625
  4  b      0             0.375
  5  b      0             0.375
  6  b      0             0.375
  7  b      1             0.375
Parameters
  • features – List of features to transform.

  • prefix – Prefix for the names of the transformed features.

  • target – The name of the column containing the target variable.

  • unseen – Strategy to deal with unseen values.

  • weight – Weight to be applied to the mean encodings (smoothing parameter).

fit(df)[source]

Fit a BayesianTargetEncoder to the DataFrame.

Parameters

df – A vaex DataFrame

prefix

Prefix for the names of the transformed features.

target

The name of the column containing the target variable.

transform(df)[source]

Transform a DataFrame with a fitted BayesianTargetEncoder.

Parameters

df – A vaex DataFrame.

Returns

A shallow copy of the DataFrame that includes the encodings.

Return type

DataFrame

unseen

Strategy to deal with unseen values.

weight

Weight to be applied to the mean encodings (smoothing parameter).

class vaex.ml.transformations.WeightOfEvidenceEncoder(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Encode categorical variables with a Weight of Evidence Encoder.

Weight of Evidence measures how well a particular feature supports the given hypothesis (i.e. the target variable). With this encoder, each category in a categorical feature is encoded by its “strength” i.e. Weight of Evidence value. The target feature can be a boolean or numerical column, where True/1 is seen as ‘Good’, and False/0 is seen as ‘Bad’

Reference: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

Example:

>>> import vaex
>>> import vaex.ml
>>> df = vaex.from_arrays(x=['a', 'a', 'b', 'b', 'b', 'c', 'c'],
...                       y=[1, 1, 0, 0, 1, 1, 0])
>>> woe_encoder = vaex.ml.WeightOfEvidenceEncoder(target='y', features=['x'])
>>> woe_encoder.fit_transform(df)
  #  x      y    mean_encoded_x
  0  a      1         13.8155
  1  a      1         13.8155
  2  b      0         -0.693147
  3  b      0         -0.693147
  4  b      1         -0.693147
  5  c      1          0
  6  c      0          0
Parameters
  • epsilon – Small value taken as minimum fot the negatives, to avoid a division by zero

  • features – List of features to transform.

  • prefix – Prefix for the names of the transformed features.

  • target – The name of the column containing the target variable.

  • unseen – Strategy to deal with unseen values.

epsilon

Small value taken as minimum fot the negatives, to avoid a division by zero

fit(df)[source]

Fit a WeightOfEvidenceEncoder to the DataFrame.

Parameters

df – A vaex DataFrame

prefix

Prefix for the names of the transformed features.

target

The name of the column containing the target variable.

transform(df)[source]

Transform a DataFrame with a fitted WeightOfEvidenceEncoder.

Parameters

df – A vaex DataFrame.

Returns

A shallow copy of the DataFrame that includes the encodings.

Return type

DataFrame

unseen

Strategy to deal with unseen values.

class vaex.ml.transformations.KBinsDiscretizer(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

Bin continous features into discrete bins.

A stretegy to encode continuous features into discrete bins. The transformed columns contain the bin label each sample falls into. In a way this transformer Label/Ordinal encodes continous features.

Example:

>>> import vaex
>>> import vaex.ml
>>> df = vaex.from_arrays(x=[0, 2.5, 5, 7.5, 10, 12.5, 15])
>>> bin_trans = vaex.ml.KBinsDiscretizer(features=['x'], n_bins=3, strategy='uniform')
>>> bin_trans.fit_transform(df)
  #     x    binned_x
  0   0             0
  1   2.5           0
  2   5             1
  3   7.5           1
  4  10             2
  5  12.5           2
  6  15             2
Parameters
  • epsilon – Tiny value added to the bin edges ensuring samples close to the bin edges are binned correcly.

  • features – List of features to transform.

  • n_bins – Number of bins. Must be greater than 1.

  • prefix – Prefix for the names of the transformed features.

  • strategy – Strategy used to define the widths of the bins. Can be either “uniform”, “quantile” or “kmeans”.

bin_edges_

The bin edges for each binned feature

epsilon

Tiny value added to the bin edges ensuring samples close to the bin edges are binned correcly.

fit(df)[source]

Fit KBinsDiscretizer to the DataFrame.

Parameters

df – A vaex DataFrame.

n_bins

Number of bins. Must be greater than 1.

n_bins_

Number of bins per feature.

prefix

Prefix for the names of the transformed features.

strategy

Strategy used to define the widths of the bins. Can be either “uniform”, “quantile” or “kmeans”.

transform(df)[source]

Transform a DataFrame with a fitted KBinsDiscretizer.

Parameters

df – A vaex DataFrame.

Returns copy

a shallow copy of the DataFrame that includes the binned features.

Return type

DataFrame

class vaex.ml.transformations.GroupByTransformer(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

The GroupByTransformer creates aggregations via the groupby operation, which are joined to a DataFrame. This is useful for creating aggregate features.

Example:

>>> import vaex
>>> import vaex.ml
>>> df_train = vaex.from_arrays(x=['dog', 'dog', 'dog', 'cat', 'cat'], y=[2, 3, 4, 10, 20])
>>> df_test = vaex.from_arrays(x=['dog', 'cat', 'dog', 'mouse'], y=[5, 5, 5, 5])
>>> group_trans = vaex.ml.GroupByTransformer(by='x', agg={'mean_y': vaex.agg.mean('y')}, rsuffix='_agg')
>>> group_trans.fit_transform(df_train)
  #  x      y  x_agg      mean_y
  0  dog    2  dog             3
  1  dog    3  dog             3
  2  dog    4  dog             3
  3  cat   10  cat            15
  4  cat   20  cat            15
>>> group_trans.transform(df_test)
  #  x        y  x_agg    mean_y
  0  dog      5  dog      3.0
  1  cat      5  cat      15.0
  2  dog      5  dog      3.0
  3  mouse    5  --       --
Parameters
  • agg – Dict where the keys are feature names and the values are vaex.agg objects.

  • by – The feature on which to do the grouping.

  • features – List of features to transform.

  • rprefix – Prefix for the names of the aggregate features in case of a collision.

  • rsuffix – Suffix for the names of the aggregate features in case of a collision.

agg

Dict where the keys are feature names and the values are vaex.agg objects.

by

The feature on which to do the grouping.

fit(df)[source]

Fit GroupByTransformer to the DataFrame.

Parameters

df – A vaex DataFrame.

rprefix

Prefix for the names of the aggregate features in case of a collision.

rsuffix

Suffix for the names of the aggregate features in case of a collision.

transform(df)[source]

Transform a DataFrame with a fitted GroupByTransformer.

Parameters

df – A vaex DataFrame.

Returns copy

a shallow copy of the DataFrame that includes the aggregated features.

Return type

DataFrame

Clustering

vaex.ml.cluster.KMeans(**kwargs)

The KMeans clustering algorithm.

class vaex.ml.cluster.KMeans(**kwargs: Any)[source]

Bases: vaex.ml.transformations.Transformer

The KMeans clustering algorithm.

Example:

>>> import vaex.ml
>>> import vaex.ml.cluster
>>> df = vaex.datasets.iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> cls = vaex.ml.cluster.KMeans(n_clusters=3, features=features, init='random', max_iter=10)
>>> cls.fit(df)
>>> df = cls.transform(df)
>>> df.head(5)
 #    sepal_width    petal_length    sepal_length    petal_width    class_    prediction_kmeans
 0            3               4.2             5.9            1.5         1                    2
 1            3               4.6             6.1            1.4         1                    2
 2            2.9             4.6             6.6            1.3         1                    2
 3            3.3             5.7             6.7            2.1         2                    0
 4            4.2             1.4             5.5            0.2         0                    1
Parameters
  • cluster_centers – Coordinates of cluster centers.

  • features – List of features to cluster.

  • inertia – Sum of squared distances of samples to their closest cluster center.

  • init – Method for initializing the centroids.

  • max_iter – Maximum number of iterations of the KMeans algorithm for a single run.

  • n_clusters – Number of clusters to form.

  • n_init – Number of centroid initializations. The KMeans algorithm will be run for each initialization, and the final results will be the best output of the n_init consecutive runs in terms of inertia.

  • prediction_label – The name of the virtual column that houses the cluster labels for each point.

  • random_state – Random number generation for centroid initialization. If an int is specified, the randomness becomes deterministic.

  • verbose – If True, enable verbosity mode.

cluster_centers

Coordinates of cluster centers.

features

List of features to cluster.

fit(dataframe)[source]

Fit the KMeans model to the dataframe.

Parameters

dataframe – A vaex DataFrame.

inertia

Sum of squared distances of samples to their closest cluster center.

init

Method for initializing the centroids.

max_iter

Maximum number of iterations of the KMeans algorithm for a single run.

n_clusters

Number of clusters to form.

n_init

Number of centroid initializations. The KMeans algorithm will be run for each initialization, and the final results will be the best output of the n_init consecutive runs in terms of inertia.

prediction_label

The name of the virtual column that houses the cluster labels for each point.

random_state

Random number generation for centroid initialization. If an int is specified, the randomness becomes deterministic.

transform(dataframe)[source]

Label a DataFrame with a fitted KMeans model.

Parameters

dataframe – A vaex DataFrame.

Returns copy

A shallow copy of the DataFrame that includes the cluster labels.

Return type

DataFrame

verbose

If True, enable verbosity mode.

Metrics

class vaex.ml.metrics.DataFrameAccessorMetrics(ml)[source]

Bases: object

Common metrics for evaluating machine learning tasks.

This DataFrame Accessor contains a number of common machine learning evaluation metrics. The idea is that the metrics can be evaluated out-of-core, and without the need to materialize the target and predicted columns.

See https://vaex.io/docs/api.html#metrics for a list of supported evaluation metrics.

accuracy_score(y_true, y_pred, selection=None, array_type='python')[source]

Calculates the accuracy classification score.

Parameters
  • y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

The accuracy score.

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0], y_pred=[1, 0, 0, 1, 1])
>>> df.ml.metrics.accuracy_score(df.y_true, df.y_pred)
  0.6
classification_report(y_true, y_pred, average='binary', decimals=3)[source]

Returns a text report showing the main classification metrics

The accuracy, precision, recall, and F1-score are shown.

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1])
>>> report = df.ml.metrics.classification_report(df.y_true, df.y_pred)
>>> print(report)
>>> print(report)
    Classification report:

Accuracy: 0.667 Precision: 0.75 Recall: 0.75 F1: 0.75

confusion_matrix(y_true, y_pred, selection=None, array_type=None)[source]

Docstrings :param y_true: expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y :param y_pred: expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y :param selection: Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections :param array_type: Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list :returns: The confusion matrix

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1])
>>> df.ml.metrics.confusion_matrix(df.y_true, df.y_pred)
  array([[1, 1],
         [1, 3]]
f1_score(y_true, y_pred, average='binary', selection=None, array_type=None)[source]

Calculates the F1 score.

This is the harmonic average between the precision and the recall.

For a binary classification problem, average should be set to “binary”. In this case it is assumed that the input data is encoded in 0 and 1 integers, where the class of importance is labeled as 1.

For multiclass classification problems, average should be set to “macro”. The “macro” average is the unweighted mean of a metric for each label. For multiclass problems the data can be ordinal encoded, but class names are also supported.

Parameters
  • y_true – {expression_one}

  • y_pred – {expression_one}

  • average – Should be either ‘binary’ or ‘macro’.

  • selection – {selection}

  • array_type – {array_type}

Returns

The recall score

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1])
>>> df.ml.metrics.recall_score(df.y_true, df.y_pred)
  0.75
matthews_correlation_coefficient(y_true, y_pred, selection=None, array_type=None)[source]

Calculates the Matthews correlation coefficient.

This metric can be used for both binary and multiclass classification problems.

Parameters
  • y_true – {expression_one}

  • y_pred – {expression_one}

  • selection – {selection}

Returns

The Matthews correlation coefficient.

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1])
>>> df.ml.metrics.matthews_correlation_coefficient(df.y_true, df.y_pred)
  0.25
mean_absolute_error(y_true, y_pred, selection=None, array_type='python')[source]

Calculate the mean absolute error.

Parameters
  • y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

The mean absolute error

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.datasets.iris()
>>> df.ml.metrics.mean_absolute_error(df.sepal_length, df.petal_length)
  2.0846666666666667
mean_squared_error(y_true, y_pred, selection=None, array_type='python')[source]

Calculates the mean squared error.

Parameters
  • y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

The mean squared error

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.datasets.iris()
>>> df.ml.metrics.mean_squared_error(df.sepal_length, df.petal_length)
  5.589000000000001
precision_recall_fscore(y_true, y_pred, average='binary', selection=None, array_type=None)[source]

Calculates the precision, recall and f1 score for a classification problem.

These metrics are defined as follows: - precision = tp / (tp + fp) - recall = tp / (tp + fn) - f1 = tp / (tp + 0.5 * (fp + fn)) where “tp” are true positives, “fp” are false positives, and “fn” are false negatives.

For a binary classification problem, average should be set to “binary”. In this case it is assumed that the input data is encoded in 0 and 1 integers, where the class of importance is labeled as 1.

For multiclass classification problems, average should be set to “macro”. The “macro” average is the unweighted mean of a metric for each label. For multiclass problems the data can be ordinal encoded, but class names are also supported.

Y_true

expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

Y_pred

expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

Average

Should be either ‘binary’ or ‘macro’.

Selection

Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

Array_type

Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

The precision, recall and f1 score

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1])
>>> df.ml.metrics.precision_score(df.y_true, df.y_pred)
  (0.75, 0.75, 0.75)
precision_score(y_true, y_pred, average='binary', selection=None, array_type=None)[source]

Calculates the precision classification score.

For a binary classification problem, average should be set to “binary”. In this case it is assumed that the input data is encoded in 0 and 1 integers, where the class of importance is labeled as 1.

For multiclass classification problems, average should be set to “macro”. The “macro” average is the unweighted mean of a metric for each label. For multiclass problems the data can be ordinal encoded, but class names are also supported.

Parameters
  • y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • average – Should be either ‘binary’ or ‘macro’.

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

The precision score

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1])
>>> df.ml.metrics.precision_score(df.y_true, df.y_pred)
  0.75
r2_score(y_true, y_pred)[source]

Calculates the R**2 (coefficient of determination) regression score function.

Parameters
  • y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

The R**2 score

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.datasets.iris()
>>> df.ml.metrics.r2_score(df.sepal_length, df.petal_length)
  -7.205575765485069
recall_score(y_true, y_pred, average='binary', selection=None, array_type=None)[source]

Calculates the recall classification score.

For a binary classification problem, average should be set to “binary”. In this case it is assumed that the input data is encoded in 0 and 1 integers, where the class of importance is labeled as 1.

For multiclass classification problems, average should be set to “macro”. The “macro” average is the unweighted mean of a metric for each label. For multiclass problems the data can be ordinal encoded, but class names are also supported.

Parameters
  • y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • average – Should be either ‘binary’ or ‘macro’.

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns

The recall score

Example:

>>> import vaex
>>> import vaex.ml.metrics
>>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1])
>>> df.ml.metrics.recall_score(df.y_true, df.y_pred)
  0.75

Scikit-learn

vaex.ml.sklearn.IncrementalPredictor(**kwargs)

This class wraps any scikit-learn estimator (a.k.a predictions) that has a .partial_fit method, and makes it a vaex pipeline object.

vaex.ml.sklearn.Predictor(**kwargs)

This class wraps any scikit-learn estimator (a.k.a predictor) making it a vaex pipeline object.

class vaex.ml.sklearn.IncrementalPredictor(**kwargs: Any)[source]

Bases: vaex.ml.state.HasState

This class wraps any scikit-learn estimator (a.k.a predictions) that has a .partial_fit method, and makes it a vaex pipeline object.

By wrapping “on-line” scikit-learn estimators with this class, they become a vaex pipeline object. Thus, they can take full advantage of the serialization and pipeline system of vaex. While the underlying estimator need to call the .partial_fit method, this class contains the standard .fit method, and the rest happens behind the scenes. One can also iterate over the data multiple times (epochs), and optionally shuffle each batch before it is sent to the estimator. The predict method returns a numpy array, while the transform method adds the prediction as a virtual column to a vaex DataFrame.

Note: the .fit method will use as much memory as needed to copy one batch of data, while the .predict method will require as much memory as needed to output the predictions as a numpy array. The transform method is evaluated lazily, and no memory copies are made.

Note: we are using normal sklearn without modifications here.

Example:

>>> import vaex
>>> import vaex.ml
>>> from vaex.ml.sklearn import IncrementalPredictor
>>> from sklearn.linear_model import SGDRegressor
>>>
>>> df = vaex.example()
>>>
>>> features = df.column_names[:6]
>>> target = 'FeH'
>>>
>>> standard_scaler = vaex.ml.StandardScaler(features=features)
>>> df = standard_scaler.fit_transform(df)
>>>
>>> features = df.get_column_names(regex='^standard')
>>> model = SGDRegressor(learning_rate='constant', eta0=0.01, random_state=42)
>>>
>>> incremental = IncrementalPredictor(model=model,
...                                    features=features,
...                                    target=target,
...                                    batch_size=10_000,
...                                    num_epochs=3,
...                                    shuffle=True,
...                                    prediction_name='pred_FeH')
>>> incremental.fit(df=df)
>>> df = incremental.transform(df)
>>> df.head(5)[['FeH', 'pred_FeH']]
  #        FeH    pred_FeH
  0  -2.30923     -1.66226
  1  -1.78874     -1.68218
  2  -0.761811    -1.59562
  3  -1.52088     -1.62225
  4  -2.65534     -1.61991
Parameters
  • batch_size – Number of samples to be sent to the model in each batch.

  • features – List of features to use.

  • model – A scikit-learn estimator with a .fit_predict method.

  • num_epochs – Number of times each batch is sent to the model.

  • partial_fit_kwargs – A dictionary of key word arguments to be passed on to the fit_predict method of the model.

  • prediction_name – The name of the virtual column housing the predictions.

  • prediction_type – Which method to use to get the predictions. Can be “predict”, “predict_proba” or “predict_log_proba”.

  • shuffle – If True, shuffle the samples before sending them to the model.

  • target – The name of the target column.

batch_size

Number of samples to be sent to the model in each batch.

features

List of features to use.

fit(df, progress=None)[source]

Fit the IncrementalPredictor to the DataFrame.

Parameters
  • df – A vaex DataFrame containing the features and target on which to train the model.

  • progress – If True, display a progressbar which tracks the training progress.

model

A scikit-learn estimator with a .fit_predict method.

num_epochs

Number of times each batch is sent to the model.

partial_fit_kwargs

A dictionary of key word arguments to be passed on to the fit_predict method of the model.

predict(df)[source]

Get an in-memory numpy array with the predictions of the Predictor

Parameters

df – A vaex DataFrame, containing the input features.

Returns

A in-memory numpy array containing the Predictor predictions.

Return type

numpy.array

prediction_name

The name of the virtual column housing the predictions.

prediction_type

Which method to use to get the predictions. Can be “predict”, “predict_proba” or “predict_log_proba”.

shuffle

If True, shuffle the samples before sending them to the model.

snake_name = 'sklearn_incremental_predictor'
target

The name of the target column.

transform(df)[source]

Transform a DataFrame such that it contains the predictions of the IncrementalPredictor. in form of a virtual column.

Parameters

df – A vaex DataFrame.

Return copy

A shallow copy of the DataFrame that includes the IncrementalPredictor prediction as a virtual column.

Return type

DataFrame

class vaex.ml.sklearn.Predictor(**kwargs: Any)[source]

Bases: vaex.ml.state.HasState

This class wraps any scikit-learn estimator (a.k.a predictor) making it a vaex pipeline object.

By wrapping any scikit-learn estimators with this class, it becomes a vaex pipeline object. Thus, it can take full advantage of the serialization and pipeline system of vaex. One can use the predict method to get a numpy array as an output of a fitted estimator, or the transform method do add such a prediction to a vaex DataFrame as a virtual column.

Note that a full memory copy of the data used is created when the fit and predict are called. The transform method is evaluated lazily.

The scikit-learn estimators themselves are not modified at all, they are taken from your local installation of scikit-learn.

Example:

>>> import vaex.ml
>>> from vaex.ml.sklearn import Predictor
>>> from sklearn.linear_model import LinearRegression
>>> df = vaex.datasets.iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length']
>>> df_train, df_test = df.ml.train_test_split()
>>> model = Predictor(model=LinearRegression(), features=features, target='petal_width', prediction_name='pred')
>>> model.fit(df_train)
>>> df_train = model.transform(df_train)
>>> df_train.head(3)
 #    sepal_length    sepal_width    petal_length    petal_width    class_      pred
 0             5.4            3               4.5            1.5         1  1.64701
 1             4.8            3.4             1.6            0.2         0  0.352236
 2             6.9            3.1             4.9            1.5         1  1.59336
>>> df_test = model.transform(df_test)
>>> df_test.head(3)
 #    sepal_length    sepal_width    petal_length    petal_width    class_     pred
 0             5.9            3               4.2            1.5         1  1.39437
 1             6.1            3               4.6            1.4         1  1.56469
 2             6.6            2.9             4.6            1.3         1  1.44276
Parameters
  • features – List of features to use.

  • model – A scikit-learn estimator.

  • prediction_name – The name of the virtual column housing the predictions.

  • prediction_type – Which method to use to get the predictions. Can be “predict”, “predict_proba” or “predict_log_proba”.

  • target – The name of the target column.

features

List of features to use.

fit(df, **kwargs)[source]

Fit the Predictor to the DataFrame.

Parameters

df – A vaex DataFrame containing the features and target on which to train the model.

model

A scikit-learn estimator.

predict(df)[source]

Get an in-memory numpy array with the predictions of the Predictor.

Parameters

df – A vaex DataFrame, containing the input features.

Returns

A in-memory numpy array containing the Predictor predictions.

Return type

numpy.array

prediction_name

The name of the virtual column housing the predictions.

prediction_type

Which method to use to get the predictions. Can be “predict”, “predict_proba” or “predict_log_proba”.

snake_name = 'sklearn_predictor'
target

The name of the target column.

transform(df)[source]

Transform a DataFrame such that it contains the predictions of the Predictor. in form of a virtual column.

Parameters

df – A vaex DataFrame.

Return copy

A shallow copy of the DataFrame that includes the Predictor prediction as a virtual column.

Return type

DataFrame

Boosted trees

vaex.ml.lightgbm.LightGBMModel(**kwargs)

The LightGBM algorithm.

vaex.ml.xgboost.XGBoostModel(**kwargs)

The XGBoost algorithm.

vaex.ml.catboost.CatBoostModel(**kwargs)

The CatBoost algorithm.

class vaex.ml.lightgbm.LightGBMModel(**kwargs: Any)[source]

Bases: vaex.ml.state.HasState

The LightGBM algorithm.

This class provides an interface to the LightGBM algorithm, with some optimizations for better memory efficiency when training large datasets. The algorithm itself is not modified at all.

LightGBM is a fast gradient boosting algorithm based on decision trees and is mainly used for classification, regression and ranking tasks. It is under the umbrella of the Distributed Machine Learning Toolkit (DMTK) project of Microsoft. For more information, please visit https://github.com/Microsoft/LightGBM/.

Example:

>>> import vaex.ml
>>> import vaex.ml.lightgbm
>>> df = vaex.datasets.iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> df_train, df_test = df.ml.train_test_split()
>>> params = {
    'boosting': 'gbdt',
    'max_depth': 5,
    'learning_rate': 0.1,
    'application': 'multiclass',
    'num_class': 3,
    'subsample': 0.80,
    'colsample_bytree': 0.80}
>>> booster = vaex.ml.lightgbm.LightGBMModel(features=features, target='class_', num_boost_round=100, params=params)
>>> booster.fit(df_train)
>>> df_train = booster.transform(df_train)
>>> df_train.head(3)
 #    sepal_width    petal_length    sepal_length    petal_width    class_    lightgbm_prediction
 0            3               4.5             5.4            1.5         1    [0.00165619 0.98097899 0.01736482]
 1            3.4             1.6             4.8            0.2         0    [9.99803930e-01 1.17346471e-04 7.87235133e-05]
 2            3.1             4.9             6.9            1.5         1    [0.00107541 0.9848717  0.01405289]
>>> df_test = booster.transform(df_test)
>>> df_test.head(3)
 #    sepal_width    petal_length    sepal_length    petal_width    class_    lightgbm_prediction
 0            3               4.2             5.9            1.5         1    [0.00208904 0.9821348  0.01577616]
 1            3               4.6             6.1            1.4         1    [0.00182039 0.98491357 0.01326604]
 2            2.9             4.6             6.6            1.3         1    [2.50915444e-04 9.98431777e-01 1.31730785e-03]
Parameters
  • features – List of features to use when fitting the LightGBMModel.

  • num_boost_round – Number of boosting iterations.

  • params – parameters to be passed on the to the LightGBM model.

  • prediction_name – The name of the virtual column housing the predictions.

  • target – The name of the target column.

features

List of features to use when fitting the LightGBMModel.

fit(df, valid_sets=None, valid_names=None, early_stopping_rounds=None, evals_result=None, verbose_eval=None, **kwargs)[source]

Fit the LightGBMModel to the DataFrame.

The model will train until the validation score stops improving. Validation score needs to improve at least every early_stopping_rounds rounds to continue training. Requires at least one validation DataFrame, metric specified. If there’s more than one, will check all of them, but the training data is ignored anyway. If early stopping occurs, the model will add best_iteration field to the booster object.

Parameters
  • df – A vaex DataFrame containing the features and target on which to train the model.

  • valid_sets (list) – A list of DataFrames to be used for validation.

  • valid_names (list) – A list of strings to label the validation sets.

  • int (early_stopping_rounds) – Activates early stopping.

  • evals_result (dict) – A dictionary storing the evaluation results of all valid_sets.

  • verbose_eval (bool) – Requires at least one item in valid_sets. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage.

num_boost_round

Number of boosting iterations.

params

parameters to be passed on the to the LightGBM model.

predict(df, **kwargs)[source]

Get an in-memory numpy array with the predictions of the LightGBMModel on a vaex DataFrame. This method accepts the key word arguments of the predict method from LightGBM.

Parameters

df – A vaex DataFrame.

Returns

A in-memory numpy array containing the LightGBMModel predictions.

Return type

numpy.array

prediction_name

The name of the virtual column housing the predictions.

target

The name of the target column.

transform(df)[source]

Transform a DataFrame such that it contains the predictions of the LightGBMModel in form of a virtual column.

Parameters

df – A vaex DataFrame.

Return copy

A shallow copy of the DataFrame that includes the LightGBMModel prediction as a virtual column.

Return type

DataFrame

class vaex.ml.xgboost.XGBoostModel(**kwargs: Any)[source]

Bases: vaex.ml.state.HasState

The XGBoost algorithm.

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. (https://github.com/dmlc/xgboost)

Example:

>>> import vaex
>>> import vaex.ml.xgboost
>>> df = vaex.datasets.iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> df_train, df_test = df.ml.train_test_split()
>>> params = {
    'max_depth': 5,
    'learning_rate': 0.1,
    'objective': 'multi:softmax',
    'num_class': 3,
    'subsample': 0.80,
    'colsample_bytree': 0.80,
    'silent': 1}
>>> booster = vaex.ml.xgboost.XGBoostModel(features=features, target='class_', num_boost_round=100, params=params)
>>> booster.fit(df_train)
>>> df_train = booster.transform(df_train)
>>> df_train.head(3)
#    sepal_length    sepal_width    petal_length    petal_width    class_    xgboost_prediction
0             5.4            3               4.5            1.5         1                     1
1             4.8            3.4             1.6            0.2         0                     0
2             6.9            3.1             4.9            1.5         1                     1
>>> df_test = booster.transform(df_test)
>>> df_test.head(3)
#    sepal_length    sepal_width    petal_length    petal_width    class_    xgboost_prediction
0             5.9            3               4.2            1.5         1                     1
1             6.1            3               4.6            1.4         1                     1
2             6.6            2.9             4.6            1.3         1                     1
Parameters
  • features – List of features to use when fitting the XGBoostModel.

  • num_boost_round – Number of boosting iterations.

  • params – A dictionary of parameters to be passed on to the XGBoost model.

  • prediction_name – The name of the virtual column housing the predictions.

  • target – The name of the target column.

features

List of features to use when fitting the XGBoostModel.

fit(df, evals=(), early_stopping_rounds=None, evals_result=None, verbose_eval=False, **kwargs)[source]

Fit the XGBoost model given a DataFrame.

This method accepts all key word arguments for the xgboost.train method.

Parameters
  • df – A vaex DataFrame containing the features and target on which to train the model.

  • evals – A list of pairs (DataFrame, string). List of items to be evaluated during training, this allows user to watch performance on the validation set.

  • early_stopping_rounds (int) – Activates early stopping. Validation error needs to decrease at least every early_stopping_rounds round(s) to continue training. Requires at least one item in evals. If there’s more than one, will use the last. Returns the model from the last iteration (not the best one).

  • evals_result (dict) – A dictionary storing the evaluation results of all the items in evals.

  • verbose_eval (bool) – Requires at least one item in evals. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage.

num_boost_round

Number of boosting iterations.

params

A dictionary of parameters to be passed on to the XGBoost model.

predict(df, **kwargs)[source]

Provided a vaex DataFrame, get an in-memory numpy array with the predictions from the XGBoost model. This method accepts the key word arguments of the predict method from XGBoost.

Returns

A in-memory numpy array containing the XGBoostModel predictions.

Return type

numpy.array

prediction_name

The name of the virtual column housing the predictions.

target

The name of the target column.

transform(df)[source]

Transform a DataFrame such that it contains the predictions of the XGBoostModel in form of a virtual column.

Parameters

df – A vaex DataFrame. It should have the same columns as the DataFrame used to train the model.

Return copy

A shallow copy of the DataFrame that includes the XGBoostModel prediction as a virtual column.

Return type

DataFrame

class vaex.ml.catboost.CatBoostModel(**kwargs: Any)[source]

Bases: vaex.ml.state.HasState

The CatBoost algorithm.

This class provides an interface to the CatBoost aloritham. CatBoost is a fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks. For more information please visit https://github.com/catboost/catboost

Example:

>>> import vaex
>>> import vaex.ml.catboost
>>> df = vaex.datasets.iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> df_train, df_test = df.ml.train_test_split()
>>> params = {
    'leaf_estimation_method': 'Gradient',
    'learning_rate': 0.1,
    'max_depth': 3,
    'bootstrap_type': 'Bernoulli',
    'objective': 'MultiClass',
    'eval_metric': 'MultiClass',
    'subsample': 0.8,
    'random_state': 42,
    'verbose': 0}
>>> booster = vaex.ml.catboost.CatBoostModel(features=features, target='class_', num_boost_round=100, params=params)
>>> booster.fit(df_train)
>>> df_train = booster.transform(df_train)
>>> df_train.head(3)
#    sepal_length    sepal_width    petal_length    petal_width    class_  catboost_prediction
0             5.4            3               4.5            1.5         1  [0.00615039 0.98024259 0.01360702]
1             4.8            3.4             1.6            0.2         0  [0.99034267 0.00526382 0.0043935 ]
2             6.9            3.1             4.9            1.5         1  [0.00688241 0.95190908 0.04120851]
>>> df_test = booster.transform(df_test)
>>> df_test.head(3)
#    sepal_length    sepal_width    petal_length    petal_width    class_  catboost_prediction
0             5.9            3               4.2            1.5         1  [0.00464228 0.98883351 0.00652421]
1             6.1            3               4.6            1.4         1  [0.00350424 0.9882139  0.00828186]
2             6.6            2.9             4.6            1.3         1  [0.00325705 0.98891631 0.00782664]
Parameters
  • batch_size – If provided, will train in batches of this size.

  • batch_weights – Weights to sum models at the end of training in batches.

  • ctr_merge_policy – Strategy for summing up models. Only used when training in batches. See the CatBoost documentation for more info.

  • evals_result – Evaluation results

  • features – List of features to use when fitting the CatBoostModel.

  • num_boost_round – Number of boosting iterations.

  • params – A dictionary of parameters to be passed on to the CatBoostModel model.

  • pool_params – A dictionary of parameters to be passed to the Pool data object construction

  • prediction_name – The name of the virtual column housing the predictions.

  • prediction_type – The form of the predictions. Can be “RawFormulaVal”, “Probability” or “Class”.

  • target – The name of the target column.

batch_size

If provided, will train in batches of this size.

batch_weights

Weights to sum models at the end of training in batches.

ctr_merge_policy

Strategy for summing up models. Only used when training in batches. See the CatBoost documentation for more info.

evals_result_

Evaluation results

features

List of features to use when fitting the CatBoostModel.

fit(df, evals=None, early_stopping_rounds=None, verbose_eval=None, plot=False, progress=None, **kwargs)[source]

Fit the CatBoostModel model given a DataFrame. This method accepts all key word arguments for the catboost.train method.

Parameters
  • df – A vaex DataFrame containing the features and target on which to train the model.

  • evals – A list of DataFrames to be evaluated during training. This allows user to watch performance on the validation sets.

  • early_stopping_rounds (int) – Activates early stopping.

  • verbose_eval (bool) – Requires at least one item in evals. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage.

  • plot (bool) – if True, display an interactive widget in the Jupyter notebook of how the train and validation sets score on each boosting iteration.

  • progress – If True display a progressbar when the training is done in batches.

num_boost_round

Number of boosting iterations.

params

A dictionary of parameters to be passed on to the CatBoostModel model.

pool_params

A dictionary of parameters to be passed to the Pool data object construction

predict(df, **kwargs)[source]

Provided a vaex DataFrame, get an in-memory numpy array with the predictions from the CatBoostModel model. This method accepts the key word arguments of the predict method from catboost.

Parameters

df – a vaex DataFrame

Returns

A in-memory numpy array containing the CatBoostModel predictions.

Return type

numpy.array

prediction_name

The name of the virtual column housing the predictions.

prediction_type

The form of the predictions. Can be “RawFormulaVal”, “Probability” or “Class”.

target

The name of the target column.

transform(df)[source]

Transform a DataFrame such that it contains the predictions of the CatBoostModel in form of a virtual column.

Parameters

df – A vaex DataFrame. It should have the same columns as the DataFrame used to train the model.

Return copy

A shallow copy of the DataFrame that includes the CatBoostModel prediction as a virtual column.

Return type

DataFrame

Tensorflow

Incubator/experimental

These models are in the incubator phase and may disappear in the future

class vaex.ml.incubator.annoy.ANNOYModel(**kwargs: Any)[source]

Bases: vaex.ml.state.HasState

Parameters
  • features – List of features to use.

  • metric – Metric to use for distance calculations

  • n_neighbours – Now many neighbours

  • n_trees – Number of trees to build.

  • predcition_name – Output column name for the neighbours when transforming a DataFrame

  • prediction_name – Output column name for the neighbours when transforming a DataFrame

  • search_k – Jovan?

features

List of features to use.

metric

Metric to use for distance calculations

n_neighbours

Now many neighbours

n_trees

Number of trees to build.

predcition_name

Output column name for the neighbours when transforming a DataFrame

prediction_name

Output column name for the neighbours when transforming a DataFrame

search_k

Jovan?

class vaex.ml.incubator.river.RiverModel(**kwargs: Any)[source]

Bases: vaex.ml.state.HasState

This class wraps River (github.com/online-ml/river) estimators, making them vaex pipeline objects.

This class conveniently wraps River models making them vaex pipeline objects. Thus they take full advantage of the serialization and pipeline system of vaex. Only the River models that implement the learn_many are compatible. One can also wrap an entire River pipeline, as long as each pipeline step implements the learn_many method. With the wrapper one can iterate over the data multiple times (epochs), and optinally shuffle each batch before it is sent to the estimator. The predict method wil require as much memory as needed to output the predictions as a numpy array, while the transform method is evaluated lazily, and no memory copies are made.

Example:

>>> import vaex
>>> import vaex.ml
>>> from vaex.ml.incubator.river import RiverModel
>>> from river.linear_model import LinearRegression
>>> from river import optim
>>>
>>> df = vaex.example()
>>>
>>> features = df.column_names[:6]
>>> target = 'FeH'
>>>
>>> df = df.ml.standard_scaler(features=features, prefix='scaled_')
>>>
>>> features = df.get_column_names(regex='^scaled_')
>>> model = LinearRegression(optimizer=optim.SGD(lr=0.1), intercept_lr=0.1)
>>>
>>> river_model = RiverModel(model=model,
                        features=features,
                        target=target,
                        batch_size=10_000,
                        num_epochs=3,
                        shuffle=True,
                        prediction_name='pred_FeH')
>>>
>>> river_model.fit(df=df)
>>> df = river_model.transform(df)
>>> df.head(5)[['FeH', 'pred_FeH']]
  #       FeH    pred_FeH
  0  -1.00539    -1.6332
  1  -1.70867    -1.56632
  2  -1.83361    -1.55338
  3  -1.47869    -1.60646
  4  -1.85705    -1.5996
Parameters
  • batch_size – Number of samples to be sent to the model in each batch.

  • features – List of features to use.

  • model – A River model which implements the learn_many method.

  • num_epochs – Number of times each batch is sent to the model.

  • prediction_name – The name of the virtual column housing the predictions.

  • prediction_type – Which method to use to get the predictions. Can be “predict” or “predict_proba” which correspond to “predict_many” and “predict_proba_many in a River model respectively.

  • shuffle – If True, shuffle the samples before sending them to the model.

  • target – The name of the target column.

batch_size

Number of samples to be sent to the model in each batch.

features

List of features to use.

fit(df, progress=None)[source]

Fit the RiverModel to the DataFrame.

Parameters
  • df – A vaex DataFrame containig the features and target on which to train the model

  • progress – If True, display a progressbar which tracks the training progress.

model

A River model which implements the learn_many method.

num_epochs

Number of times each batch is sent to the model.

predict(df)[source]

Get an in memory numpy array with the predictions of the Model

Parameters

df – A vaex DataFrame containing the input features

Returns

A in-memory numpy array containing the Model predictions

Return type

numpy.array

prediction_name

The name of the virtual column housing the predictions.

prediction_type

Which method to use to get the predictions. Can be “predict” or “predict_proba” which correspond to “predict_many” and “predict_proba_many in a River model respectively.

shuffle

If True, shuffle the samples before sending them to the model.

target

The name of the target column.

transform(df)[source]

Transform A DataFrame such that it contains the predictions of the Model in a form of a virtual column.

Parameters

df – A vaex DataFrame

Return df

A vaex DataFrame

Return type

DataFrame

vaex-viz

class vaex.viz.DataFrameAccessorViz(df)[source]

Bases: object

__init__(df)[source]
__weakref__

list of weak references to the object (if defined)

healpix_heatmap(healpix_expression='source_id/34359738368', healpix_max_level=12, healpix_level=8, what='count(*)', selection=None, grid=None, healpix_input='equatorial', healpix_output='galactic', f=None, colormap='afmhot', grid_limits=None, image_size=800, nest=True, figsize=None, interactive=False, title='', smooth=None, show=False, colorbar=True, rotation=(0, 0, 0), **kwargs)

Viz data in 2d using a healpix column.

Parameters
  • healpix_expression – {healpix_max_level}

  • healpix_max_level – {healpix_max_level}

  • healpix_level – {healpix_level}

  • what – {what}

  • selection – {selection}

  • grid – {grid}

  • healpix_input – Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”.

  • healpix_output – Plot in “equatorial”, “galactic” or “ecliptic”.

  • f – function to apply to the data

  • colormap – matplotlib colormap

  • grid_limits – Optional sequence [minvalue, maxvalue] that determine the min and max value that map to the colormap (values below and above these are clipped to the the min/max). (default is [min(f(grid)), max(f(grid)))

  • image_size – size for the image that healpy uses for rendering

  • nest – If the healpix data is in nested (True) or ring (False)

  • figsize – If given, modify the matplotlib figure size. Example (14,9)

  • interactive – (Experimental, uses healpy.mollzoom is True)

  • title – Title of figure

  • smooth – apply gaussian smoothing, in degrees

  • show – Call matplotlib’s show (True) or not (False, defaut)

  • rotation – Rotatate the plot, in format (lon, lat, psi) such that (lon, lat) is the center, and rotate on the screen by angle psi. All angles are degrees.

Returns

heatmap(x=None, y=None, z=None, what='count(*)', vwhat=None, reduce=['colormap'], f=None, normalize='normalize', normalize_axis='what', vmin=None, vmax=None, shape=256, vshape=32, limits=None, grid=None, colormap='afmhot', figsize=None, xlabel=None, ylabel=None, aspect='auto', tight_layout=True, interpolation='nearest', show=False, colorbar=True, colorbar_label=None, selection=None, selection_labels=None, title=None, background_color='white', pre_blend=False, background_alpha=1.0, visual={'column': 'what', 'fade': 'selection', 'layer': 'z', 'row': 'subspace', 'x': 'x', 'y': 'y'}, smooth_pre=None, smooth_post=None, wrap=True, wrap_columns=4, return_extra=False, hardcopy=None)

Viz data in a 2d histogram/heatmap.

Declarative plotting of statistical plots using matplotlib, supports subplots, selections, layers.

Instead of passing x and y, pass a list as x argument for multiple panels. Give what a list of options to have multiple panels. When both are present then will be origanized in a column/row order.

This methods creates a 6 dimensional ‘grid’, where each dimension can map the a visual dimension. The grid dimensions are:

  • x: shape determined by shape, content by x argument or the first dimension of each space

  • y: ,,

  • z: related to the z argument

  • selection: shape equals length of selection argument

  • what: shape equals length of what argument

  • space: shape equals length of x argument if multiple values are given

By default, this its shape is (1, 1, 1, 1, shape, shape) (where x is the last dimension)

The visual dimensions are

  • x: x coordinate on a plot / image (default maps to grid’s x)

  • y: y ,, (default maps to grid’s y)

  • layer: each image in this dimension is blended togeher to one image (default maps to z)

  • fade: each image is shown faded after the next image (default mapt to selection)

  • row: rows of subplots (default maps to space)

  • columns: columns of subplot (default maps to what)

All these mappings can be changes by the visual argument, some examples:

>>> df.viz.heatmap('x', 'y', what=['mean(x)', 'correlation(vx, vy)'])

Will plot each ‘what’ as a column.

>>> df.viz.heatmap('x', 'y', selection=['FeH < -3', '(FeH >= -3) & (FeH < -2)'], visual=dict(column='selection'))

Will plot each selection as a column, instead of a faded on top of each other.

Parameters
  • x – Expression to bin in the x direction (by default maps to x), or list of pairs, like [[‘x’, ‘y’], [‘x’, ‘z’]], if multiple pairs are given, this dimension maps to rows by default

  • y – y (by default maps to y)

  • z – Expression to bin in the z direction, followed by a :start,end,shape signature, like ‘FeH:-3,1:5’ will produce 5 layers between -10 and 10 (by default maps to layer)

  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)

  • reduce

  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value

  • normalize – normalization function, currently only ‘normalize’ is supported

  • normalize_axis – which axes to normalize on, None means normalize by the global maximum.

  • vmin – instead of automatic normalization, (using normalize and normalization_axis) scale the data between vmin and vmax to [0, 1]

  • vmax – see vmin

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • grid – If grid is given, instead if compuation a statistic given by what, use this Nd-numpy array instead, this is often useful when a custom computation/statistic is calculated, but you still want to use the plotting machinery.

  • colormap – matplotlib colormap to use

  • figsize – (x, y) tuple passed to plt.figure for setting the figure size

  • xlabel

  • ylabel

  • aspect

  • tight_layout – call plt.tight_layout or not

  • colorbar – plot a colorbar or not

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

  • interpolation – interpolation for imshow, possible options are: ‘nearest’, ‘bilinear’, ‘bicubic’, see matplotlib for more

  • return_extra

Returns

histogram(x=None, what='count(*)', grid=None, shape=64, facet=None, limits=None, figsize=None, f='identity', n=None, normalize_axis=None, xlabel=None, ylabel=None, label=None, selection=None, show=False, tight_layout=True, hardcopy=None, progress=None, **kwargs)

Plot a histogram.

Example:

>>> df.histogram(df.x)
>>> df.histogram(df.x, limits=[0, 100], shape=100)
>>> df.histogram(df.x, what='mean(y)', limits=[0, 100], shape=100)

If you want to do a computation yourself, pass the grid argument, but you are responsible for passing the same limits arguments:

>>> counts = df.mean(df.y, binby=df.x, limits=[0, 100], shape=100)/100.
>>> df.histogram(df.x, limits=[0, 100], shape=100, grid=means, label='mean(y)/100')
Parameters
  • x – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum

  • grid – If grid is given, instead if compuation a statistic given by what, use this Nd-numpy array instead, this is often useful when a custom computation/statistic is calculated, but you still want to use the plotting machinery.

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • facet – Expression to produce facetted plots ( facet=’x:0,1,12’ will produce 12 plots with x in a range between 0 and 1)

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • figsize – (x, y) tuple passed to plt.figure for setting the figure size

  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value

  • n – normalization function, currently only ‘normalize’ is supported, or None for no normalization

  • normalize_axis – which axes to normalize on, None means normalize by the global maximum.

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • xlabel – String for label on x axis (may contain latex)

  • ylabel – Same for y axis

  • kwargs – extra argument passed to plt.plot

Param

tight_layout: call plt.tight_layout or not

Returns

scatter(x, y, xerr=None, yerr=None, cov=None, corr=None, s_expr=None, c_expr=None, labels=None, selection=None, length_limit=50000, length_check=True, label=None, xlabel=None, ylabel=None, errorbar_kwargs={}, ellipse_kwargs={}, **kwargs)

Viz (small amounts) of data in 2d using a scatter plot

Convenience wrapper around plt.scatter when for working with small DataFrames or selections

Parameters
  • x – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • y – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • s_expr – When given, use if for the s (size) argument of plt.scatter

  • c_expr – When given, use if for the c (color) argument of plt.scatter

  • labels – Annotate the points with these text values

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

  • length_limit – maximum number of rows it will plot

  • length_check – should we do the maximum row check or not?

  • label – label for the legend

  • xlabel – label for x axis, if None .label(x) is used

  • ylabel – label for y axis, if None .label(y) is used

  • errorbar_kwargs – extra dict with arguments passed to plt.errorbar

  • kwargs – extra arguments passed to plt.scatter

Returns

class vaex.viz.ExpressionAccessorViz(expression)[source]

Bases: object

__init__(expression)[source]
__weakref__

list of weak references to the object (if defined)

histogram(what='count(*)', grid=None, shape=64, facet=None, limits=None, figsize=None, f='identity', n=None, normalize_axis=None, xlabel=None, ylabel=None, label=None, selection=None, show=False, tight_layout=True, hardcopy=None, progress=None, **kwargs)[source]

Plot a histogram of the expression. This is a convenience method for df.histogram(…)

Example:

>>> df.x.histogram()
>>> df.x.histogram(limits=[0, 100], shape=100)
>>> df.x.histogram(what='mean(y)', limits=[0, 100], shape=100)

If you want to do a computation yourself, pass the grid argument, but you are responsible for passing the same limits arguments:

>>> counts = df.mean(df.y, binby=df.x, limits=[0, 100], shape=100)/100.
>>> df.plot1d(df.x, limits=[0, 100], shape=100, grid=means, label='mean(y)/100')
Parameters
  • x – Expression to bin in the x direction

  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum

  • grid – If the binning is done before by yourself, you can pass it

  • facet – Expression to produce facetted plots ( facet=’x:0,1,12’ will produce 12 plots with x in a range between 0 and 1)

  • limits – list of [xmin, xmax], or a description such as ‘minmax’, ‘99%’

  • figsize – (x, y) tuple passed to plt.figure for setting the figure size

  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value

  • n – normalization function, currently only ‘normalize’ is supported, or None for no normalization

  • normalize_axis – which axes to normalize on, None means normalize by the global maximum.

  • normalize_axis

  • xlabel – String for label on x axis (may contain latex)

  • ylabel – Same for y axis

  • kwargs – extra argument passed to plt.plot

Param

tight_layout: call plt.tight_layout or not

Returns

Datasets to download

Here we list a few datasets that might be interesting to explore with vaex.

New York taxi dataset

The very well known dataset containing trip infromation from the iconic Yellow Taxi company in NYC. The raw data is curated by the Taxi & Limousine Commission (TLC).

See for instance Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance for some ideas.

One can also stream the data directly from S3. Only the data that is necessary will be streamed, and it will cached locally:

import vaex
df = vaex.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5?anon=true')
[1]:
import vaex
import warnings; warnings.filterwarnings("ignore")

df = vaex.open('/data/yellow_taxi_2009_2015_f32.hdf5')

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

long_min = -74.05
long_max = -73.75
lat_min = 40.58
lat_max = 40.90

df.plot(df.pickup_longitude, df.pickup_latitude, f="log1p", limits=[[-74.05, -73.75], [40.58, 40.90]], show=True);
number of rows: 1,173,057,927
number of columns: 18
_images/datasets_2_1.png

Gaia - European Space Agency

Gaia is an ambitious mission to chart a three-dimensional map of our Galaxy, the Milky Way, in the process revealing the composition, formation and evolution of the Galaxy.

See the Gaia Science Homepage for details, and you may want to try the Gaia Archive for ADQL (SQL like) queries.

[2]:
df = vaex.open('/data/gaia-dr2-sort-by-source_id.hdf5')

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

df.plot("ra", "dec", f="log", limits=[[360, 0], [-90, 90]], show=True);
number of rows: 1,692,919,135
number of columns: 94
_images/datasets_4_1.png

U.S. Airline Dataset

This dataset contains information on flights within the United States between 1988 and 2018. The original data can be downloaded from United States Department of Transportation.

One can also stream it from S3:

import vaex
df = vaex.open('s3://vaex/airline/us_airline_data_1988_2018.hdf5?anon=true')
[3]:
df = vaex.open('/data/airline/us_airline_data_1988_2018.hd5')

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

df.head(5)
number of rows: 183,821,926
number of columns: 29
[3]:
# Year Month DayOfMonth DayOfWeekUniqueCarrier TailNum FlightNumOrigin Dest CRSDepTimeDepTime DepDelay TaxiOut TaxiIn CRSArrTimeArrTime ArrDelay CancelledCancellationCode Diverted CRSElapsedTimeActualElapsedTime AirTime DistanceCarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
0 1988 1 8 5PI None 930BGM ITH 15251532 7 -- -- 15451555 10 0None 0 2023 -- 32-- -- -- -- --
1 1988 1 9 6PI None 930BGM ITH 15251522 -3 -- -- 15451535 -10 0None 0 2013 -- 32-- -- -- -- --
2 1988 1 10 7PI None 930BGM ITH 15251522 -3 -- -- 15451534 -11 0None 0 2012 -- 32-- -- -- -- --
3 1988 1 11 1PI None 930BGM ITH 1525-- -- -- -- 1545-- -- 1None 0 20-- -- 32-- -- -- -- --
4 1988 1 12 2PI None 930BGM ITH 15251524 -1 -- -- 15451540 -5 0None 0 2016 -- 32-- -- -- -- --

Sloan Digital Sky Survey (SDSS)

The data is public and can be queried from the SDSS archive. The original query at SDSS archive was (although split in small parts):

SELECT ra, dec, g, r from PhotoObjAll WHERE type = 6 and  clean = 1 and r>=10.0 and r<23.5;
[4]:
df = vaex.open('/data/sdss/sdss-clean-stars-dered.hdf5')

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

df.healpix_plot(df.healpix9, show=True, f="log1p", healpix_max_level=9, healpix_level=9,
                healpix_input='galactic', healpix_output='galactic', rotation=(0,45)
               )
number of rows: 132,447,497
number of columns: 21
_images/datasets_8_1.png

Helmi & de Zeeuw 2000

Result of an N-body simulation of the accretion of 33 satellite galaxies into a Milky Way dark matter halo. * 3 million rows - 252MB

[5]:
df = vaex.datasets.helmi_de_zeeuw.fetch() # this will download it on the fly

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

df.plot([["x", "y"], ["Lz", "E"]], f="log", figsize=(12,5), show=True, limits='99.99%');
number of rows: 3,300,000
number of columns: 11
_images/datasets_10_1.png

Frequently Asked Questions

I have a massive CSV file which I can not fit all into memory at one time. How do I convert it to HDF5?

New in 4.14:

Backed by Apache Arrow, Vaex supports lazy reading of CSV files simply with:

df = vaex.open('./my_data/my_big_file.csv')

In this way you can work with artibrarily large CSV files, with the same API just as if you were working with HDF5, Apache Arrow or Apache Parquet files.

For performance reasons, we do recommend converting large CSV files either HDF5 or Apache Arrow format. This is simply done via:

df = vaex.open('./my_data/my_big_file.csv', convert='./my_data/my_big_file.hdf5')

One can also choose to convert a large CSV file to Apache Parquet in order to save disk space in the very same way.

Prior to 4.14:

df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)

When the above line is executed, Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file on disk. All temporary will files are then concatenated into a single HDF5, and the temporary files deleted. The size of the individual chunks to be read can be specified via the chunk_size argument.

Note: this is still possible in the newer version of Vaex, but it is not the most performant approach.

For more information on importing and exporting data with Vaex, please refer to please refer to the I/O example.

Why can’t I open a HDF5 file that was exported from a pandas DataFrame using .to_hdf?

When one uses the pandas .to_hdf method, the output HDF5 file has a row based format. Vaex on the other hand expects column based HDF5 files. This allows for more efficient reading of data columns, which is much more commonly required for data science applications.

One can easily export a pandas DataFrame to a vaex friendly HDF5 file:

vaex_df = vaex.from_pandas(pandas_df, copy_index=False)
vaex_df.export_hdf5('my_data.hdf5')

What is the optimal file format to use with vaex?

What is “optimal” may dependent on what one is trying to achieve. A quick summary would be:

vaex shines when the data is in a memory-mappable file format, namely HDF5, Apache Arrow, or FITS. We say a file can be memory mapped if it has the same structure in memory, as it has on disk. Although any file can be memory mapped, if it requires deserialisation there is no advantage to memory mapping.

In principle, HDF5 and Arrow should give the same performance. For files that would fit into memory the performance between the two is the same. For single files that are larger than available RAM, our tests show that HDF5 gives faster performance. What “faster” means will likely depend on your system, quantity and type of data. This performance difference may be caused by converting bit masks to byte masks, or by flattening chunked Arrow arrays. We expect that this performance difference will disappear in the future.

If your data is spread amongst multiple files that are concatenated on the fly, the performance between HDF5 and Arrow is expected to be the same. Our test show better performance when all the data is contained in a single file, compared to multiple file.

The Arrow file format allows seamless interoperability with other ecosystems. If your use-case requires sharing data with other ecosystems, e.g. Java, the Arrow file format is the way to.

vaex also supports Parquet. Parquet is compressed, therefore memory mapping brings no advantage. There is always a performance penalty when using Parquet, since the data needs to be decompressed before it is used. Parquet however allows lazy reading of the data, which can be decompressed on the fly. Thus vaex can easily work with Parquet files that are larger than RAM. We recommend using Parquet when one wants to save disk space. It can be also convenient when reading from slow i/o sources, like spinning hard-drives or Cloud storage for example. Note that by using df.materialize one can get the same performance as HDF5 or Arrow files at the cost of memory or disk space.

Technically vaex can use data from CSV and JSON sources, but then the data is put in memory and the usage is not optimal. We warmly recommend that these and any other data source be converted to either HDF5, Arrow or Parquet file format, depending on your use-case or preference.

Why can’t I add a new column after filtering a vaex DataFrame?

Unlike other libraries, vaex does not copy or modify the data. After a filtering operations for example:

df2 = df[df.x > 5]

df2 still contains all of the data present in df however. The difference is that the columns of df2 are lazily indexed, and only the rows for which the filtering condition is satisfied are displayed or used. This means that in principle one can turn filters on/off as needed.

To be able to manually add a new column to the filtered df2 DataFrame, one needs to use the df2.extract() method first. This will drop the lazy indexing, making the length of df2 equal to its filtered length.

Here is a short example:

[1]:
import vaex
import numpy as np

df = vaex.from_dict({'id': np.array([1, 2, 3, 4]),
                     'name': np.array(['Sally', 'Tom', 'Maria', 'John'])
                    })

df2 = df[df.id > 2]
df2 = df2.extract()

df2['age'] = np.array([27, 29])
df2
[1]:
# idname age
0 3Maria 27
1 4John 29

What is Vaex?

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (\(10^9\)) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, a zero memory copy policy, and lazy computations for best performance (no memory wasted).

Why vaex

  • Performance: works with huge tabular data, processes \(\gt 10^9\) rows/second

  • Lazy / Virtual columns: compute on the fly, without wasting ram

  • Memory efficient no memory copies when doing filtering/selections/subsets.

  • Visualization: directly supported, a one-liner is often enough.

  • User friendly API: you will only need to deal with the DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.

  • Lean: separated into multiple packages

    • vaex-core: DataFrame and core algorithms, takes numpy arrays as input columns.

    • vaex-hdf5: Provides memory mapped numpy arrays to a DataFrame.

    • vaex-arrow: Arrow support for cross language data sharing.

    • vaex-viz: Visualization based on matplotlib.

    • vaex-jupyter: Interactive visualization based on Jupyter widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet.

    • vaex-astro: Astronomy related transformations and FITS file support.

    • vaex-server: Provides a server to access a DataFrame remotely.

    • vaex-distributed: (Deprecated) Now part of vaex-enterprise.

    • vaex-qt: Program written using Qt GUI.

    • vaex: Meta package that installs all of the above.

    • vaex-ml: Machine learning

  • Jupyter integration: vaex-jupyter will give you interactive visualization and selection in the Jupyter notebook and Jupyter lab.

Installation

Using conda:

  • conda install -c conda-forge vaex

Using pip:

  • pip install --upgrade vaex

Or read the detailed instructions

Getting started

We assume that you have installed vaex, and are running a Jupyter notebook server. We start by importing vaex and asking it to give us an example dataset.

[1]:
import vaex
df = vaex.example()  # open the example dataset provided with vaex

Instead, you can download some larger datasets, or read in your csv file.

[2]:
df  # will pretty print the DataFrame
[2]:
# x y z vx vy vz E L Lz FeH
0 -0.7774707672.10626292 1.93743467 53.276722 288.386047 -95.2649078-121238.171875 831.0799560546875 -336.426513671875 -2.309227609164518
1 3.77427316 2.23387194 3.76209331 252.810791 -69.9498444-56.3121033-100819.91406251435.1839599609375-828.7567749023438 -1.788735491591229
2 1.3757627 -6.3283844 2.63250017 96.276474 226.440201 -34.7527161-100559.96093751039.2989501953125920.802490234375 -0.7618109022478798
3 -7.06737804 1.31737781 -6.10543537 204.968842 -205.679016-58.9777031-70174.8515625 2441.724853515625 1183.5899658203125 -1.5208778422936413
4 0.243441463 -0.822781682-0.206593871-311.742371-238.41217 186.824127 -144138.75 374.8164367675781 -314.5353088378906 -2.655341358427361
... ... ... ... ... ... ... ... ... ... ...
329,9953.76883793 4.66251659 -4.42904139 107.432999 -2.1377129617.5130272 -119687.3203125746.8833618164062 -508.96484375 -1.6499842518381402
329,9969.17409325 -8.87091351 -8.61707687 32.0 108.089264 179.060638 -68933.8046875 2395.633056640625 1275.490234375 -1.4336036247720836
329,997-1.14041007 -8.4957695 2.25749826 8.46711349 -38.2765236-127.541473-112580.359375 1182.436279296875 115.58557891845703 -1.9306227597361942
329,998-14.2985935 -5.51750422 -8.65472317 110.221558 -31.392559186.2726822 -74862.90625 1324.59265136718751057.017333984375 -1.225019818838568
329,99910.5450506 -8.86106777 -4.65835428 -2.10541415-27.61088563.80799961 -95361.765625 351.0955505371094 -309.81439208984375-2.5689636894079477

Using square brackets[], we can easily filter or get different views on the DataFrame.

[3]:
df_negative = df[df.x < 0]  # easily filter your DataFrame, without making a copy
df_negative[:5][['x', 'y']]  # take the first five rows, and only the 'x' and 'y' column (no memory copy!)
[3]:
# x y
0 -0.777471 2.10626
1 -7.06738 1.31738
2 -5.17174 7.82915
3-15.9539 5.77126
4-12.3995 13.9182

When dealing with huge datasets, say a billion rows (\(10^9\)), computations with the data can waste memory, up to 8 GB for a new column. Instead, vaex uses lazy computation, storing only a representation of the computation, and computations are done on the fly when needed. You can just use many of the numpy functions, as if it was a normal array.

[4]:
import numpy as np
# creates an expression (nothing is computed)
some_expression = df.x + df.z
some_expression  # for convenience, we print out some values
[4]:
<vaex.expression.Expression(expressions='(x + z)')> instance at 0x118f71550 values=[1.159963903, 7.53636647, 4.00826287, -13.17281341, 0.036847591999999985 ... (total 330000 values) ... -0.66020346, 0.5570163800000003, 1.1170881900000003, -22.95331667, 5.8866963199999995]

These expressions can be added to a DataFrame, creating what we call a virtual column. These virtual columns are similar to normal columns, except they do not waste memory.

[5]:
df['r'] = some_expression  # add a (virtual) column that will be computed on the fly
df.mean(df.x), df.mean(df.r)  # calculate statistics on normal and virtual columns
[5]:
(-0.06713149126400597, -0.0501732470530304)

One of the core features of vaex is its ability to calculate statistics on a regular (N-dimensional) grid. The dimensions of the grid are specified by the binby argument (analogous to SQL’s grouby), and the shape and limits.

[6]:
df.mean(df.r, binby=df.x, shape=32, limits=[-10, 10]) # create statistics on a regular grid (1d)
[6]:
array([-9.67777315, -8.99466731, -8.17042477, -7.57122871, -6.98273954,
       -6.28362848, -5.70005784, -5.14022306, -4.52820368, -3.96953423,
       -3.3362477 , -2.7801045 , -2.20162243, -1.57910621, -0.92856689,
       -0.35964342,  0.30367721,  0.85684123,  1.53564551,  2.1274488 ,
        2.69235585,  3.37746363,  4.04648274,  4.59580105,  5.20540601,
        5.73475069,  6.28384101,  6.67880226,  7.46059303,  8.13480148,
        8.90738265,  9.6117928 ])
[7]:
df.mean(df.r, binby=[df.x, df.y], shape=32, limits=[-10, 10]) # or 2d
df.count(df.r, binby=[df.x, df.y], shape=32, limits=[-10, 10]) # or 2d counts/histogram
[7]:
array([[22., 33., 37., ..., 58., 38., 45.],
       [37., 36., 47., ..., 52., 36., 53.],
       [34., 42., 47., ..., 59., 44., 56.],
       ...,
       [73., 73., 84., ..., 41., 40., 37.],
       [53., 58., 63., ..., 34., 35., 28.],
       [51., 32., 46., ..., 47., 33., 36.]])

These one and two dimensional grids can be visualized using any plotting library, such as matplotlib, but the setup can be tedious. For convenience we can use heatmap, or see the other visualization commands

[8]:
df.viz.heatmap(df.x, df.y, show=True);  # make a plot quickly
_images/index_19_0.png

Continue

Continue the tutorial here or check the guides.