Handling missing or invalid data

Data in the real world is seldom clean and never perfect. It often happens that we end up with “missing” or “invalid” data. There are countless reasons for why data can be missing: an instrument failed to make a recording in the real world, there was a temporary or no connection between the instrument and the computer storing the readings, maybe our scraper failed to gather all of the data, or our tracking tool did not manage to record all events.. this list can go on and on.

In addition to this, during our analysis we can sometimes make a wrong turn and “corrupt” our data by dividing by zero, or taking the logarithm of a negative number. In addition, a sensor or a human may record invalid values that we want to highlight in a special way.

In Vaex we have 3 ways of representing these special values:

  • “missing” or “masked” values;

  • “not a number” or nan values;

  • “not available” or na values.

If you have used Vaex, you may have noticed some DataFrame methods, Expression methods, or method arguments referencing “missing”, “nan”, “na”. Here are some examples:

“missing”

“nan”

“na”

df.dropmissing

df.dropnan

df.dropna

df.x.countmissing

df.x.countnan

df.x.countna

df.x.ismissing

df.x.isnan

df.x.isna

df.x.fillmissing

df.x.fillnan

df.x.fillna

In what follows we will explain the difference between these 3 types of values, when they should be used, and why does Vaex makes the distinction between them.

“nan” vs “missing” vs “na”

Summary (TLDR;)

The following table summarizes the differences between missing values, nan values and na:

missing or masked values

Not a number (nan)

Not available (na)

dtype

Any dtype

Float

Any dtype but only truly relevant for float

Meaning

Total absence of data

Data is present, but is corrupted or can not be represented in numeric form (e.g. log(-5))

Union of missing and nan values

Use case

Sensor did not make a measurement

Sensor made a measurement but the data is corrupted, or mathematical transformation leads to an invalid / non-numerical values

It is up to the user to decide

Not a number or nan

Many data practitioners, perhaps erroneously, interchangeably use the term nan and the term missing values. In fact nan values are commonly used as sentinel values to generally indicate invalid data. This is inaccurate because nan values are in fact special float values. nan is a shorthand for “not a number”, which is meant to indicate a value that is not a number in a sequence of floats, and thus in itself is not missing. It is used to represent values that are undefined mathematically, such as 0/0 or log(-5), or for data that does exist but is corrupted or can not be represented in numerical form. Note that there is no such corresponding value for integers for example, or for non-numeric types such as string.

In Python one can use nan values via the math standard library (e.g.: math.nan) or via the numpy library (e.g.: numpy.nan).

So why are nan values synonymous with missing values? It is hard to tell. One guess is that data practitioners found using numpy.nan a convenient shortcut to representing an “missing” or invalid value in arrays. Numpy does have a proper way of indicating a missing values via masked arrays (more on that in the next section), but for many that API can be less convenient and requires an addition knowledge of how to handle those array types. This effect might have been more enhanced by Pandas, in which for a long time nan values were the only way to indicate both invalid/corrupted and truly missing data.

Missing or masked values

Perhaps a better way to mark the absence of data is via missing or masked values. Python itself has a special object to indicate missing or no data, and that is the None object, which has its own NoneType type. The None object in Python is equivalent to the NULL value in SQL.

Modern data analysis libraries also implement their own ways of indicating missing values. For arrays that have missing data, Numpy implements so-called “masked arrays”. When constructing the arrays, in addition to data one is also required to provide a boolean mask. A True value in the mask array, indicates that the corresponding element in the data array is missing. In the example below, the last element of the masked array is missing:

import numpy as np

data = [23, 31, 0]
mask = [False, False, True]

my_masked_array = np.ma.masked_array(data, mask)

Pyarrow also implements a null type to indicate missing values in their data structures. Unlike Numpy that uses bytemasks, Pyarrow uses bitmasks to indicate missing data which make it more memory efficient. Note that in Pyarrow, if the mask has a value of 1 it means that the data is present, while 0 indicates missing data. Similarly to Vaex, Pyarrow also makes the distinction between nan and null values.

In more recent versions, Pandas also implements a pd.NA value to indicate missing values, which can be used in arrays or Series various types and not just float.

In Vaex, missing data are null values if the underlying array is backed by Pyarrow, and masked values if the underlying array is a Numpy masked array.

When are missing values used in practice? They are used to indicate data that was not collected, i.e. a sensor was scheduled to make a reading but it did not, or a doctor was supposed to make scan of a patient but they did not.

To contrast with nan values: missing or masked values indicate a complete absence of data, while nan values indicate the presence of data that can not be interpreted numerically. This can be a subtle but sometimes an important distinction to make

Not available or na

Vaex also implements methods referring to na which stands for Not available”, and is a union of both nan and missing values. This only really matters when dealing with Expressions of float type, since that is the only type that can have both missing and nan values. Of course if you do not make the distinction between nan and missing values in your code, use can use methods that refer to na to encompass both cases and simplify development.

Examples

Let us consider the following DataFrame:

[1]:
import vaex
import numpy as np
import pyarrow as pa

x = np.ma.array(data=[1, 0, 3, 4, np.nan], mask=[False, True, False, False, False])
y = pa.array([10, 20, None, 40, 50])
z = pa.array(['Reggie Miller', 'Michael Jordan', None, None, 'Kobe Bryant'])
w = pa.array([
        {'city': 'Indianapolis', 'team': 'Pacers'},
        None,
        {'city': 'Dallas', 'team': 'Mavericks'},
        None,
        {'city': 'Los Angeles', 'team': 'Lakers'}
    ])

df = vaex.from_arrays(x=x, y=y, z=z, w=w)
df
[1]:
# x y z w
01.010 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}
1-- 20 Michael Jordan--
23.0-- -- {'city': 'Dallas', 'team': 'Mavericks'}
34.040 -- --
4nan50 Kobe Bryant {'city': 'Los Angeles', 'team': 'Lakers'}

The df contains a float column x which in turn contains both a missing (masked) value and a nan value. The columns y, z, and w which are of dtype int, string, and struct respectively can only contain masked values in addition to their nominal type.

For example, if we want to drop all rows with missing values from the entire DataFrame, we can use the dropmissing method:

[2]:
df.dropmissing()
[2]:
# x yz w
0 1 10Reggie Miller{'city': 'Indianapolis', 'team': 'Pacers'}
1nan 50Kobe Bryant {'city': 'Los Angeles', 'team': 'Lakers'}

We see that all missing (masked) values are dropped, but the nan value in column x is still present since it is not technically “missing”.

If we want drop all nan values from the DataFrame we can do so via the corresponding dropnan method:

[3]:
df.dropnan()
[3]:
# x y z w
01.010 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}
1-- 20 Michael Jordan--
23.0-- -- {'city': 'Dallas', 'team': 'Mavericks'}
34.040 -- --

Now we see that the nan value from the column x is no longer in the DataFrame, but all the other missing values are still there.

If we simply want to get rid of all values that are not available for us to use directly, we can use the dropna method:

[4]:
df.dropna()
[4]:
# x yz w
0 1 10Reggie Miller{'city': 'Indianapolis', 'team': 'Pacers'}

Now we see that only rows containing valid data entries remain.