Outlier detection

Accounting for “outliers” can be an integral part in any kind of data analysis, yet, it remains basically unclear what such an outlier actually is. Citing Hawkins “an outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”.

Below, we give algorithms to detect outliers based on the Generalized Extreme Studentized Deviate (ESD) test

and polynomial fits

The generalized ESD test

PyAstronomy.pyasl.generalizedESD(x, maxOLs, alpha=0.05, fullOutput=False, ubvar=False)

Carry out a Generalized ESD Test for Outliers.

The Generalized Extreme Studentized Deviate (ESD) test for outliers can be used to search for outliers in a univariate data set, which approximately follows a normal distribution. A description of the algorithm is, e.g., given at Nist or [Rosner1983].

Parameters
maxOLsint

Maximum number of outliers in the data set.

alphafloat, optional

Significance (default is 0.05).

fullOutputboolean, optional

Determines whether additional return values are provided. Default is False.

ubvarboolean, optional

If True, an unbiased estimate for the variance will be used in the calculation; this provides compatibility with the R implementation of NIST. If False (default), the maximum-likelihood variance estimate will be used.

Returns
Number of outliersint

The number of data points characterized as outliers by the test.

Indiceslist of ints

The indices of the data points found to be outliers.

Rlist of floats, optional

The values of the “R statistics”. Only provided if fullOutput is set to True.

Llist of floats, optional

The lambda values needed to test whether a point should be regarded an outlier. Only provided if fullOutput is set to True.

Example

from __future__ import print_function, division
import numpy as np
import matplotlib.pylab as plt
from PyAstronomy import pyasl

# Convert data given at:
# http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
# to array.
x = np.array([float(x) for x in "-0.25 0.68 0.94 1.15 1.20 1.26 1.26 1.34 1.38 1.43 1.49 1.49 \
          1.55 1.56 1.58 1.65 1.69 1.70 1.76 1.77 1.81 1.91 1.94 1.96 \
          1.99 2.06 2.09 2.10 2.14 2.15 2.23 2.24 2.26 2.35 2.37 2.40 \
          2.47 2.54 2.62 2.64 2.90 2.92 2.92 2.93 3.21 3.26 3.30 3.59 \
          3.68 4.30 4.64 5.34 5.42 6.01".split()])

# Apply the generalized ESD
r = pyasl.generalizedESD(x, 10, 0.05, fullOutput=True)

print("Number of outliers: ", r[0])
print("Indices of outliers: ", r[1])
print("        R      Lambda")
for i in range(len(r[2])):
    print("%2d  %8.5f  %8.5f" % ((i+1), r[2][i], r[3][i]))

# Plot the "data"
plt.plot(x, 'b.')
# and mark the outliers.
for i in range(r[0]):
    plt.plot(r[1][i], x[r[1][i]], 'rp')
plt.show()

Distance-based outlier detection (e.g., for spectra)

The generalized ESD test requires approximate normal distribution for the data points, which—for example in the case of a spectrum—can be a harsh limitation.

This function applies the generalized ESD test to the distances between adjacent data points, which are then requires to be distributed approximately normally. It will characterize a data point as an outlier, only if the distances to its right and left neighbor are abnormal as judged by the generalized ESD.

PyAstronomy.pyasl.pointDistGESD(x, maxOLs, alpha=0.05, ubvar=False)

Search for outliers by comparing distance of adjacent points.

This method computes the “distance” between adjacent points, e.g., d = x[i+1] - x[i]. It then uses PyAstronomy.pyasl.generalizedESD() to search for outliers in the list of distances. A point will be characterized as being an outlier, if (and only if) the distance to its left and right neighbor are abnormal.

Parameters
xarray

The data set to be searched for outliers.

maxOLsint

The number of outliers. Note that the number passed to generalizedESD is actually 2*`maxOLs`.

alphafloat, optional

The significance level to be used in applying generalizedESD. The default is 0.05.

ubvarboolean, optional

If True, an unbiased estimate for the variance will be used in the calculation; this provides compatibility with the R implementation of NIST. If False (default), the maximum-likelihood variance estimate will be used.

Returns
nint

The number of outliers detected.

Indiceslist of ints

The indices of the points found to be outliers.

Example

from __future__ import print_function, division
import numpy as np
import matplotlib.pylab as plt
from PyAstronomy import pyasl

# Get some data
x = np.random.normal(0., 0.1, 50)

# Introduce outliers
x[27] = 1.0
x[43] = -0.66

# Run distance based outlier detection
r = pyasl.pointDistGESD(x, 5)

print("Number of outliers detected: ", r[0])
print("Indices of these outliers: ", r[1])

plt.plot(x, 'b.')
for i in range(len(r[1])):
    plt.plot(r[1][i], x[r[1][i]], 'rp')
plt.show()

Outlier detection based on polynomial fit

The algorithm implemented here is based on a polynomial fit to the data. After the fit is subtracted, the residuals are calculated. Based on their standard deviation, points with residuals deviating by more than the specified number of standard deviations from the fit are identified. Implementations with a single polynomial fit and a sliding window are available.

PyAstronomy.pyasl.polyResOutlier(x, y, deg=0, stdlim=3.0, controlPlot=False, fullOutput=False, mode='both')

Simple outlier detection based on residuals.

This algorithm fits a polynomial of the specified degree to the data, subtracts it to find the residuals, determines the standard deviations of the residuals, and, finally, identifies all points with residuals further than the specified number of standard deviations from the fit.

Parameters
x, yarrays

The abscissa and ordinate of the data.

degint, optional

The degree of the polynomial to be fitted. The default is 0, i.e., a constant.

stdlimfloat, optional

The number of standard deviations acceptable for points not categorized as outliers.

modestring, {both, above, below}

If ‘both’ (default), outliers may be located on both sides of the polynomial. If ‘above/below’, outliers are only expected above/below it.

controlPlotboolean, optional

If True, a control plot will be generated showing the location of outliers (default is False).

fullOutputboolean, optional

If True, the fitted polynomial and the resulting model will be returned.

Returns
indiinarray

The indices of the points not being categorized as outliers.

indioutarray

Indices of the oulier points.

parray, optional

Coefficients of the fitted polynomial (only returned if fullOutput is True).

modelarray, optional

The polynomial model (only returned if fullOutput is True).

PyAstronomy.pyasl.slidingPolyResOutlier(x, y, points, count=1, deg=0, stdlim=3.0, controlPlot=False, dx=1, mode='both')

Outlier detection based on polynomial fit in sliding box.

This algorithm fits a polynomial of the specified degree to a sliding chunk of the data, subtracts it to find the residuals, determines the standard deviations of the residuals, and, finally, identifies all points with residuals further than the specified number of standard deviations from the fit.

The length of the chunk is determined by points. In each step, the chunk is advanced by dx data points (default is one). To be finally maked as an outlier, a point must be detected as an outlier in at least count instances, when the chunk slides over it. By default, a single such detection is sufficient to establish its outlier status.

Parameters
x, yarrays

The abscissa and ordinate of the data.

pointsint

Number of points for the sliding box

countint, optional

Number of “slides” in which the point shall deviate from the fit by the stdlim

degint, optional

The degree of the polynomial to be fitted. The default is 0, i.e., a constant.

stdlimfloat, optional

The number of standard deviations acceptable for points not categorized as outliers.

modestring, {both, above, below}

If ‘both’ (default), outliers may be located on both sides of the polynomial. If ‘above/below’, outliers are only expected above/below it.

controlPlotboolean, optional

If True, a control plot will be generated showing the location of outliers (default is False).

dxint, optional

The number of data points by which the chunk is advanced in each step.

Returns
indiinarray

The indices of the points not categorized as outliers.

indioutarray

Indices of the oulier points.

Example: polyResOutlier

The polyResOutlier method relies on a single polynomial fit.

from __future__ import print_function, division
from PyAstronomy import pyasl
import numpy as np
import matplotlib.pylab as plt

# Generate some "data"
x = np.arange(100)
y = np.random.normal(x*0.067, 1.0, len(x))

# Introduce an outliers
y[14] = -5.0
y[67] = +9.8

# Find outliers based on a linear (deg = 1) fit.
# Assign outlier status to all points deviating by
# more than 3.0 standard deviations from the fit,
# and show a control plot.
iin, iout = pyasl.polyResOutlier(x, y, deg=1, stdlim=3.0, controlPlot=True)

# What about the outliers
print("Number of outliers: ", len(iout))
print("Indices of outliers: ", iout)

# Remove outliers
xnew, ynew = x[iin], y[iin]

# Plot result (outlier in red)
plt.plot(x, y, 'r.')
plt.plot(xnew, ynew, 'bp')
plt.show()

Example: slidingPolyResOutlier

The slidingPolyResOutlier method usus polynomial fits in a sliding window to detect outliers.

from __future__ import print_function, division
from PyAstronomy import pyasl
import numpy as np
import matplotlib.pylab as plt

# Generate some "data"
x = np.arange(100)
y = np.random.normal(x*0.067 + 0.01*x**2, 1.0, len(x))

# Introduce an outliers
y[14] = -5.0
y[67] = +9.8

# Find outliers based on a linear (deg = 1) fit.
# Assign outlier status to all points deviating by
# more than 3.0 standard deviations from the fit,
# and show a control plot.
iin, iout = pyasl.slidingPolyResOutlier(x, y, 20, deg=1, stdlim=3.0, controlPlot=True)

# What about the outliers
print("Number of outliers: ", len(iout))
print("Indices of outliers: ", iout)

# Remove outliers
xnew, ynew = x[iin], y[iin]

# Plot result (outlier in red)
plt.plot(x, y, 'r.')
plt.plot(xnew, ynew, 'bp')
plt.show()

References

Rosner83

Rosner, Bernard (May 1983), Percentage Points for a Generalized ESD Many-Outlier Procedure,Technometrics, 25(2), pp. 165-172