Outlier detection¶
Accounting for “outliers” can be an integral part in any kind of data analysis, yet, it remains basically unclear what such an outlier actually is. Citing Hawkins “an outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”.
Below, we give algorithms to detect outliers based on the Generalized Extreme Studentized Deviate (ESD) test
and polynomial fits
The generalized ESD test¶
- PyAstronomy.pyasl.generalizedESD(x, maxOLs, alpha=0.05, fullOutput=False, ubvar=False)¶
Carry out a Generalized ESD Test for Outliers.
The Generalized Extreme Studentized Deviate (ESD) test for outliers can be used to search for outliers in a univariate data set, which approximately follows a normal distribution. A description of the algorithm is, e.g., given at Nist or [Rosner1983].
- Parameters
- maxOLsint
Maximum number of outliers in the data set.
- alphafloat, optional
Significance (default is 0.05).
- fullOutputboolean, optional
Determines whether additional return values are provided. Default is False.
- ubvarboolean, optional
If True, an unbiased estimate for the variance will be used in the calculation; this provides compatibility with the R implementation of NIST. If False (default), the maximum-likelihood variance estimate will be used.
- Returns
- Number of outliersint
The number of data points characterized as outliers by the test.
- Indiceslist of ints
The indices of the data points found to be outliers.
- Rlist of floats, optional
The values of the “R statistics”. Only provided if fullOutput is set to True.
- Llist of floats, optional
The lambda values needed to test whether a point should be regarded an outlier. Only provided if fullOutput is set to True.
Example¶
from __future__ import print_function, division
import numpy as np
import matplotlib.pylab as plt
from PyAstronomy import pyasl
# Convert data given at:
# http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
# to array.
x = np.array([float(x) for x in "-0.25 0.68 0.94 1.15 1.20 1.26 1.26 1.34 1.38 1.43 1.49 1.49 \
1.55 1.56 1.58 1.65 1.69 1.70 1.76 1.77 1.81 1.91 1.94 1.96 \
1.99 2.06 2.09 2.10 2.14 2.15 2.23 2.24 2.26 2.35 2.37 2.40 \
2.47 2.54 2.62 2.64 2.90 2.92 2.92 2.93 3.21 3.26 3.30 3.59 \
3.68 4.30 4.64 5.34 5.42 6.01".split()])
# Apply the generalized ESD
r = pyasl.generalizedESD(x, 10, 0.05, fullOutput=True)
print("Number of outliers: ", r[0])
print("Indices of outliers: ", r[1])
print(" R Lambda")
for i in range(len(r[2])):
print("%2d %8.5f %8.5f" % ((i+1), r[2][i], r[3][i]))
# Plot the "data"
plt.plot(x, 'b.')
# and mark the outliers.
for i in range(r[0]):
plt.plot(r[1][i], x[r[1][i]], 'rp')
plt.show()
Distance-based outlier detection (e.g., for spectra)¶
The generalized ESD test requires approximate normal distribution for the data points, which—for example in the case of a spectrum—can be a harsh limitation.
This function applies the generalized ESD test to the distances between adjacent data points, which are then requires to be distributed approximately normally. It will characterize a data point as an outlier, only if the distances to its right and left neighbor are abnormal as judged by the generalized ESD.
- PyAstronomy.pyasl.pointDistGESD(x, maxOLs, alpha=0.05, ubvar=False)¶
Search for outliers by comparing distance of adjacent points.
This method computes the “distance” between adjacent points, e.g., d = x[i+1] - x[i]. It then uses
PyAstronomy.pyasl.generalizedESD()
to search for outliers in the list of distances. A point will be characterized as being an outlier, if (and only if) the distance to its left and right neighbor are abnormal.- Parameters
- xarray
The data set to be searched for outliers.
- maxOLsint
The number of outliers. Note that the number passed to generalizedESD is actually 2*`maxOLs`.
- alphafloat, optional
The significance level to be used in applying generalizedESD. The default is 0.05.
- ubvarboolean, optional
If True, an unbiased estimate for the variance will be used in the calculation; this provides compatibility with the R implementation of NIST. If False (default), the maximum-likelihood variance estimate will be used.
- Returns
- nint
The number of outliers detected.
- Indiceslist of ints
The indices of the points found to be outliers.
Example¶
from __future__ import print_function, division
import numpy as np
import matplotlib.pylab as plt
from PyAstronomy import pyasl
# Get some data
x = np.random.normal(0., 0.1, 50)
# Introduce outliers
x[27] = 1.0
x[43] = -0.66
# Run distance based outlier detection
r = pyasl.pointDistGESD(x, 5)
print("Number of outliers detected: ", r[0])
print("Indices of these outliers: ", r[1])
plt.plot(x, 'b.')
for i in range(len(r[1])):
plt.plot(r[1][i], x[r[1][i]], 'rp')
plt.show()
Outlier detection based on polynomial fit¶
The algorithm implemented here is based on a polynomial fit to the data. After the fit is subtracted, the residuals are calculated. Based on their standard deviation, points with residuals deviating by more than the specified number of standard deviations from the fit are identified. Implementations with a single polynomial fit and a sliding window are available.
- PyAstronomy.pyasl.polyResOutlier(x, y, deg=0, stdlim=3.0, controlPlot=False, fullOutput=False, mode='both')¶
Simple outlier detection based on residuals.
This algorithm fits a polynomial of the specified degree to the data, subtracts it to find the residuals, determines the standard deviations of the residuals, and, finally, identifies all points with residuals further than the specified number of standard deviations from the fit.
- Parameters
- x, yarrays
The abscissa and ordinate of the data.
- degint, optional
The degree of the polynomial to be fitted. The default is 0, i.e., a constant.
- stdlimfloat, optional
The number of standard deviations acceptable for points not categorized as outliers.
- modestring, {both, above, below}
If ‘both’ (default), outliers may be located on both sides of the polynomial. If ‘above/below’, outliers are only expected above/below it.
- controlPlotboolean, optional
If True, a control plot will be generated showing the location of outliers (default is False).
- fullOutputboolean, optional
If True, the fitted polynomial and the resulting model will be returned.
- Returns
- indiinarray
The indices of the points not being categorized as outliers.
- indioutarray
Indices of the oulier points.
- parray, optional
Coefficients of the fitted polynomial (only returned if fullOutput is True).
- modelarray, optional
The polynomial model (only returned if fullOutput is True).
- PyAstronomy.pyasl.slidingPolyResOutlier(x, y, points, count=1, deg=0, stdlim=3.0, controlPlot=False, dx=1, mode='both')¶
Outlier detection based on polynomial fit in sliding box.
This algorithm fits a polynomial of the specified degree to a sliding chunk of the data, subtracts it to find the residuals, determines the standard deviations of the residuals, and, finally, identifies all points with residuals further than the specified number of standard deviations from the fit.
The length of the chunk is determined by points. In each step, the chunk is advanced by dx data points (default is one). To be finally maked as an outlier, a point must be detected as an outlier in at least count instances, when the chunk slides over it. By default, a single such detection is sufficient to establish its outlier status.
- Parameters
- x, yarrays
The abscissa and ordinate of the data.
- pointsint
Number of points for the sliding box
- countint, optional
Number of “slides” in which the point shall deviate from the fit by the stdlim
- degint, optional
The degree of the polynomial to be fitted. The default is 0, i.e., a constant.
- stdlimfloat, optional
The number of standard deviations acceptable for points not categorized as outliers.
- modestring, {both, above, below}
If ‘both’ (default), outliers may be located on both sides of the polynomial. If ‘above/below’, outliers are only expected above/below it.
- controlPlotboolean, optional
If True, a control plot will be generated showing the location of outliers (default is False).
- dxint, optional
The number of data points by which the chunk is advanced in each step.
- Returns
- indiinarray
The indices of the points not categorized as outliers.
- indioutarray
Indices of the oulier points.
Example: polyResOutlier¶
The polyResOutlier method relies on a single polynomial fit.
from __future__ import print_function, division
from PyAstronomy import pyasl
import numpy as np
import matplotlib.pylab as plt
# Generate some "data"
x = np.arange(100)
y = np.random.normal(x*0.067, 1.0, len(x))
# Introduce an outliers
y[14] = -5.0
y[67] = +9.8
# Find outliers based on a linear (deg = 1) fit.
# Assign outlier status to all points deviating by
# more than 3.0 standard deviations from the fit,
# and show a control plot.
iin, iout = pyasl.polyResOutlier(x, y, deg=1, stdlim=3.0, controlPlot=True)
# What about the outliers
print("Number of outliers: ", len(iout))
print("Indices of outliers: ", iout)
# Remove outliers
xnew, ynew = x[iin], y[iin]
# Plot result (outlier in red)
plt.plot(x, y, 'r.')
plt.plot(xnew, ynew, 'bp')
plt.show()
Example: slidingPolyResOutlier¶
The slidingPolyResOutlier method usus polynomial fits in a sliding window to detect outliers.
from __future__ import print_function, division
from PyAstronomy import pyasl
import numpy as np
import matplotlib.pylab as plt
# Generate some "data"
x = np.arange(100)
y = np.random.normal(x*0.067 + 0.01*x**2, 1.0, len(x))
# Introduce an outliers
y[14] = -5.0
y[67] = +9.8
# Find outliers based on a linear (deg = 1) fit.
# Assign outlier status to all points deviating by
# more than 3.0 standard deviations from the fit,
# and show a control plot.
iin, iout = pyasl.slidingPolyResOutlier(x, y, 20, deg=1, stdlim=3.0, controlPlot=True)
# What about the outliers
print("Number of outliers: ", len(iout))
print("Indices of outliers: ", iout)
# Remove outliers
xnew, ynew = x[iin], y[iin]
# Plot result (outlier in red)
plt.plot(x, y, 'r.')
plt.plot(xnew, ynew, 'bp')
plt.show()
References¶
- Rosner83
Rosner, Bernard (May 1983), Percentage Points for a Generalized ESD Many-Outlier Procedure,Technometrics, 25(2), pp. 165-172