NEAR, Inc.: Data Mining, Indentifying Outliers

Home
What's New

Consulting
Aerodynamics & Hydrodynamics
Computational Fluid Dynamics
Design Optimization
Reduced Order Modeling
Flow-Related Sensors
Aerodynamic Hardware Solns

Proprietary Software
Aerodynamics & Hydrodynamics
Knowledge Management Sys
Reduced Order Modeling

Publications

Awards

NEAR

Contact

Analytical Mechanics Associates
Nielsen Engineering
& Research Division

(408) 727-9465

info@ama-inc.com
Software Inquiries:
softwaresales@ama-inc.com

Data Mining, Indentifying Outliers

Example 3: Data Mining, Indentifying Outliers

A third application, illustrated below, is feature extraction and data mining, specifically, the identification of outliers in N-dimensional data.

Input file description:

DEPENDENT = "y"
UNCERTAINTY = "0.01"
INDEPENDENT = "x"
RESOLUTION = "0.00"

VARIABLES = "x", "y"
ZONE T="Computations (method #1)"
0.000000e+00 5.462130e-01
1.000000e-01 6.110911e-01
2.000000e-01 6.755913e-01
3.000000e-01 7.395816e-01
4.000000e-01 8.029658e-01
5.000000e-01 8.656751e-01
6.000000e-01 9.276608e-01
7.000000e-01 9.888899e-01
8.000000e-01 0.900000e+00
9.000000e-01 1.109002e+00
1.000000e+00 1.167867e+00
1.100000e+00 1.225937e+00
1.200000e+00 1.283215e+00
1.300000e+00 1.339709e+00
1.400000e+00 1.395429e+00
1.500000e+00 1.450386e+00
1.600000e+00 1.504593e+00
1.700000e+00 1.558063e+00
1.800000e+00 1.610811e+00
1.900000e+00 1.662852e+00
2.000000e+00 1.714200e+00
2.100000e+00 1.764871e+00
2.200000e+00 1.814880e+00
2.300000e+00 1.864241e+00
2.400000e+00 1.912971e+00
2.500000e+00 1.961082e+00
2.600000e+00 2.008591e+00
2.700000e+00 2.055510e+00
2.800000e+00 2.101853e+00
2.900000e+00 2.147635e+00
3.000000e+00 2.192868e+00
3.100000e+00 2.237565e+00
3.200000e+00 2.281738e+00
3.300000e+00 2.325399e+00
3.400000e+00 2.368561e+00
3.500000e+00 2.440000e+00
3.600000e+00 2.490000e+00
3.700000e+00 2.520000e+00
3.800000e+00 2.536432e+00
3.900000e+00 2.577259e+00
4.000000e+00 2.617650e+00
4.100000e+00 2.657614e+00
4.200000e+00 2.697160e+00
4.300000e+00 2.736298e+00
4.400000e+00 2.775036e+00
4.500000e+00 2.813383e+00
4.600000e+00 2.851347e+00
4.700000e+00 2.888936e+00
4.800000e+00 2.926157e+00
4.900000e+00 2.963018e+00
5.000000e+00 2.999526e+00

The example shown here is a simple two-dimensional illustration. The global response surface is the solid line running through the data points. In this case, a simple visual inspection of the data would suggest that the data point around x=0.9 might be suspicious and deserving of further investigation. In a high-dimensional space, however, such visual identification is typically not possible. In this case, the controllable stiffness of the response surface can be used to automate the identification of suspicious areas of the data.

outlier global data and line

Regardless of the dimensionality of the parameter space (i.e., the number of independent variables), a histogram of the differences between data points and the response surface prediction can be used. The following plot shows that the response surface is within +/- 0.2 of the actual data for a majority of the data points. The outlier at x=0.9 is clearly visible with a difference of -0.12 between the data point and the response surface.

outlier histogram

The following plot shows the difference of the y variable with respect to the response surface fit. Note that the point at 0.9 as well as the points around 3.6 seem to warrant further investigation because even though they do not exhibit as large a difference as other points with respect to the response surface, they do not seem to follow the trend exhibited by the other data points around them. The results will change as a function of the stiffness used, but serve to illustrate the potential of RS techniques for the identification of N-dimensional outliers.

outlier difference

Analytical Mechanics Associates, Nielsen Engineering & Research Division
info@ama-inc.com