Example 3: Data Mining, Indentifying Outliers
A third application, illustrated below, is feature extraction and data mining, specifically, the identification of outliers in N-dimensional data.
Input file description:
DEPENDENT
= "y"
UNCERTAINTY = "0.01"
INDEPENDENT = "x"
RESOLUTION
= "0.00"
VARIABLES
= "x", "y"
ZONE T="Computations (method #1)"
0.000000e+00
5.462130e-01
1.000000e-01
6.110911e-01
2.000000e-01
6.755913e-01
3.000000e-01
7.395816e-01
4.000000e-01
8.029658e-01
5.000000e-01
8.656751e-01
6.000000e-01
9.276608e-01
7.000000e-01
9.888899e-01
8.000000e-01
0.900000e+00
9.000000e-01
1.109002e+00
1.000000e+00
1.167867e+00
1.100000e+00
1.225937e+00
1.200000e+00
1.283215e+00
1.300000e+00
1.339709e+00
1.400000e+00
1.395429e+00
1.500000e+00
1.450386e+00
1.600000e+00
1.504593e+00
1.700000e+00
1.558063e+00
1.800000e+00
1.610811e+00
1.900000e+00
1.662852e+00
2.000000e+00
1.714200e+00
2.100000e+00
1.764871e+00
2.200000e+00
1.814880e+00
2.300000e+00
1.864241e+00
2.400000e+00
1.912971e+00
2.500000e+00
1.961082e+00
2.600000e+00
2.008591e+00
2.700000e+00
2.055510e+00
2.800000e+00
2.101853e+00
2.900000e+00
2.147635e+00
3.000000e+00
2.192868e+00
3.100000e+00
2.237565e+00
3.200000e+00
2.281738e+00
3.300000e+00
2.325399e+00
3.400000e+00
2.368561e+00
3.500000e+00
2.440000e+00
3.600000e+00
2.490000e+00
3.700000e+00
2.520000e+00
3.800000e+00
2.536432e+00
3.900000e+00
2.577259e+00
4.000000e+00
2.617650e+00
4.100000e+00
2.657614e+00
4.200000e+00
2.697160e+00
4.300000e+00
2.736298e+00
4.400000e+00
2.775036e+00
4.500000e+00
2.813383e+00
4.600000e+00
2.851347e+00
4.700000e+00
2.888936e+00
4.800000e+00
2.926157e+00
4.900000e+00
2.963018e+00
5.000000e+00
2.999526e+00
The example shown here is a simple two-dimensional illustration.
The global response surface is the solid line running through the data points.
In this case, a simple visual inspection of the data would suggest that the data point around x=0.9 might be suspicious and deserving of further investigation.
In a high-dimensional space, however, such visual identification is typically not possible.
In this case, the controllable stiffness of the response surface can be used to automate the identification of suspicious areas of the data.
Regardless of the dimensionality of the parameter space (i.e., the number of independent variables), a histogram of the differences between data points and the response surface prediction can be used.
The following plot shows that the response surface is within +/- 0.2 of the actual data for a majority of the data points.
The
outlier at x=0.9 is clearly visible with a difference of -0.12 between the data point and the response surface.
The following plot shows the difference of the y variable with respect to the response surface fit.
Note that the point at 0.9 as well as the points around 3.6 seem to warrant further investigation because even though they do not exhibit as large a difference as other points with respect to the response surface, they
do not seem to follow the trend exhibited by the other data points around them.
The results will change as a function of the stiffness used, but serve to illustrate the potential of RS techniques for the identification of N-dimensional outliers.