NEAR, Inc.: Aspect Technical Background
Data and Rule Representation
Relevant Variable Selection
Return to Overview
Association rules are postulated from the variables in the data, and the strength of
each rule is measured by its confidence (how frequently an instance of A also contains an instance of B) and support (how frequently A and B
occur together in the data set).
Other measurements of a rule<92>s strength
are occasionally used, such as the lift (a measure of correlation) .
Minimum thresholds for the support and confidence are specified by the user for a particular problem and may be modified as the analysis progresses;
in particular, a trade off is often needed between the threshold values
and the resulting number of useful rules .
Aspect handles standard geographic data formats (ESRI® shapefiles or feature classes) and converts their information into a format that can be used for the rule discovery analysis.
Using this intermediate data, Aspect formulates the geographic object and spatial predicates used as components of an association rule.
Techniques are also included for identifying and selecting relevant variables for inclusion in a rule.
These methods are described in more detail below.
Association rules deal with the relationships between specific conditions or variable instances rather than with variable trends; thus, they are appropriate for data that are nominal or categorical in nature.
A great many geographic data are categorical, for example, distinct features such as structures and water bodies, and descriptive variables such as vegetation types and soil types.
Other variables are numerical, such as elevation, average temperature, and slope.
In an association rule, numerical values are handled by binning them into categories.
Aspect focuses on processing vector data sets consisting of point, line, or polygon features.
In the current version, raster layers need to be converted to vector data before use.
Any relevant attributes that are associated with these vector features can be used by Aspect within an association rule.
These data are stored in a predicate format that represents the geographic objects and the attributes of interest.
Two main types of geographic objects predicates areformed, one for distinct geographic features or categories, called "Feature_at," and one for continuous numerical values, called "Value_at."
Example predicates are
- Feature_at(X, road(surface, paved))
- Value_at(Y, (elevation, 1000))
where X and Y are locations.
A spatial relationship predicate then relates X and Y(described below).
Attribute name/value pairs are included in the geographic object predicates as appropriate.
A hierarchy of variables is constructed from the input data sets.
The top level consists of a feature name that describes the data layer, for example, "Roads," "Soil Type," or "Vegetation Type."
Multiple data layers can be input.
The second level consists of attribute names.
The third hierarchical level consists of attribute values for distinct features, or the numerical value of a continuous condition (such as slope or elevation).
As an example, for a data layer "Roads" with attribute name/value pairs "Surface Type:Paved," "Surface Type:Gravel," "Width:2 lanes," and "Width:4 Lanes," the hierarchical levels would be (1) Roads; (2) Surface Type, Width; and (3) Paved, Gravel, 2 Lanes, 4 Lanes:
The user can use the hierarchies to determine the detail of the candidate rules or to limit the number of rules postulated and tested.
Spatial predicates describe the relationships between the locations X and Y of eachgeographic object.
The spatial relationships represent concepts such as overlap, adjacent, disjoint, distance, density, or orientation.
Example predicates are "Overlap(X, Y)," "Adjacent(X, Y)," and "Disjoint(X, Y)."
Association rules are built from the geographic object predicates and the spatial predicates, for example:
- Feature_at(X, road(surface, paved)) => Value_at(Y, (slope, <15%)) ^ Overlap(X, Y)
To form a rule, the problem variables are classified as primary or secondary.
Candidate rules are formed based on a combination of one primary and one or more secondary variables.
These variables are chosen either manually by the user or automatically by the software.
A rule must contain the primary variable and one of the secondary variables as a minimum.
Additional secondary variables are added to the rule depending on the natural variable dependencies present in the data set (see next section).
The data layers are sampled at different locations to acquire the data used to compute the support and confidence of the rules.
Point and line features are sampled individually, and polygons are sampled using a frequency proportionate to size.
The sampling scheme is based on the primary variable features.
Various methods are used to help identify and select the significant variables that should be included in an association rule.
These include the hierarchical groups, contingency table analysis, and the Apriori  algorithm.
A variable list including the hierarchies (described above) is shown to the user (Figure 1),
allowing the user to select one or more as candidate primary variables, and one or more as candidate secondary variables.
If the user does not wish to specify the exact variables as candidates, a hierarchical level can be chosen instead, and the program will automatically use all variables at that level for the primary or secondary candidates.
The user then has the option to use contingency table analysis or Apriori to further modify the candidate variable lists.
Figure 1. Aspect screen shot showing
the selection of the primary and secondary variables.
Contingency table analysis provides a measure of correlation between two categorical variables.
The contingency table results show whether the variable trends have statistical significance (whereas the association rule looks at specific values of the variable only).
The frequency within each cell of the contingency table and its appropriately normalized value are equivalent to the support and confidence measures for a two-variable association rule consisting of the row and column value for that cell.
Contingency table analyses have limited usefulness since they miss any multivariate effects.
However, the results can be used to examine the two-dimensional trends and to guide the selection of variables to include in the association rule analysis.
If a secondary variable candidate is shown to have relatively low correlation with the primary variable, it may be eliminated from consideration.
However, if one of the table cells shows a high support and confidence for a two-value rule, an effort to include those values in the postulated rules should be made.
Once an idea of the individual correlations is obtained and some of the secondary variable candidates are eliminated, the Apriori algorithm can be used to find frequent combinations of the multiple variables that remain.
With Aspect, the user can iterate between the hierarchical variable list, the selected primary and secondary variable candidates, the contingency table analysis, and the Apriori algorithm until a final list of candidate association rules is determined.
Once this is accomplished, the user sets the confidence and support thresholds and the strength of each rule is computed and output.
- Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Academic Press, 2001, p. 261.
- Scheffer, T., "Finding association rules that trade support optimally against confidence," Lecture Notes in Computer Science, Vol. 2168, Springer, 2001, pp. 424-435.
- Agrawal, R. and Srikant, R., "Fast algorithms for mining association rules," Proc. 1994 Int. Conf. Very Large Data Bases, Santiago, Chile, pp. 487-499, Sept. 1994.
Return to Overview