Data Mining Testing Validation
Data Mining Testing Validation
Validation is the process of assessing how well your mining models perform against real data. It is important that you validate your mining models by understanding their quality and characteristics before you deploy them into a production environment.
This section introduces some basic concepts related to model quality, and describes the strategies for model validation that are provided in Microsoft Analysis Services. For an overview of how model validation fits into the larger data mining process, see Data Mining Solutions.
Methods for Testing and Validation of Data Mining Models
There are many approaches for assessing the quality and characteristics of a data mining model.
Definition of Criteria for Validating Data Mining Models
Measures of data mining generally fall into the categories of accuracy, reliability, and usefulness.
Accuracy is a measure of how well the model correlates an outcome with the attributes in the data that has been provided. There are various measures of accuracy, but all measures of accuracy are dependent on the data that is used. In reality, values might be missing or approximate, or the data might have been changed by multiple processes. Particularly in the phase of exploration and development, you might decide to accept a certain amount of error in the data, especially if the data is fairly uniform in its characteristics. For example, a model that predicts sales for a particular store based on past sales can be strongly correlated and very accurate, even if that store consistently used the wrong accounting method. Therefore, measurements of accuracy must be balanced by assessments of reliability.
Reliability assesses the way that a data mining model performs on different data sets. A data mining model is reliable if it generates the same type of predictions or finds the same general kinds of patterns regardless of the test data that is supplied. For example, the model that you generate for the store that used the wrong accounting method would not generalize well to other stores, and therefore would not be reliable.
Usefulness includes various metrics that tell you whether the model provides useful information. For example, a data mining model that correlates store location with sales might be both accurate and reliable, but might not be useful, because you cannot generalize that result by adding more stores at the same location. Moreover, it does not answer the fundamental business question of why certain locations have more sales. You might also find that a model that appears successful in fact is meaningless, because it is based on cross-correlations in the data.
Tools for Testing and Validation of Mining Models
Analysis Services supports multiple approaches to validation of data mining solutions, supporting all phases of the data mining test methodology.
These metrics do not aim to answer the question of whether the data mining model answers your business question; rather, these metrics provide objective measurements that you can use to assess the reliability of your data for predictive analytics, and to guide your decision of whether to use a particular iterate on the development process.
The topics in this section provide an overview of each method and walk you through the process of measuring the accuracy of models that you build using SQL Server Data Mining.Related Topics
Learn how to set up a testing data set using a wizard or DMX commands
Training and Testing Data Sets
Learn how to test the distribution and representativeness of the data in a mining structure
Cross-Validation (Analysis Services - Data Mining) Learn about the accuracy chart types provided. Lift Chart (Analysis Services - Data Mining)
Profit Chart (Analysis Services - Data Mining)
Scatter Plot (Analysis Services - Data Mining)
Learn how to create a classification matrix, sometimes called a confusion matrix, for assessing the number of true and false positives and negatives.
Classification Matrix (Analysis Services - Data Mining)
See Also Data Mining Tools Data Mining Solutions Testing and Validation Tasks and How-tos (Data Mining)