Preparing data for mining
Video tutorials
BIRT Analytics supports normalization, scaling, and remapping operations that prepare a data column to meet conditions required by each data mining algorithm. Preprocessing applies a mathematical operation to values in a data column. Preprocessing a column of data values having a distribution that differs from a standard, or normal, distribution before applying a data mining algorithm can produce a more useful result. For example, you can compare data sets that have different scales and units by standardizing the data so that it falls in the 0 to 1 range. Test scores are often calibrated by percentile, with most scores falling in the 25th to 75th percentile.
Figure 6-1 shows the distribution of raw, or non-standardized, data for age and income. Ages fall in the 19 to 93 range, while incomes fall in the 479.79 to 111571.4 range. To compare these distributions, you must standardize the data.
Figure 6-1  
Standardizing data in a column
There are four ways to standardize the data in a column:
In each case, a new column is created to contain the standardized data.
Understanding normalization
Normalization calculates the mean of all values in a column. Each value in the new column compares in the same way to values in a standard, normal distribution. The operation subtracts the mean value from each value in the column, then divides the difference by the standard deviation. The formula is:
y = (x - mean{x1, xN}) / (stdv{x1, xN})
Standard deviation shows how much variation there is from the average (mean), or expected value. A low standard deviation indicates that the data points tend to be very close to the mean. A high standard deviation indicates that the data points are spread out over a large range of values.
Figure 6-2 shows normalized data for age and income. The values on the horizontal axis represent the number of standard deviations from the mean. The standard deviation of the mean is 0.
Figure 6-2  
Understanding linear scaling
Standardization by linear scaling is useful when values in a column have the following characteristics:
The more the data are clustered, the better the result obtained by linear regression study. Maximum and/or minimum values are calculated to be suitable to perform linear regression analysis. Linear scaling supports two options:
The formula is:
y = (x - min{x1, xN}) / (max{x1, xN} - min{x1, xN})
Figure 6-3 shows linear scaling with the original minimum and maximum values for age and income.
Figure 6-3  
Figure 6-4 shows linear scaling with a stretch to the minimum and maximum values for age and income.
Figure 6-4  
Understanding logistic scaling
Standardization by logistic scaling is used to recode the variable of study for use in a logistic regression. Logistic regression is a type of regression analysis used for predicting the outcome of a categorical dependent variable (a dependent variable that can take on a limited number of values) based on one or more predictor variables. The equation used is:
P(n) = 1/(1 + e-n)
where n represents the values in the column. This equation analyzes the values to form a logistic model.
Understanding Softmax scaling
Softmax scaling standardization is a nonlinear transformation that reduces data ranges for the values in a column as much as possible. The objective is to achieve the minimum and maximum values asymptotically. In other words, the low-end and high-end values gradually approach the minimum and maximum values without ever reaching them.
If you choose Softmax scaling, you can set the confidence level to 68%, 95%, or 99%. The lower the confidence level is, the shorter the intervals and the greater the probability of error will be. The formula is:
x’ = x - E(x) / λ(σx/2π)
λ is the confidence level.
σx is the standard deviation of the study variable.
Figure 6-5 shows Softmax scaling at 68% for age and income.
Figure 6-5  
Figure 6-6 shows Softmax scaling at 95% for age and income.
Figure 6-6  
Figure 6-7 shows Softmax scaling at 99% for age and income.
Figure 6-7  
How to standardize the data in a column
Figure 6-8  

Additional Links:

Copyright Actuate Corporation 2013 BIRT Analytics 4.2