Preparing data for mining

Preparing data for mining

Video tutorials

BIRT Analytics supports normalization, scaling, and remapping operations that prepare a data column to meet conditions required by each data mining algorithm. Preprocessing applies a mathematical operation to values in a data column. Preprocessing a column of data values having a distribution that differs from a standard, or normal, distribution before applying a data mining algorithm can produce a more useful result. For example, you can compare data sets that have different scales and units by standardizing the data so that it falls in the 0 to 1 range. Test scores are often calibrated by percentile, with most scores falling in the 25th to 75th percentile.

Figure 6-1 shows the distribution of raw, or non-standardized, data for age and income. Ages fall in the 19 to 93 range, while incomes fall in the 479.79 to 111571.4 range. To compare these distributions, you must standardize the data.

Figure 6-1

Distribution of raw data for age (left) and income (right)

Standardizing data in a column

There are four ways to standardize the data in a column:

Normalization

Linear scaling

Logistic scaling

Softmax scaling

In each case, a new column is created to contain the standardized data.

Understanding normalization

Normalization calculates the mean of all values in a column. Each value in the new column compares in the same way to values in a standard, normal distribution. The operation subtracts the mean value from each value in the column, then divides the difference by the standard deviation. The formula is:

y = (x - mean{x1, xN}) / (stdv{x1, xN})

Standard deviation shows how much variation there is from the average (mean), or expected value. A low standard deviation indicates that the data points tend to be very close to the mean. A high standard deviation indicates that the data points are spread out over a large range of values.

Figure 6-2 shows normalized data for age and income. The values on the horizontal axis represent the number of standard deviations from the mean. The standard deviation of the mean is 0.

Figure 6-2

Normalized data for age (left) and income (right)

Understanding linear scaling

Standardization by linear scaling is useful when values in a column have the following characteristics:

Many similar values

Very extreme minimum and/or maximum values that are highly concentrated

The more the data are clustered, the better the result obtained by linear regression study. Maximum and/or minimum values are calculated to be suitable to perform linear regression analysis. Linear scaling supports two options:

Use the original minimum and maximum values.

Add a stretch to the minimum and maximum values. This option adds, according to the percentage indicated, a minimum and maximum value before the data is preprocessed.

The formula is:

y = (x - min{x1, xN}) / (max{x1, xN} - min{x1, xN})

Figure 6-3 shows linear scaling with the original minimum and maximum values for age and income.

Figure 6-3

Linear scaling with original minimum and maximum values for age (left) and income (right)

Figure 6-4 shows linear scaling with a stretch to the minimum and maximum values for age and income.

Figure 6-4

Linear scaling with a stretch to the minimum and maximum values for age (left) and income (right)

Understanding logistic scaling

Standardization by logistic scaling is used to recode the variable of study for use in a logistic regression. Logistic regression is a type of regression analysis used for predicting the outcome of a categorical dependent variable (a dependent variable that can take on a limited number of values) based on one or more predictor variables. The equation used is:

P(n) = 1/(1 + e-n)

where n represents the values in the column. This equation analyzes the values to form a logistic model.

Understanding Softmax scaling

Softmax scaling standardization is a nonlinear transformation that reduces data ranges for the values in a column as much as possible. The objective is to achieve the minimum and maximum values asymptotically. In other words, the low-end and high-end values gradually approach the minimum and maximum values without ever reaching them.

If you choose Softmax scaling, you can set the confidence level to 68%, 95%, or 99%. The lower the confidence level is, the shorter the intervals and the greater the probability of error will be. The formula is:

x’ = x - E(x) / λ(σx/2π)

where

x represents the values in the column.

E(x) is the average value.

λ is the confidence level.

σx is the standard deviation of the study variable.

π is the number 3.14159…

Figure 6-5 shows Softmax scaling at 68% for age and income.

Figure 6-5

Softmax scaling at 68% for age (left) and income (right)

Figure 6-6 shows Softmax scaling at 95% for age and income.

Figure 6-6

Softmax scaling at 95% for age (left) and income (right)

Figure 6-7 shows Softmax scaling at 99% for age and income.

Figure 6-7

Softmax scaling at 99% for age (left) and income (right)

How to standardize the data in a column

1

In Analytics—Advanced, choose Preprocessing→Normalize Column.

2

Drag the column you want to standardize from My Data and drop it in Column in Normalize column, shown in Figure 6-8.

3

In New column, type the name of the new column.

4

Choose Normalize, or choose one of the scaling algorithms: linear scaling, logistic scaling, or SoftMax scaling. If you choose linear scaling or SoftMax scaling, you can customize the scaling.

5

Choose Create. The column appears in My Data.

Figure 6-8

Standardizing the data in a column

Additional Links:

Forums |

Blogs |

Support |

Give Us Feedback |

More...

BIRT Analytics 4.2