Understanding Clustering
Clustering enables you to organize data based on variables you specify. Usually, the clustering algorithm produces segments of data that help you identify groups with the largest number of attributes in common. In other words, clustering provides an idea about the similarity and differences between records in the same group.
After BIRT Analytics applies this algorithm, a new field is created in the selected table to group records into a specified number of clusters (N). Because each record is given a value for the clustering, you can see a count of records in each cluster in the Data Tree’s Discrete Values view. For example, customers grouped in the same category or cluster can have common demographic features.
You must use continuous variables because clustering calculates the distance between values to set up a group, and only fields with continuous values work for clustering. Continuous means that there are many discrete values. Categorical variables, or fields with few discrete values like gender or occupation, do not work.
To set up a clustering model, create a training process.
How to set up a training process
1 Choose Parameters and specify the following:
*Domain: The segment of data from the database. All linked tables are automatically added.
*Confidence level: The representative sample size to create the groups.
*Clusters: The number of groups.
*The attributes or categories to create the groups. Only fields containing continuous values are available to choose. Add them by dragging from the list on the left to the area on the right.
2 Choose Train.
How to use the results
When training finishes, Results contains a list with all the groups that have been created, the records in each group, and the mean of each attribute used to set up the groups. Note that every mean value acts as a centroid of the group.
1 Review the results. After you save the cluster, you cannot train it again.
2 Save the trained cluster in My Folders, to an existing folder, or a newly created one.
3 Select the saved cluster in My Folders. Right-click and choose Open.
4 On K-Means, drag a new segment and drop it in Domain.
5 Type the new target column name.
6 Choose OK. The new column appears in the selected table.