Data Mining, Cluster Analysis, Market-Basket Analysis,
and Decision Trees
Kroenke defines data mining as "the application of statistical techniques to find patterns and relationships among data for classification and prediction."1 There are two types of data mining – unsupervised and supervised data mining.
Cluster analysis and decision trees are common methods of unsupervised data mining.
Supervised data mining, as the name suggests, is the opposite of unsupervised data mining. Before any analysis occurs, a model and/or theory about the data is drawn up.
Popular methods of supervised data mining include regression analysis and neural networks.
Cluster analysis is a form of unsupervised data mining that uses "statistical techniques [to] identify groups of entities that have similar characteristics."1 There are many techniques and algorithms used to group various types of data together that qualify as cluster analysis, such as connectivity based clustering, centroid-based clustering, distribution-based clustering, and density-based clustering. Cluster analysis is commonly used to identify customer demographics by identifying similar purchasing behavior based on criteria such as age, ethnicity, and economic status. It is also used in the fields of biology, medicine, internet technology, computer science, and even crime analysis to search for conclusions to various problems.
An example of a density-based cluster analysis from Wikipedia.
Kroenke defines market-basket analysis as "a data-mining technique for determining sales patterns."1 It is often used to identify what products consumers are likely to buy based on their purchasing habits. This can be used both by stores and supermarkets when considering where to place products in relation to other products, or what additional products a salesman might try to push onto a customer. This concept is called cross-selling. With this method, the probability that two products will be sold together is drawn from a data sample and a probability of the products being purchased together, called support, is drawn from the data. Conditional probabilities of a product being purchased if another is purchased, known as confidences, can then be drawn from the supports.
"A decision tree is a hierarchical arrangement of criteria that predict a classification or value."1 Decision trees can easily help visualize the probabilities of events occurring based on the data collected. These probabilities can then be used to make educated business decisions based on the sample data trends.