I ran track in college, specifically pole vault was my game – and the 4x4 relay when coach wanted to punish me for something. As you could assume, my training was very different than that of a thrower or distance runner, but we all shared common goals: to jump higher, throw farther, run faster. In attempt to do so, each of our practices were focused around bettering every aspect of our event.
At Othot, our “practices” are spent bettering every aspect of an analysis. This preparation time is better known as data pre-processing. Data pre-processing is the act of carefully screening the data to avoid misleading results from the analysis the most valuable information from the analysis. Similar to track and field training, pre-processing is not a specific routine that can be generalized, but rather involves utilizing various methods specific to the type of analysis and the data itself.
Let’s first look into a classification type analysis, which aims to predict a categorical target. A common type of algorithm used in classification are decision trees which are models of decisions and their possible consequences, showing the path by which each decision leads to the end result. Data used in this type of analysis could initially be categorical or numerical. Think about yearly income, for example, which is a continuous numerical variable. If you have a dataset with 1000 different customers, you could potentially have 1000 different yearly income levels. Including a course of action in the decision tree for each of the different 1000 income levels would unnecessarily complicate the model (especially if there are multiple variables with this similar property). The solution in this scenario is to partition the income values into a certain number of effective ranges, because even if the income values are not exactly the same, surely some of the values are relatively similar. This captures the spread of the income values, while creating discrete variables that will generate better results from the decision tree.
Survey data is a good candidate for a classification type analysis. Assume a survey asks what a person’s single favorite restaurant in the area is. Ideally the survey would list a fixed number of restaurants to choose from, but some surveys use a yes or no response basis. This method would list, let’s say ten, restaurants asking if each one was your single favorite, to which a person would answer yes or no. This method generates nine “no” responses and one “yes” response. Here we have ten variables, with a redundant presence of “no” responses, that provide no additional information than one variable. The pre-processing solution in this instance is to collapse the data into one variable, with the restaurant names as the responses.
Another common type of analysis is regression analysis, which focuses on the relationship between a dependent variable and independent variable(s). Scatter plots are also a useful tool in linear regression, allowing you to plot a single variable against a dependent variable. Take for example, a travel agency that wants to predict how much money a customer will spend on vacations for the year using 20 independent variables, including yearly household income and number of children. If there are a large number of customers with a household income clustered around $50,000, it may be difficult to identify trends in the graph. Logarithmic transformation, which involves taking the logarithm of each value in the data to use in the analysis, can make patterns of skewed data more visible than the raw data itself.
Note that household income and number of children are clearly measured on different scales. We want variables to be measured on the same scale so that we are able to interpret their influence on the dependent variable relative to the other variables. Normalizing the dataset, generally between a range of 0 and 1, will solve this problem, generating more useful information from the analysis.
Now I have only mentioned a few techniques of data pre-processing thus far, but truly the methods are endless. Pre-processing is a crucial component of the data mining process, but this is not to say that the analysis itself is not important. In track and field, your performance on the day of competition is important, but how your “practices” are spent will determine how successful you are.