The Art of Data Science

Much of what I do at Othot is focused around algorithms and data. Implementing the appropriate machine learning algorithms is important in order to build models and generate predictions. At the same time, it is important to have sufficient data when building these models. Each are key components in predictive analytics, but how the two interact is where the real beauty of data science lies.

Algorithms aren’t used only for building models. In fact, a big part of data science involves using algorithms to transform data prior to being run through machine learning models. Initially this involves preprocessing the data (i.e. fixing inconsistencies and errors, removing outliers, etc). Using algorithms allow us to quickly automate this process.

Next the data needs to be prepared appropriately for the type of model being built. This requires a strong understanding of what the variables in the data set represent and how they will be handled in each specific model. Here, we can again utilize algorithms to automate formatting our data properly (i.e. discretizing numeric variables, appropriately handing missing data, etc.).

Some of the above processes may seem like structured transformations, but algorithms also allow us to creatively interact with data. A great example of this is feature engineering, which involves deriving additional features from an original data set. Suppose we have an address for each individual record in a data set. Having city and state variables can give us a general idea of location, as we may know that Ohio is closer to Pennsylvania than the state of California, but how can we make sure that these insights are accounted for in the actual model? We can derive a distance variable for each individual address, which provides the model with a value that quantifies the above understanding. Once we have identified these types of transformations, algorithms allow us to quickly generate these additional features.

Utilizing a strong set of algorithms to make transformations in a data set can not only improve the data, but can also greatly improve the overall success of a model. Predictive analytics is not only the data or the science, it’s the art of how the two interact.