Turning data science into a product is not an easy task, as evidenced by the many data science consultancy companies and still relatively few data science product companies. Data science applied to one specific problem and data set is an already highly complex task that usually requires many steps of data preprocessing, the ability to make appropriate modeling assumptions, domain specific knowledge, and machine learning and software engineering expertise (as well as avoiding all kinds of looming pitfalls that can easily invalidate the results). Taking that one solution and turning it into a more general product adds another layer of complexity that can be extremely challenging. (Note that I will only cover the data science aspects of creating a product, there are many more necessary and difficult steps that involve user interface design, web development, database design, security, etc. that I won’t discuss here.)
At Othot we designed a domain specific language to address this problem; the Othot Data Science Language, ODSL for short. ODSL captures the sequence of many steps that turn our customers’ data into actionable insights. ODSL scripts are written in a custom JSON format and the interpreter is written in Python. The first step in most ODSL scripts is to load a customer’s training data set. The data is parsed and semantic information is automatically added, such as whether a variable is an identifier, a date, a range, etc. Personally, I really like the detection of range variables because it is a huge time saver. For example, if salary ranges are provided in the data such as $50,000 – $60,000, $20,000 – $30,000, under $10,000, $250,000+ the algorithm will automatically standardize on the ranges (by creating an ordinal variable) and will correctly identify under $10,000 as the lowest in the range and $250,000+ as the highest (note that this cannot be achieved just by sorting). Another one of my favorites is an algorithm that automatically detects and corrects typos in the input data, e.g., Pitsburgh will be automatically corrected to Pittsburgh. Finally, there are also HIQ™ definition files (HIQ™ stands for High Impact Question™) that contain information about a specific domain, such as higher education, and these are used to validate input data and attempt to fix it when incorrect. All of this results in higher quality training data which, in turn, leads to higher performing models.
But this is really just the tip of the iceberg. After loading and interpreting the data, many additional preprocessing steps can be performed automatically such as normalization, standardization, discretization, calculating distances based on address information, removing outliers, one hot encoding, deducing new columns, calculating similarity/distance matrices, etc. There is too much to cover in just a single blog post, but one of the more exciting features that we are actively developing is the ability to (fully automatically) combine customer data with many sources of external data (e.g., weather data). Last but not least, we obviously also have a comprehensive collection of machine learning algorithms at our disposal (ranging from classification to regression to clustering) that create models during the training phase and make predictions during deployment. The preprocessing steps can be dependent on the type of algorithm, e.g., for clustering a similarity matrix is automatically constructed, and ODSL can run algorithms in parallel to find the optimal solution in a given domain.
Finally, I want to emphasize two important benefits that ODSL brings us. The first one is reproducibility, as we store all of our scripts and can rerun them at any time. If there is a particular question about a given customer data set, we simply run the scripts again and can reanalyze the results and dig deeper if necessary. The second advantage is experimentation. Suppose we change one of the preprocessing steps and want to assess its efficacy. This is just a matter of running the ODSL scripts on a representative sample of customer data with the new feature enabled and with the feature disabled, and then evaluating the results for all the algorithms. Of course, it is possible that some algorithms may benefit and others do not, in which case we can enable the new feature conditionally.
I do not think it is a coincidence that reproductibility and experimentation turned out to be (admittedly emergent) benefits of ODSL that came about through automation; they are fundamental to science and therefore fundamental to data science!