Mark Voortman
One question that arises naturally is what quantity and quality of data are necessary to build effective models. For example, at a recent Othot seminar, an attendee commented that his IT department said he didn’t have enough data to build predictive and prescriptive models.
In this blog, we look at that question in more detail and offer five lessons on guidance and insights into data collection for building models.
There are a lot of platitudes about the importance of data, such as “garbage in – garbage out” or “data is the new oil of the digital economy.”
As data scientists, we can certainly relate as data is the foundation for every model we build. One could even go so far as to say that a stronger focus on a data-centric approach could deliver better results than a model-centric approach.
That is not to say that modeling is not important. In fact, developing performant models is extremely important and is something we specialize in at Othot. But once you have very performant models, really the best way forward to get even better insights and results is to gather more data or improve the quality.
Before determining if enough data is available to build predictive/prescriptive models, it's helpful to understand the types of data that typically help inform these models: demographic and personal data, behavioral data, and interventional data.
Demographic and personal data is descriptive of the individual, such as address, academic performance, interests, diversity data, etc. Most of these variables are stored in an institution’s SIS, CRM, and LMS and tend to be more readily accessible than other data types. We also combine your data with data from external sources like the census.
Behavioral data captures the actions students are taking, such as visiting campus, attending an online webinar, filing a FAFSA, etc.
Interventional data capture how students engage with an institution, often represented through students’ responses to email, phone, and sometimes website interactions.
Collecting and incorporating data in all three of these categories is good but not necessary for building performant models. Including an array of demographic and personal information sets a good data foundation for the machine learning algorithms. These variables tend to be mainly predictive rather than prescriptive but are still helpful in understanding the population of records and the characteristics that drive these individuals to enroll, retain, graduate, etc.
When available, supplementing this foundation of data with additional behavioral and/or interventional data can improve performance while also generating insights that tend to be more actionable, especially through the introduction of prescriptive features. Interventional data can be predictive but can also be a more difficult type of data to incorporate. For the Othot partners who cannot do so, the model performance has still been successful.
Finally, it is important to emphasize that we can help retrieve, combine, and clean data. For example, while your data may be in disparate systems, we have developed tools that can automatically merge this data once it is set up.
Based on our experience working with many partners, we have developed a data map to guide the process and determine which variables are essential to collect and which variables tend to be less impactful.
In addition to considering the various types of data, it's also important to recognize both the quantity and quality of the data and features.
Quantity is important to a certain point. It’s quite possible and likely that 25 demographic/personal variables adequately capture the same amount of information that 100 variables can.
For example, a variety of variables capture an individual’s location, including city, state, zip, country, county, geomarket, longitude, latitude, etc. Each of these variables represents very similar pieces of information, and once a few items have been incorporated, it becomes less important to introduce additional items.
The same thing can be said for years of data. In an ideal situation, 2-3 years of data would be used to build models, but one year of data can be sufficient.
Quality of data is always important. Collecting data consistently for individuals in any given year and maintaining the same or similar practices for multiple years contributes to the quality of the data. If data is not collected in the same manner for all individuals, then the usefulness of the data is compromised, and variables may not be included in the models.
It's inevitable that the way data is recorded will change to some degree over time, usually due to migrating between systems, collecting data in different or better ways over time, or starting to record new variables that were not previously captured. Therefore, understanding the consequences of data changes on predictive modeling can help mitigate the issues that could arise when these changes or migrations occur. This is also something Othot can help with.
So even if the quantity and quality of data are compromised, in most cases, effective models can be built as a starting point, after which additional data collection will enhance these models.
Let’s turn to data collection next.
Even if your data quantity and quality are not what you would like it to be, Othot can help you determine what other data to collect and how to go about that and how to improve any current practices for collecting and storing data.
Due to our extensive experience working with our partners, we know what data can make a big difference for your institution and the problem you are trying to solve. All the variables mentioned above are candidates for collection, and which additional variables are most effective to collect depends on the variables you are currently collecting.
Also, note that tracking data transactionally is always preferred over aggregated data because once data is aggregated, you can’t go back to the raw data.
Another thing worth mentioning and quite critical for the models to make realistic predictions and prescriptions is never changing or removing any existing data.
For example, if you offer a prospective student financial aid, but they decide to enroll elsewhere, then don’t remove the offered aid from the data altogether. It tells you something about how effective that amount of aid is for that particular type of student. This may sound somewhat obvious, but in practice, this is something that we see happen regularly.
So just following the simple rule that data should not be changed or removed once collected will be very helpful. There are some examples where changes are permitted, such as changing a visit from no to yes, but even in those cases, it may make more sense to track data transactionally.
The bottom line is that regardless of your specific data situation, we will work with you and use your data to achieve your objective. There are many different types of data, and quantity and quality matters, but we can create effective models and guide additional data collection in almost all cases.
As is widely known among data scientists, but perhaps somewhat underappreciated in general, the collection of data is a journey and an ongoing process that never ends. But regardless of your starting point, just get started. It is always possible to improve and enhance data collection processes to improve models and insights. More and better data has other positive side effects as well because every decision will be more informed.
We think this blog post addresses the concern raised at the beginning. While there are situations where it is very difficult to help, i.e., if there is no data, in most cases, we can build initial models that are effective and help guide you in collecting additional data, which will depend on the data you are already collecting.
Let no one discourage you from getting started!
Contact us. We can help!
Ashton Black, Data Scientist at Othot, co-authored the post.