The term “unstructured data” seems to be getting a lot of mention nowadays in the data science realm. It may be somewhat of a new term, but it describes something that has been around since the beginning of time.
Before we get too far with talking about structured and unstructured data, we should have a bit of philosophical conversation about data. Data is information. Information is a set of facts about things. The statement, “Today I am wearing a white shirt,” is information. The previous sentence is data. Humans convert information to knowledge. By reading the sentence about the color of shirt that I’m wearing, you converted that data to information and now have knowledge that I’m wearing a white shirt today. (I realize you might not be reading this on the day I am writing, but any number of days later, which would make that data no longer valid. This is the time dimension of data which could be a blog post all its own…..)
Information can be conveyed as data in a number of ways. With regards to the color of shirt I’m wearing today, the sentence in the previous paragraph is one way. An attached photo of me would be another, a Twitter post by me of my shirt color is another, the footage from the security cameras in the parking lot is another. In search of the knowledge of what color shirt I’m wearing today, there are a number of possible data sources you could use.
Let’s continue with my shirt example. Let’s imagine that the guy who sits by me every day determines that there is value to him in knowing what color shirt I wear every day (as creepy as that sounds…). To record this information, he opens up a spreadsheet application and makes one column titled “Day” and another column titled “John’s Shirt Color”. After a week, he would have several days’ worth of data. Using the spreadsheet application, which is a set of computer programs that analyze data, he can build a chart of the color occurrence over time (e.g. a bar chart with color on the x-axis). The computer program can take the data entered each day and process it, easily, to a chart of aggregated data. We used data and a computer program to create additional information, namely my shirt color preference, at least for the measured time duration. Now the creepy guy next to me has knowledge of my shirt color preference, hopefully to be used by him when my birthday next arrives. This data he collected is considered structured data.
Now, let’s redo our previous thought experiment. Instead of using a spreadsheet to collect my daily shirt color, he snaps a photo (even creepier!). He takes the digital file of the photo and puts it on his computer. Over time he now has a record, in photographs, of my shirt color. The interesting thing is that the photo has several other pieces of data about me each day, such as my hair length, when I shaved, and possibly even my disposition for the day based on my facial expression. There is a lot of data in one photo. The problem for him now in determining my birthday present, is how do I aggregate this data using a computer program like the spreadsheet application? There is no real computer structure to this data for the spreadsheet to aggregate. It is digital data; open it up and you’ll see a bunch of crazy characters that another computer program can convert to color pixels to make a photo. However, for the problem at hand, my shirt color is not structured data, but rather as we keep hearing over and over again, it is unstructured data.
I know what you’re thinking. He could open each photo and build that same spreadsheet as in the first thought experiment, and of course you are exactly correct (if it is possible to be exactly correct). He can use the most sophisticated computer known to human kind, our brains, to take the unstructured data of the photo, convert it manually to structured data in the spreadsheet, build that graph and buy my present. However, let’s expand the data set. Let’s say we wanted to know the color of shirt worn every day by every person in the city of Pittsburgh and all we had were photos of everyone. You better pack a lunch, because it would take you all day and then some if you use your computer brain and convert those unstructured photos to the structured color-by-day spreadsheet. Instead, just think how powerful it would be to have a computer able to take the unstructured photo data and automatically do the work to convert it to structured data. If I were marketing shirts to Pittsburgh, outside of black and gold, I could learn what the best colors are to present to that market. That could be very valuable.
So, finally to the title of this blog. What is all the fuss about? Well, with recent innovations in computer science, the abilities to convert unstructured data in massive amounts to structured data that data science models can use to provide information and eventually, knowledge, are HERE. And, in most cases it doesn’t take the cost of a space program to utilize these abilities (more on this in future blogs from OThot). So the “fuss” is that all these sources of unstructured data in photos, Twitter posts, word documents, digital books, and movies on the internet could be converted to structured data. Just let that sink in. Wow. Got to go, I’m moving my desk away from the creepy guy sitting next to me….