How does OThot bring data to your doorstep? How does, a tweet, for example, get to the displays and results we provide? Let’s put on our imagination caps and pretend a grain of sand is a tweet. We will follow it from inception; in the time it takes for you to compose your tweet, a grain of sand has been carved from a rock and washed up on some beach.
Turns out, this is a fairly apt metaphor. According to Domo, every minute in 2014: Facebook users share nearly 2.5 million pieces of content, Twitter users tweet nearly 300,000 times, Instagram users post nearly 220,000 new photos, YouTube users upload 72 hours of new video content, Apple users download nearly 50,000 apps, and email users send over 200 million messages.
Still, how big is “Big Data”, really? It’s difficult to say, exactly; first, of course, due to the size of the number in question, but also because technology companies are notoriously secretive about their data. For example, Google releases very little information about its data centers. However, Randall Munroe’s comic, xkcd, explored the question of private data stores in his “What if?” series (4). He estimated in his signature charm that Google stores about 15 exabytes, or 1.5 x 10 19 bytes.
Of course, that is only a fraction of the digital universe. The EMC-sponsored IDC study, “Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East”, found that, in 2012, there were 2.8 ZettaBytes (ZB) of data in the digital universe (1). The IDC projected this will grow to 40 ZB by 2020. (A ZettaByte is a mythical creature with horns made of—er, wait; a ZettaByte is 103 exabytes, or 1021 bytes.)
That seems like a lot, what with its fancy scientific notation and all, but, for the sake of our mushy human brains, let’s reframe it: Monroe estimated that if Google’s database data was stored on punched cards, as it had been around the time your great-grandfather was worried about polio, it would be enough to “[cover] New England, to a depth of about 4.5 kilometers”, which is, as he notes is “three times deeper than the ice sheets that covered the region during the last advance of the glaciers”.
While you’re still googling what a punched card is (adding more hypothetical punched cards to some remote server): according to the University of Hawaii, there are only an estimated 7.5 x 10 18 grains of sand on all the beaches on Earth (2). (Though, following my Spring Break, there are at least a dozen more in my shoe, and they do seem to be multiplying somehow.) For kids keeping score at home, on a typical night sky, we can only see about a few thousand stars in the sky, but with eyes like the Hubble space telescope, we might be able to capture a good chunk of the estimated 10 24 stars in the observable universe. So, when laying your children down to sleep, or when gazing into your lover’s eye, reassure them: I love you more than all the bytes of data predicted to exist in the next decade or so.
Like sands through an hourglass, this is the data of our lives…
So our robot overlords have all this information, beaches and deserts of data, living in Skynet somewhere. Now what? Well, there is so much data available, in fact, the trick is sifting through the muck. The IDC study also found that, “despite the unprecedented expansion of the digital universe due to the massive amounts of data generated daily by people and machines…only 0.5% of the world’s data is being analyzed”.
To complicate matters, as John Abbatico mentioned in his OThot blog post, much of this data is unstructured (3). And if you think “unstructured” doesn’t sound all that bad, here’s some raw binary data:
“00000a0: 6d65 3233 3a6d 6564 6961 7769 6b69 2d31 me23:mediawiki-1 00000b0: 2e31 352e 312e 7461 722e 677a 3132 3a70 .15.1.tar.gz12:p 00000c0: 6965 6365 206c 656e 6774 6869 3332 3736 iece lengthi3276 00000d0: 3865 363a 7069 6563 6573 3636 3230 3ae9 8e6:pieces6620:. 00000e0: b08a 7ef8 00f8 d8b4 a53e 15e3 6bd6 e2c4 ..~……>..k… 00000f0: a7e4 1aa6 c67f 7106 cd3e 1672 decc b5c7 ……q..>.r…. 0000100: 455c a86d 4751 379a f59f 3665 1e8c 128a E\.mGQ7…6e…. 0000110: dec4 e670 ca0f e960 353b 48fe 3dfb c455 …p…`5;H.=..U 0000120: f940 e102 13d6 8385 1655 4642 3e83 060b .@…….UFB>… 0000130: 585f d353 2ef2 07ff d9e3 aeb6 7329 2192 X_.S……..s)!. 0000140: e0a9 7d75 390f 3c16 def6 d806 469e af64 ..}u9.<…..F..d”
Did you get all that? We have the tools that can decipher this mess, but first we must access it, which is its own hurdle. Much of the world’s data exists on privately-restricted databases, or else it is obfuscated by outdated and fickle APIs.
Still, some of these data, like sand, trickle through. Even then, it may not be particularly useful. For example, we may very well be able to find though some internet magic that I have recently tweeted about how to Photoshop a shark’s head onto a bear’s body (10). In fact, you may even have access to my entire tweeting history. Geez, you think to yourself, this kid really enjoys Photoshopping shark heads onto things that are not otherwise sharks (8) (9), maybe he should get out, maybe he should get some other hobbies. Maybe some friends. (You don’t know me. You don’t know my life.) Anyway, all of this information is potentially unhelpful when decided how likely I am to buy, for example, tickets to the zoo and aquarium (though, as it turns out, the correlation is very high, in this case).
We need some systematic process to comb through this massive collection of sand (something something Spaceballs reference). We must identify what is relevant, what is accessible, and we must aggregate it. Dr. Mark Voortman’s (7) and Ashton Stewart’s post (6) delve more into this process, so I’ll just pick up where they left off: after some incantations and blood sacrifice, we have the proper and relevant information, dumped into massive sand boxes.
Well, it’s still not very helpful. Sure, our magic wands have been able to generate the most relevant information, in fact, it can print out results of decimals and P-Values and percentages, but human beings need to process that information! Let’s take a time machine back to the first few paragraphs in this post. When I lobbed all those scientific notations around, did it register at all? What about the punch cards? A little difficult to understand because it was so far-fetched. The stars in the sky and the sand were surely more evocative; you’d seen the sky and you’ve been to a beach, so you could begin to wrap your mind around it.
Likewise, in a sea of sand, we must build sandcastles. Just like in She’s All That, or Can’t Buy Me Love, or that girl from the Breakfast Club, or Hunk, or any of the apparently endless Hollywood pandering efforts towards teenage insecurities: data is nerdy, it’s ugly, it’s square. We need to make it cool, popular, and sexy.
We need to make it Fonzie.
It’s our job to sit data down in the chair, take a little bit off the top, shave its beard, slap some lipstick on, and make it easy for human eyeballs to love. Using the latest and greatest in web-based data-rendering technologies, we can replicate things like real-world physics and material design to generate impactful visualizations.
At the end of the day, we get something like this:
That’s the power of presentation, and that is the power of OThot.