Last month my wife and I headed down to the Gulf Coast for a long weekend.  The week before we were to leave I pulled up the forecast on my phone to see what we could expect.  Good news for us – sunshine and moderate temperatures were expected. That’s when it hit me.

When did weather forecasting get so good? It wasn’t very long ago that weather apps showed five days out, a week at the most, and no one cared because anything further than a couple of days was highly suspect.  But now showing fifteen-day forecasts is normal and Accuweather will gladly show you twenty-five if you want to contribute a few bucks to their bottom line by upgrading to “platinum”.

That’s flexing some real forecasting muscles!  We’ve made impressive strides at predicting the weather, from nailing the expected high-temperature to forecasting the path of severe storms.  It’s not perfect, of course, depending on the location and time of year. But overall it’s clear that long range accuracy has significantly improved.  

Today, we can guess the high temperature three days into the future as accurately as we could predict tomorrow’s weather just 10 years ago, according to Eric Floehr, founder of ForecastWatch. And our five-day forecasts are, on average, as skilled as 2005’s three-day predictions.

image credit

How did that happen? It’s all about the data. Big data.  

Weather observations have exploded in the past decade in a very big way.  The World Meteorological Organization has invested large sums of money to install and maintain increasing numbers of observing networks on the ground, while Nasa has launched an armada of satellites and instruments in the past 10 years.  Nasa’s satellites, starting with the GOES-13 satellite in 2009 and including 3 more in the GOES-N line with the most recent one, GOES-16, launched in 2017, send torrents of data on atmospheric and surface conditions back to earth for analysis.

The analytics get the publicity – we always hear the meteorologists talk about the “models” — but the data is what drives the accuracy. Everyone acknowledges it all starts with lots and lots of good quality data.  Good data, and lots of it, and we get a good weekend forecast. Bad data, or not enough data, and we’d better pack both an umbrella and sunscreen just to be safe.

To a marketer, that sounds eerily familiar when you consider that more than 90% of all today’s data was created in the past couple of years. That includes all the sources of consumer data that we mine for insights today. Our “satellites” are found in the booming sources of consumer data from social, online, offline, and mobile – the billions of data points about characteristics, behaviors, connections, movements, opinions, preferences, and more.  It’s all there to help us understand our customers better, to better predict their needs and wants and to serve them better by making the best possible business decisions every day.

We don’t use Nasa satellites to get our big data – it definitely takes different data points for different problems —  but the challenge of predicting the weather accurately is in the same solution space as the challenge of predicting what products, services, and messages specific consumers need and will respond to the best.  And just like the weather, understanding and predicting consumer behavior starts with the data – lots of good data is what we need, and to be successful we need to make sure it’s good or the bad data will foul our predictions – and sink our business.

The “Vs” of Big Data

Whether you’re a Nasa scientist or a data scientist working with consumer data, like those of us here at SpotRight, you need to understand and keep a close eye on the key characteristics of big data in order to maintain top level quality.  Often those characteristics are listed by the “V” words – the “Vs” of big data. Some experts talk about the 5 Vs of big data, while some fold in additional characteristics to come up with 8, 9, or even 10. There isn’t a magic number, but for the concept of quality we at SpotRight focus on 8 characteristics in the big data we collect, prepare, and analyze for insights and audiences.  The “8 Vs” we concentrate on to ensure world class quality are: volume, velocity, variety, variability, veracity, validity, volatility, and visualization.

Some like to include vulnerability and value in that list, but in my opinion they don’t impact the quality of the data as much as they relate to the quality of the data use. Vulnerability is very important, but it affects corporate and private risk rather than data quality.  Value is really what the end goal is – business value. That’s not a characteristic of the data alone but rather the analytics and systems employed to take advantage of the data, along with how “democratic” the data is – how accessible the data or its insights are along with how useful it is to decision makers across your business.

The “8 Vs” we concentrate on to ensure world class quality are: volume, velocity, variety, variability, veracity, validity, volatility, and visualization.

A Deeper Look at the Big Data Vs 

Volume is probably the best known characteristic of big data. But volume alone doesn’t define big data, as aggregators of static historical data have been dealing with large volumes for decades. It’s the combination of volume and velocity that gives rise to what we call big data.  

Velocity refers to the speed at which data is being generated, produced, created, or refreshed.  To give you an idea of what that means in practical terms, SpotRight maintains a consumer graph with billions of characteristics and connections, the kind of volume needed to deliver effective analytic results, and the velocity of data generation is so high it requires constant updates of the graph just to keep up.

A quick look at internetlivestats.com gives us an idea of what I mean by that velocity.  As I write this, in mid-afternoon, over 460 million tweets have been sent so far today, along with 3.9 billion Google searches, 3.7 million blog posts (I’ll be adding 1 to that number shortly), 48.9 million photos were uploaded, and there were 339.9 million active Twitter users worldwide.

Volume and velocity demonstrate both the immense potential and the staggering challenge of big data – how do you keep up with the changes in the data, while trying to figure out what that’s telling you.

Variety refers to the seemingly infinite forms that data takes on. Those forms fall in three high level categories – some is structured, some semi-structured, and some unstructured. Whether it’s audio, image, video files, social media updates, log files, click data, machine and sensor data, or other text formats it is typically some combination of those three.  Ingesting that data at the volume and velocity found in the consumer data world requires close attention to both the key attributes and the expected uses of the data. Its critical that the analytic engines are able to reach the descriptive and predictive data points in ways that make it possible to effectively identify trends and predict future conditions and events.

The next few “Vs” relate to the heart of good vs. bad data.  

Variability has three parts:

  1. The number of dimensions in the data caused by disparate data types and sources
  2. The inconsistent speed of updates to the database or graph
  3. The number of inconsistencies in the data  

Variability is a big culprit for bad data, and needs to be carefully managed by anyone collecting big data.  Build anomaly and outlier detection methods to find and fix inconsistencies between data points, and be sure that the data dimensions and update frequencies are each specifically accounted for in the data onboarding and preparation systems.

Veracity is the level of trust in the data.  Nasa scientists have it easy – so to speak – since all they have to do is deal with rocket science. That’s nothing compared to social media! Kidding aside, it’s one of the unfortunate characteristics of big data that as any or all of the properties we’ve been talking about increase, confidence in the data drops.  Users rightfully question whether the data comes from reliable sources, how meaningful it is to the analytics, and what the context of the data is.

Be sure that you understand, or can get answers to, the questions that underpin the data’s veracity.  Businesses that provide you with analytic results need to be able to clearly tell you about their sources and data management methodologies, and explain why it makes sense to the analytics you’re using.

Similar to veracity, validity refers to how accurate and correct the data is for its intended use. Data must be cleansed, organized, and made accessible in order to perform any analysis on it.  Data scientists working manually on data, for example accessing a big Hadoop dataset through Solr, will typically spend 60% of their time preparing the data and only 40% actually applying analytics.  It’s hard to democratize data when it takes over half of your key staff’s time to even get it ready to work on, let alone make sense of, and it can only be understood by those with data science skills.  At SpotRight we’ve invested in our data governance practices and systems to eliminate that waste and provide consistent data quality, common definitions, useful metadata, valid data and analytics, all with the speed it takes to ensure true widespread access and use of the insights for all users, not just the data scientists.  Our goal is to make data truly democratic, delivering results and business value throughout the organization. You can learn more about how to do that here.

With the velocity and volume of big data, its volatility must be carefully considered. Clearly data isn’t valuable forever, but at what point does it become irrelevant or simply wrong? Rules for the data have to apply to each source, and in some cases each data type, independently to ensure storage and availability rules are in place and reflect a thoughtful consideration of exactly how long it must be readily accessible for the valuable analytics you’re depending on.

Finally, visualization ensures the data tells a story, and that its story can be easily understood by users of different backgrounds and business functions. But how do you visualize the data so that is is easy to use? The key is to make sure the charts you choose are simple enough to tell their story at a glance, while containing enough behind the scenes complexity to make that story important and useful.  

In many ways visualization is as important to the democratization of data as is widespread access.  If the data complexity covers up the message, then it may look pretty but result in bad choices and misled decisions.  Sometimes a tree map or a circular network diagram is the only way to visualize big data, but if the analytics can yield results easily understood through simple bar or line charts you may find more people using the data more often to make more of their decisions. Devote the time to figure that part out, or invest in tools built by people who have done it well, to optimize the use of big data in your everyday decision making.

Friends, Romans, Countrymen – Lend Me Your Data

Give me data – more and more data. That’s the cry of the analyst, because they understand how to tell good from bad data and know the importance of using only good data for their analytic endeavors. And they can cull out the bad data.   But for the rest of us, more data can be good or it can be very, very bad. The apps and tools we use that pull insights from big data need to be very, very good at making sure the 8 Vs are in good shape, and if they do that, we can use the results they are giving us freely and without worry.  Put that type of tool in your team’s hands and you’ve taken a huge step toward the democratization of the data you depend on.

In my next blog I’ll explore how AI is transforming the analytics we use today, in terms of both data exploration and decision-making. Data science and machine learning are helping us to unearth a goldmine of information which has far too often been lost in massive volumes of data.  

I’ll also explore the importance of people to the process. In analytics, where context is key, humans must remain at the heart of a data democratization approach or the inaccessibility is replaced by a lack of flexibility and comprehension.