Back to blog

Gray data – what is it and why does it matter?

Dec 22, 2015 by Romulo Melillo

Gray… what?
You have almost certainly heard of dark data. Dark data is generally defined as data being collected during various operational activities – but not used. During our bi-weekly discussions with Gartner, it became clear to me that there is quite a gap between Active Data (email, documents, reports and so on) and Dark Data – data that is not used regularly by a business. In fact, it is often clear where ownership of both Active Data and Dark Data lies – with the customer.

Gray data is metadata held in the cloud which has been generated during the course of managing or manipulating normal data. It is characteristically difficult to access and export, and, unlike other data, ownership is not clear should a customer choose to migrate from or terminate the cloud service.

To take an example, an email contains intellectual property belonging to the originator, along with physical data stored on the server that makes up that email. It is clear where ownership of the actual email resides – with the customer. But when something is done to that email – such as it being ‘liked’, or processed in some way by a third party, additional metadata about it is generated.

Such metadata clearly exists and, because it contains information about the behavior of the individual or organization that generated the original item, questions arise around who actually owns that data. If this metadata is generated by and stored on a cloud service provider’s system, who really owns it? The user? The Customer? The Cloud Vendor? It’s an issue that is not easy to resolve. This is gray data.

How is gray data generated?
Let’s begin with some simple – and innocuous – examples of gray data that you’re likely to have come across. Microsoft recently switched on the Clutter feature in Office 365. Clutter automatically filters what it suspects are low-priority emails out of your inbox in order to make you more productive. Your administrators can modify its settings, but the point is that Clutter is intended to take decisions for you. It is generating gray data.

There’s been a lot in the news about Safe Harbor and Facebook recently. Facebook users can, of course, request a dump of all their data for review. You can theoretically unlike all those things you once claimed to ‘like’ years ago, though it’s not easy. Conceptually we’re all struggling with the right to be forgotten, and somewhere there’ll be metadata pointing to you. Who owns that? Is it you? Facebook? The person or company that posted what you liked? Do you own ‘your’ likes? Is it exportable?

Let’s bring those two examples together. Have you heard about Microsoft bringing a ‘like’ button to emails? If you’re using Exchange Online, Outlook will let you ‘like’ specific emails and @mention other people. Outlook will notify you if someone likes one of your emails, and Microsoft claims that @mentions will be useful in grabbing attention in longer chain messages. These ‘likes’ and so on generate more gray data. Who owns it? Where is it even stored, and is it exportable? We’ll explore that in a future blog.

The implications of gray data
We briefed Gartner around our perception of gray data on one of our regular calls to Gartner’s research director for data archiving and information governance. Whilst gray data generated by the large vendors isn’t likely nefarious or malicious, what we are proposing is that whenever a cloud provider is providing tenancy or supplying services, new data ownership issues are raised that haven’t really begun to be addressed.

The characteristics of gray data include being:

  • Unregulated and unlegislated.
  • Out of scope in legal contracts with your service provider.
  • Unclassified.

Before the cloud, it was much more obvious what you owned. As an organization, you managed your emails on-premise. You owned the inboxes. When you sent an email, that email remained your property. Nothing was profiling your staff’s inboxes, making automated decisions on your behalf, and propagating that information to persons unknown. And, because you owned the hardware all infrastructure was running on, you typically owned everything in it.

Can gray data be migrated?
Every cloud service you might use is generating gray data to some degree, and it’s generally not being used for anything more than driving user experience or other efficiency processes, and maybe some advertising right now. It’s also pretty secure (otherwise it would be easier for you to retrieve it). But more and more gray data is being retained, and over time it will get aggregated, filtered, and analyzed. What happens when that gray data becomes central to your Active Data is useful, and you are unable to migrate it?

You’ll have signed a contract with your cloud provider that specifies who is responsible for your data. The trouble is that the rules of the game keep changing – and fast. What might have been the unthinkable practice at the outset might be mainstream in a year or two. The regular algorithmic changes you see rolled out by the Facebooks and Googles of this world, which in turn generate more Gray Data, are just the tip of the iceberg.

And one of the great benefits of services like Office 365 is portability. You are free to terminate your contract and migrate everything to a new service provider if you choose. But can you really get back everything? Does your organization need a mechanism for managing what the cloud provider might retain? What is the provider allowed to retain after migration, and who is checking that it complies with those requirements?

As social media concepts become more deeply embedded in the business world, these are questions that need to be addressed. And as email archive migration specialists, we at Quadrotech are taking a close interest in how things evolve.

In a future post, we’ll take a look at some possible solutions for managing gray data.