The Small Data revolution is the real revolution

The growing capacity of individual users to process smaller data packages will drive the future of computing.

"What matters is therefore the amount of data that an average data geek can handle on their own machine - their own laptop," writes Rufus Pollock [EPA]

There is a lot of talk about Big Data at the moment. But these discussions miss a much bigger and more important picture: the real opportunity is not around Big Data but around Small Data. Not centralized “big iron” but decentralized data wrangling.

Not “one ring to rule them all” but “small pieces loosely joined”. I would argue that the next decade belongs to distributed models, not centralized ones; to collaboration, not control; and to Small Data, not Big Data.

‘The Internet of things’

The real revolution

Big Data smacks of the centralization fads we’ve seen in each computing era. The notion that there’s more data than we can process – something which has – no doubt – always been true, year on year, since computing began – has been dressed up as the latest trend, complete with associated technology must-haves.

Meanwhile, a much more important story, the real revolution, is being overlooked: the mass democratisation of the means of access, storage, and processing of data. This story isn’t about large organisations running parallel software on tens of thousands of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of Small Data.

What do I mean by Small Data? Let’s define it crudely as: the amount of data you can conveniently store and process on a single laptop or server.

Why a laptop? What’s interesting and new is the democratisation of data and the associated possibility of a large-scale, distributed community of data wranglers working collaboratively. What matters is therefore the amount of data that an average data geek can handle on their own machine – their own laptop.

Advances in computing, storage, and bandwidth have far bigger implications for Small Data than for Big Data. As should be clear from the above definition and from any recent history of computing, small and big are relative terms that change as technology advances.

In 1994, for example, a terabyte of storage cost several hundred thousand dollars. Today it costs less than a hundred. Yesterday’s big is today’s small.

Recent technological advances have dramatically expanded the realm of Small Data, the kind of data that an individual can handle on their own hardware. This expansion has been relatively far greater than the corresponding expansion of Big Data.

Today, working with significant datasets – datasets containing tens of thousands, hundreds of thousands, or even millions of rows – can be a mass-participation activity done by individuals or small organizations.

Size doesn’t matter. What matters is having the data, of whatever size, that helps us solve a problem or addresses the question we have – and for many problems and questions, Small Data is enough.

Some of the strongest evidence for this in fact comes from the realm of supposedly Big Data where it turns out that a lot of what gets framed as Big Data isn’t in fact so big.

For example, a paper by a team at Microsoft Research aptly named “Nobody ever got fired for buying a cluster” showed that even at firms like Microsoft, Yahoo and Facebook most data analysis jobs could run on a single machine.

This means that as per our definition, this data analysis work is actually working with small – or at best “medium-sized” – data (and imagine what will count as small in a couple of years when storage and computer power have doubled again).

The situation of Small Data today is similar to that of microcomputers in the late 70s or of the Internet in the 90s. When microcomputers first arrived, they seemed puny in comparison to “big computing” and the “big software” then around, and there was nothing they could do that existing systems couldn’t.

They were, however, revolutionary in one fundamental way: They made computing a mass-participation activity. Similarly, the Internet was not new in the 1990s – it had been around in various forms for several decades – but it was at that point that it became available at a mass scale to the average developer and, eventually, citizen. In both cases “big” kept on advancing, too, whether with supercomputers or high-end connectivity – but the revolution came from “small”.

Just as we now find it ludicrous to talk of “big software” – as if size in itself were a measure of value – we should, and will, one day find it equally odd to talk of Big Data.

Size doesn’t matter. What matters is having the data, of whatever size, that helps us solve a problem or address the question we have – and for many problems and questions, Small Data is enough. The data on my household energy use, the times of local buses, and government spending is all Small Data. When Hans Rosling shows us how to understand our world through population change or literacy, he’s doing it with Small Data.

Just as importantly, Small Data is the right way to scale. Small Data scales up not through the creation of massive centralized silos but by partitioning problems in a way that works across people and organizations, by creating and integrating Small Data “packages”.

This Small Data revolution is just beginning. The tools and infrastructure to enable effective collaboration and rapid scaling for Small Data are in their infancy, and the communities with the capacities and skills to use Small Data are in their early stages. Those who wish to play their part of the revolution have the opportunity to get in on the ground floor.

Dr Rufus Pollock (PhD) is the Founder and Director of the Open Knowledge Foundation, an international non-profit that works to empower organizations and citizens through opening up information and providing the tools and skills to turn that information into knowledge, insight and change.