At O’Reilly publishing, Alistair Croll makes the argument that it’s a fight we can’t avoid any longer:
In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. [...]
With the new, data-is-abundant model, we collect first and ask questions later. The schema comes after the collection. Indeed, big data success stories like Splunk, Palantir, and others are prized because of their ability to make sense of content well after it’s been collected — sometimes called a schema-less query. This means we collect information long before we decide what it’s for.
And this is a dangerous thing.
Go read the whole thing. Because in light of stories like this one, about Minnesota police collecting and storing data indiscriminately for all vehicles they pass on public roads, and this one about Tiburon, California which records the plates of every single car entering and leaving the city, makes you wonder just where it all will end.