Sam Hind

lecturer in digital media and culture at the University of Manchester, UK.

sam.hind@manchester.ac.uk

Big Data Problems

Like I mentioned in a post a few months back, there are a few problems with mining Twitter for locational data. Partly, the problems are due to a less than representative sample size. Related to this is an article on Wired today on big data and the ‘death’ of theory. Mark Graham, who is actually part of the floatingsheep collective, has this to say in it:

“I do get why people think that ‘big data’ will mean the end of theory, because you can now answer almost any conceivable question with large data sets and transactional data shadows, but irrespective of how big or complete our datasets are, they will always be selective and partial. We’re talking about a classic ‘if you have a hammer everything starts to look like a nail’ issue here.”

Or in other words, in reference to the original floatingsheep map I commented on, and from the same Wired article:

not everyone tweets, and not everyone who tweets geotags their tweets. Even with the…contextual geotagging of tweets, that still leaves a sample of tweeters that isn’t absolutely everyone. It’s still a sample of “people with the capability and urge to tweet”.

And so the issue of a small, unrepresentative sample size remains. Not quite the takeover of big data just yet.

2013.01.25

Big data, floatingsheep, Mark Graham, The Semaphore Line