Big Data Problems

Like I mentioned in a post a few months back, there are a few problems with mining Twitter for locational data. Partly, the problems are due to a less than representative sample size. Related to this is an article on Wired today on big data and the ‘death’ of theory. Mark Graham, who is actually part of the floatingsheep collective, has this to say in it:

“I do get why people think that ‘big data’ will mean the end of theory, because you can now answer almost any conceivable question with large data sets and transactional data shadows, but irrespective of how big or complete our datasets are, they will always be selective and partial. We’re talking about a classic ‘if you have a hammer everything starts to look like a nail’ issue here.” 

Or in other words, in reference to the original floatingsheep map I commented on, and from the same Wired article:

not everyone tweets, and not everyone who tweets geotags their tweets. Even with the…contextual geotagging of tweets, that still leaves a sample of tweeters that isn’t absolutely everyone. It’s still a sample of “people with the capability and urge to tweet”.  

And so the issue of a small, unrepresentative sample size remains. Not quite the takeover of big data just yet.

Mapping racist tweets

I’m a little late to the game here, but the ever popular floatingsheep blog smashed their previous daily page view high with this post entitled ‘Mapping Racist Tweets in Response to President Obama’s Re-election‘ 10 days ago. I don’t think I need to explain what it’s about, it’s pretty self-explanatory.

It drew a lot of comments, understandably. From people questioning the small sample size (395 ‘hate’ tweets in total), the search terms (‘monkey’ OR ‘nigger’ AND the text ‘Obama’ OR ‘reelected’ OR ‘won’), the exclusion of racist tweets towards Romney (rectified by floating sheep, here), the geolocation of tweets (2-5% of all tweets), the use of particular search terms in the positive (for example, ‘nigger’), and the mapping of racial tweets as opposed to tweeters (the results could have produced multiple tweets from the same individual).

They’re all now included in a FAQ section, here, in response to the many questions floatingsheep received as to their choice of method. For me, it says a lot about people’s ability to pick holes in ‘scientific’ method. Although the comments started to get a little wild, they did at least open the door for a response from the floatingsheep team, clarifying the methods they used. It really isn’t a sample of the American population, let alone the American twitter population, let’s get that clear. 395 tweets is so far removed from a representative sample size it’s at the best kind of naive drawing any conclusions (‘the south is racist’) and at the worst, dangerous. I think floatingsheep know that. Still, it says a lot about the pitfalls of mapping tweet data, because there are just so many removals from the population at large. In this case:

NOT people in USA


people in USA tweeting


people in USA tweeting racist comments about Obama


people in USA tweeting racist comments about Obama with geo-location activated


people in USA tweeting particular racist comments about Obama with geo-location activated during a 7-day period at a specific time as searched for in a built database

That’s 4x removed from the classification that, I would argue, most people think of this data as representing. That is, constituting the people of the USA. Although maybe it’s only 3x removed, because I’d like to think most people have the intelligence to think this at the least is only feasibly representative of those with twitter (it doesn’t take much to realise there are more young people than old people on twitter, whatever that may mean). There’s a danger in not making this patently clear to people.