Friday 25 April 2014

The Blind Spot for Big Data

The New York Times has been doing a series of pieces on the uses and limitations of Big Data.   While I do not specifically focus on big data, I look at some of the ways we collect it; therefore, I am interested in the downstream implications once it's aggregated.  How could small distortions at the scale I study become much larger? 

Since I look at conflict, the piece by Somini Sengupta, 'Spreadsheets and Global Mayhem' certainly caught my eye.  The title for the opinion piece about all the ways we are trying to mine data for conflict prevention matches the term 'spreadsheets,' a feeble and not very advanced technology for organizing stuff, against the description 'global mayhem' (for me it evokes Microsoft Excel battling the Palestinian-Israeli conflict.)  The title conveyed the incongruence of strategies centered on big data.   Collecting information, aggregating it, that isn't enough.  The sheer weight of it, the potential feels powerful.  Surely, answers must be in there somewhere?  But finding patterns, asking the right questions, creating really good models with the complex information such as communications data (much of it translated)... that's a long ways off.  We don't really know what to do with what we have, and we don't really know what the answers mean from the models we build.  That's where I think we are.  Most marketing firms vehemently disagree. (Sentiment analysis).  And certainly the types of conflict prediction machines Sengupta references such as the GDELT Project and the University of Sydney's Atrocity Forecasting  believe fortune telling is within our digital grasp.

Another piece by Gary Marcus and Ernest Davis 'Eight (No, Nine!) Problems With Big Data' addresses some of these issues including translation.  They remind the reader of how often the data collected has been 'washed' or 'homogenized' with translation such as the ubiquitous Google Translate.  The original data may appear several times over in new forms because of this tool.  And there is a growing industry of writing about flaws with big data.  The debate has made many who work within the field weary or intensely frustrated because the debate is fueled largely by popular misunderstandings of a very complex undertaking.

From my perspective, there remains a giant blind spot, what I call the invisible variable of culture. Most acutely, this involves the languages now coming online, the languages spoken in regions experiencing a tech boom.  Individuals in these areas must either either participate online and with mobile communication technology in a European language or muddle through a transliteration of their own local language which will not be part of this Big Data mining.  My research looks at the distortions in the narratives they produce in both instances.  The distortion over computer mediated communication such as SMS or smart phone apps which compartmentalize narrative, is a problem about how we organize what we want to say before we say it.  This pre-language process varies by culture and structures how we connect information such as sensory perception.  At the moment, our technology primarily reflects one culture's notion of how to connect information, how to organize it conceptually.  This has implications both in how information technology collects data and how questions about that data are posed and understood.

What if other cultures have a fantastically different concept of organizing information?  How do you know the data you've collected means what you think it means?

[math example: your math is base 10... but other groups might use base 12 or base 2, etc... so when you see their numbers and analyze them with your base 10... they make sense to you but don't mean what they meant originally.]

We haven't cracked the code yet of how to incorporate a variable like culture into software applications.  It's more than translation.  It's not as easy as word replacement.  It's deeper than that.  It's context.  It's at the level of concepts and categories.  The way we see things before we use language.  That's not to say we can't unravel these things with algorithms... but those are often based on (even unconsciously) our understanding of communication.  And there is massively insufficient research on most languages out there.   If there are around 6800 languages, Evans and Levinson (2009) figure that:
Less than 10% of these languages have decent descriptions (full grammars and dictionaries). Consequently, nearly all generalizations about what is possible in human languages are based on a maximal 500 language sample (in practice, usually much smaller – Greenberg’s famous universals of language were based on 30), and almost every new language description still guarantees substantial surprises.
And the languages within the tech boom regions such as Africa and Southeast Asia are certainly part of the knowledge void.  We aren't prepared to collect this data yet.  The data we do collect are basically shoehorned into a format meant for English and for western concepts (like our notions of cause and effect or even time).  Data from these language groups including usage patterns, such as the flu or pregnancy predictor algorithms we've read about, won't be any good without further cultural adaptation.  And when it comes to crunching the data, we have a lot to learn about asking context specific questions and understanding the data from a non-western framework. (My own research results have shown me it's the difference between thinking you've identified a victim or a villain.)

While not widely understood yet, these cultural differences in the Big Data story are a dazzling challenge to consider.


The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day. - See more at: http://gdeltproject.org/index.html#sthash.Tl6hO993.dpuf
The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day. - See more at: http://gdeltproject.org/index.html#sthash.Tl6hO993.dpuf
The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day. - See more at: http://gdeltproject.org/index.html#sthash.Tl6hO993.dpuf
The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day. - See more at: http://gdeltproject.org/index.html#sthash.Tl6hO993.dpuf