Monday 20 October 2014

Think Outside the Lab


This takes me back full circle to the initial post with which I launched this blog—my communicative purpose and challenge was to work within an impossible sort of Venn Diagram of three disciplines that never seem to collaborate: Computer science (such as developers, information scientists, and engineers), linguistics (including cognitive linguistics, but really all language-oriented studies), and finally political science (policy-oriented and often practitioners such as human rights activists or crisis response managers).  Pairs of these disciplines can be found teaming up, but an effort combining the insights of all three is, sadly, very rare indeed. 

A recent project out of MIT from Berzak, Reichart, and Katz pursued the hypothesis that structural features of a speaker's first language will transfer into written English as a Second Language (ESL) and can be used to predict the first language of the speaker.  I believe their work does not go far enough in two respects.  First, in terms of sampling.  Second, in terms of considering how parsing communication in this manner might be applied to software design.

Their paper addresses the sampling problem as one of resources.  They acknowledge that there are over 7000 languages, and there exists a written corpus (their data pool) from only a relative few.  My critique, however, is that they consider all languages members of the same sample set for their experiment.  Katz explains in an interview that he was drawn to investigate and algorithmically describe ‘mistakes’ made by Russian speakers in English.  These mistakes are called linguistic transfer because an element from the first language is transferred into the second.  (Reverse transfer can happen as well when a new language affects the first language.)  Linguistic transfer can come in several forms: phonological/orthographic (mistakes due to sound or spelling), lexical/semantic ('false friends'), morphological/syntactic (grammar mistakes), sociological/discursive (such as appropriateness or formality), and conceptual (categories, inferences, event elements, concepts generally).  If Katz’s group had differentiated between types of mistakes, they might have improved the rate of prediction success in their results. Also, it is unclear if their model was able to incorporate more complex types of transfer such as discursive or conceptual.  One reason they may not have seen the need to differentiate by type was their limited sample. 

Most linguistics studies (or research asserting multi-lingual or multi-cultural value) that purport to incorporate a broad range of languages, in fact, only draw from a small number of closely related languages that don’t possess particularly profound differences in conceptual organization of information. That means if there were instances of conceptual transfer, they would be rare or at least difficult to detect.  (Most studies look at Indo-European languages, plus perhaps Russian, Hebrew, Korean, or Japanese to appear to have real diversity.)  Among the nearly 7000 languages, there are only 100 or so that have a literature; it is this group of languages that are most frequently studied. These languages are, therefore, ones which have a strong history and preference for writing (called chirographic), and this mode of communication has had an effect on many cognitive processes within the populations that speak these languages.  The rest of the 7000 are predominantly oral, and there are very rarely oral languages represented in the sample sets (of any study).  Orality is not be confused with literacy; it is a preference for communication and most speakers of predominantly oral languages also speak and operate in chirographic languages as well.  The impact on cognitive processes such as categorization, problem solving, ordering for memory, imagination, memory recall, etc, is connected to a need to rely on sound and associated mneumonics for information organization.  If you cannot write something down, this changes your strategy for remembering something or for working through a problem or any number of other cognitive processes.  The linguistics studies that fail to represent a member from this set of predominantly oral languages make an egregious sampling error which leads to false conclusions about universal or easily modeled qualities of communication.  Orality is a profound variable in terms of its effect on cognitive processes. That is why investigating and describing communication at a conceptual level and drawing from languages much more distant to the typical baseline of English would yield some surprising results.

The second problem with the MIT study is one of anticipating a use for the findings.  Quoting from the press release about their work:
"These [linguistic] features that our system is learning are of course, on one hand, of nice theoretical interest for linguists,” says Boris Katz, a principal research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory and one of the leaders of the new work. “But on the other, they’re beginning to be used more and more often in applications. Everybody’s very interested in building computational tools for world languages, but in order to build them, you need these features. So we may be able to do much more than just learn linguistic features. … These features could be extremely valuable for creating better parsers, better speech-recognizers, better natural-language translators, and so forth." (L. Hardesty for MIT news office 2014)

Yes, so true.  However, it's not theoretical at all, nor is it simply the folly of linguists to pursue communication variation at a conceptual level.  Using conceptual frames (Minsky, 1974) has already been proven to be an effective method in improving the search capability in map tools by Chengyang et al. (2009)  who shifted a map search tool to operate from conceptual frames rather than conventional English search terms.

Katz and his lab at MIT are credited with the work the led to Siri, and this new study could be applied to machine language tools so that patterns of mistakes become predictable and thus correctable.  It could also be added to text scanning tools to detect the first language of non-native English authors on the web thus adding to the mass surveillance toolkit.

I think a lot more could be done (but hasn't) with this methodology in terms of looking at how an oral language's conceptual frames could be described and then used to calibrate a more responsive information and communication application.  I used very similar methodology to the MIT researchers in my experiment looking at reverse linguistic transfer last year (that I have been charting on this blog).  I compared sets of bilingual narratives and looked for patterns of 'mistakes,' but I was interested in what these mistakes could tell us about the communication needs of the users (mobile technology users in rapidly growing markets like Africa, South East Asia, or South America).  My hypothesis was that the structures of their first language were being  distorted, being converted into 'mistakes,' in order to fit a prescribed (foreign) conceptual structure of the software application.  What I found was much more complex than counting instances of mistakes.

What I observed, and quantified, was that when comparing the first language oral narrative to the first language narrative via mobile report (either as an SMS or as a smart phone app question series),  3/4 of the participants expressed dramatically different narratives when using the mobile report format than in their initial first language oral narratives.  That means that translating interfaces isn't sufficient to provide communication access.  There are underlying conceptual aspects to communication that have yet to be addressed and that are inherently cultural (currently mono-cultural).  Due to the complex nature of concepts such as justice, personhood, time, or place identifying and isolating instances of transfer was very challenging.  A summary of the results is forthcoming in 2 papers as well as my doctoral research, but the main conclusion I researched was conceptual-level parsing of communication should be integrated into design of communication and information management software with the integration of insights from oral languages.  Inclusion of this variable with indigenous software design will increase the ability of users from rapidly growing markets to participate with and leverage the information and communication technology in a manner which meets their needs.

this topic will be continued with highlights from forthcoming publications.

references:
Chengyang, Z., Yan, H., Rada, M. and Hector, C., 2009. A Natural Language Interface for Crime-Related Spatial Queries. In: Proceedings of IEEE Intelligence and Security Informatics, Dallas, TX, 2009. 

Jarvis, S. and Crossley, S. 2012. Approaching language transfer through text classification: Explorations in the detection-based approach. Multilingual Matters, volume 64.

Jarvis, S. and Pavlenko, A. 2007. Crosslinguistic Influence in Language and Cognition. London; New York: Routledge.