Home Blog
Luke D. Gessler

Thoughts from ICLDC6 and ComputEL-3

Today's the end of a packed week I've spent at the University of Hawaii at Manoa presenting at ComputEL-3 and attending ICLDC6. The papers were fantastic and often vivid (particularly memorable for its imagery: a presentation given by a marine biologist and his collaborators on ethnobiological work carried out deep in rural Papua New Guinea). This is a collection of mostly unrelated reactions I had to some of the conference.

1 Decolonizing language technology

There were quite a few presentations touching on the theme of how colonialist ideologies surface in language technology. I remember the workshop given by Wesley Leonard, Megan Lukaniec, and Adrienne Tsikewa, and the paper given by Atticus Harrigan, Jordan Lachler, and Antti Arppe, though there may have been others as well.

If I had to summarize the thesis here, I would maybe paraphrase what I remember Wesley Leonard saying: there is linguistics, broadly construed, and there is Linguistics, whose theories and methods conform to colonialist ideas about epistemology and authenticity.

This can come out in obvious ways, but there were many fine examples of more subtle instances. A representative instance is the problem Atticus Harrigan and his collaborators presented on: how they addressed the needs of the Plains Cree-speaking community in their dictionary.

If I remember correctly: writing Plains Cree is difficult for two reasons. The first is that there are dialects of Plains Cree that differ in how words are pronounced. The second is that even within a single dialect of Plains Cree, there is considerable variation in how Plains Cree speakers spell words, owing (IIRC) in part to difficulties about the standard orthography. The word kîkway (IPA: [kiːgwaj]) is spelled as such in the standard orthography, but a very common orthographic variant is kigway: the diacritic for vowel length has been lost, and the k that actually corresponds to [g] phonetically has been spelled as a g. Understandably so: that is how it sounds!

At this point, many linguists would be tempted to say something like "Well, the normative orthography is what it is, so users will either have to spell it right, or reform their orthography." And the point is that this lack of interest in even trying to accommodate interdialectal variation, combined with the normative elevation of the spelling conventions of a particular dialect, is, if I have understood correctly, colonialist.

In the case of the Plains Cree dictionary, the solution was to attempt to accept the most common kinds of orthographic variation (using a finite state transducer), which was the topic of Harrigan et al.'s work.

2 Automating documentation: transcription

The NLP community, for reasons I'm still trying to understand, has gotten very interested in automating transcription. Transcription, which you can think of as a much more general kind of "subtitling" of audio or video, is extraordinarily laborious to do manually, and is often the bottleneck in the pipeline linguistic data takes from the time of its collection to its deposit in an archive during language documentation. Graham Neubig et al., Daan van Esch et al., and Christopher Cox et al. all presented on this and seem to be the ones to watch for progress in this task.

I think it's worth noting that even when progress has been made in this task, users will still need application software which will allow even technically less-experienced end users to actually benefit from progress in machine learning.

3 Reusable and communal sotware: why are there so many dictionaries?

There were no fewer than ten presentations on the topic of making a dictionary for a community. These communities have made incredible progress, but many of them would have been helped considerably on their way by free and open source, general-purpose dictionary software. Instead, most of the time, these dictionaries were developed with bespoke software written by a contract software developer. Why has such a common language development task has still not seen a free and open source, general purpose software project succeed?

For most application software in the area of language documentation and revitalization, the lack of a general-purpose solution that has gained majority use is more explainable, as requirements can vary quite a lot across communities and contexts. To be sure, dictionaries can also sometimes have great differences in requirements, but for most languages, it seems that most of the requirements would be in common for dictionary software.

I'm still puzzled. Some potential explanations:

  1. Dictionaries, contrary to my assumptions, actually do vary so much across implementations that it is difficult to make a general-purpose dictionary app that would be useful.
  2. The developers who undertake dictionary making projects are too busy or not interested in generalizing the software they make.
  3. The communities who pay developers for dictionary making projects are unable/unwilling to pay developers for the extra work.
  4. Open-source dictionary software actually does exist, but communities don't know about it.
  5. Open-source dictionary software actually does exist, and communities know about it, but they have chosen not to use it (because of poor quality, etc.).

Whichever of these it is, it's worth noting that there's at least one dictionary project that has made it a goal to produce a general, reusable piece of software, and that is the one underway by Rebecca Everson, Wolf Honore and Scott Grimm.