Tuesday, August 1, 2017

ELAN: making tier(s) out of search results

Hedvig in her office in Canberra figuring this out
and writing this guide.
Here is another guide for how to do something practical in ELAN. Previously, we relayed Eri Kashima's guide for sensible auto-segmentation with PRAAT and ELAN (time saver!). (For all posts about fieldwork on this blog, see this tag.)

This time: how to take your search results and make the matching annotations into new separate tier(s). This is useful if you for example want to cycle through only the annotations that match a certain search query in transcription mode. This post has a longer guide, and a short guide at the end.

For those who don't do a lot of transcription: ELAN (EUDICO Linguistic Annotator) is a program from TLA at MPI-Nijmegen. This program allows us to easily annotate audio and/or video files with lots of relevant data. We can use ELAN to count things, but we can also export as CSV-files for analysis later (Excel, R, Libreoffice etc). ELAN is free and great. If you ever need to do transcription, do it in ELAN. Do not create long text-documents with no linking to the audio, it is just ridiculous. Download ELAN here.

Version of ELAN: 4.8.1 (to my knowledge though this should work the same for other versions)

We're going to:
  • search in a clever way
  • export those results
  • import them as new tier(s) into the .eaf-file you're working on
  • thus creating a tier with a defined subset of other existing tiers, making work speedier on targeted parts of your corpus
You can click the images for larger versions.

Example case
I've got a transcribed file where I've noticed some different pronunciation of a certain word. I'd like to pick out only the annotations containing that word, make a new tier with only them, and write down some clever things about this word in that tier. I don't want to have to scroll through all annotations to get to only these.

I work on Samoan, and the word I'm looking at means "to tell/explain": fa'amatala. "Fa'amatala" is the dictionary entry for this word, but it varies in pronunciation in actual speech. I've asked my transcription assistant to mark down vowel length and presence and absence of glottal stops (as opposed to more orthographic transcription). She has done this pretty consistently (as far as I can tell, it's hard to hear glottal stops sometimes), and since I know what kind of variations to expect I can easily find the instances for this word. Due to t and k-style (lects in Samoan) and speed these are the variations we can expect:
  • fa'amatala
  • fa:matala
  • famatala
  • fa'amakala
  • fa:makala
  • famakala
Besides the obvious difference in pronunciation, I've noticed something unusual going on in the realisation of the realisation of t/k, sort of like an affricate. So, I'd like to listen to all instances of this word with all these spellings and make notes of that.

Here are the steps. At the end is a short guide for when you've started to get the hang of this but need basic guidance.

Step 1) clever searching
In ELAN we can search for simple words, but we can also do something a bit more clever: we can search using regular expressions. Now, you don't need to have a complicated query or know all regex magic to make use of this. In this case, we're simply going to use the 'OR'-function. 'OR' in regular expressions is expressed by the vertical line/pipe character: "|" .

So, I'm searching for "fa'amakala|fa:makala|famakala|fa'amatala|fa:matala|famatala" in the tier marked "transcription". No need for bracketing, asterisks or anything like that in this case. If you want to do more complicated things with regular expressions, I highly recommend this guide and cheat sheet for regular expressions in ELAN by Ulrike Mosel*.

Search query results

Here are our search results:
  • uma fa'amatala i a'u i le tala o le video 
  • fa:makala loa le!
  • fa:makala?
  • fa:makala ka:maloa lale e 
  • ma: e mafai ona e fa:matala mai fapefea le vaitaimi na'e tuputupu 'ae i: falealupo
  • mafai ona e fa'amatala i a'u 
  • fa'amatala?
  • i e mafai ona e fa:matala i le ese'esega o gagana sa:moa 
  • e mafai ona e fa'amatala i le tala le lenei 
  • i fasa:moa, fa'amolemole fa'amatala i le a
  • le kusi la ga ae kago famakala aka 
  • o: mai o le se famakala aku le mea 
  • fa:makala uma ?
  • e ke kago famakala le aka 
That looks good! Not all variations we thought might exist occurred (we didn't get "famatala"), but that's normal. (In fact, specifically not getting that form is expected. Shortening of vowel + the t-lect should not co-occur often, if we believe what Mayer, Ochs and others have said about Samoan variation.)

If you want to edit your search query, you don't need to start all over. Just click the search window again right there over your results, it'll be editable again. (This took me a while to realize.)

Step 2) exporting the search results
This is is very easy, in the search window you have up, go to "Query>Export" and choose to export as tab-delimited text.
Export search query results
Exporting search results dialogue window
Name your file something sensible, and put it in a good place. Now let's have a look at said file outside of ELAN, shall we? The file will have the file-extension ".txt", but it is a tab-separated file (".tsv"). Open it in some spreadsheet program (excel, numbers, libreoffice, google sheets, whathaveyou) and it should look a little something like this:

Search results file opened in Excel, specifying tab as delimiter.
That looks kinda alright, doesn't it? There's no headings, but we can figure this out. There's some things in there that we didn't ask to have, for example the first column is the file location. That's not needed for what we're doing, and I'll show you how to handle that in the next step. Don't worry.

Step 3) creating tier(s) out of the search results
Now we go back to ELAN and we import this file as a tier. What will happen here is that a entire new .eaf-file will be created, the tier will actually not be imported directly into whichever file you currently have open.  This means that it doesn't matter which .eaf-file you currently have open when you import (or indeed if any is open). Counterintuitive, I know, but don't worry - I've figured it out. It's not that complicated, just stay with me.

File>Import> CSV/Tab-delimited Text file

Importing CSV/Tab-delimited Text file
Next up you will get a window asking you questions about the file you're trying to import. Remember how the file didn't have headings for the columns? How will we figure out what is what? Not to worry, it's like this:

1 col: ignore (uncheck)
2 col: Tier
3 col: Begin time
4 col: ignore (uncheck)
5 col: end time
6 col: ignore (uncheck)
7 col: Duration (not sure why this is needed but oh well)
8 col: ignore (uncheck)
9 col: Annotation

Import CSV/Tab-delimited Text file dialogue window.
I wish that ELAN had a way of automatically recognizing its own search output, but it doesn't and we know how to do this anyway so it's all good. No need to specify the other options, just leave them unchecked.
An actual ghost

Now you will have a new .eaf-file with the same name as the file with the search results. This file will contain only the tier(s) you had searched within and only the annotations matching the search query. There's no audio file and no other tiers. It's like a ghost tier, haunting the void of empty silence of this lonely .eaf-file.
A lonely ghost tier in an otherwise empty .eaf-file
Save this file and other files currently open in some clever place(s), quit ELAN and then restart ELAN. Sometimes there seems to be a problem for ELAN to accurately see files later on in this process unless you do this. I don't know why this is, but saving, closing and restarting seems to help, so let's just do that :)!
Chris O'Dowd as Roy Trenneman in IT-crowd
Step 4) importing the search results tier into the original file
Now here's where I slightly lied to you: we're not going to import the tier into your file. We're going to merge the search-results-tier-only-file with the other .eaf -file that has all the audio and other tiers and the result is going to be a new .eaf-file. So you'll have three files by the end of this:
  • a) your original .eaf-file with audio and lotsa tiers
  • b) your .eaf-file with only the search results-tier and no audio etc (ghost-tier)
  • c) a new merged file consisting of the two above listed
Don't worry, I've got this.  I'm henceforth going to call these files (a), (b) and (c) as indicated above.

Open file (a). Select "Merge Transcriptions..."

File>Merge >Transcriptions...

Select Merge transcriptions
Now, select file (a) as the current transcription (this is default anyway), file (b) as the second source and choose a name and location for the new file, file (c), in the "Destination" window. You can think of "Destination" as "Save as.." for file (c) - our new file.

Specifying what should be merged and how
Do not, I repeat, do not append. And no need to worry about linked media, because (b) doesn't have any audio or anything (remember, it's a ghost). Just leave all those boxes unchecked.

Let ELAN chug away with the merging, and then you're done!

Step 5) finished!
Tadaaa! We're done! That wasn't so bad, was it? And look at what we've created!

Here's my merged file - file (c). I've taken the search-results tier and renamed it ("famakala"). I also copied it and renamed that one ("famakala - comments"). That way, I have a tier for making comments about the transcription annotation that has the exact same annotation distributions, but different values.
Final merged file in annotation mode, with the search results tier renamed and copied.
Here's the same file in the transcription mode, configured to only show the two tiers targeting the search query:
Final merged file in transcription mode, showing only the search results tiers.
Now, some final notes:
  • You might want to rename file (c) and delete file (a) and (b), for your own sanity later when managing the files, if for nothing else
  • Don't know how to get to transcription mode? Go to "Options>Transcription Mode".
  • Your tiers aren't showing up properly in transcription mode? Check that the "linguistic types" of the tiers are what you think they are and that that's what you've configured to see in transcription mode. Transcription mode can only show you tiers of one linguistic type at once (unless columns but that complex). I also don't get it really, but then again I barely get "linguistic types" at all though
  • Transcription mode getting clogged up with lots of irrelevant tiers? Got o "Configure..." left in the transcription mode window, select the right linguistic type and "Select tiers.." in the bottom left. Tick only the tiers you want to see at that moment
  • You can import several tiers at once by this method, you don't have to merge one search result at a time, see below
  • You might want to do something complicated related to speakers, see below
Several tiers at once
You can either search several tiers at once in the search mode and hence have several tiers in the search query output, or you could do several searches separately and then append the resulting tsv-files together afterwards in your spreadsheet-program. If there is a different value in the "Tier" column, ELAN will make several tiers when importing back as an .eaf-file. So, you can do several tiers at once.

Speaker tiers
Everyone organises their ELAN-files differently. I have a separate tier where I indicate who the speaker is in the annotation (see above screenshots). This is in contrast to how a lot of other people do it, with different tiers for different speakers. This means that I can search many speakers at the same time, or condition the search for "when X is indicated in speaker-ID-tier". 

If you're doing different tiers for different speakers, you might have to figure out something a bit different from me in order to search many speakers at the same time. It's not that difficult though, you just have to meddle a bit with the search query (or just search one speaker at a time). Contact me if you want help.

On a related note, if someone ever was to ask me to do separate speakers in different tiers, I can use the above process to separate out only annotations with a certain value in the speaker-tier and then import them back as tiers per speaker. I'd rather not, I like it this way. But, I like making sure that the way I set things up is possible to configure to please others as well. Flexibility is good, don't lock yourself into a too narrow set-up that doesn't allow you to change without losing data.

That granted, I need to do manual fidgety things for overlapping speech given this model. That's inconvenient, but I'm ok with it.

Short guide
Step 1) Clever searching
Step 2) export search results
  • Query>Export (Save as tab-delimited text file)
Step 3) create new tier
  • File>Import> CSV/Tab-delimited Text file
  • Specify columns (1 col: ignore, 2 col: Tier, 3 col: Begin time, 4 col: ignore, 5 col: end time, 6 col: ignore, 7 col: Duration, 8 col: ignore , 9 col: Annotation)
  • Save new .eaf-file. 
  • Quit and restart ELAN
Step 4) Creating merged file
  • Open original file with audio and other tiers
  • File>Merge transcriptions...
  • Select .eaf-file with search results as second source (do not append)
  • Save new merged file
  • Delete superfluous files
Step 5) done
  • rename and copy tiers if necessary

I'm sure there's other ways of doing this, but this is what has worked well for me. I'd like this to be easier in ELAN, but in the meantime this works so I'm gonna do it like this.

I find, in general, that I learn more about ELAN and other similar tools by just trying lots of different things and probing the system. Sure, there's manuals, but they often envisage a different usage than I'm after. For example, I'm not clear on what I actually gain by "linguistic types" in what I want to do. Nevermind, probing, searching and sharing seem to be the best way to go for tailored functions. Usually, what you can conceptually imagine as a useful thing exists somewhere (it's like rule 34 but for software). I didn't know how this worked until I thought to myself: "there must be a way of importing search results". And lo and behold, there is. Now here's something I've learned and that you now can do too! Good luck!
Good bye!
Richard Ayoade as Maurice Moss in IT-crowd
* No, I don't know why it is that two linguists who are working/worked on specifically Samoan are trying to teach other linguists to use regular expressions in ELAN. Must be something in the water.
Ulrike Mosel and Hedvig Skirgård (yours truly) in Canberra
Samoan water, Neiafu-Tai village

    • Sloetjes, H., & Wittenburg, P. (2008).
      Annotation by category – ELAN and ISO DCR.
      In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008).
    • Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H. (2006).
      ELAN: a Professional Framework for Multimodality Research.
      In: Proceedings of LREC 2006, Fifth International Conference on Language Resources and Evaluation.
    • Brugman, H., Russel, A. (2004).
      Annotating Multimedia/ Multi-modal resources with ELAN.
      In: Proceedings of LREC 2004, Fourth International Conference on Language Resources and Evaluation.
    • Crasborn, O., Sloetjes, H. (2008).
      Enhanced ELAN functionality for sign language corpora.
      In: Proceedings of LREC 2008, Sixth International Conference on Language Resources and Evaluation.
    • Lausberg, H., & Sloetjes, H. (2009).
      Coding gestural behavior with the NEUROGES-ELAN system.
      Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.591.

Saturday, July 29, 2017

Speakers per language diagram & International Linguistics Olympiad memes

Hello readers of Humans Who Read Grammars,

As well as writing on this blog, I also work with the International Linguistics Olympiad (IOL*). The IOL is a contest for students of secondary school from all over the world where they get to compete in solving linguistic puzzles. Normally in order to explain what the contest is all about I send people to the page with old problem sets, but there's a hip IOL-meme page that's produced some very apt memes that may do a better job at explaining the contest to linguists. I'll paste them in below. (Remember how we started as a meme-based blog for typologists?)

I recently made a post on our blog over there about the dominance of European countries in the contest and language diversity. For that post, I derived a little data visualisation of speaker populations per language (based on the 19th edition of Ethnologue) with infogram. I thought y'all might like it as well, so I'm sharing it here too.

By the way, if you're a linguist who'd like to help keep the contest strong and encourage clever youngsters to get into linguistics, get in touch! There's a lot of countries where there is no contest, or where the contest could well do with some help in thinking of clever problems based on small languages, lecturing etc. Talk to us and we'll figure something out.

Here is a table from Ethnologue that tries to explain this as well, a bit niftier but perhaps less pretty.

Table from Ethnologue summarising the number of speakers per language.

* Yes, the International Linguistics Olympiad is abbreviated "IOL". It's a thing about neutrality, don't worry about it.

Wednesday, July 5, 2017

What languages are grammars of the world written in?

Humans have been writing grammars for a long time. The serious expansion into non-european languages is fairly recent though, and associated with colonialism and Christian missionary work. Because of this, it's interesting to see in what language grammars are written in (meta-langauge) as well as what language their about (target-language). In the map above, this is precisely what we see - what the meta-languages of Glottolog language descriptions are.

There's roughly 7,000 languages in the world alive today, and we have some kind of description of approximately 4,000 of them. If you want to find them, go and search Glottolog.

Harald Hammarström, one of the editors of Glottolog, recently shared with me some interesting data on these descriptions that I want to share with all of you. In Glottolog, descriptive references are tagged for which language their in (meta-language) as well as which language they are about (target-language)*.  The map above gives the distribution of meta-languages of the descriptions of 4,005 languages in Glottolog. For each language on the map above there is only one dot with only one color. The color is according to the meta-language of the Most Extensive Description for said language**.

In this map we can clearly see the domination of English as a world language, but we can also so the prevalence of French in former French colonies in Africa and naturally the national languages of the modern nation states like Brazil (Portuguese) and Indonesia (Indonesian).

If we look a bit closer at this data we can see exactly how many target-languages there are per meta-language in total, as well how many documents in Glottolog there are per meta-language. For those documents where it's possible, Hammarström has also compiled a corpus of the actual content text per document and calculated how many types and tokens there are therein.

The table below summarizes this information for all references in Glottolog, i.e. not only the Most Extensive Description per language. There's a total of 96 meta-languages in Glottolog, the table summarized the 9 most common.
Here is an interactive graphic showing the same data as the table above:

We hope you enjoyed that, be sure to explore Glottolog yourself if you haven't already!

* In bibTeX-entries for Glottolog references, meta-language have the entry field "inlg" and target-languages have "lgcode". 

** Most Extensive Description is first sorted by descriptive type (Grammar>Grammar Sketch> etc), then number of pages and lastly publication year.

Tuesday, June 27, 2017

New Approaches to Ethno-Linguistic Maps

I’m excited to give a guest blog post here at humans who read grammars on new methods in language geography.  I’m a geographer by trade, and I am currently a PhD student at the University of Maryland.  I also work for an environmental nonprofit - Conservation International - doing data science on agriculture and environmental change in East Africa.  Before ending up where I am now, I lived for some time in West Africa and the Philippines.  During my time in both of those linguistically-rich areas, I became quite interested in language geographies and linguistics more generally.  Spurned on by curiosity and my disappointment in available resources, I’ve done some side projects mapping languages and language groups, which I’ll talk about here.

Problems with Current Language Maps

Screen Shot 2017-06-26 at 11.23.48 PM.png
A map of tonal languages from WALS.  Fascinating at a global scale, but unsatisfying if you zoom in to smaller regions.
One major issue with most modern maps of languages is that they often consist of just a single point for each language - this is the approach that WALS and glottolog take.  This works pretty well for global-scale analyses, but simple points are quite uninformative for region scale studies of languages.  Points also have a hard time spatially describing languages that have disjoint distributions, like English, or languages that overlap spatially. See here for a more in-depth discussion of these issues from Humans Who Read Grammars

One reason that most language geographers go for the one-point-per-language approach is that using a simple point is simple, while mapping languages across regions and areas is very difficult.  An expert must decide where exactly one language ends and another begins.  The problem with relying on experts, however, is that no expert has uniform experience across an entire region, and thus will have to rely on other accounts of which language is prevalent where.  This is how, for example, the Murdock Map of African ethno-linguistic groups was created.  As a continental scale map, it is rich and fascinating.  However, looking for closely at specific region, and the map seems to have problems - how did Murdock know exactly the shape of each little wiggle identifying the boundary between two groups?  What about areas where two different groups overlap?  Other issues can arise when trying to distinguish distinct groups when often the on-the-ground reality is that a language may exist as a dialect continuum, something that subjectively drawing polygons does not readily account for.

These maps can have real import when they form the foundation of other analyses. Researchers have examined whether ethnic diversity in developing countries, and in Africa in particular, can hamper economic development and lead to conflict. Scientists disagree, although many analyses use the Murdock map. See some of this research here, here and here. Another study, recently published in Science, looked at Internet penetration in areas where politically excluded ethnic groups live. They found that groups without political power were often marginalized in terms of internet service provision. However, their data for West Africa, which came from the Ethnic Power Relations database, was quite rough: all of southern Mali was one ethnic group labeled "blacks" while the north was labeled as "Tuaregs" or "Arabs", while there was no data at all for Burkina Faso.  While their findings were important and they did the best that they could with available datasets, a less informed analysis from the same data could end up looking like linguistics done horribly wrong.  We need better ethno-linguistic maps simply to do good social science and address these critical questions.

New Methods and Datasets

I believe that, thanks to greater computational efficiency offered by modern computers and new datasets available from social media, it is increasingly possible to develop better maps of language distributions using geotagged text data rather than an expert’s opinion.  In this blog, I’ll cover two projects I’ve done to map languages - one using data from Twitter in the Philippines, and another using computationally-intensive algorithms to classify toponyms in West Africa.

I should note that for all its hype, big data can be pretty useless without real-world experience.  The Philippines and West Africa are two parts of the world where I have spent a good amount of time and have some on-the-ground familiarity with the languages.  Thus, I was able to use my local knowledge to inform how I conducted the analyses, as well as to evaluate their issues and shortcomings.

Case Study 1: Social Media From The Philippines

Many fascinating language maps from twitter have been created at global scales - see here, and here.  However, to explore the distribution of understudied languages that don’t show up in maps of global languages, one must use more bespoke methods.  This is especially true of austronesian languages like those found in the Philippines, which don’t have a lot of phonemic variability, and therefore aren’t easily classified using the methods that google translate uses.  These methods, which rely on slices of the sample text, often confuse austronesian languages like Tagolog and Bahasa - just look at the maps I mentioned above. Thus, I had to use a word-list method, and created word lists from corpora offered by SEAlang, and by scraping from local-language wikipedia articles.  The resulting maps show exactly where minority languages are used in comparison with English and Tagalog in the philippines, and likely underestimate the prevalence of minority languages because the corpora used (wikipedia and the bible) are quite different from the twitter data that was classified.

Languages of Tweets in the Philippines.
The resulting map shows about 125,000 tweets in English, Tagalog, Taglish (using Tagalog and English in the same tweet), and the local languages Cebuano, Ilocano, Hiligaynon, Kapampangan, Bikol, and Waray.  This map offers more nuance than traditional language maps of the Philippines.  For example, most maps would show Ilocano over the entire northern part of Luzon, but this map shows that the use of Ilocano is much more robust on the northwest coast than in the rest of the north.  This analysis also allowed me to test a hypothesis that I frequently heard locals assert when in the Philippines - that English is more common in the south, because southerners would rather use English than Tagalog, which is seen as a northern language.  I found that this was to be the case, and I was only able to confirm this because I had such a large sample size.  Without newer datasets like those offered by social media, this hypothesis would be untestable.

To see a more in-depth description of this analysis, you can see my original blog post here.

Case Study 2: West African Toponyms

Another project I did used toponyms, or place names, from West Africa.  Toponyms databases like geonames.org have relatively high spatial resolution - with a name for every populated place in an area.  And while a place name is not as long as a tweet or other linguistic dataset, toponyms do encode ethno-linguistic information.  It would be easy for someone familiar with Europe to distinguish whether a toponym is associated with the French or German linguistic group - a French name would likely begin with “Les” and end with “-elle”, while a German name could begin with “Der” and end with “-berg”.  Similar differences exist between toponyms from different ethnic groups all over the world, and are quite evident to locals.  What if you could train an algorithm to detect these differences, and then had it classify every single toponym throughout a region?  That is what I tried to do in this analysis.

I used toponyms for six countries in French West Africa. I decided to focus on French West Africa for several reasons. For one, I have worked there, and have some familiarity with the ethnic groups of the region and their distributions, and it is an area I am very curious about. For another thing, this is a relatively poorly documented part of the world as far as ethno-linguistic groups go, and it is an area with significant region-scale ethnic diversity. Finally, the countries I selected were colonized by one group, meaning that all of the toponyms were transliterated the same way and could be compared even across national borders. In all, I used 35,785 toponyms.

First, I got a list of every possible set of three letters (called a 3-gram) from the toponyms.   Then, I tested for spatial autocorrelation in the locations that contained each 3-gram using a Moran's I test, and selected only those 3-grams that had significant clustering.

To give an illustration of why this was necessary, here are two examples of the spatial distribution 3-grams. One 3-gram - "ama" - occurs roughly evenly throughout the regions in this study. The other 3-gram - "kro" - is very common in toponyms in south-east Côte d'Ivoire, and virtually nonexistent in other areas. Thus, "kro" has significant spatial autocorrelation whereas "ama" does not.

Here are all of the toponyms that contain the 3-gram "kro" 

And here are all of the toponyms that contain the 3-gram "ama" 

Thus, the the 3-gram "ama" doesn't tell us much about which ethnic group a toponym belongs to, because that 3-gram is found evenly distributed throughout West Africa - it is just noise. The 3-gram "kro", on the other hand, carries information about which ethnic group a toponym belongs to, because it is clearly clustered in a group in Southeast Côte d'Ivoire.

I then calculated the lexical distance between all of the toponyms based on the number shared 3-grams that had significant spatial autocorrelation.  To add a spatial component, I also linked any two toponyms that were less than 25 kilometers apart. Thus, I had a graph where every toponym was a vertex, and undirected edges connected toponyms that had spatial or lexical affinity.  Finally, I used a fast greedy modularity-optimizing algorithm to detect communities, or clusters, in this graph.

The algorithm found seven distinct communities, which definitely correspond to ethnic groups and ethnic macro-groups in West Africa.

The red cluster includes Wolof, Serer, and Fulfulde place names, which makes sense, as all of these groups are Senegambian languages. This group of languages is the primary group in Senegal and Mauritania, which my classification picked up on. It also caught the large Fulfulde presence in central Guinea, throughout an area known as the Fouta-Djallon. This cluster also has a significant presence throughout the Sahel, stretching into Burkina Faso and dotted throughout the rest of West Africa, much like the migrant Fulfulde people.

The green cluster captures most of the area where Mandé languages are spoken, including most of Mali, where the Bambara are found, as well as Eastern Guinea and Northern Côte d'Ivoire, where Malinké is found. Interestingly, most of the toponyms in Western Mali fell into the Senegambian/Fulfulde cluster, and were not in the Mandé cluster, even though there are Mandé groups like the Soninké and Khassonké in Western Mali. Southern Guinea is densely green, representing the presence of Mandé groups there, like the Kuranko. Surprisingly, much of central and southern Côte d'Ivoire also fell into the green cluster, even through there are a couple of different groups there which are not in any way related to the Mandé groups that were most represented in the green cluster. This is also true of areas in Western Burkina Faso and Eastern Mali, where there are many languages unrelated to the broader Mandé group, such as Dogon, Bobo, Minianka, and Senufo/Syempire. However, I know that Dyula, a Mandé language closely related to Bambara, is spoken as a trade language in both of these areas (Côte d'Ivoire and Western Burkina Faso). It could be that Dyula has had a long enough presence in these areas to leave an imprint on the toponyms there.

The purple group pretty clearly captured two different disjoint groups that are both in the broader Mandé group - the Susu, in far Western Guinea, and the Dan, in Western Côte d'Ivoire. These groups are normally classified as being on quite separate branches of the Mandé language family, with the Susu being Northern Mandé and Dan being Eastern Mandé. However, the fact that the algorithm put them in the same group, even though they were too far apart to have edges/connections based on spatial affinity, shows that Dan and Susu toponyms have several three-grams common.

The yellow cluster seems to have caught two sub-groups within the broader green/Mandé cluster. Many of the yellow toponyms in central Mali are in what you could call the Bambara homeland, between Bamako and Segou. However, a second cluster stands out quite distinctly in southern Guinea. It's unclear to me what group this could represent and why it would have toponymic features distinct enough from its neighbors that the algorithm put it in a different cluster. Some maps say that a group called the Konyanka lives here and speaks a language closely related to Malinké.

The turquoise cluster quite clearly captures the Mossi people and their toponyms, as well as the Gurunsi, a related group (both Mossi and Gurunsi are classified as Gur languages).

The black cluster in southern Burkina Faso captured a group that most national ethno-linguistic maps call the Lobi, although this part of West Africa is known for its significant entho-linguistic heterogeneity. Another group of villages in Eastern Burkina Faso also fell into the black cluster, although I could not find any significant ethnic group found there.

Finally, the blue cluster captured both the Baoulé/Akan languages as well as the Senufo. It captured the Senufo especially in Côte d'Ivoire and somewhat in Burkina Faso, but not much in Mali, where I know the Senufo have a significant presence. This could represent a Bambarization of previously Senufo toponyms due to the fact that the government of Mali is predominantly Bambara, or it could pre-date the Malian state, as this area was part of Samori Toure's Wassoulou Empire, in which the Malinké language was strongly enforced. The classification of the Senufo languages has always been controversial, but this toponymic analysis suggests that they are more related to Kwa toponyms to the south rather than to Gur toponyms to the northeast.


Some caveats with this work and its interpretation. For one, this only shows toponymic affinities. Those affinities usually correspond to ethnic distributions, but not always. There is a lot of migration in West Africa today, and place names don't usually change as quickly as the distributions of people. Thus, toponyms can sometimes encode historic ethnic distributions, for example many toponyms in the United States come from Native American languages, and there are many toponym suffixes in England that reflect a historic Nordic presence. Thus, this and similar maps are most informative when interpreted in combination with on-the-ground information and knowledge.

Another issue with classifying toponyms in West Africa in particular is that West African toponyms are transcribed using the Latin alphabet, which definitely does not capture all of the sounds that exist in West African languages. Different extensions of the Latin alphabet, as well as an indigenous alphabet, are often used to transcribe these languages, however these idiosyncratic methods of writing languages are not used in the geonames dataset. Thus, the Fulfulde bilabial implosive (/ɓ/ in IPA) is written the same way as a pulmonic bilabial plosive - as a "b", so this distinction is lost in our dataset, even though it adds a lot of information about what ethnic group a given toponym belongs to. However, some other sounds and sound combinations, which are very indicative of specific languages are captured using a Latin alphabet- for example prenasalized consonants (/mb/) common in Senegambian languages, labial velars (/gb/ and /kp/) common in coastal languages, or the lack of a 'v' in Mandé languages. Issues also arise with how different colonizers transcribe sounds differently, for example 'ny' and 'kwa' in English would be 'gn' and 'coua' in French. However, this didn't apply in this analysis, which only used Francophone countries, and I believe it could be dealt with if I tried to do a larger analysis.


This is an exciting time to be at the intersection of geography and linguistics!  New datasets and computational methods are giving researchers the ability to ask newer and better questions about who belongs to what group, and where.  I hope new developments in this research can yields new linguistic results about phylogeny, migration, and the spread of linguistic phenomena.  Outside of the field of linguistics, better language maps could have broad applications, from improving disaster response planning to helping to answer critical questions about the origins of ethnic conflict.

Thanks for reading! You can check out my personal website for more detailed descriptions of these two projects, as well as other side projects I've done.

Thursday, June 1, 2017

World map of language families from Glottolog

World map from Glottolog, each language is one dot and coloured by language family (or other top-genetic unit).
Language families are the main way we categorise and understand the language diversity of the world. A language family is a group of languages that have been analysed as having one ancestor,  one great-great-great-and-yet-greater-grand-mother language. Indo-European is a language family, with the sub-groups of Romance, Germanic, Slavic etc.

Maps are great tools for visualising information, we're pretty map-nerdy on this blog. Robert Forkel, one of the editors of Glottolog, kindly shared an interactive map of the world with languages plotted out and coloured by language family with me. This map is interactive, rendered in a web browser with and html and json file.

This map is not available on the Glottolog site, but will later be implemented in the command-line interface. You can see language families on the website by either selecting a country or a specific family. This tool is the only way to see all language families in all countries on Glottolog. 

I will let you know when this is implemented and you can play with it yourself. In the meantime, I thought I'd share this screenshot and talk a little bit about language families.


Some notes on language families, and in particular Glottolog language families and this map

When we look at the collected wisdom of linguistic scholars, we actually find a lot of disagreement. For example, Ethnologue counts to 135 language families and Glottolog to 239!* To read more about this, please go to this post on the "other" languages of Glottolog and Ethnologue, and how the two catalogues define these categories.

Due to lack of data and disagreements, we also have very different estimates for language family depth, i.e. how long time ago the greatest-grand-mother language was spoken. Here are some examples:

Language family proposed date
Afro-Asiatic 9,500 - 18,000
Algic 7,000
Austronesian 6,000-8,000
Dravidian 6,000
Indo-European 5,500

In this case, we're using the language families (and other top-genetic units) from Glottolog. Glottolog is a carefully curated catalogue of languages, and for each grouping there is always a reference provided to where in the academic literature we can find support for exactly how the tree is structured. This is very helpful. With this said, it's worth noting that Glottolog often tends to be more "splitting" (not lumping languages into very large families) than other similar resources, like Ethnologue. In general, Glottolog often represents a more conservative view of language history.

Glottolog also contains other kinds of groupings besides what we commonly think of as "families", for example: unattested, sign languages, isolates, pidgins, artifical etc. More on this here.

Please remember when you look at this/these map that:

  • stacking of dots is not trivial, Nigeria for example looks more full of atlantic-congo languages than it is, see images below. Zoom in for denser areas
  • the colours on this map were not picked manually, but assigned automatically
  • Creoles are in the family of their lexifier
  • there are other groupings besides traditional language families in the dataset
  • these are dots, not polygons
  • this will be implemented as a command line tool, so you should get your git and python on in order to make these yourself.

Nigeria in the world map at the top of the post
Nigeria zoomed in
Here are some more zoomed in areas for your enjoyment
The island of New Guinea
Mainland South East Asia
Top South America

Language Family Tournament

On a sillier note, the Facebook page Etymology Memes for Reconstructed Phonemes recently ran a tournament where followers could vote for which was their favourite language family from a set of 24. Since this is related to the content of this blog post, I'll share those results as well!
A tournament on Facebook where followers of the page
"Etymology Memes for Reconstructed Phonemes" could vote for which was their favourite language family.
The winner of said contest, Basque
Other ways of categorising languages besides language families
There are other way of categorising languages than into language families, most notably into geographic areas. It seems that languages that are in contact influence each other. Furthermore, it is not necessarily true that all parts of a language (sound system, vocabulary, grammar, syntax, etc) has one and only one shared ancestry - there could be multiple underlying trees for different parts of language. It may be that the counting system was borrowed from neighbour x and some phonemes imported from neighbour y. Another reason for multiple trees is dialect chains breaking up and coming together again, which is hard to detect given enough time.

Besides these approaches, we can also categorise languages into types (suffixing, tonal, CVCV, VSO, isolating etc). This is what typologists do. Knowing the distribution of various traits in the worlds languages, we can not only investigate language history, but also ask questions such as:

  • are certain traits correlated with each other?
  • are there trade-offs between traits, for example to minimize complexity?
  • are there cognitive constraints on combination of traits?

Ok, that's it for now. Hope you enjoyed this!



* In order to make a fair comparison, I've excluded some special cases that the two catalogues deal with in very different ways or that we have very little data on. For Ethnologue, I've excluded: constructed languages (1), creoles (88), deaf sign languages (137), language isolates, mixed languages (21), pidgins (13), and unclassified languages (51). For Glottolog I've excluded pidgins (79), isolates (198), mixed languages (23), artificial (9), speech registers (6), “unattested” (61), “unclassifiable” (117) and sign languages (166). Creoles in Glottolog are classified under their lexifier family, making them hard to count, but they don’t increase the number of families. There are 37 language with "creole" or "kriol" in their name in Glottolog, but I didn't subtract these since they belonged to families that also contain non-contact languages.