Tuesday, September 19, 2017

Public service announcement: list of databases and more


Public service announcement: there are website that keep a well-curated list of things that are useful to linguistics researchers and students, including the following:
It would appear that some don't know about these lists, so now you know/are reminded :).

Lists are good, and instead of reinventing them you can look through these and add to them. For more hopefully useful stuff like this, go here.

Monday, August 28, 2017

Ethnologue more restricted


In April this year, Ethnologue changed access restrictions to their website again. Now, non-paying users from high income countries can only access 1 page per month before they are banned, previously it was 7. In light of this change, we will go through some basics regarding the paywall again (old post here) and where you can go instead. Finally, I list some questions should any SIL International/Ethnologue staff see this post.

Basics on the pay-wall
We haven't received much detailed information on this change, but if it's the same as last time it means that users with IP-addresses in countries that are classified by the world bank as "high-income" will be restricted. Cloudflare would appear to be the service provider managing this for Ethnologue. Previously, we've learned that only 5% of users look at more than 7 pages per month. We don't know how many go to more than 1 page (probably a lot more though!).

SIL International also maintains the ISO 639-3 codes for language names (one of 6 language ISO-codes). Those pages are NOT affected by this restriction. Ethnologue and SIL International are not the same thing, SIL International produce more things than just Ethnologue.

Old editions of Ethnologue have different restrictions than the current edition.

Ethnologue is mainly funded by Wycliffe Global Alliance (an explicitly Christian organisation), and not by any state or academic institution. This information is based on what I understand from financial statements, I may be mistaken. Clarification form Ethnologue/SIL International staff is highly appreciated here. Please note that there are many other ISO industry standards that are pay-access only, the fact that SIL International provides 639-3 openly is fortunate.

It appears to us, the users, that SIL International have made these decisions to remedy a financial situation. It is not clear at this time if SIL International is seeking other ways of bringing in funds, like more traditional grants from research councils.

Where to go instead?
Much of the information that Ethnologue provides is actually available elsewhere. Here is a table displaying some of the places you could go to instead of Ethnologue.


Family trees MultiTree, Glottolog
Codes Glottolog, ISO 639-3 repository 
Alternative names Glottolog
Endangerment level UNESCO Atlas of the World's Languages in Danger
Maps of language areas (polygon data) Langscape
Population stats (Old Ethnologue editions), CIA World Factbook, Wikipedia

The information that Ethnologue provides that is the hardest to replace is population stats. The pages that I regret the most that I cannot access are the overall summary stat pages, they're nice for showing size of language families and the power law of speaker populations.

Here is some more details on some of the resources listed above.


MultiTree

MultiTree is by Linguist List and is a catalogue of lots of different language trees. You can search through the database for lots of different trees and compare them, very cool!

Glottolog
Glottolog is provided by the Max Planck Society, and edited by Harald Hammarström, Robert Forkel and Martin Haspelmath. Most of the detailed curation of the data is managed by Harald. Glottolog provides a lot of information, mainly language codes, trees, location (dots, not polygons) and references. Each tree in Glottolog has a clear reference to a published source, which is very handy. There is also clear information on how the classification is handled.

If you disagree with information you find on Glottolog, or want to add information, you can file a GitHub-issue or click the little alarm bell symbol on the relevant page.

UNESCO Atlas of Languages in Danger
This atlas is the complimentary online version of the 2010 print edition and edited by Christopher Moseley. It contains information on 2,464 languages. This is the scale, and the number of languages at each level:
  • Vulnerable (592)
  • Definitely endangered (640)
  • Severely endangered (537)
  • Critically endangered (577)
  • Extinct (228)

Langscape
Langscape is a website by the Maryland Language Science Center. They provide games, lesson material for teachers and - most interestingly to us - maps. These interactive maps are actually based on the polygon set of SIL International, and they're not available for download freely. They are however accessible in the interactive web browser interface. One way you can see that these are Ethnologue polygons, is that the genealogical classification is the same. For example: Mande languages are marked as the same family as Bantu languages (not the case in Glottolog).

One of the games that Langscape has is an identification game, not that different from the Great Language Game that we wrote a paper about! We also made a new game, LingQuest, that you can play.

Alright, now you know. Best of luck with whatever research you have that is dependent on this kind of information.

Questions for Ethnologue/SIL International staff (should they be reading)

  1. will the ISO 639-3 codes ever be behind a pay-wall?
  2. are there other products of SIL International that may become like Ethnologue?
  3. is it still only in effect in high-income countries (according to the world bank)?
  4. is it intentional that the Ethnoblog and the summary statistics pages are included under the pay-wall?
  5. how are Ethnologue and SIL International funded?
  6. have you considered other funding options?
  7. how does Ethnologue and SIL International see their own roles in modern academia and some academics dependence on the data, despite these resources not being traditionally funded by academic institutions?
  8. why was the change made?
  9. was the change announced anywhere publicly?
  10. how many users access more than 1 page per month?
  11. how many users access more than 7 pages per month?
  12. how has the user stats changed the past 2 years?
  13. how many of your users are commercial and how many are academic, by estimation?
***
EDIT
Note that Ethnologue is not only used by the academic research community, but also by commercial and governmental institutions (for example in this scandal). In fact, considering the new restrictions on access and problems with the basic data (opaque decisions and sources), perhaps academics shouldn't really use Ethnologue much at all.

Tuesday, August 1, 2017

ELAN: making tier(s) out of search results

Hedvig in her office in Canberra figuring this out
and writing this guide.
Here is another guide for how to do something practical in ELAN. Previously, we relayed Eri Kashima's guide for sensible auto-segmentation with PRAAT and ELAN (time saver!). (For all posts about fieldwork on this blog, see this tag.)

This time: how to take your search results and make the matching annotations into new separate tier(s). This is useful if you for example want to cycle through only the annotations that match a certain search query in transcription mode. This post has a longer guide, and a short guide at the end.

For those who don't do a lot of transcription: ELAN (EUDICO Linguistic Annotator) is a program from TLA at MPI-Nijmegen. This program allows us to easily annotate audio and/or video files with lots of relevant data. We can use ELAN to count things, but we can also export as CSV-files for analysis later (Excel, R, Libreoffice etc). ELAN is free and great. If you ever need to do transcription, do it in ELAN. Do not create long text-documents with no linking to the audio, it is just ridiculous. Download ELAN here.

Version of ELAN: 4.8.1 (to my knowledge though this should work the same for other versions)

We're going to:
  • search in a clever way
  • export those results
  • import them as new tier(s) into the .eaf-file you're working on
  • thus creating a tier with a defined subset of other existing tiers, making work speedier on targeted parts of your corpus
You can click the images for larger versions.

Example case
I've got a transcribed file where I've noticed some different pronunciation of a certain word. I'd like to pick out only the annotations containing that word, make a new tier with only them, and write down some clever things about this word in that tier. I don't want to have to scroll through all annotations to get to only these.

I work on Samoan, and the word I'm looking at means "to tell/explain": fa'amatala. "Fa'amatala" is the dictionary entry for this word, but it varies in pronunciation in actual speech. I've asked my transcription assistant to mark down vowel length and presence and absence of glottal stops (as opposed to more orthographic transcription). She has done this pretty consistently (as far as I can tell, it's hard to hear glottal stops sometimes), and since I know what kind of variations to expect I can easily find the instances for this word. Due to t and k-style (lects in Samoan) and speed these are the variations we can expect:
  • fa'amatala
  • fa:matala
  • famatala
  • fa'amakala
  • fa:makala
  • famakala
Besides the obvious difference in pronunciation, I've noticed something unusual going on in the realisation of the realisation of t/k, sort of like an affricate. So, I'd like to listen to all instances of this word with all these spellings and make notes of that.

Here are the steps. At the end is a short guide for when you've started to get the hang of this but need basic guidance.

Step 1) clever searching
In ELAN we can search for simple words, but we can also do something a bit more clever: we can search using regular expressions. Now, you don't need to have a complicated query or know all regex magic to make use of this. In this case, we're simply going to use the 'OR'-function. 'OR' in regular expressions is expressed by the vertical line/pipe character: "|" .

So, I'm searching for "fa'amakala|fa:makala|famakala|fa'amatala|fa:matala|famatala" in the tier marked "transcription". No need for bracketing, asterisks or anything like that in this case. If you want to do more complicated things with regular expressions, I highly recommend this guide and cheat sheet for regular expressions in ELAN by Ulrike Mosel*.

Search query results




















Here are our search results:
  • uma fa'amatala i a'u i le tala o le video 
  • fa:makala loa le!
  • fa:makala?
  • fa:makala ka:maloa lale e 
  • ma: e mafai ona e fa:matala mai fapefea le vaitaimi na'e tuputupu 'ae i: falealupo
  • mafai ona e fa'amatala i a'u 
  • fa'amatala?
  • i e mafai ona e fa:matala i le ese'esega o gagana sa:moa 
  • e mafai ona e fa'amatala i le tala le lenei 
  • i fasa:moa, fa'amolemole fa'amatala i le a
  • le kusi la ga ae kago famakala aka 
  • o: mai o le se famakala aku le mea 
  • fa:makala uma ?
  • e ke kago famakala le aka 
That looks good! Not all variations we thought might exist occurred (we didn't get "famatala"), but that's normal. (In fact, specifically not getting that form is expected. Shortening of vowel + the t-lect should not co-occur often, if we believe what Mayer, Ochs and others have said about Samoan variation.)

If you want to edit your search query, you don't need to start all over. Just click the search window again right there over your results, it'll be editable again. (This took me a while to realize.)

Step 2) exporting the search results
This is is very easy, in the search window you have up, go to "Query>Export" and choose to export as tab-delimited text.
Export search query results
Exporting search results dialogue window
Name your file something sensible, and put it in a good place. Now let's have a look at said file outside of ELAN, shall we? The file will have the file-extension ".txt", but it is a tab-separated file (".tsv"). Open it in some spreadsheet program (excel, numbers, libreoffice, google sheets, whathaveyou) and it should look a little something like this:

Search results file opened in Excel, specifying tab as delimiter.
That looks kinda alright, doesn't it? There's no headings, but we can figure this out. There's some things in there that we didn't ask to have, for example the first column is the file location. That's not needed for what we're doing, and I'll show you how to handle that in the next step. Don't worry.

Step 3) creating tier(s) out of the search results
Now we go back to ELAN and we import this file as a tier. What will happen here is that a entire new .eaf-file will be created, the tier will actually not be imported directly into whichever file you currently have open.  This means that it doesn't matter which .eaf-file you currently have open when you import (or indeed if any is open). Counterintuitive, I know, but don't worry - I've figured it out. It's not that complicated, just stay with me.

File>Import> CSV/Tab-delimited Text file

Importing CSV/Tab-delimited Text file
Next up you will get a window asking you questions about the file you're trying to import. Remember how the file didn't have headings for the columns? How will we figure out what is what? Not to worry, it's like this:

1 col: ignore (uncheck)
2 col: Tier
3 col: Begin time
4 col: ignore (uncheck)
5 col: end time
6 col: ignore (uncheck)
7 col: Duration (not sure why this is needed but oh well)
8 col: ignore (uncheck)
9 col: Annotation

Import CSV/Tab-delimited Text file dialogue window.
I wish that ELAN had a way of automatically recognizing its own search output, but it doesn't and we know how to do this anyway so it's all good. No need to specify the other options, just leave them unchecked.
An actual ghost

Now you will have a new .eaf-file with the same name as the file with the search results. This file will contain only the tier(s) you had searched within and only the annotations matching the search query. There's no audio file and no other tiers. It's like a ghost tier, haunting the void of empty silence of this lonely .eaf-file.
A lonely ghost tier in an otherwise empty .eaf-file
Save this file and other files currently open in some clever place(s), quit ELAN and then restart ELAN. Sometimes there seems to be a problem for ELAN to accurately see files later on in this process unless you do this. I don't know why this is, but saving, closing and restarting seems to help, so let's just do that :)!
Chris O'Dowd as Roy Trenneman in IT-crowd
Step 4) importing the search results tier into the original file
Now here's where I slightly lied to you: we're not going to import the tier into your file. We're going to merge the search-results-tier-only-file with the other .eaf -file that has all the audio and other tiers and the result is going to be a new .eaf-file. So you'll have three files by the end of this:
  • a) your original .eaf-file with audio and lotsa tiers
  • b) your .eaf-file with only the search results-tier and no audio etc (ghost-tier)
  • c) a new merged file consisting of the two above listed
Don't worry, I've got this.  I'm henceforth going to call these files (a), (b) and (c) as indicated above.

Open file (a). Select "Merge Transcriptions..."

File>Merge >Transcriptions...

Select Merge transcriptions
Now, select file (a) as the current transcription (this is default anyway), file (b) as the second source and choose a name and location for the new file, file (c), in the "Destination" window. You can think of "Destination" as "Save as.." for file (c) - our new file.

Specifying what should be merged and how
Do not, I repeat, do not append. And no need to worry about linked media, because (b) doesn't have any audio or anything (remember, it's a ghost). Just leave all those boxes unchecked.

Let ELAN chug away with the merging, and then you're done!

Step 5) finished!
Tadaaa! We're done! That wasn't so bad, was it? And look at what we've created!

Here's my merged file - file (c). I've taken the search-results tier and renamed it ("famakala"). I also copied it and renamed that one ("famakala - comments"). That way, I have a tier for making comments about the transcription annotation that has the exact same annotation distributions, but different values.
Final merged file in annotation mode, with the search results tier renamed and copied.
Here's the same file in the transcription mode, configured to only show the two tiers targeting the search query:
Final merged file in transcription mode, showing only the search results tiers.
Now, some final notes:
  • You might want to rename file (c) and delete file (a) and (b), for your own sanity later when managing the files, if for nothing else
  • Don't know how to get to transcription mode? Go to "Options>Transcription Mode".
  • Your tiers aren't showing up properly in transcription mode? Check that the "linguistic types" of the tiers are what you think they are and that that's what you've configured to see in transcription mode. Transcription mode can only show you tiers of one linguistic type at once (unless columns but that complex). I also don't get it really, but then again I barely get "linguistic types" at all though
  • Transcription mode getting clogged up with lots of irrelevant tiers? Got o "Configure..." left in the transcription mode window, select the right linguistic type and "Select tiers.." in the bottom left. Tick only the tiers you want to see at that moment
  • You can import several tiers at once by this method, you don't have to merge one search result at a time, see below
  • You might want to do something complicated related to speakers, see below
Several tiers at once
You can either search several tiers at once in the search mode and hence have several tiers in the search query output, or you could do several searches separately and then append the resulting tsv-files together afterwards in your spreadsheet-program. If there is a different value in the "Tier" column, ELAN will make several tiers when importing back as an .eaf-file. So, you can do several tiers at once.

Speaker tiers
Everyone organises their ELAN-files differently. I have a separate tier where I indicate who the speaker is in the annotation (see above screenshots). This is in contrast to how a lot of other people do it, with different tiers for different speakers. This means that I can search many speakers at the same time, or condition the search for "when X is indicated in speaker-ID-tier". 

If you're doing different tiers for different speakers, you might have to figure out something a bit different from me in order to search many speakers at the same time. It's not that difficult though, you just have to meddle a bit with the search query (or just search one speaker at a time). Contact me if you want help.

On a related note, if someone ever was to ask me to do separate speakers in different tiers, I can use the above process to separate out only annotations with a certain value in the speaker-tier and then import them back as tiers per speaker. I'd rather not, I like it this way. But, I like making sure that the way I set things up is possible to configure to please others as well. Flexibility is good, don't lock yourself into a too narrow set-up that doesn't allow you to change without losing data.

That granted, I need to do manual fidgety things for overlapping speech given this model. That's inconvenient, but I'm ok with it.

Short guide
Step 1) Clever searching
Step 2) export search results
  • Query>Export (Save as tab-delimited text file)
Step 3) create new tier
  • File>Import> CSV/Tab-delimited Text file
  • Specify columns (1 col: ignore, 2 col: Tier, 3 col: Begin time, 4 col: ignore, 5 col: end time, 6 col: ignore, 7 col: Duration, 8 col: ignore , 9 col: Annotation)
  • Save new .eaf-file. 
  • Quit and restart ELAN
Step 4) Creating merged file
  • Open original file with audio and other tiers
  • File>Merge transcriptions...
  • Select .eaf-file with search results as second source (do not append)
  • Save new merged file
  • Delete superfluous files
Step 5) done
  • rename and copy tiers if necessary
Questions/comments

I'm sure there's other ways of doing this, but this is what has worked well for me. I'd like this to be easier in ELAN, but in the meantime this works so I'm gonna do it like this.

I find, in general, that I learn more about ELAN and other similar tools by just trying lots of different things and probing the system. Sure, there's manuals, but they often envisage a different usage than I'm after. For example, I'm not clear on what I actually gain by "linguistic types" in what I want to do. Nevermind, probing, searching and sharing seem to be the best way to go for tailored functions. Usually, what you can conceptually imagine as a useful thing exists somewhere (it's like rule 34 but for software). I didn't know how this worked until I thought to myself: "there must be a way of importing search results". And lo and behold, there is. Now here's something I've learned and that you now can do too! Good luck!
Good bye!
Richard Ayoade as Maurice Moss in IT-crowd
* No, I don't know why it is that two linguists who are working/worked on specifically Samoan are trying to teach other linguists to use regular expressions in ELAN. Must be something in the water.
Ulrike Mosel and Hedvig Skirgård (yours truly) in Canberra
Samoan water, Neiafu-Tai village













References
    • Sloetjes, H., & Wittenburg, P. (2008).
      Annotation by category – ELAN and ISO DCR.
      In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008).
    • Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H. (2006).
      ELAN: a Professional Framework for Multimodality Research.
      In: Proceedings of LREC 2006, Fifth International Conference on Language Resources and Evaluation.
    • Brugman, H., Russel, A. (2004).
      Annotating Multimedia/ Multi-modal resources with ELAN.
      In: Proceedings of LREC 2004, Fourth International Conference on Language Resources and Evaluation.
    • Crasborn, O., Sloetjes, H. (2008).
      Enhanced ELAN functionality for sign language corpora.
      In: Proceedings of LREC 2008, Sixth International Conference on Language Resources and Evaluation.
    • Lausberg, H., & Sloetjes, H. (2009).
      Coding gestural behavior with the NEUROGES-ELAN system.
      Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.591.

Saturday, July 29, 2017

Speakers per language diagram & International Linguistics Olympiad memes

Hello readers of Humans Who Read Grammars,

As well as writing on this blog, I also work with the International Linguistics Olympiad (IOL*). The IOL is a contest for students of secondary school from all over the world where they get to compete in solving linguistic puzzles. Normally in order to explain what the contest is all about I send people to the page with old problem sets, but there's a hip IOL-meme page that's produced some very apt memes that may do a better job at explaining the contest to linguists. I'll paste them in below. (Remember how we started as a meme-based blog for typologists?)

I recently made a post on our blog over there about the dominance of European countries in the contest and language diversity. For that post, I derived a little data visualisation of speaker populations per language (based on the 19th edition of Ethnologue) with infogram. I thought y'all might like it as well, so I'm sharing it here too.

By the way, if you're a linguist who'd like to help keep the contest strong and encourage clever youngsters to get into linguistics, get in touch! There's a lot of countries where there is no contest, or where the contest could well do with some help in thinking of clever problems based on small languages, lecturing etc. Talk to us and we'll figure something out.




Here is a table from Ethnologue that tries to explain this as well, a bit niftier but perhaps less pretty.


Table from Ethnologue summarising the number of speakers per language.




* Yes, the International Linguistics Olympiad is abbreviated "IOL". It's a thing about neutrality, don't worry about it.

Wednesday, July 5, 2017

What languages are grammars of the world written in?

Humans have been writing grammars for a long time. The serious expansion into non-european languages is fairly recent though, and associated with colonialism and Christian missionary work. Because of this, it's interesting to see in what language grammars are written in (meta-langauge) as well as what language their about (target-language). In the map above, this is precisely what we see - what the meta-languages of Glottolog language descriptions are.

There's roughly 7,000 languages in the world alive today, and we have some kind of description of approximately 4,000 of them. If you want to find them, go and search Glottolog.

Harald Hammarström, one of the editors of Glottolog, recently shared with me some interesting data on these descriptions that I want to share with all of you. In Glottolog, descriptive references are tagged for which language their in (meta-language) as well as which language they are about (target-language)*.  The map above gives the distribution of meta-languages of the descriptions of 4,005 languages in Glottolog. For each language on the map above there is only one dot with only one color. The color is according to the meta-language of the Most Extensive Description for said language**.

In this map we can clearly see the domination of English as a world language, but we can also so the prevalence of French in former French colonies in Africa and naturally the national languages of the modern nation states like Brazil (Portuguese) and Indonesia (Indonesian).

If we look a bit closer at this data we can see exactly how many target-languages there are per meta-language in total, as well how many documents in Glottolog there are per meta-language. For those documents where it's possible, Hammarström has also compiled a corpus of the actual content text per document and calculated how many types and tokens there are therein.

The table below summarizes this information for all references in Glottolog, i.e. not only the Most Extensive Description per language. There's a total of 96 meta-languages in Glottolog, the table summarized the 9 most common.
Here is an interactive graphic showing the same data as the table above:


We hope you enjoyed that, be sure to explore Glottolog yourself if you haven't already!


* In bibTeX-entries for Glottolog references, meta-language have the entry field "inlg" and target-languages have "lgcode". 

** Most Extensive Description is first sorted by descriptive type (Grammar>Grammar Sketch> etc), then number of pages and lastly publication year.

Tuesday, June 27, 2017

New Approaches to Ethno-Linguistic Maps

I’m excited to give a guest blog post here at humans who read grammars on new methods in language geography.  I’m a geographer by trade, and I am currently a PhD student at the University of Maryland.  I also work for an environmental nonprofit - Conservation International - doing data science on agriculture and environmental change in East Africa.  Before ending up where I am now, I lived for some time in West Africa and the Philippines.  During my time in both of those linguistically-rich areas, I became quite interested in language geographies and linguistics more generally.  Spurned on by curiosity and my disappointment in available resources, I’ve done some side projects mapping languages and language groups, which I’ll talk about here.

Problems with Current Language Maps

Screen Shot 2017-06-26 at 11.23.48 PM.png
A map of tonal languages from WALS.  Fascinating at a global scale, but unsatisfying if you zoom in to smaller regions.
One major issue with most modern maps of languages is that they often consist of just a single point for each language - this is the approach that WALS and glottolog take.  This works pretty well for global-scale analyses, but simple points are quite uninformative for region scale studies of languages.  Points also have a hard time spatially describing languages that have disjoint distributions, like English, or languages that overlap spatially. See here for a more in-depth discussion of these issues from Humans Who Read Grammars


One reason that most language geographers go for the one-point-per-language approach is that using a simple point is simple, while mapping languages across regions and areas is very difficult.  An expert must decide where exactly one language ends and another begins.  The problem with relying on experts, however, is that no expert has uniform experience across an entire region, and thus will have to rely on other accounts of which language is prevalent where.  This is how, for example, the Murdock Map of African ethno-linguistic groups was created.  As a continental scale map, it is rich and fascinating.  However, looking for closely at specific region, and the map seems to have problems - how did Murdock know exactly the shape of each little wiggle identifying the boundary between two groups?  What about areas where two different groups overlap?  Other issues can arise when trying to distinguish distinct groups when often the on-the-ground reality is that a language may exist as a dialect continuum, something that subjectively drawing polygons does not readily account for.


These maps can have real import when they form the foundation of other analyses. Researchers have examined whether ethnic diversity in developing countries, and in Africa in particular, can hamper economic development and lead to conflict. Scientists disagree, although many analyses use the Murdock map. See some of this research here, here and here. Another study, recently published in Science, looked at Internet penetration in areas where politically excluded ethnic groups live. They found that groups without political power were often marginalized in terms of internet service provision. However, their data for West Africa, which came from the Ethnic Power Relations database, was quite rough: all of southern Mali was one ethnic group labeled "blacks" while the north was labeled as "Tuaregs" or "Arabs", while there was no data at all for Burkina Faso.  While their findings were important and they did the best that they could with available datasets, a less informed analysis from the same data could end up looking like linguistics done horribly wrong.  We need better ethno-linguistic maps simply to do good social science and address these critical questions.

New Methods and Datasets

I believe that, thanks to greater computational efficiency offered by modern computers and new datasets available from social media, it is increasingly possible to develop better maps of language distributions using geotagged text data rather than an expert’s opinion.  In this blog, I’ll cover two projects I’ve done to map languages - one using data from Twitter in the Philippines, and another using computationally-intensive algorithms to classify toponyms in West Africa.


I should note that for all its hype, big data can be pretty useless without real-world experience.  The Philippines and West Africa are two parts of the world where I have spent a good amount of time and have some on-the-ground familiarity with the languages.  Thus, I was able to use my local knowledge to inform how I conducted the analyses, as well as to evaluate their issues and shortcomings.

Case Study 1: Social Media From The Philippines

Many fascinating language maps from twitter have been created at global scales - see here, and here.  However, to explore the distribution of understudied languages that don’t show up in maps of global languages, one must use more bespoke methods.  This is especially true of austronesian languages like those found in the Philippines, which don’t have a lot of phonemic variability, and therefore aren’t easily classified using the methods that google translate uses.  These methods, which rely on slices of the sample text, often confuse austronesian languages like Tagolog and Bahasa - just look at the maps I mentioned above. Thus, I had to use a word-list method, and created word lists from corpora offered by SEAlang, and by scraping from local-language wikipedia articles.  The resulting maps show exactly where minority languages are used in comparison with English and Tagalog in the philippines, and likely underestimate the prevalence of minority languages because the corpora used (wikipedia and the bible) are quite different from the twitter data that was classified.


Languages of Tweets in the Philippines.
The resulting map shows about 125,000 tweets in English, Tagalog, Taglish (using Tagalog and English in the same tweet), and the local languages Cebuano, Ilocano, Hiligaynon, Kapampangan, Bikol, and Waray.  This map offers more nuance than traditional language maps of the Philippines.  For example, most maps would show Ilocano over the entire northern part of Luzon, but this map shows that the use of Ilocano is much more robust on the northwest coast than in the rest of the north.  This analysis also allowed me to test a hypothesis that I frequently heard locals assert when in the Philippines - that English is more common in the south, because southerners would rather use English than Tagalog, which is seen as a northern language.  I found that this was to be the case, and I was only able to confirm this because I had such a large sample size.  Without newer datasets like those offered by social media, this hypothesis would be untestable.


To see a more in-depth description of this analysis, you can see my original blog post here.


Case Study 2: West African Toponyms

Another project I did used toponyms, or place names, from West Africa.  Toponyms databases like geonames.org have relatively high spatial resolution - with a name for every populated place in an area.  And while a place name is not as long as a tweet or other linguistic dataset, toponyms do encode ethno-linguistic information.  It would be easy for someone familiar with Europe to distinguish whether a toponym is associated with the French or German linguistic group - a French name would likely begin with “Les” and end with “-elle”, while a German name could begin with “Der” and end with “-berg”.  Similar differences exist between toponyms from different ethnic groups all over the world, and are quite evident to locals.  What if you could train an algorithm to detect these differences, and then had it classify every single toponym throughout a region?  That is what I tried to do in this analysis.


I used toponyms for six countries in French West Africa. I decided to focus on French West Africa for several reasons. For one, I have worked there, and have some familiarity with the ethnic groups of the region and their distributions, and it is an area I am very curious about. For another thing, this is a relatively poorly documented part of the world as far as ethno-linguistic groups go, and it is an area with significant region-scale ethnic diversity. Finally, the countries I selected were colonized by one group, meaning that all of the toponyms were transliterated the same way and could be compared even across national borders. In all, I used 35,785 toponyms.


First, I got a list of every possible set of three letters (called a 3-gram) from the toponyms.   Then, I tested for spatial autocorrelation in the locations that contained each 3-gram using a Moran's I test, and selected only those 3-grams that had significant clustering.


To give an illustration of why this was necessary, here are two examples of the spatial distribution 3-grams. One 3-gram - "ama" - occurs roughly evenly throughout the regions in this study. The other 3-gram - "kro" - is very common in toponyms in south-east Côte d'Ivoire, and virtually nonexistent in other areas. Thus, "kro" has significant spatial autocorrelation whereas "ama" does not.


Here are all of the toponyms that contain the 3-gram "kro" 

And here are all of the toponyms that contain the 3-gram "ama" 


Thus, the the 3-gram "ama" doesn't tell us much about which ethnic group a toponym belongs to, because that 3-gram is found evenly distributed throughout West Africa - it is just noise. The 3-gram "kro", on the other hand, carries information about which ethnic group a toponym belongs to, because it is clearly clustered in a group in Southeast Côte d'Ivoire.


I then calculated the lexical distance between all of the toponyms based on the number shared 3-grams that had significant spatial autocorrelation.  To add a spatial component, I also linked any two toponyms that were less than 25 kilometers apart. Thus, I had a graph where every toponym was a vertex, and undirected edges connected toponyms that had spatial or lexical affinity.  Finally, I used a fast greedy modularity-optimizing algorithm to detect communities, or clusters, in this graph.


Results
The algorithm found seven distinct communities, which definitely correspond to ethnic groups and ethnic macro-groups in West Africa.




The red cluster includes Wolof, Serer, and Fulfulde place names, which makes sense, as all of these groups are Senegambian languages. This group of languages is the primary group in Senegal and Mauritania, which my classification picked up on. It also caught the large Fulfulde presence in central Guinea, throughout an area known as the Fouta-Djallon. This cluster also has a significant presence throughout the Sahel, stretching into Burkina Faso and dotted throughout the rest of West Africa, much like the migrant Fulfulde people.


The green cluster captures most of the area where Mandé languages are spoken, including most of Mali, where the Bambara are found, as well as Eastern Guinea and Northern Côte d'Ivoire, where Malinké is found. Interestingly, most of the toponyms in Western Mali fell into the Senegambian/Fulfulde cluster, and were not in the Mandé cluster, even though there are Mandé groups like the Soninké and Khassonké in Western Mali. Southern Guinea is densely green, representing the presence of Mandé groups there, like the Kuranko. Surprisingly, much of central and southern Côte d'Ivoire also fell into the green cluster, even through there are a couple of different groups there which are not in any way related to the Mandé groups that were most represented in the green cluster. This is also true of areas in Western Burkina Faso and Eastern Mali, where there are many languages unrelated to the broader Mandé group, such as Dogon, Bobo, Minianka, and Senufo/Syempire. However, I know that Dyula, a Mandé language closely related to Bambara, is spoken as a trade language in both of these areas (Côte d'Ivoire and Western Burkina Faso). It could be that Dyula has had a long enough presence in these areas to leave an imprint on the toponyms there.


The purple group pretty clearly captured two different disjoint groups that are both in the broader Mandé group - the Susu, in far Western Guinea, and the Dan, in Western Côte d'Ivoire. These groups are normally classified as being on quite separate branches of the Mandé language family, with the Susu being Northern Mandé and Dan being Eastern Mandé. However, the fact that the algorithm put them in the same group, even though they were too far apart to have edges/connections based on spatial affinity, shows that Dan and Susu toponyms have several three-grams common.


The yellow cluster seems to have caught two sub-groups within the broader green/Mandé cluster. Many of the yellow toponyms in central Mali are in what you could call the Bambara homeland, between Bamako and Segou. However, a second cluster stands out quite distinctly in southern Guinea. It's unclear to me what group this could represent and why it would have toponymic features distinct enough from its neighbors that the algorithm put it in a different cluster. Some maps say that a group called the Konyanka lives here and speaks a language closely related to Malinké.


The turquoise cluster quite clearly captures the Mossi people and their toponyms, as well as the Gurunsi, a related group (both Mossi and Gurunsi are classified as Gur languages).


The black cluster in southern Burkina Faso captured a group that most national ethno-linguistic maps call the Lobi, although this part of West Africa is known for its significant entho-linguistic heterogeneity. Another group of villages in Eastern Burkina Faso also fell into the black cluster, although I could not find any significant ethnic group found there.


Finally, the blue cluster captured both the Baoulé/Akan languages as well as the Senufo. It captured the Senufo especially in Côte d'Ivoire and somewhat in Burkina Faso, but not much in Mali, where I know the Senufo have a significant presence. This could represent a Bambarization of previously Senufo toponyms due to the fact that the government of Mali is predominantly Bambara, or it could pre-date the Malian state, as this area was part of Samori Toure's Wassoulou Empire, in which the Malinké language was strongly enforced. The classification of the Senufo languages has always been controversial, but this toponymic analysis suggests that they are more related to Kwa toponyms to the south rather than to Gur toponyms to the northeast.

Caveats

Some caveats with this work and its interpretation. For one, this only shows toponymic affinities. Those affinities usually correspond to ethnic distributions, but not always. There is a lot of migration in West Africa today, and place names don't usually change as quickly as the distributions of people. Thus, toponyms can sometimes encode historic ethnic distributions, for example many toponyms in the United States come from Native American languages, and there are many toponym suffixes in England that reflect a historic Nordic presence. Thus, this and similar maps are most informative when interpreted in combination with on-the-ground information and knowledge.


Another issue with classifying toponyms in West Africa in particular is that West African toponyms are transcribed using the Latin alphabet, which definitely does not capture all of the sounds that exist in West African languages. Different extensions of the Latin alphabet, as well as an indigenous alphabet, are often used to transcribe these languages, however these idiosyncratic methods of writing languages are not used in the geonames dataset. Thus, the Fulfulde bilabial implosive (/ɓ/ in IPA) is written the same way as a pulmonic bilabial plosive - as a "b", so this distinction is lost in our dataset, even though it adds a lot of information about what ethnic group a given toponym belongs to. However, some other sounds and sound combinations, which are very indicative of specific languages are captured using a Latin alphabet- for example prenasalized consonants (/mb/) common in Senegambian languages, labial velars (/gb/ and /kp/) common in coastal languages, or the lack of a 'v' in Mandé languages. Issues also arise with how different colonizers transcribe sounds differently, for example 'ny' and 'kwa' in English would be 'gn' and 'coua' in French. However, this didn't apply in this analysis, which only used Francophone countries, and I believe it could be dealt with if I tried to do a larger analysis.

Conclusion


This is an exciting time to be at the intersection of geography and linguistics!  New datasets and computational methods are giving researchers the ability to ask newer and better questions about who belongs to what group, and where.  I hope new developments in this research can yields new linguistic results about phylogeny, migration, and the spread of linguistic phenomena.  Outside of the field of linguistics, better language maps could have broad applications, from improving disaster response planning to helping to answer critical questions about the origins of ethnic conflict.

Thanks for reading! You can check out my personal website for more detailed descriptions of these two projects, as well as other side projects I've done.