Wednesday, July 5, 2017

What languages are grammars of the world written in?

Humans have been writing grammars for a long time. The serious expansion into non-european languages is fairly recent though, and associated with colonialism and Christian missionary work. Because of this, it's interesting to see in what language grammars are written in (meta-langauge) as well as what language their about (target-language). In the map above, this is precisely what we see - what the meta-languages of Glottolog language descriptions are.

There's roughly 7,000 languages in the world alive today, and we have some kind of description of approximately 4,000 of them. If you want to find them, go and search Glottolog.

Harald Hammarström, one of the editors of Glottolog, recently shared with me some interesting data on these descriptions that I want to share with all of you. In Glottolog, descriptive references are tagged for which language their in (meta-language) as well as which language they are about (target-language)*.  The map above gives the distribution of meta-languages of the descriptions of 4,005 languages in Glottolog. For each language on the map above there is only one dot with only one color. The color is according to the meta-language of the Most Extensive Description for said language**.

In this map we can clearly see the domination of English as a world language, but we can also so the prevalence of French in former French colonies in Africa and naturally the national languages of the modern nation states like Brazil (Portuguese) and Indonesia (Indonesian).

If we look a bit closer at this data we can see exactly how many target-languages there are per meta-language in total, as well how many documents in Glottolog there are per meta-language. For those documents where it's possible, Hammarström has also compiled a corpus of the actual content text per document and calculated how many types and tokens there are therein.

The table below summarizes this information for all references in Glottolog, i.e. not only the Most Extensive Description per language. There's a total of 96 meta-languages in Glottolog, the table summarized the 9 most common.
Here is an interactive graphic showing the same data as the table above:

We hope you enjoyed that, be sure to explore Glottolog yourself if you haven't already!

* In bibTeX-entries for Glottolog references, meta-language have the entry field "inlg" and target-languages have "lgcode". 

** Most Extensive Description is first sorted by descriptive type (Grammar>Grammar Sketch> etc), then number of pages and lastly publication year.

Tuesday, June 27, 2017

New Approaches to Ethno-Linguistic Maps

I’m excited to give a guest blog post here at humans who read grammars on new methods in language geography.  I’m a geographer by trade, and I am currently a PhD student at the University of Maryland.  I also work for an environmental nonprofit - Conservation International - doing data science on agriculture and environmental change in East Africa.  Before ending up where I am now, I lived for some time in West Africa and the Philippines.  During my time in both of those linguistically-rich areas, I became quite interested in language geographies and linguistics more generally.  Spurned on by curiosity and my disappointment in available resources, I’ve done some side projects mapping languages and language groups, which I’ll talk about here.

Problems with Current Language Maps

Screen Shot 2017-06-26 at 11.23.48 PM.png
A map of tonal languages from WALS.  Fascinating at a global scale, but unsatisfying if you zoom in to smaller regions.
One major issue with most modern maps of languages is that they often consist of just a single point for each language - this is the approach that WALS and glottolog take.  This works pretty well for global-scale analyses, but simple points are quite uninformative for region scale studies of languages.  Points also have a hard time spatially describing languages that have disjoint distributions, like English, or languages that overlap spatially. See here for a more in-depth discussion of these issues from Humans Who Read Grammars

One reason that most language geographers go for the one-point-per-language approach is that using a simple point is simple, while mapping languages across regions and areas is very difficult.  An expert must decide where exactly one language ends and another begins.  The problem with relying on experts, however, is that no expert has uniform experience across an entire region, and thus will have to rely on other accounts of which language is prevalent where.  This is how, for example, the Murdock Map of African ethno-linguistic groups was created.  As a continental scale map, it is rich and fascinating.  However, looking for closely at specific region, and the map seems to have problems - how did Murdock know exactly the shape of each little wiggle identifying the boundary between two groups?  What about areas where two different groups overlap?  Other issues can arise when trying to distinguish distinct groups when often the on-the-ground reality is that a language may exist as a dialect continuum, something that subjectively drawing polygons does not readily account for.

These maps can have real import when they form the foundation of other analyses. Researchers have examined whether ethnic diversity in developing countries, and in Africa in particular, can hamper economic development and lead to conflict. Scientists disagree, although many analyses use the Murdock map. See some of this research here, here and here. Another study, recently published in Science, looked at Internet penetration in areas where politically excluded ethnic groups live. They found that groups without political power were often marginalized in terms of internet service provision. However, their data for West Africa, which came from the Ethnic Power Relations database, was quite rough: all of southern Mali was one ethnic group labeled "blacks" while the north was labeled as "Tuaregs" or "Arabs", while there was no data at all for Burkina Faso.  While their findings were important and they did the best that they could with available datasets, a less informed analysis from the same data could end up looking like linguistics done horribly wrong.  We need better ethno-linguistic maps simply to do good social science and address these critical questions.

New Methods and Datasets

I believe that, thanks to greater computational efficiency offered by modern computers and new datasets available from social media, it is increasingly possible to develop better maps of language distributions using geotagged text data rather than an expert’s opinion.  In this blog, I’ll cover two projects I’ve done to map languages - one using data from Twitter in the Philippines, and another using computationally-intensive algorithms to classify toponyms in West Africa.

I should note that for all its hype, big data can be pretty useless without real-world experience.  The Philippines and West Africa are two parts of the world where I have spent a good amount of time and have some on-the-ground familiarity with the languages.  Thus, I was able to use my local knowledge to inform how I conducted the analyses, as well as to evaluate their issues and shortcomings.

Case Study 1: Social Media From The Philippines

Many fascinating language maps from twitter have been created at global scales - see here, and here.  However, to explore the distribution of understudied languages that don’t show up in maps of global languages, one must use more bespoke methods.  This is especially true of austronesian languages like those found in the Philippines, which don’t have a lot of phonemic variability, and therefore aren’t easily classified using the methods that google translate uses.  These methods, which rely on slices of the sample text, often confuse austronesian languages like Tagolog and Bahasa - just look at the maps I mentioned above. Thus, I had to use a word-list method, and created word lists from corpora offered by SEAlang, and by scraping from local-language wikipedia articles.  The resulting maps show exactly where minority languages are used in comparison with English and Tagalog in the philippines, and likely underestimate the prevalence of minority languages because the corpora used (wikipedia and the bible) are quite different from the twitter data that was classified.

Languages of Tweets in the Philippines.
The resulting map shows about 125,000 tweets in English, Tagalog, Taglish (using Tagalog and English in the same tweet), and the local languages Cebuano, Ilocano, Hiligaynon, Kapampangan, Bikol, and Waray.  This map offers more nuance than traditional language maps of the Philippines.  For example, most maps would show Ilocano over the entire northern part of Luzon, but this map shows that the use of Ilocano is much more robust on the northwest coast than in the rest of the north.  This analysis also allowed me to test a hypothesis that I frequently heard locals assert when in the Philippines - that English is more common in the south, because southerners would rather use English than Tagalog, which is seen as a northern language.  I found that this was to be the case, and I was only able to confirm this because I had such a large sample size.  Without newer datasets like those offered by social media, this hypothesis would be untestable.

To see a more in-depth description of this analysis, you can see my original blog post here.

Case Study 2: West African Toponyms

Another project I did used toponyms, or place names, from West Africa.  Toponyms databases like have relatively high spatial resolution - with a name for every populated place in an area.  And while a place name is not as long as a tweet or other linguistic dataset, toponyms do encode ethno-linguistic information.  It would be easy for someone familiar with Europe to distinguish whether a toponym is associated with the French or German linguistic group - a French name would likely begin with “Les” and end with “-elle”, while a German name could begin with “Der” and end with “-berg”.  Similar differences exist between toponyms from different ethnic groups all over the world, and are quite evident to locals.  What if you could train an algorithm to detect these differences, and then had it classify every single toponym throughout a region?  That is what I tried to do in this analysis.

I used toponyms for six countries in French West Africa. I decided to focus on French West Africa for several reasons. For one, I have worked there, and have some familiarity with the ethnic groups of the region and their distributions, and it is an area I am very curious about. For another thing, this is a relatively poorly documented part of the world as far as ethno-linguistic groups go, and it is an area with significant region-scale ethnic diversity. Finally, the countries I selected were colonized by one group, meaning that all of the toponyms were transliterated the same way and could be compared even across national borders. In all, I used 35,785 toponyms.

First, I got a list of every possible set of three letters (called a 3-gram) from the toponyms.   Then, I tested for spatial autocorrelation in the locations that contained each 3-gram using a Moran's I test, and selected only those 3-grams that had significant clustering.

To give an illustration of why this was necessary, here are two examples of the spatial distribution 3-grams. One 3-gram - "ama" - occurs roughly evenly throughout the regions in this study. The other 3-gram - "kro" - is very common in toponyms in south-east Côte d'Ivoire, and virtually nonexistent in other areas. Thus, "kro" has significant spatial autocorrelation whereas "ama" does not.

Here are all of the toponyms that contain the 3-gram "kro" 

And here are all of the toponyms that contain the 3-gram "ama" 

Thus, the the 3-gram "ama" doesn't tell us much about which ethnic group a toponym belongs to, because that 3-gram is found evenly distributed throughout West Africa - it is just noise. The 3-gram "kro", on the other hand, carries information about which ethnic group a toponym belongs to, because it is clearly clustered in a group in Southeast Côte d'Ivoire.

I then calculated the lexical distance between all of the toponyms based on the number shared 3-grams that had significant spatial autocorrelation.  To add a spatial component, I also linked any two toponyms that were less than 25 kilometers apart. Thus, I had a graph where every toponym was a vertex, and undirected edges connected toponyms that had spatial or lexical affinity.  Finally, I used a fast greedy modularity-optimizing algorithm to detect communities, or clusters, in this graph.

The algorithm found seven distinct communities, which definitely correspond to ethnic groups and ethnic macro-groups in West Africa.

The red cluster includes Wolof, Serer, and Fulfulde place names, which makes sense, as all of these groups are Senegambian languages. This group of languages is the primary group in Senegal and Mauritania, which my classification picked up on. It also caught the large Fulfulde presence in central Guinea, throughout an area known as the Fouta-Djallon. This cluster also has a significant presence throughout the Sahel, stretching into Burkina Faso and dotted throughout the rest of West Africa, much like the migrant Fulfulde people.

The green cluster captures most of the area where Mandé languages are spoken, including most of Mali, where the Bambara are found, as well as Eastern Guinea and Northern Côte d'Ivoire, where Malinké is found. Interestingly, most of the toponyms in Western Mali fell into the Senegambian/Fulfulde cluster, and were not in the Mandé cluster, even though there are Mandé groups like the Soninké and Khassonké in Western Mali. Southern Guinea is densely green, representing the presence of Mandé groups there, like the Kuranko. Surprisingly, much of central and southern Côte d'Ivoire also fell into the green cluster, even through there are a couple of different groups there which are not in any way related to the Mandé groups that were most represented in the green cluster. This is also true of areas in Western Burkina Faso and Eastern Mali, where there are many languages unrelated to the broader Mandé group, such as Dogon, Bobo, Minianka, and Senufo/Syempire. However, I know that Dyula, a Mandé language closely related to Bambara, is spoken as a trade language in both of these areas (Côte d'Ivoire and Western Burkina Faso). It could be that Dyula has had a long enough presence in these areas to leave an imprint on the toponyms there.

The purple group pretty clearly captured two different disjoint groups that are both in the broader Mandé group - the Susu, in far Western Guinea, and the Dan, in Western Côte d'Ivoire. These groups are normally classified as being on quite separate branches of the Mandé language family, with the Susu being Northern Mandé and Dan being Eastern Mandé. However, the fact that the algorithm put them in the same group, even though they were too far apart to have edges/connections based on spatial affinity, shows that Dan and Susu toponyms have several three-grams common.

The yellow cluster seems to have caught two sub-groups within the broader green/Mandé cluster. Many of the yellow toponyms in central Mali are in what you could call the Bambara homeland, between Bamako and Segou. However, a second cluster stands out quite distinctly in southern Guinea. It's unclear to me what group this could represent and why it would have toponymic features distinct enough from its neighbors that the algorithm put it in a different cluster. Some maps say that a group called the Konyanka lives here and speaks a language closely related to Malinké.

The turquoise cluster quite clearly captures the Mossi people and their toponyms, as well as the Gurunsi, a related group (both Mossi and Gurunsi are classified as Gur languages).

The black cluster in southern Burkina Faso captured a group that most national ethno-linguistic maps call the Lobi, although this part of West Africa is known for its significant entho-linguistic heterogeneity. Another group of villages in Eastern Burkina Faso also fell into the black cluster, although I could not find any significant ethnic group found there.

Finally, the blue cluster captured both the Baoulé/Akan languages as well as the Senufo. It captured the Senufo especially in Côte d'Ivoire and somewhat in Burkina Faso, but not much in Mali, where I know the Senufo have a significant presence. This could represent a Bambarization of previously Senufo toponyms due to the fact that the government of Mali is predominantly Bambara, or it could pre-date the Malian state, as this area was part of Samori Toure's Wassoulou Empire, in which the Malinké language was strongly enforced. The classification of the Senufo languages has always been controversial, but this toponymic analysis suggests that they are more related to Kwa toponyms to the south rather than to Gur toponyms to the northeast.


Some caveats with this work and its interpretation. For one, this only shows toponymic affinities. Those affinities usually correspond to ethnic distributions, but not always. There is a lot of migration in West Africa today, and place names don't usually change as quickly as the distributions of people. Thus, toponyms can sometimes encode historic ethnic distributions, for example many toponyms in the United States come from Native American languages, and there are many toponym suffixes in England that reflect a historic Nordic presence. Thus, this and similar maps are most informative when interpreted in combination with on-the-ground information and knowledge.

Another issue with classifying toponyms in West Africa in particular is that West African toponyms are transcribed using the Latin alphabet, which definitely does not capture all of the sounds that exist in West African languages. Different extensions of the Latin alphabet, as well as an indigenous alphabet, are often used to transcribe these languages, however these idiosyncratic methods of writing languages are not used in the geonames dataset. Thus, the Fulfulde bilabial implosive (/ɓ/ in IPA) is written the same way as a pulmonic bilabial plosive - as a "b", so this distinction is lost in our dataset, even though it adds a lot of information about what ethnic group a given toponym belongs to. However, some other sounds and sound combinations, which are very indicative of specific languages are captured using a Latin alphabet- for example prenasalized consonants (/mb/) common in Senegambian languages, labial velars (/gb/ and /kp/) common in coastal languages, or the lack of a 'v' in Mandé languages. Issues also arise with how different colonizers transcribe sounds differently, for example 'ny' and 'kwa' in English would be 'gn' and 'coua' in French. However, this didn't apply in this analysis, which only used Francophone countries, and I believe it could be dealt with if I tried to do a larger analysis.


This is an exciting time to be at the intersection of geography and linguistics!  New datasets and computational methods are giving researchers the ability to ask newer and better questions about who belongs to what group, and where.  I hope new developments in this research can yields new linguistic results about phylogeny, migration, and the spread of linguistic phenomena.  Outside of the field of linguistics, better language maps could have broad applications, from improving disaster response planning to helping to answer critical questions about the origins of ethnic conflict.

Thanks for reading! You can check out my personal website for more detailed descriptions of these two projects, as well as other side projects I've done.

Thursday, June 1, 2017

World map of language families from Glottolog

World map from Glottolog, each language is one dot and coloured by language family (or other top-genetic unit).
Language families are the main way we categorise and understand the language diversity of the world. A language family is a group of languages that have been analysed as having one ancestor,  one great-great-great-and-yet-greater-grand-mother language. Indo-European is a language family, with the sub-groups of Romance, Germanic, Slavic etc.

Maps are great tools for visualising information, we're pretty map-nerdy on this blog. Robert Forkel, one of the editors of Glottolog, kindly shared an interactive map of the world with languages plotted out and coloured by language family with me. This map is interactive, rendered in a web browser with and html and json file.

This map is not available on the Glottolog site, but will later be implemented in the command-line interface. You can see language families on the website by either selecting a country or a specific family. This tool is the only way to see all language families in all countries on Glottolog. 

I will let you know when this is implemented and you can play with it yourself. In the meantime, I thought I'd share this screenshot and talk a little bit about language families.


Some notes on language families, and in particular Glottolog language families and this map

When we look at the collected wisdom of linguistic scholars, we actually find a lot of disagreement. For example, Ethnologue counts to 135 language families and Glottolog to 239!* To read more about this, please go to this post on the "other" languages of Glottolog and Ethnologue, and how the two catalogues define these categories.

Due to lack of data and disagreements, we also have very different estimates for language family depth, i.e. how long time ago the greatest-grand-mother language was spoken. Here are some examples:

Language family proposed date
Afro-Asiatic 9,500 - 18,000
Algic 7,000
Austronesian 6,000-8,000
Dravidian 6,000
Indo-European 5,500

In this case, we're using the language families (and other top-genetic units) from Glottolog. Glottolog is a carefully curated catalogue of languages, and for each grouping there is always a reference provided to where in the academic literature we can find support for exactly how the tree is structured. This is very helpful. With this said, it's worth noting that Glottolog often tends to be more "splitting" (not lumping languages into very large families) than other similar resources, like Ethnologue. In general, Glottolog often represents a more conservative view of language history.

Glottolog also contains other kinds of groupings besides what we commonly think of as "families", for example: unattested, sign languages, isolates, pidgins, artifical etc. More on this here.

Please remember when you look at this/these map that:

  • stacking of dots is not trivial, Nigeria for example looks more full of atlantic-congo languages than it is, see images below. Zoom in for denser areas
  • the colours on this map were not picked manually, but assigned automatically
  • Creoles are in the family of their lexifier
  • there are other groupings besides traditional language families in the dataset
  • these are dots, not polygons
  • this will be implemented as a command line tool, so you should get your git and python on in order to make these yourself.

Nigeria in the world map at the top of the post
Nigeria zoomed in
Here are some more zoomed in areas for your enjoyment
The island of New Guinea
Mainland South East Asia
Top South America

Language Family Tournament

On a sillier note, the Facebook page Etymology Memes for Reconstructed Phonemes recently ran a tournament where followers could vote for which was their favourite language family from a set of 24. Since this is related to the content of this blog post, I'll share those results as well!
A tournament on Facebook where followers of the page
"Etymology Memes for Reconstructed Phonemes" could vote for which was their favourite language family.
The winner of said contest, Basque
Other ways of categorising languages besides language families
There are other way of categorising languages than into language families, most notably into geographic areas. It seems that languages that are in contact influence each other. Furthermore, it is not necessarily true that all parts of a language (sound system, vocabulary, grammar, syntax, etc) has one and only one shared ancestry - there could be multiple underlying trees for different parts of language. It may be that the counting system was borrowed from neighbour x and some phonemes imported from neighbour y. Another reason for multiple trees is dialect chains breaking up and coming together again, which is hard to detect given enough time.

Besides these approaches, we can also categorise languages into types (suffixing, tonal, CVCV, VSO, isolating etc). This is what typologists do. Knowing the distribution of various traits in the worlds languages, we can not only investigate language history, but also ask questions such as:

  • are certain traits correlated with each other?
  • are there trade-offs between traits, for example to minimize complexity?
  • are there cognitive constraints on combination of traits?

Ok, that's it for now. Hope you enjoyed this!



* In order to make a fair comparison, I've excluded some special cases that the two catalogues deal with in very different ways or that we have very little data on. For Ethnologue, I've excluded: constructed languages (1), creoles (88), deaf sign languages (137), language isolates, mixed languages (21), pidgins (13), and unclassified languages (51). For Glottolog I've excluded pidgins (79), isolates (198), mixed languages (23), artificial (9), speech registers (6), “unattested” (61), “unclassifiable” (117) and sign languages (166). Creoles in Glottolog are classified under their lexifier family, making them hard to count, but they don’t increase the number of families. There are 37 language with "creole" or "kriol" in their name in Glottolog, but I didn't subtract these since they belonged to families that also contain non-contact languages.

Monday, May 22, 2017

China's dialect quiz shows: some ideas for language games – and language game shows – that you can make

People love languages; there are many game shows on TV which are language-based. Most language quiz shows test contestants on their knowledge of the broadcasting language itself. For instance, English-language channels testing contestants on their knowledge of the English language and English literature:

Wheel of Fortune; e.g. CBS News

Have you seen quiz shows on TV where they quiz contestants on languages other than the broadcasting language (or some major international language)? Wouldn't it be cool if there are quiz shows on TV where they test contestants on languages that are rarely or never broadcasted? 

Here I will talk about the "dialect quiz shows" in China, which I think can be emulated in many other parts of the world. These shows are entertaining for the general audience (if done right), and can help raise the interest in regional languages.

(In this blogpost, there are many pics/gifs of people struggling and failing. They are not suggesting that these quiz shows are impossibly hard; many contestants do very well. It is just that seeing the contestants struggling is... funny, and that their wrong answers often turn out to be more educational than the correct answers provided by contestants who answer them effortlessly.)

"Dialect quiz show"

The "dialect quiz show" is a genre of TV game show that has become somewhat popular in Mainland China in the last few (?) years. Many provinces have their own dialect quiz shows. "Dialect" is a commonly used – but not very good – translation of 方言 fāngyán, which can refer to any regional speech varieties that is not the standard. Genealogically related speech varieties which would be considered separate languages in the West are often considered 方言 fāngyán of the same language from a Chinese point of view. The diversity amongst the Sinitic languages ("Chinese dialects") is comparable with the diversity amongst the Romance languages, but in China the Sinitic languages are all regarded as 方言 fāngyán of the same Chinese language. I don't know whether this dialect-quiz-show trend has spread to the (very few) TV-channels in Mainland China that broadcast in non-Chinese languages (e.g.); here I am talking about the dialect quiz shows in Sinitic languages. (From this point onwards, except for the term dialect quiz show, I will just use the word lect to mean any language variety, and avoid the entire language-vs.-dialect debate.) 

There are two main types of dialect quiz shows: comprehension-oriented, and production-oriented. The comprehension-oriented quiz shows aim at exposing people to a wide variety of lects within a geographical range (especially lects that people are less likely to encounter), while production-oriented quiz shows aim at increasing the knowledge of one particular lect. There are are also other dialect game shows that differ from these two prototypes.

The dialect quiz shows have raised the interest in regional lects, which are all loosing speakers to Standard Mandarin, in many cases alarmingly rapidly. For me, the dialect quiz shows are interesting for a number of different reasons:

  • Some of them are entertaining, in a TV-show-kinda way, even if the lects that they quiz on are ones that I am less interested in, and have zero intelligibility of (I have to stick to the Chinese subtitles the entire time);
  • They are interesting to the linguist me; some of the quiz shows are formatted in a way (e.g. more explanations and discussions from the judges) that makes it easier for me to learn and think about the sound changes and semantic changes within the Sinitic language family. Often you also get to learn about the cultures of different places, and sometimes you get to see picturesque video clips;
  • For the comprehension-oriented shows, the contestants are quizzed on a wide range of lects; how well or not well the contestants cope with a particular question gives a proxy on the level of intelligibility of that lects to other Sinitic speakers. (Questions from some lects invoke significantly more 😨 s than others. For instance, I learnt that basically all people in Húnán fear questions from Lóudǐ. Someone should just code the rate of correct/wrong answers for each lect);
The 😬 😧 😑 😱 faces of the team from 汕尾 Swabue/Shànwěi, as they listen to a question on Yuè of 斗门 Dǒumén; 谁语争锋 All about the Dialects; YouTube

  • For the production-oriented shows, sadly, often they show the level of attrition of a particular lect (this is a linguist-grunt, not an old-person-grunt);
A young contestant trying to retrieve in her head how to say 麻雀 máquè 'sparrow' in her native Hokchiu/Fúzhōu (in vain; the answer is 隻隻 [tsieʔ21 ʒieʔ24]); 福州话我最霸 Hók-ciŭ-uâ Nguāi Cói-bá; YouTube
  • I think about the social/linguistic/showbiz factors that make these quiz shows successful in China, and where else in the world such shows would work. These shows – if done well – are great in promoting the interest in regional lects, and increasing the knowledge of them. 
Before we continue, you might want to read this introduction to the modern Sinitic languages. The Language Atlas of China classifies the Sinitic lects into ten "dialect groups": Pínghuà, Yuè, Hakka, Mǐn, , Huī, Gàn, Xiāng, Jìn, and MandarinThere are also Sinitic lects that are left unclassified (not shown in the map below). There are many lects within each group that are not mutually intelligible. A "dialect group" is not necessarily a valid genealogical node. 

Ten Sinitic "dialect groups" à l'Atlas des langues de la Chine; Wikipedia

Unlike India, China tries to make provincial boundaries not based on linguistic lines. The following are the four
 provinces that I will talk about to some extent in this blogpost: (1) Húnán, (2) Guǎngdōng, (3) Fújiàn, and (4) Shànghǎi Municipality.

Comprehension-oriented dialect quiz shows

In a comprehension-oriented quiz show, the contestants listen to or watch a monologue/ dialogue/ performance in a lect from somewhere within the province, and then they have to answer a question about the monologue/ dialogue/ performance. The contestants give their answer in writing, on an electronic writing board in front of them. Some tasks require the contestants to act out their answer. Usually the contestants are also given the opportunity to further explain their answers verbally.

(Obviously, the linguistic background of the contestants varies; an easy question for one contestant can be a difficult one for another contestant. It is most fun – and educational – when the contestants half-understand something, and get caught in false-friend traps set by the question designers.)

These shows are great, as you get to hear a wide range of lects that you would otherwise never encounter. 
Here I will talk about three shows that I am more familiar with. 

  • 方言听写大会 Dialect Dictation Competition, part of the variety show 越策越开心 More Talk More Happy, from Húnán. They probably started the trend of comprehension-oriented dialect quiz shows. First season in 2013/14, second season in 2015;
  • 多彩中国话 Splendid Chinese Language. In 2016 the Hunannese expanded their dialect quiz show into this new show that covers six provinces: four provinces in the middle Yangtze (Húnán, Húběi to the north, Jiāngxī to the east, Ānhuī to the northeast) and two provinces in the middle Yellow further north (Hénán and Héběi);
  • 谁语争锋 All about the Dialects (Sèuihyúh Jāngfūng), from GuǎngdōngUnlike nearly all the other quiz shows discussed in this blogpost (and 95%+ of all broadcasting time in Mainland China), which have Standard Mandarin as the baseline broadcasting language, 谁语争锋 All about the Dialects is broadcasted in Cantonese. (Obviously, the participants have to at least understand Standard Cantonese; many contestants and judges speak in Mandarin instead.) First season in 2014, second season in 2015, third season in 2016.  
Other similar shows include 乡音对对碰 Xiāngyīn Duìduìpèng from Shāndōng, 江苏方言听写大赛 Jiāngsū Fāngyán Tīngxiě Dàsài from Jiāngsū, and 方言达人 Fāngyán Dárén from Sìchuān. (But I have not really watched these.)

The contestants and rules

The first season of both 方言听写大会 Dialect Dictation Competition and 谁语争锋 All about the Dialects have contestants from the general public, who apply as teams of five contestants. During the competition, a question is read out, and one contestant from each team answers the question on their own (i.e. they cannot consult their team members). If their answer is correct, they remain for the next question. If their answer is incorrect, they are eliminated from the rest of the game. The winning team is the team that still has contestants remaining at the end. (There are special rules to prevent two or more teams all being eliminated at once, e.g. a question does not count if all teams get that question wrong, special games when only two teams are left.) The winning team of an episode advances to the semi-final.

From left to right: broadcaster of Standard Cantonese, contestant, actor, and MC. In this last question of the episode, the broadcaster tells a story in Standard Cantonese, and the two contestants have to reenact the story: (pretend) hitting the actor at three different body parts, and in three different manners. This contestant, the last member of his team, unfortunately understood one body-part term less than the other contestant; his team – consisting of international students – was the runner-up out of five teams (the other teams consist of locals). 谁语争锋 All about the Dialects; YouTube

In the second season of 方言听写大会 Dialect Dictation Competition, and in 多彩中国话 Splendid Chinese Language, the contestants are invited-individuals. In 多彩中国话 Splendid Chinese Language, there is one contestant from each of the six provinces. The contestants play at the same time. The contestant that scores the most points is the winner of that episode. 

The second season of 谁语争锋 All about the Dialects is a battle amongst the prefectures of Guǎngdōng province. In each episode, there are two teams, each headed by a broadcaster, now playing as a contestant. (Obviously, there are no questions from their prefectures during that episode.) The team leader pre-selects three other people from their prefecture to be in their team. (Because having higher linguistic diversity in a team increases the likelihood of winning, the team-leaders often interpret this "from their prefecture" requirement very loosely.) The contestants get to play the entire episode, and the team that scores the most points advances to the semi-final.

Brother Orange, as a member of the Méizhōu team. Here he tells the MC in Cantonese that he also speaks a little bit of Teochew [besides his native Hakka]. 谁语争锋 All about the Dialects; YouTube.  
The third season of 谁语争锋 All about the Dialects has invited-individuals or couples as contestants. Only one individual/couple play at a time. They get at most one question wrong before they are eliminated. After their first wrong answer, they press the red or blue button, which are randomly assigned to mean 'eliminated' or 'not eliminated'. Contestants who survive after the sixth round advance to the final. (Only a few individuals/couples managed.) 

Judges: the broadcasters and the linguists/philologists

In each show there is a panel of broadcasters, one from each prefecture of the province (e.g. news presenters from regional TV; these people have broadcasting degrees). For instance, 方言听写大会 Dialect Dictation Competition has a panel of fourteen broadcasters, one from each of the 14 prefecture in Húnán Province谁语争锋 All about the Dialects has a panel of twenty broadcasters, one from each of the 21 prefectures in Guǎngdōng Province. (In all three seasons, there is no broadcaster from Shenzhen; I guess they couldn't find any broadcasters who are actual locals of Shenzhen.) 多彩中国话 Splendid Chinese Language, covering six provinces, has a panel of 87 broadcasters.

The panel of broadcasters in 多彩中国话 Splendid Chinese Language

A prefecture is "randomly" selected, and a question or a few questions on lects from that prefecture are asked. About half of the time, the broadcaster delivers the question themselves from their lectern. In other instances the question is delivered through a short video clip on the big screen, or someone performing a song or a play in the centre of the stage. In case the broadcaster has to deliver the question, the broadcaster speaks first in whatever lect that question is about (that lect is not necessarily their native lect, or the dominant lect of that prefecture). All the questions I have seen are on Sinitic languages, except for a couple of questions on 
Iu Mien that were asked in 谁语争锋 All about the Dialects. (The broadcaster from 韶关 Sháoguān is a Mien person, but she is usually tasked with asking questions on Hakka of Sháoguān.) 

Above: the broadcaster from 韶关 Sháoguān, in Mien attire, asking a question on Hakka of Sháoguān. Below: the broadcaster from 江门 Kongmoon/Jiāngmén, mouthing in Cantonese wa! gam sām gé! wwaaaa~! 'Wah! So difficult! Waaaaah~!'. 谁语争锋 All about the Dialects; YouTube 
In 谁语争锋 All about the Dialects, sometimes there are questions on Standard Cantonese, the broadcasting language itself. In this case, the questions are usually about the some older or obscure terms/idioms.

After the contestants give (and explain) their answers, the answer is revealed, and the broadcaster acts as the judge.

Other than the broadcasters, there are also one to three linguists/philologists acting as the final judge(s) when there are disputes. The linguists/philologists often also give key background facts and other linguistic fun facts related to the questions, e.g. the etymology of the words involved, how something is expressed in various lects, unusual/noteworthy sound correspondences, related gags. 

A contestant running over to hug a philologist-judge. (Thereafter, the philologist let their slightly incorrect answer pass.) 谁语争锋 All about the Dialects; YouTube

Glammed-up linguistics professors (judges for the show), here spreading a language-preservation message. 越策越开心 方言听写大会 More Talk More Happy Dialect Dictation Competition; YouTube

Types of questions/tasks

In the question rounds, each round begins with the contestant watching/ listening to a monologue/ dialogue/ performance in a lect that the contestant may or may not understand. The monologue/ dialogue/ performance is either performed by a broadcaster or some other performers, or presented in a video clip. The contestant is then, e.g.: 
  • asked about the meaning of a particular word or phrase;
  • asked a question of which the answer has to be inferred. Some examples: a) after hearing descriptions of a person, the contestant has to infer the emotion of that person (e.g. the contestant has to correctly comprehend whether that person is loved/ has won a game of mahjong/ was sacked/ had food poisoning etc.); b) after hearing descriptions of a person, the contestant has to infer the level of wealth of that person described (false friends galore, trying to figure what that person has or has not); c) identify the speech act involved, i.e. what they want to get done with that utterance (e.g. this speech by the announcer from Héngyáng, which turns out to be an actual marriage proposal, after the answer was revealed); d) an arithmetic question in a random lect (good luck deciphering the numerals in another lect, e.g., this easier question for the contestants, and this harder question the broadcaster has for his colleagues, plus discussions on the literal fricative in Wúchuān Yuè); e) the broadcaster uttering verses/ lyrics translated from another lect, and the contestant has to name the author or the name of the poem/ song; 
Haunting performance of a cheerful song, translated from Mandarin to Teochew, delivered by the broadcaster from 汕头 Swatow/Shàntóu. The contestants have to give the name of the original version of the song. 谁语争锋 All about the DialectsYouTube (cf. this original Mandarin version of the song)

    • asked how many homophones there were in the sentence they have just heard and/or what each of the homophones mean;
    • asked to list items that have been mentioned in the monologue/ dialogue/ performance, in a random lect. (A variation of this is a multiple choice question asking what has not been mentioned).
    Girls performing their 茶蛇爬 la13 la13 la13 'tea snake crawl' song in Xiāng of Yīyáng. After singing and acting out  la13 la13, and 爬 la13 many times during the song, the contestants were asked to name two out of four types of tea mentioned in the song.  越策越开心 方言听写大会 More Talk More Happy Dialect Dictation Competition; YouTube 

    Contestants may be given open questions, or a multiple choice questions.

    Other than questions, there are also other types of tasks, e.g.:

    • each contestant has a range of clothing items in front of them; they are then given instructions in a random lect to pick which clothing item, and wear them in what manner;
    Two contestants from Fatshan/Fóshān, from opposing teams, displaying what they have worn after being given the same five instructions from the broadcaster from Zhūhǎi. The discrepancy in their understanding of the instructions is obvious. 谁语争锋 All about the DialectsYouTube
    • each contestant or each group of contestants has a range of food items in front of them; they are then given instructions in a random lect to pick which food item to eat. They have to eat that item as quickly as possible. The first contestant/group that finishes signals that they have finished (a judge then verifies that they have in fact eaten up the item), and the contestant(s) have to say the name of the food in the broadcasting language (and hope that they have picked the correct food item; example). (If you want to organise this game, make sure that you check with the contestants beforehand what food items they can and cannot eat. And please don't make contestants eat, e.g., an entire chicken. This game is risky.
    • the contestants are told a short story in a random lect, and the contestants have to reenact the story (example);
    • one contestant from each team has their hearing and sight blocked; their team mates are given some descriptions by the broadcasters, each description in a different lect, and they have to draw what they think is described (e.g. 'that man wears a giant turd-shaped hat', in a random lect). After the drawing session is finished, the first contestants are brought back, and they have to rely on the picture that their team mates have drawn to answer questions from the broadcasters;
    From the same description in Sìyì Yuè, one Hakka team drew a roast duck, while the other Hakka team drew a bracelet; one team has fallen into a false-friend trap. 谁语争锋 All about the DialectsYouTube
    • Chinese whispers. There are (at least) two variants: a) contestants of the same team are lined up in row of Chinese-whispers-booths. The broadcaster utters a sentence in a random lect (most likely a lect that none of them speak), and the contestants have to pass this phonetic string along. The last contestant has to pronounce the sentence they heard, and then translate that back to the broadcasting language (example); b) only the first and the last players are contestants; the players in the middle are broadcasters or other guests. The first contestant is shown a word, and then they have to describe the meaning of that word in their own lect, without uttering the word itself, to the person in the next booth. This process is then repeated, with each participant speaking a different lect. In the last booth is the second contestant trying to guess the word that was shown to the first contestant. (This latter variant is most hilarious, seeing what contents get misinterpreted and invented along the way.)
    Chinese people playing Chinese-whispers (variation b). Here the celebrity guest is speaking in his non-native Cantonese to the broadcaster in the next booth. (Afterwards he switches back to his native Northeastern Mandarin.) 越策越开心 方言听写大会 More Talk More Happy Dialect Dictation Competition; YouTube

    • Similar to above: have a row of broadcasters (possibly including guests, or even other contestants), each speaking their own lect, describing a word/phrase that the contestants cannot see, without uttering the word/phrase itself. The contestant has to guess the word/phrase.

    The contestants failing to understand the broadcaster speaking Gàn of Jí'ān, and choosing to use up their last chance to pass (they have three chances to pass). 多彩中国话 Splendid Chinese Language; YouTube

    Entertainment value
    I find 谁语争锋 All about the Dialects and the second season of 越策越开心 方言听写大会 More Talk More Happy Dialect Dictation Competition more enjoyable: a right amount of discussions on the linguistic points by all the participants, the questions are often funny, and there is a larger variety of questions/tasks in each episode. There are also many traditional and modern performances.

    In the first season of 越策越开心 方言听写大会 More Talk More Happy Dialect Dictation Competition, the contestants are not made to explicate their answers, and there were relatively few explanations from the judges, especially with questions that they consider to be the easier ones. (They forgot that even their easiest questions are not easy for viewers not from their province). 

    I find 多彩中国话 Splendid Chinese Language too broad in scope: they have two questions from each of the six provinces, and these twelve questions already take up most of the time in an episode, leaving very little time for other types of game tasks. Also, because this show covers such a wide area, there is much less scope for the MCs to speak in lects other than Standard Mandarin.

    There is a lot of scope for hilarity with the questions, especially with the video clips. Sometimes the contestants are asked about words or phrases that sound funny/wrong/rude in the broadcasting language. I remember seeing a broadcaster teaching the MC some sound correspondence rules; the broadcaster was testing the MC in applying those sound correspondence rules, and – after a few rounds – the MC fell for the trap and ended up saying a strong expletive in the broadcasting language (which was funny for the audience, and censoring would be even more offensive than censoring it). 

    In the second season of 越策越开心 方言听写大会 More Talk More Happy Dialect Dictation Competition, most contestants give a little performance at the beginning of the show. Some contestants give non-linguistic performances (e.g. traditional dance). On the other hand, many contestants sing or act in their own lect.

    (Unlike other contestants who sing in their mother tongue:) this teenage contestant sings in a lect that puzzles the broadcasters. 越策越开心 方言听写大会 More Talk More Happy Dialect Dictation CompetitionYouTube. (It turns out: the contestant is singing in reconstructed Early Middle Chinese.)

    谁语争锋 All about the Dialects has a theme song that (is too long, but) has each prefecture singing about four lines each in their respective lect(s), cleverly highlighting lexical and phonetic differences. (Even if you know nothing about these lects, you can still tell, e.g., which areas have lateral fricative, retroflexes, nasalised vowels, rounded front vowels, unrounded back vowels.) The music video is also very picturesque. 

    Production-oriented dialect quiz shows
    In these shows, the contestants are tested on their knowledge of one particular lect. Most of the tasks require the contestants to produce that lect "correctly", and sometimes also "entertainingly".

    As you can guess, these shows are aimed at people who have deep interest in a particular lect. The following are some examples of production-oriented quiz shows that I have watched a bit of:

    • 福州话我最霸 Hók-ciŭ-uâ Nguāi Cói-bá (I can only find snippets of this online). This seems to be a show in the early 2010s. This show tests contestants on Hokchiu, and the broadcasting language is Hokchiu.
    • 福建方言文化大赛 Fujian Dialect and Culture Competition. This show has four streams for four different lects in Fújiàn (Hokkien) province: Southern Mǐn, Hokchiu (Eastern Mǐn), Hinghwa (Púxiān Mǐn), and Chángtīng Hakka. In later episodes, the best players in each stream form teams of four contestants (one from each stream) to play in a combined quiz show. The baseline broadcasting language is Standard Mandarin.
    • 阿拉乓乓响 Aqla Pángpangxiang (e.g. YouTube), a contest on Shanghainese. Despite being a production-oriented quiz show on one single lect, the baseline broadcasting language is Standard Mandarin. For instance, the MC announces that contestants will get a yellow card (penalty) each time they speak Mandarin, in Mandarin, and she continues to speak in Mandarin most of the time. 

    Similar to the comprehension-oriented quiz shows, the production-oriented quiz shows also have broadcasters and linguists/philologists as judges. However, unlike comprehension-oriented quiz shows where broadcasters from many different places are needed, production-oriented quiz shows require only a few judges.

    Common game tasks include:

    • Reading out words or phrases on the screen. For instance, this Uyghur contestant reading out words/phrases in Shanghainese. Native speakers do not necessarily do very well either (in this episode, all contestants are native speakers, except the first Uyghur contestant and the last Mandarin contestant), given that many speakers – especially younger speakers – are trained by the education system to associate literacy with only Mandarin. In most places people switch to Mandarin or a locally prestigious lect when they have to say something more formal or less common; they do not necessarily think how these rarer words are meant to be pronounced in their own lect;  
    • Interpreting words and sentences from Mandarin to their lect, with as few Mandarin-influences as possible/reasonable. For instance, this contestant interpreting from Mandarin into his native Hokkien (Southern Mǐn). (Of cause, low competence in interpreting between two languages does not infer low competence in the two language);
    • Interpreting an entire video clip. For instance, the same contestant interpreting from Mandarin into his native Hokkien (Southern Mǐn)
    • Quiz on language and oral/written literature;
    • Quiz on local social/cultural phenomena; 
    • Have one contestant describe a word/phrase shown on the screen to another contestant who cannot see the screen, and the latter has to guess that word/phrase.

    Other dialect game shows

    1) There are dialect talent shows, where performances must involve speaking/singing in local lects (these shows can also include some game tasks like the ones discussed above). Examples are:

    2) In Fújiàn there is 方言欢语游福建 Dialect Fun through Fujian, which is a The-Amazing-Race-type game show, except that 方言欢语 Dialect Fun is not a contest: it is more a touristic show, and more about how the players cope with the tasks, most of which involve them learning the local lect. The two players, playing as a team, are taken to a city/town/village. They are given a clue to find the next checkpoint guardian. (There is no physical checkpoint to find; the checkpoint is just an unmarked person.) The clue is usually a word or phrase in the local lect, the meaning of which have to be figured out along the way. When they find the checkpoint guardian, they have to perform some task before they are given the next clue. Most of the checkpoint tasks are linguistic, e.g. learning and giving a small speech in the local lect, learning and performing a song in the local lect, solving a riddle in the local lect (the players will have to figure out what the riddle means first), learning a phrase in the local lect and use that as a pass-phrase to obtain some item from someone. Some tasks are not primarily linguistic, e.g. learning to do some traditional activity like tea ceremony or wood carving. During these tasks that involve earning traditional activities, the players are taught words associated with the activities, and the contestants might be tested on those words as part of the task.

    The relationship between the tasks are often not linear. For instance, one task might require them to collect various items (e.g. map pieces), which the players have to obtain from other checkpoints (which they have to find through clues, and each checkpoint has a task of its own). A song learnt at one checkpoint might be required at more than one checkpoint.

    Disappointed players, being told to go back to the preceding checkpoint and relearn the lyrics of a song in the local lect (Northern Mǐn of Wǔyíshān), before the actual task of this checkpoint is given. (They forgot the lyrics while trying to find this checkpoint; they mumbled through most of the lyrics. Their music and choreography were reasonable though.) 方言欢语游福建 Dialect Fun through Fujian; 16:40

    What makes them work, and where else they might work

    The competition amongst Mainland Chinese TV channels is cut-throat; there are more than 3000 TV channels in Mainland China. All provincial TV stations have at least one channel that can be viewed throughout the country, and some even have international version of their channels that you can pick up on foreign cable/satellite packages. (And despite YouTube being banned in Mainland China, China Central Television [CCTV] and some Chinese provincial TV networks have official YouTube channels.) 

    One way to stand out amongst the many provincial channels is to highlight local culture in a way that is accessible to audiences all over the country. This is one of the reasons why the dialect quiz shows all have Standard Mandarin as the baseline broadcasting language, even for production-oriented quiz shows that focus on just one lect. (Besides that, there are also severe government restrictions on broadcasting in languages other than Standard Mandarin. As for 谁语争锋 All about the Dialects broadcasting in Cantonese, and that there is a Cantonese satellite TV channel in Mainland China: the Cantonese literally said f you and continued to speak Cantonese.)

    The inspiration behind the comprehension-oriented quiz shows was perhaps the first season (2013) of 中国汉字听写大会 Chinese Characters Dictation Competition on CCTV-1, which is like the Chinese version of the American spelling bee: contestants listen to a word or phrase in Standard Mandarin, and they have to write them correctly in Chinese characters. In Mainland China, the most successful TV network after CCTV is Hunan TV, the showbiz trend-setter in Mainland China. Hunan TV, in late 2013, adopted the format of the Chinese Character Dictation Competition, and have the contestants listen to the local lects in Húnán Province in its 方言听写大会 Dialect Dictation Competition. Many other provinces followed suit in 2014, and made them more entertaining. However, in 2016, despite being immensely popular, 多彩中国话 Splendid Chinese Language was abruptly cancelled after nine (?) episodes (Because China). We will see whether there are more dialect quiz shows in 2017 and beyond.

    These shows are successful due to a number of reasons:

    • people clearly feel the endangerment of their regional lects, and they feel uneasy about their traditions not being passed on to younger generations;
    • the entertainment value of the shows: people like jokes (e.g. words in another lect that sound "funny/wrong/rude" in the broadcasting language), and songs/dances/plays in lects that they care about;
    • audiences are listening to lects that they can relate to. So, although the intelligibility of the lects featured in the shows range from 100% to 0% (e.g. Cantonese speakers listening to Teochew is like, I guess, English speakers listening to Faeroese):
      A Cantonese contestant head-slapping after listening to a nursery rhyme in Teochew sung by three broadcasters (and the question hasn't even been asked yet). 谁语争锋 All about the DialectsYouTube
      at least they are listening to lects from the same language family, and many would understand explanations given by the linguists afterwards, especially on the cognacy between the words in question and words in other lects that the audience know. (Some contestants are good at working out sound correspondence rules from these explanations, and applying them for later questions.) 谁语争锋 All about the Dialects occasionally had questions from Iu Mien, which is not even Sinitic; if it is a video-clip questions, there is some scope for guessing from the visuals. (However – for the "easier" questions at least – the TV producers are known to quite often create misleading visuals/contexts...)  

    Can you guess from this gif the meaning of Ya Yi Bu Bie Bia in Iu Mien? 谁语争锋 All about the Dialects; YouTube (hint: Swadesh-type basic vocabulary)

    I can see comprehension-oriented quiz shows being popular in many other parts of the world; the key is testing contestants mostly on lects that are genealogically related (same family) to the lects that the audience understand or are interested in. Many European nation states have the right amount of lect variation for such shows to be interesting. There can also be pan-family/pan-regional comprehension-oriented quiz shows. Some examples are: 

    • the Celtic-language TV channels joining force to create a Pan-Celtic quiz show;
    • a show covering the lects in the Dutch-speaking-sphere;
    • Auckland (ping: Māori Television) can easily have a Pan-Polynesian comprehension-oriented quiz show. (Perhaps also include Fijian and Rotuman, or even further.) A Māori production-oriented quiz show would also be nice. I guess Honolulu (ping: ʻŌiwi TV) would have no problem pulling in a Polynesian+Micronesian show;
    • I think South Africa has a good amount of linguistic diversity for a comprehension quiz show.
    Obviously, very few endangered languages have a TV station of its own. Some have broadcasting time with mainstream TV channels; some share a TV station with many other indigenous languages (e.g. Taiwan Indigenous Television). Even if it is not an hour-long quiz show, a 5 or 10 minute quizlet would still be fun. Many endangered languages have radio broadcasting time, and radio language quiz shows are also entirely feasible. (Some of the TV shows mentioned above also have radio versions.) With the internet, (provided you have electricity/internet/good enough equipment to work with) it is easy to broadcast in languages that in the past had no or very little chance of being heard by the wider audience. There are many different ways to have language quizzes for endangered languages in the media.

    May the lesser-broadcasted languages flourish in the media, and be loved by everyone.

    (Tip: for classrooms/parties, in places with good internet/wifi/mobile network, try 
    Kahoot! for multiple-choice quizzes (Google Play; iTunes; Windows 10; quiz maker on Windows 10) (faq).)

    (Tangent thought: wouldn't it be cool if there is a LingQuest-like game show? Or a glamorous – and less intense – IOL-like game show?)