East Asia Student

Random Stuff Related to East Asia


Chinese Ngrams on Google Labs

Google Labs recently announced an exciting new feature: Ngrams.

It uses data from Google Books (which contains billions of words in several languages) to produce graphs comparing the frequency of words over time.

A well-written research paper and website have already been produced as a taster of what could become a huge new research opportunity for linguistic and social scientists.

Disappointingly, the corpus available for Ngrams includes only one East Asian language. This is listed as ‘Chinese (simplified)’, which is a little strange, as character variants don’t make different languages. However, the distinction is understandable, as the two character sets are highly interlinked and overlapping, across national and historical boundaries, so collating them together would be a computational nightmare.

Chinese Ngrams

Xinjiang and Tibet

[caption id=“attachment_1380” align=“aligncenter” caption=“Frequency of ‘新疆’ (blue) and ‘西藏’ (red), 1900 to 2000”][/caption]

The graph showing the frequency of ‘新疆’ (Xinjiang) compared to ‘西藏’ (Tibet) shows the two terms rise and falling quite consistently together, with Xinjiang pulling ahead in recent years. Interestingly, when the PRC took over Tibet in 1950, the frequency of the term in Chinese does not change dramatically.

[caption id=“attachment_1381” align=“aligncenter” caption=“Frequency of ‘Tibet’ (red) and ‘Xinjiang’ (blue), 1900 to 2000”][/caption]

Compare the Chinese Ngrams with the equivalent terms in English. English books have paid comparatively little interest to ‘Xinjiang’, although the data does not extend to 2008 when it gained notoreity for unrest in the Uyghur ethnic minority.

‘Tibet’ however, leaps up in frequency from 1950 onwards. Then during the 1980s use of the term declines slightly, perhaps due to increased focus on civil unrest in China’s core provinces and cities.

Taiwan and the Mainland

[caption id=“attachment_1384” align=“aligncenter” caption=“Frequency of ‘大陆’ (blue) and ‘台湾’ (red), 1900 to 2000”][/caption]

These terms are very difficult to compare, as 大陆 does not necessarily refer to Mainland China, whereas 台湾 is specifically Taiwan. Unfortunately, there doesn’t seem to be data for the more exclusive terms 中人民共和国人民共和国 (People’s Republic of China) and 中华民国 (Republic of China).


[caption id=“attachment_1385” align=“aligncenter” caption=“Frequency of ‘中国’, 1900 to 2000”][/caption]

This is an extremely interesting graph. Frequency of the Chinese word for ‘China’ (中国) trails off from the beginning of the century until around 1940, when it makes a long, steady rise back up.

Why this might be is likely very complicated, but it could well be to do with the chaotic few decades following the collapse of the Qing dynasty and the Warlord Era that followed. The country was frequently separated and reformed, and this uncertainty may explain declining use of the word.

Another angle is that the patriotism and nationalism inspired and employed by the Communist Party could have led to increased use of the word.

English Ngrams on China and East Asia

East Asian nations

[caption id=“attachment_1387” align=“aligncenter” caption=“Frequency of ‘China’, ‘Japan’ and ‘Korea’, 1800-2000”][/caption]

Another interesting graph, this one shows the frequency of the English names of three East Asian nations from 1800 to 1900. Again, events that one would expect to make a significant difference, such as the Opium Wars, do not seem to impact the frequency.

There is a large peak, though, for ‘Japan’ around the time of the atomic bombings, but there are also leaps for ‘China’ and ‘Korea’ at this time, perhaps suggesting that the Second World War in general contributed to the spike.

The Korean War seems not to have caused increased use of the word ‘Korea’ in books - it may not just be ‘the Forgotten War’, but the ‘Not Mentioned War’.


[caption id=“attachment_1389” align=“aligncenter” caption=“Frequency of ‘Orient’ (blue) and ‘East Asia’ (red), 1800 to 1900”][/caption]

One final graph, that probably best demonstrates the descriptive power of Ngrams. This one compares the English terms ‘Orient’ and ‘East Asia’ from 1800 to 1900. The huge rise of ‘Orient’ from around 1870 onwards, followed by its fairly rapid decline and loss of ground to ‘East Asia’ from 1940 onwards is fascinating.

It’s also interesting that both terms experienced a decline in the 1940s, but ‘East Asia’ pulled out of this while ‘Orient’ did not. Why this is the case is open to speculation.