This is a list of Linux terminal commands that I’ve come across that are often useful for studying languages.
In most cases they’ve been provided by helpful users at Ubuntu Forums.
Remove duplicate lines from a file
This command is pretty useful in general, but is particularly handy for quickly removing duplicate entries from vocabulary lists.
sort file | uniq -u > output
In this command, file should be replaced with the input file, and output with the name of the new file which will have duplicate lines removed.
Remove all ASCII characters from a file
Removing non-ASCII characters is quite a common requirement, but this command does the opposite. It’s useful if you’ve got a file with a mix of Latin and CJK characters, and you want to leave only the CJK ones (e.g. delete everything except Chinese characters in a file).
sed -e 's/[0-9a-zA-Z]//g' -e 's/[[:punct:]]//g' -e '/^$/d' file > output
Again, file is the input file and output is the file that will contain only CJK characters. Note that this isn’t perfect for isolating CJK characters – it will leave pretty much any unusual characters in the output file.
Get all lines with non-ASCII characters from a file
This command will go through a file and output any lines containing non-ASCII characters (such as Chinese, Japanese or Korean). This could be useful if you want to keep ASCII characters that appear on the same line as CJK ones, rather than completely removing everything except CJK (as the command above does).
grep -P "[\x80-\xFF]" file > output
Again, replace file with the input file, and output with the desired output file name.
Remove all lines containing X from a file
This command is useful for stripping out lines from a file that contain a specific term. I use it most often for removing lines marked with the word ‘simplified’ in vocab lists; it works well when you have tagged lists like that.
sed -e "/text/d" file > output
Convert a .MO file to a readable .PO one (plain text)
A lot of software translations use MO and PO files to store a translated interface. The MO is the machine-readable one and is fairly useless if you want to get at the translations yourself. Use this command to convert an MO file into a plain text PO file:
msgunfmt file -o output
You’ll often want to get at the data in multiple .mo files at once, and this is easy to do by putting *.mo as the file in the command above. The command will still output the text into a single file.
Find the location of translation (.mo) files
To get hold of .mo files, you can either download language packs (e.g. language-pack-zh) and go through them with an archive browser, or you can search for the .mo files already in use in your installation. You can use the following command to do this:
dpkg -L language-pack
Replace language-pack with whatever language files you’re looking for, and the command will return the locations of all the .mo files for that language pack in your installation.
If you notice a mistake, or have another useful command to add, please share it in the comments.
If you found this useful, consider helping me out in return.



I think I mentioned before, that I was delighted to find you are not only a fellow Chinese learner but also a fellow Ubuntu user. I enjoy all your posts including those about Chinese and Linux. Keep up the great blog.
You can add iconv to convert the encoding of these pesky GB text files.
And for something a little bit more advanced, you may be interested in a command to count unique characters in a text file:
sed ’s/\(.\)/\1\n/g’ text.txt | sort | uniq -c | sort -rn
Download mp3 file from Google TTS:
On a Mac or in Linux in a Terminal window:
curl -A Mozilla -o “/Path/To/Audios/Filename.mp3″ “http://translate.google.com/translate_tts?ie=utf- 8&tl=zh&q=快点儿吧,再有一个小时就要考试了。”
Just change language code (zh for Chinese above) to the language of the text after q=.
Language codes: http://bit.ly/you7rx (not all work)
Other example (in German (de)):
curl -A Mozilla -o “/Path/To/Audios/Filename.mp3″ “http://translate.google.com/translate_tts?ie=utf- 8&tl=de&q=Ich+bin+müde.”
Spaces within the text must be converted to %20 or +.
More info: http://bit.ly/A8XMHh
On Windows you need curl: http://curl.haxx.se/download.html