East Asia Student

Random Stuff Related to East Asia


Useful terminal commands for language students

This is a list of Linux terminal commands that I’ve come across that are often useful for studying languages.

In most cases they’ve been provided by helpful users at Ubuntu Forums.

Remove duplicate lines from a file

This command is pretty useful in general, but is particularly handy for quickly removing duplicate entries from vocabulary lists.

awk ‘!x[$0]++’ file > output

In this command, file should be replaced with the input file, and output with the name of the new file which will have duplicate lines removed.

Remove all ASCII characters from a file

Removing non-ASCII characters is quite a common requirement, but this command does the opposite. It’s useful if you’ve got a file with a mix of Latin and CJK characters, and you want to leave only the CJK ones (e.g. delete everything except Chinese characters in a file).

sed -e ’s/[0-9a-zA-Z]//g’ -e ’s/[[:punct:]]//g’ -e ‘/^$/d’ file > output

Again, file is the input file and output is the file that will contain only CJK characters. Note that this isn’t perfect for isolating CJK characters - it will leave pretty much any unusual characters in the output file.

Get all lines with non-ASCII characters from a file

This command will go through a file and output any lines containing non-ASCII characters (such as Chinese, Japanese or Korean). This could be useful if you want to keep ASCII characters that appear on the same line as CJK ones, rather than completely removing everything except CJK (as the command above does).

grep -P “[x80-xFF]” *file* > *output*

Again, replace file with the input file, and output with the desired output file name.

Remove all lines containing X from a file

This command is useful for stripping out lines from a file that contain a specific term. I use it most often for removing lines marked with the word ‘simplified’ in vocab lists; it works well when you have tagged lists like that.

sed -e “/ text /d” file > output

Convert a .MO file to a readable .PO one (plain text)

A lot of software translations use MO and PO files to store a translated interface. The MO is the machine-readable one and is fairly useless if you want to get at the translations yourself. Use this command to convert an MO file into a plain text PO file:

msgunfmt *file* -o output

You’ll often want to get at the data in multiple .mo files at once, and this is easy to do by putting *.mo as the file in the command above. The command will still output the text into a single file.

Find the location of translation (.mo) files

To get hold of .mo files, you can either download language packs (e.g. language-pack-zh) and go through them with an archive browser, or you can search for the .mo files already in use in your installation. You can use the following command to do this:

dpkg -L language-pack

Replace language-pack with whatever language files you’re looking for, and the command will return the locations of all the .mo files for that language pack in your installation.

If you notice a mistake, or have another useful command to add, please share it in the comments.