Installing MeCab on Fedora 15 and Ubuntu 11.04 and later

MeCab is a very useful open source part-of-speech and morphological analyzer for Japanese language text. It was written by Taku Kudo. An example of the sort of POS information such as noun, verb, etc and other information that it provides.

% mecab -N2
今日もしないとね。
今日 名詞,副詞可能,*,*,*,*,今日,キョウ,キョー
も 助詞,係助詞,*,*,*,*,も,モ,モ
し 動詞,自立,*,*,サ変・スル,未然形,する,シ,シ
ない 助動詞,*,*,*,特殊・ナイ,基本形,ない,ナイ,ナイ
と 助詞,接続助詞,*,*,*,*,と,ト,ト
ね 助詞,終助詞,*,*,*,*,ね,ネ,ネ
。 記号,句点,*,*,*,*,。,。,。
EOS

 

Installing MeCab packages

Fedora

Use the Add/Remove Software installer

Screen shot--selecting MeCab packages for Fedora 15

Screen shot--selecting MeCab packages for Fedora 15

Ubuntu

The Ubuntu Software Center has just one mecab choice and I am not sure if that includes bindings so I do not use it. Searching for mecab in Synaptic is another choice but apparently Synaptic will not be part of the base Ubuntu install in the future so the easiest thing to do is to use apt-cache search mecab to see what is available then use sudo apt-get install mecab mecab-utils mecab-naist-jdic mecab-ipadic-utf8 mecab-jumandic-utf8 libmecab-java libmecab-jni python-mecab libmecab-ruby1.9.1

 

Set Up Bindings to Your Dev Language

Using SWIG (traditional way)

For C Ruby

Everything should be ready at this point. To get the sample test.rb to work just remember to ensure the encoding is set properly at the start of your script to # encoding: utf-8

 

For Java and Scala

Add MeCab.jar found at /usr/share/java/MeCab.jar in both Fedora and Ubuntu to your buildpath. This is done in Eclipse by right clicking on the project and selecting buildpath. Then select Add external archives /usr/share/java/MeCab.jar

You also need to load libMeCab.

It can be done two ways.

In your program using the following line of code System.load(“/usr/lib/jni/libMeCab.so”); for Ubuntu 11.04 or System.load(“/usr/lib/libMeCab.so”); for Fedora 15.

or you could add it to the path on your machine (or use LD_LIBRARY_PATH, etc ) . Although setting it on your machine has the advantage of making it available to all users of the machine I prefer to explicitly load in my code so that when I am working on a different machine and forget to do this the cause of the error is readily apparent.

Java sample code for MeCab. Warning you may need to select UTF-8 Encoding to get the Japanese to display properly.

Not using SWIG

For JRuby and even C Ruby see Natto http://code.google.com/p/natto. It is a much more Rubyesque solution. Simply gem install natto
I needed to apt-get install libdrb-ruby and liberb-ruby and rdoc to get this to work under CRuby Ubuntu 11.04.

cmecab-java http://code.google.com/p/cmecab-java/ For those interested in using mecab as a tokenizer with Lucene and Solr this is an ideal solution. For those just interested in a java binding to mecab the advantage seems to be allowing designation of the encoding when creating the tagger. I have not yet had the opportunity to try this out yet so I can’t offer any opinion on how well it works.

Please note Fedora 15 seems to be using version 0.98 of mecab and the chasen parameter does not work but everything else does. Ubuntu 11.04 seems to be using version 0.97 and the chasen parameter does work.

, , , , , , ,

No Comments

Pitfalls of Statistical Translation on a Large Scale

An interesting article from the Atlantic about one of the shortcomings of statistical based translation, which seems to be hitting a wall of diminishing returns in its ability to improve, at least for the way Google is currently implementing it. Google is dropping an automatic-translation tool because overuse by spam-bloggers is flooding the Internet with sloppily translated text which in turn is making computerised translation trained on translated text available on the Internet even sloppier.

No Comments

Half- And Full-width Characters In CJK (and normalization)

Latin Characters in half-width and full-width
Asia Asia
Katakana in half-width and full-width
アジア アジア
Kanji full-width only
The terms half- and full-width refer to the relative width size of the glyph of characters and is important to CJK languages as most of their characters require full width characters to display due to their complexity. The origin of half width characters goes back to the early days of computing when single byte representation of characters was the norm and Japanese and Korean computer manufacturers displayed very limited ranges of their languages’ characters in low resolution in single byte encodings. In fact some people still refer to half width characters as single byte characters and full width characters as double-byte characters but this should be discouraged as this has not necessarily been true for some time—half width characters may be encoded using multiple bytes and full width characters may be encoded using more than two bytes depending on the encoding scheme.
As exciting as the history of byte encoding is we are not going to cover it here. Suffice it say that size of characters is not something that should be handled by character code mapping but something that belongs in font, style sheets, or other display markup and the only reason this issue is still with us today is that for backwards compatibility reasons, the Unicode Consortium decided to include the half-width/full width character distinction in their mappings. So instead of taking some admittedly significant pain around conversion five or ten years ago, we are going to have to deal with the half- and full-width issue for the foreseeable future. This is surprising considering the outstanding job of rationalization and simplification that the Unicode consortium has managed to pull off in almost all other areas of the extremely complicated subject of CJK character encoding.
What’s the issue?
Why is half- and full-width such an issue? The problem is that having the character width distinction handled in character code mapping means all CJK text needs to be normalized before searching, sorting, filtering, matching, etc. or it will not return the expected results. People searching for a word expect all results to be returned not just those that are in the half-width format that they may have inputted in their search query or vice versa if they inputted their search term in full-width format.
Chinese, Japanese and Korean
In Chinese, half and full-width are usually referred to as 半角(bànjiǎo) and 全角(quánjiǎo) in mainland China and 半形 ( bànxíng) and 全形 (quánxíng) in Taiwan. In Japanese, these terms are referred to as 半角 (hankaku) and 全角 (zenkaku). In Korean, the terms 반각 (bangak) and 전각 (jeongak) are used.
Half- and full-width distinction is less of an issue for Chinese than Japanese and Korean because it only applies to Latin characters, symbols, and numbers for Chinese. In the case of Japanese and Korean, the distinction also applies to Katakana (Japanese) and Hangul (Korean) characters, in addition to Latin characters, numbers, etc.

RELEVANT UNICODE RANGES FROM UNICODE 6.0
The width size column in the table below is a gross simplification but is sufficient for our purposes. Those interested in understanding East Asian Width in more detail should refer to Unicode Standard Annex #11.
Range Content Width Size
0×0020-0x007F
ASCII (Latin characters, symbols, punctuation,
numbers)
HALF-WIDTH
0×1100-0x11FF
Hangul Jamo
(Korean)
FULL-WIDTH
0×3000-0x303F CJK punctuation FULL-WIDTH
0×3040-0x309F Hiragana (Japanese) FULL-WIDTH
0x30A0-0x30FF Katakana (Japanese) FULL-WIDTH
0×3130-0x318F Hangul Compatibility (Korean for KS X1001 compatibility) FULL-WIDTH
0xAC00-0xD7AC Hangul Syllables FULL-WIDTH
0xF900-0xFAFF CJK Compatibility Ideographs Block FULL-WIDTH
0xFF00-0xFFEF

 

Latin characters and half-width Katakana and Hangul

 

HALF-WIDTH AND FULL WIDTH
0x4E00-0x9FFF CJK unifed ideographs – Common and uncommon FULL-WIDTH
0×3400-0x4DBF CJK unified ideographs Extension A – Rare FULL-WIDTH
0×20000-0x2FFF CJK unified ideographs Extension B – Very rare FULL-WIDTH

NORMALIZATION
The next question is how to best normalize text to ignore the differences between half-width and full-width forms when doing searches, sorts, etc.
Consulting the table from the previous chapter shows that there is always a full-width version but not always a half-width version of characters so at first thought it might seem easiest to just convert everything into full-width size for normalization.
However, that is probably not the best solution for two reasons. A large amount of software for English and other languages using the Latin character set only targets the character codes for the ASCII half width versions and that software would all have to be rewritten. The second is that Latin characters generally look ugly in full width form.
For those reasons the preferable normalization solution is to convert all Latin characters, symbols, punctuation and numbers to half-width ASCII form and everything else to their full-width form.
Example code in Java for half and full-width normalization is included in an appendix at the end (note Java has this functionality built in but I have included this code for explanatory purposes). This code is simple and can easily be implemented in different computer languages that do not have Unicode Normalization built-in.
The Unicode Standard Annex #15 defines normalization forms for Unicode text (covers more than just half- and full-width normalization) and without going into the details the Unicode Standardization Form most useful for normalization of searches, etc. is NFKC as both half- and full-width differences are normalized and different compatibility equivalents of a single CJK character will result in the same string.
Built-in Normalization
Many computer languages and applications have the needed normalization built-in. For example in Java, which has Unicode Normalization Form functionality built in it is as easy as
String normalized_string = java.text.Normalizer.normalize(unnormalized_string, java.text.Normalizer.Form.NFKC);
Microsoft offers a similar functionality for Unicode normalization of strings. IBM products such as search offer equivalent normalization. For example, in Japanese, a full-width alphanumeric character is normalized to the half-width character, a half-width Katakana character to the full-width character, and so on.
Complexity of Normalization
Normalization of Unicode can be quite complex. Take for example the simple space character with all its different variants in Unicode.
ASCII 0020  SPACE
sometimes considered a control code
other space characters: 2000  –200A 
→ 00A0  no-break space
→ 200B  zero width space
→ 2060  word joiner
→ 3000  ideographic space
→ FEFF  zero width no-break space
For this, and a large number of other reasons I strongly recommend using the Unicode NFKC normalized format from a standard library whenever possible. Where this is not possible please see the code sample below.
JAVA CODE

, , , , , , , , , , , , , ,

No Comments

A Quick Introduction to CJK Writing Systems & Unicode.

The Internationalization Activity Lead at the W3C has a great and considering the topic concise introduction to Chinese, Japanese and Korean writing sytems & Unicode. Highly recommended.

, ,

No Comments

Using Python with Chinese, Japanese and Korean

Python 3.0 has been out for a while and handles Unicode quite well but unfortunately many applications and libraries still have not made the move to 3.0 yet, including the Natural Language Tool Kit. For those that want to use Chinese, Japanese or Korean with a 2.6 or 2.7 version of Python you need to at the very least start your Python session or script with the following
>>> # -*- coding: utf-8 -*- >>> import sys >>> reload(sys) >>> sys.setdefaultencoding('utf-8')
Although the bash shell on most modern releases of Linux handles Unicode without a problem, there are often issues with UTF-8 on the command prompt and Power Shell on Windows, so the Eclipse Python IDE is probably the best free widespread choice to use for editing. The Eclipse Python IDE seems to default to and handle ’utf-8′ automatically regardless if the Python version is less than 3.0 but the comment # -*- coding: utf-8 -*- portion is still necessary for proper 2.x Python.
A quick example of two Python Scripts using Chinese, the first for Python 2.6
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import pprint, re, sys
reload(sys)
sys.setdefaultencoding('utf-8')
def unicode_pp(obj):
    def unicode_unquoter(match):
         return unichr(int("0x"+match.group(1),16))
    pp = pprint.PrettyPrinter(indent=4, width=120)
    str = pp.pformat(obj)
    return re.sub(r'\\u([0-9a-f]{4})',unicode_unquoter,str)
print(sys.getdefaultencoding())
print("%s%s") % (u"Python语言", sys.version_info)
print('')
data = {
    u"面向对象语言":
        {u"C++": u"C++语言", u"Java": u"爪哇语言", u"Python": u"Python语言"},
    u"脚本语言":
        {u"Shell": u"外壳脚本程序", u"VBScript": u"VB脚本语言", u"Python": u"Python语言"}
    }
print(unicode_pp(data))

and the second for Python 3.X.

#!/usr/bin/python3
import sys
print(sys.getdefaultencoding())
print("Python语言", sys.version_info)
print('')
data = {
	"面向对象语言":
		{"C++": "C++语言", "Java": "爪哇语言", "Python": "Python语言"},
	"脚本语言":
		{"Shell": "外壳脚本程序", "VBScript": "VB脚本语言", "Python": "Python语言"}
	}
print(data)

No Comments

Natural language processing tools for Japanese

A list of useful natural language processing tools for Japanese packaged for Ubuntu and maintained by Eric Nichol: http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/lucid/japanese/. It doesn’t seem to have been updated after the last long-term Ubuntu release (10.04 Lucid Lynx). However, at least for Mecab, that is not really an issue as the version in the Ubuntu Software Center repositories seems to work fine from version 10.10 Maverick Meerkat, which was not always the case in previous releases of Ubuntu.

, ,

No Comments