Installing MeCab on Fedora 15 and Ubuntu 11.04 and later
Posted by James Sullivan in Japanese on August 18, 2011
MeCab is a very useful open source part-of-speech and morphological analyzer for Japanese language text. It was written by Taku Kudo. An example of the sort of POS information such as noun, verb, etc and other information that it provides.
% mecab -N2
今日もしないとね。
今日 名詞,副詞可能,*,*,*,*,今日,キョウ,キョー
も 助詞,係助詞,*,*,*,*,も,モ,モ
し 動詞,自立,*,*,サ変・スル,未然形,する,シ,シ
ない 助動詞,*,*,*,特殊・ナイ,基本形,ない,ナイ,ナイ
と 助詞,接続助詞,*,*,*,*,と,ト,ト
ね 助詞,終助詞,*,*,*,*,ね,ネ,ネ
。 記号,句点,*,*,*,*,。,。,。
EOS
Installing MeCab packages
Fedora
Use the Add/Remove Software installer
Ubuntu
The Ubuntu Software Center has just one mecab choice and I am not sure if that includes bindings so I do not use it. Searching for mecab in Synaptic is another choice but apparently Synaptic will not be part of the base Ubuntu install in the future so the easiest thing to do is to use apt-cache search mecab to see what is available then use sudo apt-get install mecab mecab-utils mecab-naist-jdic mecab-ipadic-utf8 mecab-jumandic-utf8 libmecab-java libmecab-jni python-mecab libmecab-ruby1.9.1
Set Up Bindings to Your Dev Language
Using SWIG (traditional way)
For C Ruby
Everything should be ready at this point. To get the sample test.rb to work just remember to ensure the encoding is set properly at the start of your script to # encoding: utf-8
For Java and Scala
Add MeCab.jar found at /usr/share/java/MeCab.jar in both Fedora and Ubuntu to your buildpath. This is done in Eclipse by right clicking on the project and selecting buildpath. Then select Add external archives /usr/share/java/MeCab.jar
You also need to load libMeCab.
It can be done two ways.
In your program using the following line of code System.load(“/usr/lib/jni/libMeCab.so”); for Ubuntu 11.04 or System.load(“/usr/lib/libMeCab.so”); for Fedora 15.
or you could add it to the path on your machine (or use LD_LIBRARY_PATH, etc ) . Although setting it on your machine has the advantage of making it available to all users of the machine I prefer to explicitly load in my code so that when I am working on a different machine and forget to do this the cause of the error is readily apparent.
Java sample code for MeCab. Warning you may need to select UTF-8 Encoding to get the Japanese to display properly.
Not using SWIG
For JRuby and even C Ruby see Natto http://code.google.com/p/natto. It is a much more Rubyesque solution. Simply gem install natto
I needed to apt-get install libdrb-ruby and liberb-ruby and rdoc to get this to work under CRuby Ubuntu 11.04.
cmecab-java http://code.google.com/p/cmecab-java/ For those interested in using mecab as a tokenizer with Lucene and Solr this is an ideal solution. For those just interested in a java binding to mecab the advantage seems to be allowing designation of the encoding when creating the tagger. I have not yet had the opportunity to try this out yet so I can’t offer any opinion on how well it works.
Please note Fedora 15 seems to be using version 0.98 of mecab and the chasen parameter does not work but everything else does. Ubuntu 11.04 seems to be using version 0.97 and the chasen parameter does work.
Pitfalls of Statistical Translation on a Large Scale
Posted by James Sullivan in Statistical Translation on June 14, 2011
An interesting article from the Atlantic about one of the shortcomings of statistical based translation, which seems to be hitting a wall of diminishing returns in its ability to improve, at least for the way Google is currently implementing it. Google is dropping an automatic-translation tool because overuse by spam-bloggers is flooding the Internet with sloppily translated text which in turn is making computerised translation trained on translated text available on the Internet even sloppier.
Half- And Full-width Characters In CJK (and normalization)
Posted by James Sullivan in Chinese, Japanese, Korean on May 22, 2011
| Range | Content | Width Size |
| 0×0020-0x007F |
ASCII (Latin characters, symbols, punctuation,
numbers)
|
HALF-WIDTH |
| 0×1100-0x11FF |
Hangul Jamo
(Korean)
|
FULL-WIDTH |
| 0×3000-0x303F | CJK punctuation | FULL-WIDTH |
| 0×3040-0x309F | Hiragana (Japanese) | FULL-WIDTH |
| 0x30A0-0x30FF | Katakana (Japanese) | FULL-WIDTH |
| 0×3130-0x318F | Hangul Compatibility (Korean for KS X1001 compatibility) | FULL-WIDTH |
| 0xAC00-0xD7AC | Hangul Syllables | FULL-WIDTH |
| 0xF900-0xFAFF | CJK Compatibility Ideographs Block | FULL-WIDTH |
| 0xFF00-0xFFEF |
Latin characters and half-width Katakana and Hangul
|
HALF-WIDTH AND FULL WIDTH |
| 0x4E00-0x9FFF | CJK unifed ideographs – Common and uncommon | FULL-WIDTH |
| 0×3400-0x4DBF | CJK unified ideographs Extension A – Rare | FULL-WIDTH |
| 0×20000-0x2FFF | CJK unified ideographs Extension B – Very rare | FULL-WIDTH |
String normalized_string = java.text.Normalizer.normalize(unnormalized_string, java.text.Normalizer.Form.NFKC);A Quick Introduction to CJK Writing Systems & Unicode.
Posted by James Sullivan in Chinese, Japanese, Korean on April 21, 2011
The Internationalization Activity Lead at the W3C has a great and considering the topic concise introduction to Chinese, Japanese and Korean writing sytems & Unicode. Highly recommended.
Using Python with Chinese, Japanese and Korean
Posted by James Sullivan in Chinese, Japanese, Korean on March 30, 2011
>>> # -*- coding: utf-8 -*- >>> import sys >>> reload(sys) >>> sys.setdefaultencoding('utf-8')
#!/usr/bin/python2.6 # -*- coding: utf-8 -*- import pprint, re, sys reload(sys) sys.setdefaultencoding('utf-8') def unicode_pp(obj): def unicode_unquoter(match): return unichr(int("0x"+match.group(1),16)) pp = pprint.PrettyPrinter(indent=4, width=120) str = pp.pformat(obj) return re.sub(r'\\u([0-9a-f]{4})',unicode_unquoter,str) print(sys.getdefaultencoding()) print("%s%s") % (u"Python语言", sys.version_info) print('') data = { u"面向对象语言": {u"C++": u"C++语言", u"Java": u"爪哇语言", u"Python": u"Python语言"}, u"脚本语言": {u"Shell": u"外壳脚本程序", u"VBScript": u"VB脚本语言", u"Python": u"Python语言"} } print(unicode_pp(data))
and the second for Python 3.X.
#!/usr/bin/python3 import sys print(sys.getdefaultencoding()) print("Python语言", sys.version_info) print('') data = { "面向对象语言": {"C++": "C++语言", "Java": "爪哇语言", "Python": "Python语言"}, "脚本语言": {"Shell": "外壳脚本程序", "VBScript": "VB脚本语言", "Python": "Python语言"} } print(data)
Natural language processing tools for Japanese
Posted by James Sullivan in Japanese on May 7, 2009
A list of useful natural language processing tools for Japanese packaged for Ubuntu and maintained by Eric Nichol: http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/lucid/japanese/. It doesn’t seem to have been updated after the last long-term Ubuntu release (10.04 Lucid Lynx). However, at least for Mecab, that is not really an issue as the version in the Ubuntu Software Center repositories seems to work fine from version 10.10 Maverick Meerkat, which was not always the case in previous releases of Ubuntu.
