No description
Find a file
2022-11-28 21:47:41 +01:00
.github/workflows chore(ci): add workflow dispatch trigger 2022-11-28 21:47:41 +01:00
benches docs: add usage + perf to readme 2022-11-28 09:10:52 +01:00
codegen chore: add CI workflow, update pre-commit config 2022-11-28 21:45:17 +01:00
src chore: change example text 2022-11-28 14:59:50 +01:00
tests test: add proptest 2022-11-28 14:46:33 +01:00
.editorconfig Initial commit 2022-11-17 12:30:35 +01:00
.gitignore feat: add romanization, remove wana_kana dependency 2022-11-26 22:55:42 +01:00
.pre-commit-config.yaml chore: add CI workflow, update pre-commit config 2022-11-28 21:45:17 +01:00
Cargo.toml chore: add repo url 2022-11-28 19:52:55 +01:00
CHANGELOG.md chore: fix crate packaging, add changelog 2022-11-28 11:57:00 +01:00
cliff.toml chore: fix crate packaging, add changelog 2022-11-28 11:57:00 +01:00
dict_format.md use phfbin dictionary format 2022-11-20 16:34:57 +01:00
LICENSE Initial commit 2022-11-17 12:30:35 +01:00
README.md docs: update benchmarks, add license info 2022-11-28 19:48:44 +01:00

kakasi

crates.io docs.rs licence

kakasi is a Rust library to transliterate hiragana, katakana and kanji (Japanese text) into rōmaji (Latin/Roman alphabet).

It was ported from the pykakasi library which itself is a port of the original kakasi library written in C.

Usage

Transliterate:

let res = kakasi::convert("こんにちは世界!");
assert_eq!(res.hiragana, "こんにちはせかい!");
assert_eq!(res.romaji, "konnichiha sekai!");

Check if a string contains Japanese characters:

use kakasi::IsJapanese;

assert_eq!(kakasi::is_japanese("Abc"), IsJapanese::False);
assert_eq!(kakasi::is_japanese("日本"), IsJapanese::Maybe);
assert_eq!(kakasi::is_japanese("ラスト"), IsJapanese::True);

CLI

$ cargo install kakasi

## Convert to romaji
$ kakasi こんにちは世界!
konnichiha sekai!

## Convert to hiragana
$ kakasi -k こんにちは世界!
こんにちはせかい!

## Read from file
$ kakasi -f rust_article.txt

## Read from STDIN
$ echo "こんにちは世界!" | kakasi

Performance

CPU: AMD Ryzen 7 5700G

Text Conversion time Speed
Sentence (161 B) 7.0911 µs 22.70 MB/s
Rust wikipedia article (31705 B) 1.5055 ms 21.06 MB/s

CLI comparison

Time to convert a 100KB test file using the CLI:

Library Time Speed
kakasi (Rust) 7.4 ms 13.5 MB/s
kakasi (C) 33.5 ms 2.99 MB/s
pykakasi (Python) 810.6 ms 0.123 MB/s

Test commands:

CLI performance was measured with hyperfine.

hyperfine --warmup 3 'cat 100K.txt | kakasi-rs'
hyperfine --warmup 3 'cat 100K.txt | kakasi -i utf-8 -Ka -Ha -Ja -Sa -s'
hyperfine --warmup 3 'cat 100K.txt | python bin/kakasi -Ka -Ha -Ja -Sa -s'

License

kakasi is published under the GNU GPL-3.0 license.

The Kakasi dictionaries (Files: codegen/dict/kakasidict.utf8, codegen/dict/itajidict.utf8, codegen/dict/hepburn.utf8) were taken from the pykakasi project, published under the GNU GPL-3.0 license.

pykakasi

Copyright (C) 2010-2021 Hiroshi Miura and contributors(see AUTHORS)

The dictionaries originate from the kakasi project, published under the GNU GPL-2.0 license.

original kakasi

Copyright (C) 1992 1993 1994
Hironobu Takahashi (takahasi@tiny.or.jp),
Masahiko Sato (masahiko@sato.riec.tohoku.ac.jp),
Yukiyoshi Kameyama, Miki Inooka, Akihiko Sasaki, Dai Ando, Junichi Okukawa,
Katsushi Sato and Nobuhiro Yamagishi

For testing I included a copy of the Japanese Rust wikipedia article (tests/rust_article.txt). The article is published under the Creative Commons Attribution-ShareAlike License 3.0.