chinese char/word ngram lm (#613)

* add ngram lm egs * add zhon repo * install kenlm, zhon * format * add chinese_text_normalization repo * add ngram lm egs

chinese char/word ngram lm (#613)
* add ngram lm egs * add zhon repo * install kenlm, zhon * format * add chinese_text_normalization repo * add ngram lm egs
538bf271 · Hui Zhang · GitHub · 2bdf4c94 · 538bf271 · 538bf271
139 changed file
--- a/.gitignore
+++ b/.gitignore
 .DS_Store
 *.pyc
-tools/venv
 .vscode
 *.log
 *.pdmodel
@@ -10,3 +9,6 @@ tools/venv
 *.tar.gz
 .ipynb_checkpoints
 *.npz
+tools/venv
+tools/kenlm
--- a/README.md
+++ b/README.md
@@ -52,4 +52,4 @@ DeepSpeech is provided under the [Apache-2.0 License](./LICENSE).
 ## Acknowledgement
 We depends on many open source repos. See [References](doc/src/reference.md) for more information.
\ No newline at end of file
--- a/README_cn.md
+++ b/README_cn.md
@@ -50,4 +50,4 @@ DeepSpeech遵循[Apache-2.0开源协议](./LICENSE)。
 ## 感谢
 开发中参考一些优秀的仓库，详情参见 [References](doc/src/reference.md)。
\ No newline at end of file
--- a/examples/ngram_lm/.gitignore
+++ b/examples/ngram_lm/.gitignore
+exp/
--- a/examples/ngram_lm/data/README.md
+++ b/examples/ngram_lm/data/README.md
+text_correct.txt: https://github.com/shibing624/pycorrector/raw/master/tests/test_file.txt
+custom_confusion.txt: https://github.com/shibing624/pycorrector/raw/master/tests/custom_confusion.txt
--- a/examples/ngram_lm/data/custom_confusion.txt
+++ b/examples/ngram_lm/data/custom_confusion.txt
--- a/examples/ngram_lm/data/text_correct.txt
+++ b/examples/ngram_lm/data/text_correct.txt
+少先队员因该为老人让坐
+祛痘印可以吗？有效果吗？
+不知这款牛奶口感怎样？ 小孩子喝行吗！
+是转基因油?
+我家宝宝13斤用多大码的
+会起坨吗？
+请问给送上楼吗？
+亲是送赁上门吗
+送货时候有外包装没有还是直接发货过来
+会不会有坏的？
+这个米煮粥好还煮饭好吃
+有送的马克杯吗？
+这纸尿裤分男孩女孩使用吗
+买的路由器老是断网，拔了跳过路由器就可以用了
+能泡开不？辣度几
+请问这个米蒸出来是一粒一粒的还是一坨一坨的？
+水和其他商品一样送货上门，还是自提呀？
+快两个月的孩子 要穿什么码的
+买回来会不会过期？
+洗的还干净把吧
+路由器怎么样啊，掉线严重吗？
+你好这米是五斤还是十斤
+收安费不
+给送开果器吗
+这纸好用吗？我看有不少的差评
+自用好用吗
+请问袜子穿久了会往下掉吗？
+每一卷是独立包装的吗？
+这个火龙果口味怎么样？甜不甜？
+买这个送红杯吗？
+一袋子多少斤
+这款拉拉裤有味道吗？超市买的没有味道，不知道这个怎么样
+我想问下拉拉裤上面那个贴的用来干嘛的，怎么用
+这里边有没有枣核
+玫瑰和薰衣草哪个好闻
+这个冰糖质量怎么样，有杂质吗
+倒水的时候漏吗
+请问大家，这个水壶烧出来的水有异味吗？因为给宝宝用所以很在意，谢谢大家
+这米煮出来糯吗？
+这在款子好用吗？有香味吗？
+到底是棉花的材质还是化纤的无纺布啊 求问？
+我用360手机能充电几次
+亲这纸好用吗？值得买吗？
+24瓶？还是12瓶
+是否是真的纸？
+适用机洗吗?
+好吃不好吃啊
+真的好用吗？我也想买 
+你们拿到是什么版本的
+这水和超市一样吗？质量保证吗？
+可以丢进马桶冲吗？
+纸会不会粗？
+这个翠的还不是不催的呀。。没有吃的那种不脆
+这个好用吗
+这纸有香味的吗？
+是最近的生产日期吗
+赠品是什么呀
+这是两瓶还是一瓶的价格？
+请问这是硬壳还是软壳？
+亲，苹果收到后有坏的吗？
+适合两人用吗
+这个直接喝好不好喝   还是要热一下
+纸有木有刺鼻气味？
+酸不酸？？？
+这啤好渴吗?
+跟安慕希哪个比较好喝？
+好用么，主要是带宝宝出去玩的时候用的多？
+刚出生的宝宝用什么码？
+能当洗手液吗？
+是不是很小包的那一种？50块有24包便宜的有点不敢相信
+好用吗，会不会起会不会起坨？
+这个口可以直接放饮水机上用吗？
+这种纸掉粉末吗
+手机好用吗？会卡吗
+开盖里面是拉环的吗？
+这个电池真的需要一直换吗？
+好用吗？是不是正品？
+请问有尿显吗
+容易发烫吗
+苹果有腊吗
+这油有这么好吗？不是过期的吧
+这个夏天用会不会红屁股？透气性好吗
+你好。 我想问下这个是尿不湿吗 ？
+这奶为啥这么便宜？
+你们买的酱油会没有颜色吗，像水一样，看着都没胃口
+这个是机诜，还是手洗
+这个卫生巾带香味吗？
+这种洗发水好用吗
+有餡嗎？好不好吃
+纸质不会好差吗？
+亲们，此米是真空包装吗？
+是软毛的吗？！！
+请问大家德运牌子的好喝还是安佳的？
+这纸好用吗，薄嘛
+这壶保温吗
+这个威露士货到了就是跟图片上的一样吗？只要是图片上显示的都有吗？
+你们买的牛奶是最近日期吗
+这个除菌液，是单独放在滚筒洗衣机除菌液格，还是与洗衣液混合放在洗衣液格？
+请问你们的三只松鼠寄回来的时候是用袋子装着的吗
+1kg是不是两斤？
+洗衣皂怎么样啊，味道重吗，用之后好不好清洗啊。
+我要请问你这个是不是那个拉拉裤吗？这个花纹是不是拉拉裤？
+好多人都说小米运动升级后手环就连不上了，你们有没有这种情况？
+这部手机运行速度快不快？
+新生儿可以用吗 抽一张会带出来很多张吗
+洗后有香味吗
+体验装有多少片
+银装怎么样？会漏尿吗？你们都是多久换一次的？？（我家大概2-3个小时左右，宝宝醒一回换一次）
+声音大吗？好用不？
+抽纸有味吗
+苹果好吃吗？打过蜡吗？是不是坏的很多？
+70g和80g得区别是啥？
+袋装的和瓶装的洗衣液是一样的么？
+噪音很大吗
+烧出来的水会不会很多一块一块的东西
+这个吹风真心好用吗？我今晚下单什么时候到
+请问各位宝妈 这个乳垫的背胶粘吗
+M号的你们给宝宝用到多大啊？几个月？我家宝宝3个月5㎏重，用花王的M号觉得小了。不知道这个怎么样？
+这个喝了能找到女朋友吗
+这袜子耐不耐穿
+请问好用么  是正品么
+怎么储藏 我买了两天在常温阴凉处放着下层有些化了 需要放冰箱冷冻吗
+这批苏打水是否有股消毒水的味道？
+质量怎么样，看到那么多差评，我不敢买了。
+会不会有烂的
+为什么我买的用完之后没香味
+甜吗？？？？
+我看到评论里的差评说大米里有虫，是真的吗？
+要放冰箱冷藏吗
+好不好吃啊
+这油怎么样   炒菜香不香
+这纸擦手时有屑吗？
+是正品的吗？
+好用吗
+这个特浓的苦不苦
+这个好用吗？
+米里真的有虫吗
+是金装的吗？
+双内胆有什么区别，两个一样的吗？
+请问这款水可以降尿酸吗？
+好用吗这个
+购物袋结实吗，能放重东西吗
+你好，请问这款可以剃头发刮光头吗
+这个纸巾质量如何？好用吗？
+好用吗？小孩子喜欢吗？
+亲。煮面时会糊锅不
+包邮吗运费多少
+会一抽就两三张一起抽起来吗？
+一箱几桶油呀
+这个吹风机分冷风和热风吗
+发什么快递呢
+请问一下，有些枸杞说是不要洗，你们的是否建议洗呢？
+请问纸有异味吗？我以前买过一箱就是这个居然有异味。
+这是6个么  怎么觉得有好多
+我买的荣耀10横滑home键进入后台这个操作成功率特别低，你们也是这样吗？
+你们的有塑料味吗，机械的
+小米路由器真心说的有这么差吗
+请问大家这款刮的干净吗？谢谢
+会有塑料味吗
+质量真的很差吗？不敢买
+这纸有气味吗
+我买两箱怎么要运费
+这个标准果好吃吗，酸不酸
+稀吗？是不是有种兑了水的感觉？
+威露士和滴露的消毒液哪个更好用呢？
+曰期是几月份的
+手机容易折弯吗？
+我家宝宝25斤XL会紧吗？
+这款200克一箱的纸张和10卷手提的价格相差那么多 质量一样吗？
+豆浆可以打吗
+电量有百分比吗
+用快递送过来瓶子会不会打破
+是三相电吗，有空调摇控器吧
+拿它送人，有问题吗？？
+安幕希好喝吗？
+这款纸尿裤好用吗？和尤妮佳比较哪个好用些？
+2层厚吗？是不是一到水就烂了
+为什么我宝宝拉粑粑后面总是漏出来我已经贴的很牢了，10斤的宝宝用S号也不小啊你们用了没这种情况吗？
+这个产品好用吗？
+刷毛柔软度咋样，这么便宜，会不会是很小个的
+会不会有过敏的情况呀
+请问是辣条吗
+这种米只能煮粥不能煮饭吗
+可以开袋即食吗？
+这米好吃吗？
+这个充电宝充满电需要多久
+这个奶开了可以保质喝两天吗
+这种薰衣草的洗衣液怎么样
+你们的小米六边框掉漆了吗？？？
+这个是机洗用还是手洗用的啊
+厚度怎么样、起球吗感谢大哥大姐们
+这个好喝还是康师傅红茶好喝
+这种洁面膏会不会过敏，我上次用的火山岩冰感洁面啫喱对那种过敏，但听别人说那种稀的本来就特别容易过敏，不知道这种洁面膏会不会过敏！
+这杯那么多差评，是真的吗，吓得我都不敢买了
+枣是免洗的吗？
+这个尿不湿尿过会起坨吗
+感觉和苏菲比哪个更好用呢？
+煮出来的饭香吗？
+你好！请问这个水壶烧水开了是自动切电吗？
+这个跟 原木纯品 那个啥区别？不是原木纸浆做的？
+能放冰箱吗
+纸有味道吗？
+2016全国高考卷答题模板
+2016全国大考卷答题模板
+2016全国低考卷答题模板
+床前明月光，疑是地上霜
+床前星星光，疑是地上霜
+床前白月光，疑是地上霜
+落霞与孤鹜齐飞，秋水共长天一色
+落霞与孤鹜齐跑，秋水共长天一色
+落霞与孤鹜双飞，秋水共长天一色
+众里寻他千百度，蓦然回首，那人却在，灯火阑珊处
+众里寻她千百度，蓦然回首，那人却在，灯火阑珊处
+众里寻ta千百度，蓦然回首，那人却在，灯火阑珊处
+吸烟的人容*得癌症
+就只听着我*妈所说的话，
+就接受环境污*用化肥和农药，
+是或者接受环境污染用化肥和农药，
+现在的香港比从前的*荣很多。
+现在的香港比*前的饭荣很多。
--- a/examples/ngram_lm/local/build_zh_lm.sh
+++ b/examples/ngram_lm/local/build_zh_lm.sh
+#!/bin/bash
+set -e
+stage=0
+stop_stage=100
+order=5
+mem=80%
+prune=0
+a=22
+q=8
+b=8
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
+if [ $# != 3 ]; then
+    echo "$0 token_type exp/text exp/text.arpa"
+    echo $@
+    exit 1
+fi
+# char or word
+type=$1
+text=$2
+arpa=$3
+if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
+    # text tn & wordseg preprocess
+    echo "process text."
+    python3 ${MAIN_ROOT}/utils/zh_tn.py ${type} ${text} ${text}.${type}.tn
+fi
+if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
+    # train ngram lm
+    echo "build lm."
+    bash ${MAIN_ROOT}/utils/ngram_train.sh --order ${order} --mem ${mem} --prune "${prune}" ${text}.${type}.tn ${arpa}
+fi
\ No newline at end of file
--- a/examples/ngram_lm/local/download_lm_zh.sh
+++ b/examples/ngram_lm/local/download_lm_zh.sh
+#! /usr/bin/env bash
+. ${MAIN_ROOT}/utils/utility.sh
+DIR=data/lm
+mkdir -p ${DIR}
+URL='https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm'
+MD5="29e02312deb2e59b3c8686c7966d4fe3"
+TARGET=${DIR}/zh_giga.no_cna_cmn.prune01244.klm
+echo "Download language model ..."
+download $URL $MD5 $TARGET
+if [ $? -ne 0 ]; then
+    echo "Fail to download the language model!"
+    exit 1
+fi
+exit 0
--- a/examples/ngram_lm/local/kenlm_score_test.py
+++ b/examples/ngram_lm/local/kenlm_score_test.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+import time
+import jieba
+import kenlm
+language_model_path = sys.argv[1]
+assert os.path.exists(language_model_path)
+start = time.time()
+model = kenlm.Model(language_model_path)
+print(f"load kenLM cost: {time.time() - start}s")
+sentence = '盘点不怕被税的海淘网站❗️海淘向来便宜又保真！'
+sentence_char_split = ' '.join(list(sentence))
+sentence_word_split = ' '.join(jieba.lcut(sentence))
+def test_score():
+    print('Loaded language model: %s' % language_model_path)
+    print(sentence)
+    print(model.score(sentence))
+    print(list(model.full_scores(sentence)))
+    for i, v in enumerate(model.full_scores(sentence)):
+        print(i, v)
+    print(sentence_char_split)
+    print(model.score(sentence_char_split))
+    print(list(model.full_scores(sentence_char_split)))
+    split_size = 0
+    for i, v in enumerate(model.full_scores(sentence_char_split)):
+        print(i, v)
+        split_size += 1
+    assert split_size == len(
+        sentence_char_split.split()) + 1, "error split size."
+    print(sentence_word_split)
+    print(model.score(sentence_word_split))
+    print(list(model.full_scores(sentence_word_split)))
+    for i, v in enumerate(model.full_scores(sentence_word_split)):
+        print(i, v)
+def test_full_scores_chars():
+    print('Loaded language model: %s' % language_model_path)
+    print(sentence_char_split)
+    # Show scores and n-gram matches
+    words = ['<s>'] + list(sentence) + ['</s>']
+    for i, (prob, length,
+            oov) in enumerate(model.full_scores(sentence_char_split)):
+        print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i + 2 - length:
+                                                                 i + 2])))
+        if oov:
+            print('\t"{0}" is an OOV'.format(words[i + 1]))
+    print("-" * 42)
+    # Find out-of-vocabulary words
+    oov = []
+    for w in words:
+        if w not in model:
+            print('"{0}" is an OOV'.format(w))
+            oov.append(w)
+    assert oov == ["❗", "️", "！"], 'error oov'
+def test_full_scores_words():
+    print('Loaded language model: %s' % language_model_path)
+    print(sentence_word_split)
+    # Show scores and n-gram matches
+    words = ['<s>'] + sentence_word_split.split() + ['</s>']
+    for i, (prob, length,
+            oov) in enumerate(model.full_scores(sentence_word_split)):
+        print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i + 2 - length:
+                                                                 i + 2])))
+        if oov:
+            print('\t"{0}" is an OOV'.format(words[i + 1]))
+    print("-" * 42)
+    # Find out-of-vocabulary words
+    oov = []
+    for w in words:
+        if w not in model:
+            print('"{0}" is an OOV'.format(w))
+            oov.append(w)
+    # zh_giga.no_cna_cmn.prune01244.klm is chinese charactor LM 
+    assert oov == ["盘点", "不怕", "网站", "❗", "️", "海淘", "向来", "便宜", "保真",
+                   "！"], 'error oov'
+def test_full_scores_chars_length():
+    """test bos eos size"""
+    print('Loaded language model: %s' % language_model_path)
+    r = list(model.full_scores(sentence_char_split))
+    n = list(model.full_scores(sentence_char_split, bos=False, eos=False))
+    print(r)
+    print(n)
+    assert len(r) == len(n) + 1
+    # bos=False, eos=False, input len == output len
+    print(len(n), len(sentence_char_split.split()))
+    assert len(n) == len(sentence_char_split.split())
+    k = list(model.full_scores(sentence_char_split, bos=False, eos=True))
+    print(k, len(k))
+def test_ppl_sentence():
+    """测试句子粒度的ppl得分"""
+    sentence_char_split1 = ' '.join('先救挨饿的人，然后治疗病人。')
+    sentence_char_split2 = ' '.join('先就挨饿的人，然后治疗病人。')
+    n = model.perplexity(sentence_char_split1)
+    print('1', n)
+    n = model.perplexity(sentence_char_split2)
+    print(n)
+    part_char_split1 = ' '.join('先救挨饿的人')
+    part_char_split2 = ' '.join('先就挨饿的人')
+    n = model.perplexity(part_char_split1)
+    print('2', n)
+    n = model.perplexity(part_char_split2)
+    print(n)
+    part_char_split1 = '先救挨'
+    part_char_split2 = '先就挨'
+    n1 = model.perplexity(part_char_split1)
+    print('3', n1)
+    n2 = model.perplexity(part_char_split2)
+    print(n2)
+    assert n1 == n2
+    part_char_split1 = '先 救 挨'
+    part_char_split2 = '先 就 挨'
+    n1 = model.perplexity(part_char_split1)
+    print('4', n1)
+    n2 = model.perplexity(part_char_split2)
+    print(n2)
+    part_char_split1 = '先 救 挨 饿 的 人'
+    part_char_split2 = '先 就 挨 饿 的 人'
+    n1 = model.perplexity(part_char_split1)
+    print('5', n1)
+    n2 = model.perplexity(part_char_split2)
+    print(n2)
+    part_char_split1 = '先 救 挨 饿 的 人 ，'
+    part_char_split2 = '先 就 挨 饿 的 人 ，'
+    n1 = model.perplexity(part_char_split1)
+    print('6', n1)
+    n2 = model.perplexity(part_char_split2)
+    print(n2)
+    part_char_split1 = '先 救 挨 饿 的 人 ， 然 后 治 疗 病 人'
+    part_char_split2 = '先 就 挨 饿 的 人 ， 然 后 治 疗 病 人'
+    n1 = model.perplexity(part_char_split1)
+    print('7', n1)
+    n2 = model.perplexity(part_char_split2)
+    print(n2)
+    part_char_split1 = '先 救 挨 饿 的 人 ， 然 后 治 疗 病 人 。'
+    part_char_split2 = '先 就 挨 饿 的 人 ， 然 后 治 疗 病 人 。'
+    n1 = model.perplexity(part_char_split1)
+    print('8', n1)
+    n2 = model.perplexity(part_char_split2)
+    print(n2)
+if __name__ == '__main__':
+    test_score()
+    test_full_scores_chars()
+    test_full_scores_words()
+    test_full_scores_chars_length()
+    test_ppl_sentence()
--- a/examples/ngram_lm/path.sh
+++ b/examples/ngram_lm/path.sh
+export MAIN_ROOT=${PWD}/../../
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
+export LC_ALL=C
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+export LD_LIBRARY_PATH=/usr/local/lib/:${LD_LIBRARY_PATH}
\ No newline at end of file
--- a/examples/ngram_lm/requirements.txt
+++ b/examples/ngram_lm/requirements.txt
+jieba>=0.39
\ No newline at end of file
--- a/examples/ngram_lm/run.sh
+++ b/examples/ngram_lm/run.sh
+#!/bin/bash
+set -e
+source path.sh
+stage=0
+stop_stage=100
+source ${MAIN_ROOT}/utils/parse_options.sh || exit -1
+python3 -c 'import kenlm;' || { echo "kenlm package not install!"; exit -1; }
+if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
+    # case 1, test kenlm
+    # download language model
+    bash local/download_lm_zh.sh
+    if [ $? -ne 0 ]; then
+       exit 1
+    fi
+    # test kenlm `score` and `full_score`
+    python local/kenlm_score_test.py data/lm/zh_giga.no_cna_cmn.prune01244.klm
+fi
+mkdir -p exp
+cp data/text_correct.txt exp/text
+if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
+    # case 2, chinese chararctor ngram lm build
+    # output: xxx.arpa xxx.kenlm.bin
+    input=exp/text
+    token_type=char
+    lang=zh
+    order=5
+    prune="0 1 2 4 4"
+    a=22
+    q=8
+    b=8
+    output=${input}_${lang}_${token_type}_o${order}_p${prune// /_}_a${a}_q${q}_b${b}.arpa
+    echo "build ${token_type} lm."
+    bash local/build_zh_lm.sh --order ${order} --prune "${prune}" --a ${a} --q ${a} --b ${b} ${token_type} ${input} ${output}
+fi
+if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
+    # case 2, chinese chararctor ngram lm build
+    # output: xxx.arpa xxx.kenlm.bin
+    input=exp/text
+    token_type=word
+    lang=zh
+    order=3
+    prune="0 0 0"
+    a=22
+    q=8
+    b=8
+    output=${input}_${lang}_${token_type}_o${order}_p${prune// /_}_a${a}_q${q}_b${b}.arpa
+    echo "build ${token_type} lm."
+    bash local/build_zh_lm.sh --order ${order} --prune "${prune}" --a ${a} --q ${a} --b ${b} ${token_type} ${input} ${output}
+fi
--- a/setup.sh
+++ b/setup.sh
@@ -57,11 +57,11 @@ if [ $? != 0 ]; then
 fi
-# install kaldi-comptiable feature 
+# install third_party
-pushd third_party/python_kaldi_features/
+pushd third_party
-python setup.py install
+bash install.sh
 if [ $? != 0 ]; then
-   error_msg "Please check why kaldi feature install error!"
+   error_msg "Please check why third_party install error!"
   exit -1
 fi
 popd

--- a/third_party/README.md
+++ b/third_party/README.md
 * [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features)  
 commit: fc1bd6240c2008412ab64dc25045cd872f5e126c  
 ref: https://zhuanlan.zhihu.com/p/55371926  
+licence: MIT
 * [python-pinyin](https://github.com/mozillazg/python-pinyin.git)
-  commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03
+commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03
-  licence: MIT
+licence: MIT
+* [zhon](https://github.com/tsroten/zhon)
+commit: 09bf543696277f71de502506984661a60d24494c
+licence: MIT
+* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git)
+commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d
+licence: MIT
+* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git)
+commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c
+licence: MIT
--- a/third_party/chinese_text_normalization/.gitignore
+++ b/third_party/chinese_text_normalization/.gitignore
+*~
+*.far
--- a/third_party/chinese_text_normalization/LICENSE
+++ b/third_party/chinese_text_normalization/LICENSE
+MIT License
+Copyright (c) 2020 SpeechIO
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/third_party/chinese_text_normalization/README.md
+++ b/third_party/chinese_text_normalization/README.md
+# Chinese Text Normalization for Speech Processing
+## Problem
+Search for "Text Normalization"(TN) on Google and Github, you can hardly find open-source projects that are "read-to-use" for text normalization tasks. Instead, you find a bunch of NLP toolkits or frameworks that *supports* TN functionality.  There is quite some work between "support text normalization" and "do text normalization".
+## Reason
+* TN is language-dependent, more or less.
+    Some of TN processing methods are shared across languages, but a good TN module always involves language-specific knowledge and treatments, more or less.
+* TN is task-specific.
+    Even for the same language, different applications require quite different TN.
+* TN is "dirty"
+    Constructing and maintaining a set of TN rewrite-rules is painful, whatever toolkits and frameworks you choose.  Subtle and intrinsic complexities hide inside TN task itself, not in tools or frameworks.
+* mature TN module is an asset
+    Since constructing and maintaining TN is hard, it is actually an asset for commercial companies, hence it is unlikely to find a product-level TN in open-source community (correct me if you find any)
+* TN is a less important topic for either academic or commercials.
+## Goal
+This project sets up a ready-to-use TN module for **Chinese**. Since my background is **speech processing**, this project should be able to handle most common TN tasks, in **Chinese ASR** text processing pipelines.
+## Normalizers
+1. supported NSW (Non-Standard-Word) Normalization
+    |NSW type|raw|normalized|
+    |-|-|-|
+    |cardinal|这块黄金重达324.75克|这块黄金重达三百二十四点七五克|
+    |date|她出生于86年8月18日，她弟弟出生于1995年3月1日|她出生于八六年八月十八日 她弟弟出生于一九九五年三月一日|
+    |digit|电影中梁朝伟扮演的陈永仁的编号27149|电影中梁朝伟扮演的陈永仁的编号二七一四九|
+    |fraction|现场有7/12的观众投出了赞成票|现场有十二分之七的观众投出了赞成票|
+    |money|随便来几个价格12块5，34.5元，20.1万|随便来几个价格十二块五 三十四点五元 二十点一万|
+    |percentage|明天有62％的概率降雨|明天有百分之六十二的概率降雨|
+    |telephone|这是固话0421-33441122<br>这是手机+86 18544139121|这是固话零四二一三三四四一一二二<br>这是手机八六一八五四四一三九一二一|
+    acknowledgement: the NSW normalization codes are based on [Zhiyang Zhou's work here](https://github.com/Joee1995/chn_text_norm.git)
+1. punctuation removal
+    For Chinese, it removes punctuation list collected in [Zhon](https://github.com/tsroten/zhon) project, containing
+    * non-stop puncs
+        ```
+        '＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
+        ```
+    * stop puncs
+        ```
+        '！？｡。'
+        ```
+    For English, it removes Python's `string.punctuation`
+1. multilingual English word upper/lower case conversion
+    since ASR/TTS lexicons usually unify English entries to uppercase or lowercase, the TN module should adapt with lexicon accordingly.
+## Supported text format
+1. plain text, preferably one sentence per line(most common case in ASR processing).
+    ```
+    今天早饭吃了没
+    没吃回家吃去吧
+    ...
+    ```
+    plain text is default format.
+2. Kaldi's transcription format
+    ```
+    KALDI_KEY_UTT001    今天早饭吃了没
+    KALDI_KEY_UTT002    没吃回家吃去吧
+    ...
+    ```
+    TN will skip first column key section, normalize latter transcription text
+    pass `--has_key` option to switch to kaldi format.
+_note: All input text should be UTF-8 encoded._
+## Run examples
+* TN (python)
+make sure you have **python3**, python2.X won't work correctly.
+`sh run.sh` in `TN` dir, and compare raw text and normalized text.
+* ITN (thrax)
+make sure you  have **thrax** installed, and your PATH should be able to find thrax binaries.
+`sh run.sh` in `ITN` dir. check Makefile for grammar dependency.
+## possible future work
+Since TN is a typical "done is better than perfect" module in context of ASR, and the current state is sufficient for my purpose, I probably won't update this repo frequently.
+there are indeed something that needs to be improved:
+* For TN, NSW normalizers in TN dir are based on regular expression, I've found some unintended matches, those pattern regexps need to be refined for more precise TN coverage.
+* For ITN, extend those thrax rewriting grammars to cover more scenarios.
+* Further more, nowadays commercial systems start to introduce RNN-like models into TN, and a mix of (rule-based & model-based) system is state-of-the-art.  More readings about this, look for Richard Sproat and KyleGorman's work at Google.
+END
--- a/third_party/chinese_text_normalization/python/cn_tn.py
+++ b/third_party/chinese_text_normalization/python/cn_tn.py
--- a/third_party/chinese_text_normalization/python/example_kaldi.txt
+++ b/third_party/chinese_text_normalization/python/example_kaldi.txt
+UTT000	这块黄金重达324.75克
+UTT001	她出生于86年8月18日，她弟弟出生于1995年3月1日
+UTT002	电影中梁朝伟扮演的陈永仁的编号27149
+UTT003	现场有7/12的观众投出了赞成票
+UTT004	随便来几个价格12块5，34.5元，20.1万
+UTT005	明天有62％的概率降雨
+UTT006	这是固话0421-33441122或这是手机+86 18544139121
--- a/third_party/chinese_text_normalization/python/example_plain.txt
+++ b/third_party/chinese_text_normalization/python/example_plain.txt
+这块黄金重达324.75克
+她出生于86年8月18日，她弟弟出生于1995年3月1日
+电影中梁朝伟扮演的陈永仁的编号27149
+现场有7/12的观众投出了赞成票
+随便来几个价格12块5，34.5元，20.1万
+明天有62％的概率降雨
+这是固话0421-33441122或这是手机+86 18544139121
--- a/third_party/chinese_text_normalization/python/run.sh
+++ b/third_party/chinese_text_normalization/python/run.sh
+# for plain text
+python3 cn_tn.py example_plain.txt output_plain.txt
+diff example_plain.txt output_plain.txt
+# for Kaldi's trans format
+python3 cn_tn.py --has_key example_kaldi.txt output_kaldi.txt
+diff example_kaldi.txt output_kaldi.txt
--- a/third_party/chinese_text_normalization/thrax/INSTALL.txt
+++ b/third_party/chinese_text_normalization/thrax/INSTALL.txt
+0. place install_thrax.sh into $KALDI/tools/extras/
+1. recompile openfst with necessary option "--enable-grm" to support thrax:
+* cd $KALDI_ROOT/tools
+* make clean
+* edit $KALDI_ROOT/tools/Makefile, append "--enable-grm" option to OPENFST_CONFIGURE:
+OPENFST_CONFIGURE ?= --enable-static --enable-shared --enable-far --enable-ngram-fsts --enable-lookahead-fsts --with-pic --enable-grm
+* make -j 10
+2. install thrax
+cd $KALDI_ROOT/tools
+sh extras/install_thrax.sh
+3. add thrax binary path into $KALDI_ROOT/tools/env.sh:
+export PATH=/path/to/your/kaldi_root/tools/thrax-1.2.9/src/bin:${PATH}
+usage:
+before you run anything related to thrax, use:
+. $KALDI_ROOT/tools/env.sh
+to enable binary finding, like what we always do in kaldi.
+sample usage:
+sh run_en.sh
+sh run_cn.sh
--- a/third_party/chinese_text_normalization/thrax/install_thrax.sh
+++ b/third_party/chinese_text_normalization/thrax/install_thrax.sh
+#!/bin/bash
+## This script should be placed under $KALDI_ROOT/tools/extras/, and see INSTALL.txt for installation guide
+if [ ! -f thrax-1.2.9.tar.gz ]; then
+    wget http://www.openfst.org/twiki/pub/GRM/ThraxDownload/thrax-1.2.9.tar.gz
+    tar -zxf thrax-1.2.9.tar.gz
+fi
+cd thrax-1.2.9
+OPENFSTPREFIX=`pwd`/../openfst
+LDFLAGS="-L${OPENFSTPREFIX}/lib" CXXFLAGS="-I${OPENFSTPREFIX}/include" ./configure --prefix ${OPENFSTPREFIX}
+make -j 10; make install
+cd ..
--- a/third_party/chinese_text_normalization/thrax/papers/gorman-sproat-2016.pdf
+++ b/third_party/chinese_text_normalization/thrax/papers/gorman-sproat-2016.pdf
--- a/third_party/chinese_text_normalization/thrax/papers/wu-etal-2016.pdf
+++ b/third_party/chinese_text_normalization/thrax/papers/wu-etal-2016.pdf
--- a/third_party/chinese_text_normalization/thrax/run_cn.sh
+++ b/third_party/chinese_text_normalization/thrax/run_cn.sh
+cd src/cn
+thraxmakedep itn.grm
+make
+#thraxrewrite-tester --far=itn.far --rules=ITN 
+cat ../../testcase_cn.txt | thraxrewrite-tester --far=itn.far --rules=ITN 
+cd -
--- a/third_party/chinese_text_normalization/thrax/run_en.sh
+++ b/third_party/chinese_text_normalization/thrax/run_en.sh
+cd src
+thraxmakedep en/verbalizer/podspeech.grm
+make
+cat ../testcase_en.txt
+cat ../testcase_en.txt | thraxrewrite-tester --far=en/verbalizer/podspeech.far --rules=POD_SPEECH_TN
+cd -
--- a/third_party/chinese_text_normalization/thrax/src/LICENSE
+++ b/third_party/chinese_text_normalization/thrax/src/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/third_party/chinese_text_normalization/thrax/src/Makefile
+++ b/third_party/chinese_text_normalization/thrax/src/Makefile
+en/verbalizer/podspeech.far: en/verbalizer/podspeech.grm util/util.far util/case.far en/verbalizer/extra_numbers.far en/verbalizer/float.far en/verbalizer/math.far en/verbalizer/miscellaneous.far en/verbalizer/money.far en/verbalizer/numbers.far en/verbalizer/numbers_plus.far en/verbalizer/spelled.far en/verbalizer/spoken_punct.far en/verbalizer/time.far en/verbalizer/urls.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+util/util.far: util/util.grm util/byte.far util/case.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+util/byte.far: util/byte.grm 
+	thraxcompiler --input_grammar=$< --output_far=$@
+util/case.far: util/case.grm util/byte.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/extra_numbers.far: en/verbalizer/extra_numbers.grm util/byte.far en/verbalizer/numbers.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/numbers.far: en/verbalizer/numbers.grm en/verbalizer/number_names.far util/byte.far universal/thousands_punct.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/number_names.far: en/verbalizer/number_names.grm util/arithmetic.far en/verbalizer/g.fst en/verbalizer/cardinals.tsv en/verbalizer/ordinals.tsv
+	thraxcompiler --input_grammar=$< --output_far=$@
+util/arithmetic.far: util/arithmetic.grm util/byte.far util/germanic.tsv
+	thraxcompiler --input_grammar=$< --output_far=$@
+universal/thousands_punct.far: universal/thousands_punct.grm util/byte.far util/util.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/float.far: en/verbalizer/float.grm en/verbalizer/factorization.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/factorization.far: en/verbalizer/factorization.grm util/byte.far util/util.far en/verbalizer/numbers.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/lexical_map.far: en/verbalizer/lexical_map.grm util/byte.far en/verbalizer/lexical_map.tsv
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/math.far: en/verbalizer/math.grm en/verbalizer/float.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/miscellaneous.far: en/verbalizer/miscellaneous.grm util/byte.far ru/classifier/cyrillic.far en/verbalizer/extra_numbers.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far en/verbalizer/spelled.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+ru/classifier/cyrillic.far: ru/classifier/cyrillic.grm 
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/spelled.far: en/verbalizer/spelled.grm util/byte.far ru/classifier/cyrillic.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/money.far: en/verbalizer/money.grm util/byte.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far en/verbalizer/money.tsv
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/numbers_plus.far: en/verbalizer/numbers_plus.grm en/verbalizer/factorization.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/spoken_punct.far: en/verbalizer/spoken_punct.grm en/verbalizer/lexical_map.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/time.far: en/verbalizer/time.grm util/byte.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+en/verbalizer/urls.far: en/verbalizer/urls.grm util/byte.far en/verbalizer/lexical_map.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+clean:
+	rm -f util/util.far util/case.far en/verbalizer/extra_numbers.far en/verbalizer/float.far en/verbalizer/math.far en/verbalizer/miscellaneous.far en/verbalizer/money.far en/verbalizer/numbers.far en/verbalizer/numbers_plus.far en/verbalizer/spelled.far en/verbalizer/spoken_punct.far en/verbalizer/time.far en/verbalizer/urls.far util/byte.far en/verbalizer/number_names.far universal/thousands_punct.far util/arithmetic.far en/verbalizer/factorization.far en/verbalizer/lexical_map.far ru/classifier/cyrillic.far
--- a/third_party/chinese_text_normalization/thrax/src/README.md
+++ b/third_party/chinese_text_normalization/thrax/src/README.md
+# Text normalization covering grammars
+This repository provides covering grammars for English and Russian text normalization as
+documented in:
+  Gorman, K., and Sproat, R. 2016. Minimally supervised number normalization.
+  _Transactions of the Association for Computational Linguistics_ 4: 507-519.
+  Ng, A. H., Gorman, K., and Sproat, R. 2017. Minimally supervised
+  written-to-spoken text normalization. In _ASRU_, pages 665-670.
+If you use these grammars in a publication, we would appreciate if you cite these works.
+## Building
+The grammars are written in [Thrax](thrax.opengrm.org) and compile into [OpenFst](openfst.org) FAR (FstARchive) files. To compile, simply run `make` in the `src/` directory.
+## License
+See `LICENSE`.
+## Mandatory disclaimer
+This is not an official Google product.
--- a/third_party/chinese_text_normalization/thrax/src/cn/Makefile
+++ b/third_party/chinese_text_normalization/thrax/src/cn/Makefile
+itn.far: itn.grm byte.far number.far hotfix.far percentage.far date.far amount.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+byte.far: byte.grm 
+	thraxcompiler --input_grammar=$< --output_far=$@
+number.far: number.grm byte.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+hotfix.far: hotfix.grm byte.far hotfix.list
+	thraxcompiler --input_grammar=$< --output_far=$@
+percentage.far: percentage.grm byte.far number.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+date.far: date.grm byte.far number.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+amount.far: amount.grm byte.far number.far
+	thraxcompiler --input_grammar=$< --output_far=$@
+clean:
+	rm -f byte.far number.far hotfix.far percentage.far date.far amount.far
--- a/third_party/chinese_text_normalization/thrax/src/cn/amount.grm
+++ b/third_party/chinese_text_normalization/thrax/src/cn/amount.grm
+import 'byte.grm' as b;
+import 'number.grm' as n;
+unit = (
+	"匹"|"张"|"座"|"回"|"场"|"尾"|"条"|"个"|"首"|"阙"|"阵"|"网"|"炮"|
+	"顶"|"丘"|"棵"|"只"|"支"|"袭"|"辆"|"挑"|"担"|"颗"|"壳"|"窠"|"曲"|
+	"墙"|"群"|"腔"|"砣"|"座"|"客"|"贯"|"扎"|"捆"|"刀"|"令"|"打"|"手"|
+	"罗"|"坡"|"山"|"岭"|"江"|"溪"|"钟"|"队"|"单"|"双"|"对"|"出"|"口"|
+	"头"|"脚"|"板"|"跳"|"枝"|"件"|"贴"|"针"|"线"|"管"|"名"|"位"|"身"|
+	"堂"|"课"|"本"|"页"|"家"|"户"|"层"|"丝"|"毫"|"厘"|"分"|"钱"|"两"|
+	"斤"|"担"|"铢"|"石"|"钧"|"锱"|"忽"|"毫"|"厘"|"分"|"寸"|"尺"|"丈"|
+	"里"|"寻"|"常"|"铺"|"程"|"撮"|"勺"|"合"|"升"|"斗"|"石"|"盘"|"碗"|
+	"碟"|"叠"|"桶"|"笼"|"盆"|"盒"|"杯"|"钟"|"斛"|"锅"|"簋"|"篮"|"盘"|
+	"桶"|"罐"|"瓶"|"壶"|"卮"|"盏"|"箩"|"箱"|"煲"|"啖"|"袋"|"钵"|"年"|
+	"月"|"日"|"季"|"刻"|"时"|"周"|"天"|"秒"|"分"|"旬"|"纪"|"岁"|"世"|
+	"更"|"夜"|"春"|"夏"|"秋"|"冬"|"代"|"伏"|"辈"|"丸"|"泡"|"粒"|"颗"|
+	"幢"|"堆"|"条"|"根"|"支"|"道"|"面"|"片"|"张"|"颗"|"块"|
+	(("千克":"kg")|("毫克":"mg")|("微克":"µg"))|
+	(("千米":"km")|("厘米":"cm")|("毫米":"mm")|("微米":"µm")|("纳米":"nm"))
+);
+amount = n.number unit;
+export AMOUNT = CDRewrite[amount, "", "", b.kBytes*];
--- a/third_party/chinese_text_normalization/thrax/src/cn/byte.grm
+++ b/third_party/chinese_text_normalization/thrax/src/cn/byte.grm
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Copyright 2005-2011 Google, Inc.
+# Author: ttai@google.com (Terry Tai)
+# Standard constants for ASCII (byte) based strings.  This mirrors the
+# functions provided by C/C++'s ctype.h library.
+# Note that [0] is missing.  Matching the string-termination character is kinda weird.
+export kBytes = Optimize[
+  "[1]" |   "[2]" |   "[3]" |   "[4]" |   "[5]" |   "[6]" |   "[7]" |   "[8]" |   "[9]" |  "[10]" |
+ "[11]" |  "[12]" |  "[13]" |  "[14]" |  "[15]" |  "[16]" |  "[17]" |  "[18]" |  "[19]" |  "[20]" |
+ "[21]" |  "[22]" |  "[23]" |  "[24]" |  "[25]" |  "[26]" |  "[27]" |  "[28]" |  "[29]" |  "[30]" |
+ "[31]" |  "[32]" |  "[33]" |  "[34]" |  "[35]" |  "[36]" |  "[37]" |  "[38]" |  "[39]" |  "[40]" |
+ "[41]" |  "[42]" |  "[43]" |  "[44]" |  "[45]" |  "[46]" |  "[47]" |  "[48]" |  "[49]" |  "[50]" |
+ "[51]" |  "[52]" |  "[53]" |  "[54]" |  "[55]" |  "[56]" |  "[57]" |  "[58]" |  "[59]" |  "[60]" |
+ "[61]" |  "[62]" |  "[63]" |  "[64]" |  "[65]" |  "[66]" |  "[67]" |  "[68]" |  "[69]" |  "[70]" |
+ "[71]" |  "[72]" |  "[73]" |  "[74]" |  "[75]" |  "[76]" |  "[77]" |  "[78]" |  "[79]" |  "[80]" |
+ "[81]" |  "[82]" |  "[83]" |  "[84]" |  "[85]" |  "[86]" |  "[87]" |  "[88]" |  "[89]" |  "[90]" |
+ "[91]" |  "[92]" |  "[93]" |  "[94]" |  "[95]" |  "[96]" |  "[97]" |  "[98]" |  "[99]" | "[100]" |
+"[101]" | "[102]" | "[103]" | "[104]" | "[105]" | "[106]" | "[107]" | "[108]" | "[109]" | "[110]" |
+"[111]" | "[112]" | "[113]" | "[114]" | "[115]" | "[116]" | "[117]" | "[118]" | "[119]" | "[120]" |
+"[121]" | "[122]" | "[123]" | "[124]" | "[125]" | "[126]" | "[127]" | "[128]" | "[129]" | "[130]" |
+"[131]" | "[132]" | "[133]" | "[134]" | "[135]" | "[136]" | "[137]" | "[138]" | "[139]" | "[140]" |
+"[141]" | "[142]" | "[143]" | "[144]" | "[145]" | "[146]" | "[147]" | "[148]" | "[149]" | "[150]" |
+"[151]" | "[152]" | "[153]" | "[154]" | "[155]" | "[156]" | "[157]" | "[158]" | "[159]" | "[160]" |
+"[161]" | "[162]" | "[163]" | "[164]" | "[165]" | "[166]" | "[167]" | "[168]" | "[169]" | "[170]" |
+"[171]" | "[172]" | "[173]" | "[174]" | "[175]" | "[176]" | "[177]" | "[178]" | "[179]" | "[180]" |
+"[181]" | "[182]" | "[183]" | "[184]" | "[185]" | "[186]" | "[187]" | "[188]" | "[189]" | "[190]" |
+"[191]" | "[192]" | "[193]" | "[194]" | "[195]" | "[196]" | "[197]" | "[198]" | "[199]" | "[200]" |
+"[201]" | "[202]" | "[203]" | "[204]" | "[205]" | "[206]" | "[207]" | "[208]" | "[209]" | "[210]" |
+"[211]" | "[212]" | "[213]" | "[214]" | "[215]" | "[216]" | "[217]" | "[218]" | "[219]" | "[220]" |
+"[221]" | "[222]" | "[223]" | "[224]" | "[225]" | "[226]" | "[227]" | "[228]" | "[229]" | "[230]" |
+"[231]" | "[232]" | "[233]" | "[234]" | "[235]" | "[236]" | "[237]" | "[238]" | "[239]" | "[240]" |
+"[241]" | "[242]" | "[243]" | "[244]" | "[245]" | "[246]" | "[247]" | "[248]" | "[249]" | "[250]" |
+"[251]" | "[252]" | "[253]" | "[254]" | "[255]"
+];
+export kDigit = Optimize[
+    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
+];
+export kLower = Optimize[
+    "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" |
+    "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
+];
+export kUpper = Optimize[
+    "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" |
+    "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
+];
+export kAlpha = Optimize[kLower | kUpper];
+export kAlnum = Optimize[kDigit | kAlpha];
+export kSpace = Optimize[
+    " " | "\t" | "\n" | "\r"
+];
+export kNotSpace = Optimize[kBytes - kSpace];
+export kPunct = Optimize[
+    "!" | "\"" | "#" | "$" | "%" | "&" | "'" | "(" | ")" | "*" | "+" | "," |
+    "-" | "." | "/" | ":" | ";" | "<" | "=" | ">" | "?" | "@" | "\[" | "\\" |
+    "\]" | "^" | "_" | "`" | "{" | "|" | "}" | "~"
+];
+export kGraph = Optimize[kAlnum | kPunct];
--- a/third_party/chinese_text_normalization/thrax/src/cn/date.grm
+++ b/third_party/chinese_text_normalization/thrax/src/cn/date.grm
+import 'byte.grm' as b;
+import 'number.grm' as n;
+date_day = n.number_1_to_99 ("日"|"号");
+date_month_day = n.number_1_to_99 "月" date_day;
+date_year_month_day = ((n.number_0_to_9){2,4} | n.number) "年" date_month_day;
+date = date_year_month_day | date_month_day | date_day;
+export DATE = CDRewrite[date, "", "", b.kBytes*];
--- a/third_party/chinese_text_normalization/thrax/src/cn/hotfix.grm
+++ b/third_party/chinese_text_normalization/thrax/src/cn/hotfix.grm
+import 'byte.grm' as b;
+hotfix = StringFile['hotfix.list'];
+export HOTFIX = CDRewrite[hotfix, "", "", b.kBytes*];
--- a/third_party/chinese_text_normalization/thrax/src/cn/hotfix.list
+++ b/third_party/chinese_text_normalization/thrax/src/cn/hotfix.list
+0头	零头
+10字	十字
+东4环	东4环	-1.0
+东4	东四	-0.5
+4惠	四惠
+3元桥	三元桥
+4平市	四平市
+5台山	五台山
+西2旗	西二旗
+西3旗	西三旗
+4道口	四道口	-1.0
+5道口	五道口	-1.0
+6道口	六道口	-1.0
+6里桥	六里桥
+7里庄	七里庄
+8宝山	八宝山
+9颗松	九棵松
+10里堡	十里堡
--- a/third_party/chinese_text_normalization/thrax/src/cn/itn.grm
+++ b/third_party/chinese_text_normalization/thrax/src/cn/itn.grm
+import 'byte.grm' as b;
+import 'number.grm' as number;
+import 'hotfix.grm' as hotfix;
+import 'percentage.grm' as percentage;
+import 'date.grm' as date;
+import 'amount.grm' as amount; # seems not useful for now
+export ITN = Optimize[percentage.PERCENTAGE @ (date.DATE <-1>) @ number.NUMBER @ hotfix.HOTFIX];
--- a/third_party/chinese_text_normalization/thrax/src/cn/number.grm
+++ b/third_party/chinese_text_normalization/thrax/src/cn/number.grm
+import 'byte.grm' as b;
+number_1_to_9 = (
+  ("一":"1") | ("幺":"1") |
+  ("二":"2") | ("两":"2") |
+  ("三":"3") |
+  ("四":"4") |
+  ("五":"5") |
+  ("六":"6") |
+  ("七":"7") |
+  ("八":"8") |
+  ("九":"9") 
+);
+export number_0_to_9 = (("零":"0") | number_1_to_9);
+number_10_to_19 = (
+  ("十":"10") |
+  ("十一":"11") |
+  ("十二":"12") |
+  ("十三":"13") |
+  ("十四":"14") |
+  ("十五":"15") |
+  ("十六":"16") |
+  ("十七":"17") |
+  ("十八":"18") |
+  ("十九":"19") 
+);
+number_10s    = (number_1_to_9 ("十":""));
+number_100s   = (number_1_to_9 ("百":""));
+number_1000s  = (number_1_to_9 ("千":""));
+number_10000s = (number_1_to_9 ("万":""));
+number_10_to_99 = (
+  ((number_10s number_1_to_9)<-0.3>) | 
+  ((number_10s ("":"0"))<-0.2>) | 
+  (number_10_to_19 <-0.1>)
+);
+export number_1_to_99 = (number_1_to_9 | number_10_to_99);
+number_100_to_999 = (
+  ((number_100s ("零":"0") number_1_to_9)<0.0>)|
+  ((number_100s number_10_to_99)<0.0>) |
+  ((number_100s number_1_to_9 ("":"0"))<0.0>) |
+  ((number_100s ("":"00"))<0.1>)
+);
+number_1000_to_9999 = (
+  ((number_1000s number_100_to_999)<0.0>) |
+  ((number_1000s ("零":"0") number_10_to_99)<0.0>)|
+  ((number_1000s ("零":"00") number_1_to_9)<0.0>)|
+  ((number_1000s ("":"000"))<1>) |
+  ((number_1000s number_1_to_9 ("":"00"))<0.0>)
+);
+export number = number_1_to_99 | (number_100_to_999 <-1>) | (number_1000_to_9999 <-2>);
+export NUMBER = CDRewrite[number, "", "", b.kBytes*];
--- a/third_party/chinese_text_normalization/thrax/src/cn/percentage.grm
+++ b/third_party/chinese_text_normalization/thrax/src/cn/percentage.grm
+import 'byte.grm' as b;
+import 'number.grm' as n;
+percentage = (
+  ("百分之":"") n.number_1_to_99 ("":"%")
+);
+export PERCENTAGE = CDRewrite[percentage, "", "", b.kBytes*];
--- a/third_party/chinese_text_normalization/thrax/src/en/README.md
+++ b/third_party/chinese_text_normalization/thrax/src/en/README.md
+# English covering grammar definitions
+This directory defines a English text normalization covering grammar. The
+primary entry-point is the FST `VERBALIZER`, defined in
+`verbalizer/verbalizer.grm` and compiled in the FST archive
+`verbalizer/verbalizer.far`.
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/Makefile
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/Makefile
+verbalizer.far: verbalizer.grm util/util.far en/verbalizer/extra_numbers.far en/verbalizer/float.far en/verbalizer/math.far en/verbalizer/miscellaneous.far en/verbalizer/money.far en/verbalizer/numbers.far en/verbalizer/numbers_plus.far en/verbalizer/spelled.far en/verbalizer/spoken_punct.far en/verbalizer/time.far en/verbalizer/urls.far
+	thraxcompiler --input_grammar=$< --output_far=$@
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/cardinals.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/cardinals.tsv
+0	zero
+1	one
+2	two
+3	three
+4	four
+5	five
+6	six
+7	seven
+8	eight
+9	nine
+10	ten
+11	eleven
+12	twelve
+13	thirteen
+14	fourteen
+15	fifteen
+16	sixteen
+17	seventeen
+18	eighteen
+19	nineteen
+20	twenty
+30	thirty
+40	forty
+50	fifty
+60	sixty
+70	seventy
+80	eighty
+90	ninety
+100	hundred
+1000	thousand
+1000000	million
+1000000000	billion
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/extra_numbers.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/extra_numbers.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'util/byte.grm' as b;
+import 'en/verbalizer/numbers.grm' as n;
+digit = b.kDigit @ n.CARDINAL_NUMBERS | ("0" : "@@OTHER_ZERO_VERBALIZATIONS@@");
+export DIGITS  = digit (n.I[" "] digit)*;
+# Various common factorizations
+two_digits = b.kDigit{2} @ n.CARDINAL_NUMBERS;
+three_digits = b.kDigit{3} @ n.CARDINAL_NUMBERS;
+mixed =
+   (digit n.I[" "] two_digits)
+ | (two_digits n.I[" "] two_digits)
+ | (two_digits n.I[" "] three_digits)
+ | (two_digits n.I[" "] two_digits n.I[" "] two_digits)
+;
+export MIXED_NUMBERS = Optimize[mixed];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/factorization.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/factorization.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'util/byte.grm' as b;
+import 'util/util.grm' as u;
+import 'en/verbalizer/numbers.grm' as n;
+func ToNumberName[expr] {
+  number_name_seq = n.CARDINAL_NUMBERS (" " n.CARDINAL_NUMBERS)*;
+  return Optimize[expr @ number_name_seq];
+}
+d = b.kDigit;
+leading_zero = CDRewrite[n.I[" "], ("[BOS]" | " ") "0", "", b.kBytes*];
+by_ones = d n.I[" "];
+by_twos = (d{2} @ leading_zero) n.I[" "];
+by_threes = (d{3} @ leading_zero) n.I[" "];
+groupings = by_twos* (by_threes | by_twos | by_ones);
+export FRACTIONAL_PART_UNGROUPED =
+  Optimize[ToNumberName[by_ones+ @ u.CLEAN_SPACES]]
+;
+export FRACTIONAL_PART_GROUPED =
+  Optimize[ToNumberName[groupings @ u.CLEAN_SPACES]]
+;
+export FRACTIONAL_PART_UNPARSED = Optimize[ToNumberName[d*]];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/float.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/float.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'en/verbalizer/factorization.grm' as f;
+import 'en/verbalizer/lexical_map.grm' as l;
+import 'en/verbalizer/numbers.grm' as n;
+fractional_part_ungrouped = f.FRACTIONAL_PART_UNGROUPED;
+fractional_part_grouped = f.FRACTIONAL_PART_GROUPED;
+fractional_part_unparsed = f.FRACTIONAL_PART_UNPARSED;
+__fractional_part__ = fractional_part_ungrouped | fractional_part_unparsed;
+__decimal_marker__ = ".";
+export FLOAT = Optimize[
+ (n.CARDINAL_NUMBERS
+  (__decimal_marker__ : " @@DECIMAL_DOT_EXPRESSION@@ ")
+  __fractional_part__) @ l.LEXICAL_MAP]
+;
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/g.fst
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/g.fst
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'util/byte.grm' as b;
+lexical_map = StringFile['en/verbalizer/lexical_map.tsv'];
+sigma_star = b.kBytes*;
+del_null = CDRewrite["__NULL__" : "", "", "", sigma_star];
+export LEXICAL_MAP = Optimize[
+  CDRewrite[lexical_map, "", "", sigma_star] @ del_null]
+;
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.tsv
+@@CONNECTOR_RANGE@@	to
+@@CONNECTOR_RATIO@@	to
+@@CONNECTOR_BY@@	by
+@@CONNECTOR_CONSECUTIVE_YEAR@@	to
+@@JANUARY@@	january
+@@FEBRUARY@@	february
+@@MARCH@@	march
+@@APRIL@@	april
+@@MAY@@	may
+@@JUNE@@	june
+@@JULY@@	july
+@@AUGUST@@	august
+@@SEPTEMBER@@	september
+@@OCTOBER@@	october
+@@NOVEMBER@@	november
+@@DECEMBER@@	december
+@@MINUS@@	minus
+@@DECIMAL_DOT_EXPRESSION@@	point
+@@URL_DOT_EXPRESSION@@	dot
+@@DECIMAL_EXPONENT@@	to the
+@@DECIMAL_EXPONENT@@	to the power of
+@@COLON@@	colon
+@@SLASH@@	slash
+@@SLASH@@	forward slash
+@@DASH@@	dash
+@@PASSWORD@@	password
+@@AT@@	at
+@@PORT@@	port
+@@QUESTION_MARK@@	question mark
+@@HASH@@	hash
+@@HASH@@	hash tag
+@@FRACTION_OVER@@	over
+@@MONEY_AND@@	and
+@@AND@@	and
+@@PHONE_PLUS@@	plus
+@@PHONE_EXTENSION@@	extension
+@@TIME_AM@@		a m
+@@TIME_PM@@		p m
+@@HOUR@@		o'clock
+@@MINUTE@@		minute
+@@MINUTE@@		minutes
+@@TIME_AFTER@@		after
+@@TIME_AFTER@@		past
+@@TIME_BEFORE@@		to
+@@TIME_BEFORE@@		till
+@@TIME_QUARTER@@	quarter
+@@TIME_HALF@@		half
+@@TIME_ZERO@@		oh
+@@TIME_THREE_QUARTER@@	three quarters
+@@ARITHMETIC_PLUS@@	plus
+@@ARITHMETIC_TIMES@@	times
+@@ARITHMETIC_TIMES@@	multiplied by
+@@ARITHMETIC_MINUS@@	minus
+@@ARITHMETIC_DIVISION@@	divided by
+@@ARITHMETIC_DIVISION@@	over
+@@ARITHMETIC_EQUALS@@	equals
+@@PERCENT@@		percent
+@@DEGREE@@		degree
+@@DEGREE@@		degrees
+@@SQUARE_ROOT@@		square root of
+@@SQUARE_ROOT@@		the square root of
+@@STAR@@		star
+@@HYPHEN@@		hyphen
+@@AT@@			at
+@@PER@@			per
+@@PERIOD@@		period
+@@PERIOD@@		full stop
+@@PERIOD@@		dot
+@@EXCLAMATION_MARK@@	exclamation mark
+@@EXCLAMATION_MARK@@	exclamation point
+@@COMMA@@		comma
+@@POSITIVE@@		positive
+@@NEGATIVE@@		negative
+@@OTHER_ZERO_VERBALIZATIONS@@	oh
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/math.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/math.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'en/verbalizer/float.grm' as f;
+import 'en/verbalizer/lexical_map.grm' as l;
+import 'en/verbalizer/numbers.grm' as n;
+float = f.FLOAT;
+card = n.CARDINAL_NUMBERS;
+number = card | float;
+plus = "+" : " @@ARITHMETIC_PLUS@@ ";
+times = "*" : " @@ARITHMETIC_TIMES@@ ";
+minus = "-" : " @@ARITHMETIC_MINUS@@ ";
+division = "/" : " @@ARITHMETIC_DIVISION@@ ";
+operator = plus | times | minus | division;
+percent = "%" : " @@PERCENT@@";
+export ARITHMETIC =
+  Optimize[((number operator number) | (number percent)) @ l.LEXICAL_MAP]
+;
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/miscellaneous.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/miscellaneous.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'util/byte.grm' as b;
+import 'ru/classifier/cyrillic.grm' as c;
+import 'en/verbalizer/extra_numbers.grm' as e;
+import 'en/verbalizer/lexical_map.grm' as l;
+import 'en/verbalizer/numbers.grm' as n;
+import 'en/verbalizer/spelled.grm' as s;
+letter = b.kAlpha | c.kCyrillicAlpha;
+dash   = "-";
+word = letter+;
+possibly_split_word = word (((dash | ".") : " ") word)* n.D["."]?;
+post_word_symbol =
+   ("+" : ("@@ARITHMETIC_PLUS@@" | "@@POSITIVE@@")) |
+   ("-" : ("@@ARITHMETIC_MINUS@@" | "@@NEGATIVE@@")) |
+   ("*" : "@@STAR@@")
+;
+pre_word_symbol =
+   ("@" : "@@AT@@") |
+   ("/" : "@@SLASH@@") |
+   ("#" : "@@HASH@@")
+;
+post_word = possibly_split_word n.I[" "] post_word_symbol;
+pre_word = pre_word_symbol n.I[" "] possibly_split_word;
+## Number/digit sequence combos, maybe with a dash
+spelled_word = word @ s.SPELLED_NO_LETTER;
+word_number =
+  (word | spelled_word)
+  (n.I[" "] | (dash : " "))
+  (e.DIGITS | n.CARDINAL_NUMBERS | e.MIXED_NUMBERS)
+;
+number_word =
+  (e.DIGITS | n.CARDINAL_NUMBERS | e.MIXED_NUMBERS)
+  (n.I[" "] | (dash : " "))
+  (word | spelled_word)
+;
+## Two-digit year.
+# Note that in this case to be fair we really have to allow ordinals too since
+# in some languages that's what you would have.
+two_digit_year = n.D["'"] (b.kDigit{2} @ (n.CARDINAL_NUMBERS | e.DIGITS));
+dot_com = ("." : "@@URL_DOT_EXPRESSION@@") n.I[" "] "com";
+miscellaneous = Optimize[
+    possibly_split_word
+  | post_word
+  | pre_word
+  | word_number
+  | number_word
+  | two_digit_year
+  | dot_com
+];
+export MISCELLANEOUS = Optimize[miscellaneous @ l.LEXICAL_MAP];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'util/byte.grm' as b;
+import 'en/verbalizer/lexical_map.grm' as l;
+import 'en/verbalizer/numbers.grm' as n;
+card = n.CARDINAL_NUMBERS;
+__currency__ = StringFile['en/verbalizer/money.tsv'];
+d = b.kDigit;
+D = d - "0";
+cents = ((n.D["0"] | D) d) @ card;
+# Only dollar for the verbalizer tests for English. Will need to add other
+# currencies.
+usd_maj = Project["usd_maj" @ __currency__, 'output'];
+usd_min = Project["usd_min" @ __currency__, 'output'];
+and = " @@MONEY_AND@@ " | " ";
+dollar1 =
+  n.D["$"] card n.I[" " usd_maj] n.I[and] n.D["."] cents n.I[" " usd_min]
+;
+dollar2 = n.D["$"] card n.I[" " usd_maj] n.D["."] n.D["00"];
+dollar3 = n.D["$"] card n.I[" " usd_maj];
+dollar = Optimize[dollar1 | dollar2 | dollar3];
+export MONEY = Optimize[dollar @ l.LEXICAL_MAP];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.tsv
+usd_maj	dollar
+usd_maj	dollars
+usd_min	cent
+usd_min	cents
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/number_names.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/number_names.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# English minimally supervised number grammar.
+#
+# Supports both cardinals and ordinals without overt marking.
+#
+# The language-specific acceptor G was compiled with digit, teen, and decade
+# preterminals. The lexicon transducer L is unambiguous so no LM is used.
+import 'util/arithmetic.grm' as a;
+# Intersects the universal factorization transducer (F) with the
+# language-specific acceptor (G).
+d = a.DELTA_STAR;
+f = a.IARITHMETIC_RESTRICTED;
+g = LoadFst['en/verbalizer/g.fst'];
+fg = Optimize[d @ Optimize[f @ Optimize[f @ Optimize[f @ g]]]];
+test1 = AssertEqual["230" @ fg, "(+ (* 2 100 *) 30 +)"];
+# Compiles lexicon transducer (L).
+cardinal_name = StringFile['en/verbalizer/cardinals.tsv'];
+cardinal_l = Optimize[(cardinal_name " ")* cardinal_name];
+test2 = AssertEqual["2 100 30" @ cardinal_l, "two hundred thirty"];
+ordinal_name = StringFile['en/verbalizer/ordinals.tsv'];
+# In English, ordinals have the same syntax as cardinals and all but the final
+# element is verbalized using a cardinal number word; e.g., "two hundred
+# thirtieth".
+ordinal_l = Optimize[(cardinal_name " ")* ordinal_name];
+test3 = AssertEqual["2 100 30" @ ordinal_l, "two hundred thirtieth"];
+# Composes L with the leaf transducer (P), then composes that with FG.
+p = a.LEAVES;
+export CARDINAL_NUMBER_NAME = Optimize[fg @ (p @ cardinal_l)];
+test4 = AssertEqual["230" @ CARDINAL_NUMBER_NAME, "two hundred thirty"];
+export ORDINAL_NUMBER_NAME = Optimize[fg @ (p @ ordinal_l)];
+test5 = AssertEqual["230" @ ORDINAL_NUMBER_NAME, "two hundred thirtieth"];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'en/verbalizer/number_names.grm' as n;
+import 'util/byte.grm' as bytelib;
+import 'universal/thousands_punct.grm' as t;
+cardinal = n.CARDINAL_NUMBER_NAME;
+ordinal = n.ORDINAL_NUMBER_NAME;
+# Putting these here since this grammar gets incorporated by all the others.
+func I[expr] {
+  return "" : expr;
+}
+func D[expr] {
+  return expr : "";
+}
+separators = t.comma_thousands | t.no_delimiter;
+# Language specific endings for ordinals.
+d = bytelib.kDigit;
+endings = "st" | "nd" | "rd" | "th";
+st = (d* "1") - (d* "11");
+nd = (d* "2") - (d* "12");
+rd = (d* "3") - (d* "13");
+th = Optimize[d* - st - nd - rd];
+first = st ("st" : "");
+second = nd ("nd" : "");
+third = rd ("rd" : "");
+other = th ("th" : "");
+marked_ordinal = Optimize[first | second | third | other];
+# The separator is a no-op here but will be needed once we replace
+# the above targets.
+export CARDINAL_NUMBERS = Optimize[separators @ cardinal];
+export ORDINAL_NUMBERS =
+  Optimize[(separators endings) @ marked_ordinal @ ordinal]
+;
+export ORDINAL_NUMBERS_UNMARKED = Optimize[separators @ ordinal];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers_plus.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers_plus.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Grammar for things built mostly on numbers.
+import 'en/verbalizer/factorization.grm' as f;
+import 'en/verbalizer/lexical_map.grm' as l;
+import 'en/verbalizer/numbers.grm' as n;
+num = n.CARDINAL_NUMBERS;
+ord = n.ORDINAL_NUMBERS_UNMARKED;
+digits = f.FRACTIONAL_PART_UNGROUPED;
+# Various symbols.
+plus = "+" : "@@ARITHMETIC_PLUS@@";
+minus = "-" : "@@ARITHMETIC_MINUS@@";
+slash = "/" : "@@SLASH@@";
+dot = "." : "@@URL_DOT_EXPRESSION@@";
+dash = "-" : "@@DASH@@";
+equals = "=" : "@@ARITHMETIC_EQUALS@@";
+degree = "°" : "@@DEGREE@@";
+division = ("/" | "÷") : "@@ARITHMETIC_DIVISION@@";
+times = ("x" | "*") : "@@ARITHMETIC_TIMES@@";
+power = "^" : "@@DECIMAL_EXPONENT@@";
+square_root = "√" : "@@SQUARE_ROOT@@";
+percent = "%" : "@@PERCENT@@";
+# Safe roman numbers.
+# NB: Do not change the formatting here. NO_EDIT must be on the same
+# line as the path.
+rfile = 
+  'universal/roman_numerals.tsv' # NO_EDIT
+;
+roman = StringFile[rfile];
+## Main categories.
+cat_dot_number =
+   num
+   n.I[" "] dot n.I[" "] num
+   (n.I[" "] dot n.I[" "] num)+
+;
+cat_slash_number =
+   num
+   n.I[" "] slash n.I[" "] num
+   (n.I[" "] slash n.I[" "] num)*
+;
+cat_dash_number =
+   num
+   n.I[" "] dash n.I[" "] num
+   (n.I[" "] dash n.I[" "] num)*
+;
+cat_signed_number = ((plus | minus) n.I[" "])? num;
+cat_degree = cat_signed_number n.I[" "] degree;
+cat_country_code = plus n.I[" "] (num | digits);
+cat_math_operations =
+     plus
+   | minus
+   | division
+   | times
+   | equals
+   | percent
+   | power
+   | square_root
+;
+# Roman numbers are often either cardinals or ordinals in various languages.
+cat_roman = roman @ (num | ord);
+# Allow
+#
+# number:number
+# number-number
+#
+# to just be
+#
+# number number.
+cat_number_number =
+   num ((":" | "-") : " ") num
+;
+# Some additional readings for these symbols.
+cat_additional_readings =
+  ("/" : "@@PER@@") |
+  ("+" : "@@AND@@") |
+  ("-" : ("@@HYPHEN@@" | "@@CONNECTOR_TO@@")) |
+  ("*" : "@@STAR@@") |
+  ("x" : ("x" | "@@CONNECTOR_BY@@")) |
+  ("@" : "@@AT@@")
+;
+numbers_plus = Optimize[
+   cat_dot_number
+ | cat_slash_number
+ | cat_dash_number
+ | cat_signed_number
+ | cat_degree
+ | cat_country_code
+ | cat_math_operations
+ | cat_roman
+ | cat_number_number
+ | cat_additional_readings
+];
+export NUMBERS_PLUS = Optimize[numbers_plus @ l.LEXICAL_MAP];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/ordinals.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/ordinals.tsv
+0	zeroth
+1	first
+2	second
+3	third
+4	fourth
+5	fifth
+6	sixth
+7	seventh
+8	eighth
+9	ninth
+10	tenth
+11	eleventh
+12	twelfth
+13	thirteenth
+14	fourteenth
+15	fifteenth
+16	sixteenth
+17	seventeenth
+18	eighteenth
+19	nineteenth
+20	twentieth
+30	thirtieth
+40	fortieth
+50	fiftieth
+60	sixtieth
+70	seventieth
+80	eightieth
+90	ninetieth
+100	hundredth
+1000	thousandth
+1000000	millionth
+1000000000	billionth
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/params.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/params.tsv
+float.grm	__fractional_part__ = fractional_part_ungrouped | fractional_part_unparsed;
+telephone.grm	__grouping__ = f.UNGROUPED;
+measure.grm	__measure__ = StringFile['en/verbalizer/measures.tsv'];
+money.grm	__currency__ = StringFile['en/verbalizer/money.tsv'];
+time.grm	__sep__ = ":";
+time.grm	__am__ = "a.m." | "am" | "AM";
+time.grm	__pm__ = "p.m." | "pm" | "PM";
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/podspeech.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/podspeech.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'util/util.grm' as util;
+import 'util/case.grm' as case;
+import 'en/verbalizer/extra_numbers.grm' as e;
+import 'en/verbalizer/float.grm' as f;
+import 'en/verbalizer/math.grm' as ma;
+import 'en/verbalizer/miscellaneous.grm' as mi;
+import 'en/verbalizer/money.grm' as mo;
+import 'en/verbalizer/numbers.grm' as n;
+import 'en/verbalizer/numbers_plus.grm' as np;
+import 'en/verbalizer/spelled.grm' as s;
+import 'en/verbalizer/spoken_punct.grm' as sp;
+import 'en/verbalizer/time.grm' as t;
+import 'en/verbalizer/urls.grm' as u;
+export POD_SPEECH_TN = Optimize[RmWeight[
+ (u.URL 
+  | e.MIXED_NUMBERS
+  | e.DIGITS
+  | f.FLOAT
+  | ma.ARITHMETIC
+  | mo.MONEY
+  | n.CARDINAL_NUMBERS
+  | n.ORDINAL_NUMBERS
+  | np.NUMBERS_PLUS
+  | s.SPELLED
+  | sp.SPOKEN_PUNCT
+  | t.TIME
+  | u.URL
+  | u.EMAILS) @ util.CLEAN_SPACES @ case.TOUPPER
+]];
+#export POD_SPEECH_TN = Optimize[RmWeight[(mi.MISCELLANEOUS) @ util.CLEAN_SPACES @ case.TOUPPER]];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spelled.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spelled.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This verbalizer is used whenever there is an LM symbol that consists of
+# letters immediately followed by "{spelled}".l This strips the "{spelled}"
+# suffix.
+import 'util/byte.grm' as b;
+import 'ru/classifier/cyrillic.grm' as c;
+import 'en/verbalizer/lexical_map.grm' as l;
+import 'en/verbalizer/numbers.grm' as n;
+digit = b.kDigit @ n.CARDINAL_NUMBERS;
+char_set = (("a" | "A") : "letter-a")
+        | (("b" | "B") : "letter-b")
+        | (("c" | "C") : "letter-c")
+        | (("d" | "D") : "letter-d")
+        | (("e" | "E") : "letter-e")
+        | (("f" | "F") : "letter-f")
+        | (("g" | "G") : "letter-g")
+        | (("h" | "H") : "letter-h")
+        | (("i" | "I") : "letter-i")
+        | (("j" | "J") : "letter-j")
+        | (("k" | "K") : "letter-k")
+        | (("l" | "L") : "letter-l")
+        | (("m" | "M") : "letter-m")
+        | (("n" | "N") : "letter-n")
+        | (("o" | "O") : "letter-o")
+        | (("p" | "P") : "letter-p")
+        | (("q" | "Q") : "letter-q")
+        | (("r" | "R") : "letter-r")
+        | (("s" | "S") : "letter-s")
+        | (("t" | "T") : "letter-t")
+        | (("u" | "U") : "letter-u")
+        | (("v" | "V") : "letter-v")
+        | (("w" | "W") : "letter-w")
+        | (("x" | "X") : "letter-x")
+        | (("y" | "Y") : "letter-y")
+        | (("z" | "Z") : "letter-z")
+        | (digit)
+        | ("&" : "@@AND@@")
+        | ("." : "")
+        | ("-" : "")
+        | ("_" : "")
+        | ("/" : "")
+        | (n.I["letter-"] c.kCyrillicAlpha)
+        ;
+ins_space = "" : " ";
+suffix = "{spelled}" : "";
+spelled = Optimize[char_set (ins_space char_set)* suffix];
+export SPELLED = Optimize[spelled @ l.LEXICAL_MAP];
+sigma_star = b.kBytes*;
+# Gets rid of the letter- prefix since in some cases we don't want it.
+del_letter = CDRewrite[n.D["letter-"], "", "", sigma_star];
+spelled_no_tag = Optimize[char_set (ins_space char_set)*];
+export SPELLED_NO_LETTER = Optimize[spelled_no_tag @ del_letter];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spoken_punct.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spoken_punct.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'en/verbalizer/lexical_map.grm' as l;
+punct =
+   ("." : "@@PERIOD@@")
+ | ("," : "@@COMMA@@")
+ | ("!" : "@@EXCLAMATION_MARK@@")
+ | ("?" : "@@QUESTION_MARK@@")
+;
+export SPOKEN_PUNCT = Optimize[punct @ l.LEXICAL_MAP];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/time.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/time.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'util/byte.grm' as b;
+import 'en/verbalizer/lexical_map.grm' as l;
+import 'en/verbalizer/numbers.grm' as n;
+# Only handles 24-hour time with quarter-to, half-past and quarter-past.
+increment_hour =
+    ("0" : "1")
+  | ("1" : "2")
+  | ("2" : "3")
+  | ("3" : "4")
+  | ("4" : "5")
+  | ("5" : "6")
+  | ("6" : "7")
+  | ("7" : "8")
+  | ("8" : "9")
+  | ("9" : "10")
+  | ("10" : "11")
+  | ("11" : "12")
+  | ("12" : "1")  # If someone uses 12, we assume 12-hour by default.
+  | ("13" : "14")
+  | ("14" : "15")
+  | ("15" : "16")
+  | ("16" : "17")
+  | ("17" : "18")
+  | ("18" : "19")
+  | ("19" : "20")
+  | ("20" : "21")
+  | ("21" : "22")
+  | ("22" : "23")
+  | ("23" : "12")
+;
+hours = Project[increment_hour, 'input'];
+d = b.kDigit;
+D = d - "0";
+minutes09 = "0" D;
+minutes = ("1" | "2" | "3" | "4" | "5") d;
+__sep__ = ":";
+sep_space = __sep__ : " ";
+verbalize_hours = hours @ n.CARDINAL_NUMBERS;
+verbalize_minutes =
+   ("00" : "@@HOUR@@")
+ | (minutes09 @ (("0" : "@@TIME_ZERO@@") n.I[" "] n.CARDINAL_NUMBERS))
+ | (minutes @ n.CARDINAL_NUMBERS)
+;
+time_basic = Optimize[verbalize_hours sep_space verbalize_minutes];
+# Special cases we handle right now.
+# TODO: Need to allow for cases like
+#
+#   half twelve (in the UK English sense)
+#   half twaalf (in the Dutch sense)
+time_quarter_past =
+   n.I["@@TIME_QUARTER@@ @@TIME_AFTER@@ "]
+   verbalize_hours
+   n.D[__sep__ "15"];
+time_half_past =
+   n.I["@@TIME_HALF@@ @@TIME_AFTER@@ "]
+   verbalize_hours
+   n.D[__sep__ "30"];
+time_quarter_to =
+   n.I["@@TIME_QUARTER@@ @@TIME_BEFORE@@ "]
+   (increment_hour @ verbalize_hours)
+   n.D[__sep__ "45"];
+time_extra = Optimize[
+  time_quarter_past | time_half_past | time_quarter_to]
+;
+# Basic time periods which most languages can be expected to have.
+__am__ = "a.m." | "am" | "AM";
+__pm__ = "p.m." | "pm" | "PM";
+period = (__am__ : "@@TIME_AM@@") | (__pm__ : "@@TIME_PM@@");
+time_variants = time_basic | time_extra;
+time = Optimize[
+    (period (" " | n.I[" "]))? time_variants
+ |  time_variants ((" " | n.I[" "]) period)?]
+;
+export TIME = Optimize[time @ l.LEXICAL_MAP];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/urls.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/urls.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Rules for URLs and email addresses.
+import 'util/byte.grm' as bytelib;
+import 'en/verbalizer/lexical_map.grm' as l;
+ins_space = "" : " ";
+dot = "." : "@@URL_DOT_EXPRESSION@@";
+at = "@" : "@@AT@@";
+url_suffix =
+  (".com" : dot ins_space "com") |
+  (".gov" : dot ins_space "gov") |
+  (".edu" : dot ins_space "e d u") |
+  (".org" : dot ins_space "org") |
+  (".net" : dot ins_space "net")
+;
+letter_string = (bytelib.kAlnum)* bytelib.kAlnum;
+letter_string_dot =
+  ((letter_string ins_space dot ins_space)* letter_string)
+;
+# Rules for URLs.
+export URL = Optimize[
+ ((letter_string_dot) (ins_space)
+  (url_suffix)) @ l.LEXICAL_MAP
+];
+# Rules for email addresses.
+letter_by_letter = ((bytelib.kAlnum ins_space)* bytelib.kAlnum);
+letter_by_letter_dot =
+  ((letter_by_letter ins_space dot ins_space)*
+  letter_by_letter)
+;
+export EMAIL1 = Optimize[
+ ((letter_by_letter) (ins_space)
+  (at) (ins_space)
+  (letter_by_letter_dot) (ins_space)
+  (url_suffix)) @ l.LEXICAL_MAP
+];
+export EMAIL2 = Optimize[
+ ((letter_by_letter) (ins_space)
+  (at) (ins_space)
+  (letter_string_dot) (ins_space)
+  (url_suffix)) @ l.LEXICAL_MAP
+];
+export EMAILS = Optimize[
+  EMAIL1 | EMAIL2
+];
--- a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/verbalizer.grm
+++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/verbalizer.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'util/util.grm' as util;
+import 'en/verbalizer/extra_numbers.grm' as e;
+import 'en/verbalizer/float.grm' as f;
+import 'en/verbalizer/math.grm' as ma;
+import 'en/verbalizer/miscellaneous.grm' as mi;
+import 'en/verbalizer/money.grm' as mo;
+import 'en/verbalizer/numbers.grm' as n;
+import 'en/verbalizer/numbers_plus.grm' as np;
+import 'en/verbalizer/spelled.grm' as s;
+import 'en/verbalizer/spoken_punct.grm' as sp;
+import 'en/verbalizer/time.grm' as t;
+import 'en/verbalizer/urls.grm' as u;
+export VERBALIZER = Optimize[RmWeight[
+ (  e.MIXED_NUMBERS
+  | e.DIGITS
+  | f.FLOAT
+  | ma.ARITHMETIC
+  | mi.MISCELLANEOUS
+  | mo.MONEY
+  | n.CARDINAL_NUMBERS
+  | n.ORDINAL_NUMBERS
+  | np.NUMBERS_PLUS
+  | s.SPELLED
+  | sp.SPOKEN_PUNCT
+  | t.TIME
+  | u.URL) @ util.CLEAN_SPACES
+]];
--- a/third_party/chinese_text_normalization/thrax/src/number_data/README.md
+++ b/third_party/chinese_text_normalization/thrax/src/number_data/README.md
+This directory contains data used in:
+  Gorman, K., and Sproat, R. 2016. Minimally supervised number normalization.
+  Transactions of the Association for Computational Linguistics 4: 507-519.
+* `minimal.txt`: A list of 30 curated numbers used as the "minimal" training
+  set.
+* `random-trn.txt`: A list of 9000 randomly-generated numbers used as the
+  "medium" training set.
+* `random-tst.txt`: A list of 1000 randomly-generated numbers used as the test
+  set.
+Note that `random-trn.txt` and `random-tst.txt` are totally disjoint, but that
+a small number of examples occur both in `minimal.txt` and `random-tst.txt`.
+For information about the sampling procedure used to generate the random data
+sets, see appendix A of the aforementioned paper.
--- a/third_party/chinese_text_normalization/thrax/src/number_data/minimal.txt
+++ b/third_party/chinese_text_normalization/thrax/src/number_data/minimal.txt
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
+41
+42
+43
+44
+45
+46
+47
+48
+49
+50
+51
+52
+53
+54
+55
+56
+57
+58
+59
+60
+61
+62
+63
+64
+65
+66
+67
+68
+69
+70
+71
+72
+73
+74
+75
+76
+77
+78
+79
+80
+81
+82
+83
+84
+85
+86
+87
+88
+89
+90
+91
+92
+93
+94
+95
+96
+97
+98
+99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
+124
+125
+126
+127
+128
+129
+130
+131
+132
+133
+134
+135
+136
+137
+138
+139
+140
+141
+142
+143
+144
+145
+146
+147
+148
+149
+150
+151
+152
+153
+154
+155
+156
+157
+158
+159
+160
+161
+162
+163
+164
+165
+166
+167
+168
+169
+170
+171
+172
+173
+174
+175
+176
+177
+178
+179
+180
+181
+182
+183
+184
+185
+186
+187
+188
+189
+190
+191
+192
+193
+194
+195
+196
+197
+198
+199
+200
+201
+202
+203
+204
+205
+206
+207
+208
+209
+210
+211
+212
+220
+221
+230
+300
+400
+500
+600
+700
+800
+900
+1000
+1001
+1002
+1003
+1004
+1005
+1006
+1007
+1008
+1009
+1010
+1011
+1012
+1020
+1021
+1030
+1200
+2000
+2001
+2002
+2003
+2004
+2005
+2006
+2007
+2008
+2009
+2010
+2011
+2012
+2020
+2021
+2030
+2100
+2200
+5001
+10000
+12000
+20000
+21000
+50001
+100000
+120000
+200000
+210000
+500001
+1000000
+1001000
+1200000
+2000000
+2100000
+5000001
+10000000
+10001000
+12000000
+20000000
+50000001
+100000000
+100001000
+120000000
+200000000
+500000001
+1000000000
+1000001000
+1200000000
+2000000000
+5000000001
+10000000000
+10000001000
+12000000000
+20000000000
+50000000001
+100000000000
+100000001000
+120000000000
+200000000000
+500000000001
--- a/third_party/chinese_text_normalization/thrax/src/number_data/random-trn.txt
+++ b/third_party/chinese_text_normalization/thrax/src/number_data/random-trn.txt
--- a/third_party/chinese_text_normalization/thrax/src/number_data/random-tst.txt
+++ b/third_party/chinese_text_normalization/thrax/src/number_data/random-tst.txt
--- a/third_party/chinese_text_normalization/thrax/src/ru/README.md
+++ b/third_party/chinese_text_normalization/thrax/src/ru/README.md
+# Russian covering grammar definitions
+This directory defines a Russian text normalization covering grammar. The
+primary entry-point is the FST `VERBALIZER`, defined in
+`verbalizer/verbalizer.grm` and compiled in the FST archive
+`verbalizer/verbalizer.far`.
--- a/third_party/chinese_text_normalization/thrax/src/ru/classifier/cyrillic.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/classifier/cyrillic.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+export kRussianLowerAlpha = Optimize[
+    "а" | "б" | "в" | "г" | "д" | "е" | "ё" | "ж" | "з" | "и" | "й" |
+    "к" | "л" | "м" | "н" | "о" | "п" | "р" | "с" | "т" | "у" | "ф" |
+    "х" | "ц" | "ч" | "ш" | "щ" | "ъ" | "ы" | "ь" | "э" | "ю" | "я" ];
+export kRussianUpperAlpha = Optimize[
+    "А" | "Б" | "В" | "Г" | "Д" | "Е" | "Ё" | "Ж" | "З" | "И" | "Й" |
+    "К" | "Л" | "М" | "Н" | "О" | "П" | "Р" | "С" | "Т" | "У" | "Ф" |
+    "Х" | "Ц" | "Ч" | "Ш" | "Щ" | "Ъ" | "Ы" | "Ь" | "Э" | "Ю" | "Я" ];
+export kRussianLowerAlphaStressed = Optimize[
+    "а́" | "е́" | "ё́" | "и́" | "о́" | "у́" | "ы́" | "э́" | "ю́" | "я́" ];
+export kRussianUpperAlphaStressed = Optimize[
+    "А́" | "Е́" | "Ё́" | "И́" | "О́" | "У́" | "Ы́" | "Э́" | "Ю́" | "Я́" ];
+export kRussianRewriteStress = Optimize[
+    ("А́" : "А'") | ("Е́" : "Е'") | ("Ё́" : "Ё'") | ("И́" : "И'") |
+    ("О́" : "О'") | ("У́" : "У'") | ("Ы́" : "Ы'") | ("Э́" : "Э'") |
+    ("Ю́" : "Ю'") | ("Я́" : "Я'") |
+    ("а́" : "а'") | ("е́" : "е'") | ("ё́" : "ё'") | ("и́" : "и'") |
+    ("о́" : "о'") | ("у́" : "у'") | ("ы́" : "ы'") | ("э́" : "э'") |
+    ("ю́" : "ю'") | ("я́" : "я'")
+];
+export kRussianRemoveStress = Optimize[
+    ("А́" : "А") | ("Е́" : "Е") | ("Ё́" : "Ё") | ("И́" : "И") | ("О́" : "О") |
+    ("У́" : "У") | ("Ы́" : "Ы") | ("Э́" : "Э") | ("Ю́" : "Ю") | ("Я́" : "Я") |
+    ("а́" : "а") | ("е́" : "е") | ("ё́" : "ё") | ("и́" : "и") | ("о́" : "о") |
+    ("у́" : "у") | ("ы́" : "ы") | ("э́" : "э") | ("ю́" : "ю") | ("я́" : "я")
+];
+# Pre-reform characters, just in case.
+export kRussianPreReform = Optimize[
+    "ѣ" | "Ѣ"   # http://en.wikipedia.org/wiki/Yat
+];
+export kCyrillicAlphaStressed = Optimize[
+  kRussianLowerAlphaStressed | kRussianUpperAlphaStressed
+];
+export kCyrillicAlpha = Optimize[
+    kRussianLowerAlpha | kRussianUpperAlpha | kRussianPreReform
+];
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals-lex.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals-lex.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals.tsv
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/extra_numbers.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/extra_numbers.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'util/byte.grm' as b;
+import 'ru/verbalizer/numbers.grm' as n;
+digit = b.kDigit @ n.CARDINAL_NUMBERS | ("0" : "@@OTHER_ZERO_VERBALIZATIONS@@");
+export DIGITS  = digit (n.I[" "] digit)*;
+# Various common factorizations
+two_digits = b.kDigit{2} @ n.CARDINAL_NUMBERS;
+three_digits = b.kDigit{3} @ n.CARDINAL_NUMBERS;
+mixed =
+   (digit n.I[" "] two_digits)
+ | (two_digits n.I[" "] two_digits)
+ | (two_digits n.I[" "] three_digits)
+ | (two_digits n.I[" "] two_digits n.I[" "] two_digits)
+;
+export MIXED_NUMBERS = Optimize[mixed];
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/factorization.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/factorization.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/float.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/float.grm
+# Copyright 2017 Google Inc.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import 'ru/verbalizer/factorization.grm' as f;
+import 'ru/verbalizer/lexical_map.grm' as l;
+import 'ru/verbalizer/numbers.grm' as n;
+fractional_part_ungrouped = f.FRACTIONAL_PART_UNGROUPED;
+fractional_part_grouped = f.FRACTIONAL_PART_GROUPED;
+fractional_part_unparsed = f.FRACTIONAL_PART_UNPARSED;
+__fractional_part__ = fractional_part_unparsed;
+__decimal_marker__ = ",";
+export FLOAT = Optimize[
+ (n.CARDINAL_NUMBERS
+  (__decimal_marker__ : " @@DECIMAL_DOT_EXPRESSION@@ ")
+  __fractional_part__) @ l.LEXICAL_MAP]
+;
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/g.fst
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/g.fst
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.tsv
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/math.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/math.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/miscellaneous.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/miscellaneous.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.tsv
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/nominatives.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/nominatives.tsv
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/number_names.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/number_names.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers_plus.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers_plus.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinal_endings.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinal_endings.tsv
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals-lex.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals-lex.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals.tsv
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spelled.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spelled.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spoken_punct.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spoken_punct.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/time.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/time.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/urls.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/urls.grm
--- a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/verbalizer.grm
+++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/verbalizer.grm
--- a/third_party/chinese_text_normalization/thrax/src/universal/README.md
+++ b/third_party/chinese_text_normalization/thrax/src/universal/README.md
--- a/third_party/chinese_text_normalization/thrax/src/universal/roman_numerals.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/universal/roman_numerals.tsv
--- a/third_party/chinese_text_normalization/thrax/src/universal/thousands_punct.grm
+++ b/third_party/chinese_text_normalization/thrax/src/universal/thousands_punct.grm
--- a/third_party/chinese_text_normalization/thrax/src/util/README.md
+++ b/third_party/chinese_text_normalization/thrax/src/util/README.md
--- a/third_party/chinese_text_normalization/thrax/src/util/arithmetic.grm
+++ b/third_party/chinese_text_normalization/thrax/src/util/arithmetic.grm
--- a/third_party/chinese_text_normalization/thrax/src/util/byte.grm
+++ b/third_party/chinese_text_normalization/thrax/src/util/byte.grm
--- a/third_party/chinese_text_normalization/thrax/src/util/case.grm
+++ b/third_party/chinese_text_normalization/thrax/src/util/case.grm
--- a/third_party/chinese_text_normalization/thrax/src/util/germanic.tsv
+++ b/third_party/chinese_text_normalization/thrax/src/util/germanic.tsv
--- a/third_party/chinese_text_normalization/thrax/src/util/util.grm
+++ b/third_party/chinese_text_normalization/thrax/src/util/util.grm
--- a/third_party/chinese_text_normalization/thrax/testcase_cn.txt
+++ b/third_party/chinese_text_normalization/thrax/testcase_cn.txt
--- a/third_party/chinese_text_normalization/thrax/testcase_en.txt
+++ b/third_party/chinese_text_normalization/thrax/testcase_en.txt
--- a/third_party/install.sh
+++ b/third_party/install.sh
--- a/third_party/zhon/.gitignore
+++ b/third_party/zhon/.gitignore
--- a/third_party/zhon/.travis.yml
+++ b/third_party/zhon/.travis.yml
--- a/third_party/zhon/AUTHORS.rst
+++ b/third_party/zhon/AUTHORS.rst
--- a/third_party/zhon/CHANGES.rst
+++ b/third_party/zhon/CHANGES.rst
--- a/third_party/zhon/CONTRIBUTING.rst
+++ b/third_party/zhon/CONTRIBUTING.rst
--- a/third_party/zhon/LICENSE.txt
+++ b/third_party/zhon/LICENSE.txt
--- a/third_party/zhon/MANIFEST.in
+++ b/third_party/zhon/MANIFEST.in
--- a/third_party/zhon/Makefile
+++ b/third_party/zhon/Makefile
--- a/third_party/zhon/README.rst
+++ b/third_party/zhon/README.rst
--- a/third_party/zhon/docs/Makefile
+++ b/third_party/zhon/docs/Makefile
--- a/third_party/zhon/docs/conf.py
+++ b/third_party/zhon/docs/conf.py
--- a/third_party/zhon/docs/index.rst
+++ b/third_party/zhon/docs/index.rst
--- a/third_party/zhon/docs/make.bat
+++ b/third_party/zhon/docs/make.bat
--- a/third_party/zhon/requirements.txt
+++ b/third_party/zhon/requirements.txt
--- a/third_party/zhon/setup.cfg
+++ b/third_party/zhon/setup.cfg
--- a/third_party/zhon/setup.py
+++ b/third_party/zhon/setup.py
--- a/third_party/zhon/tests/__init__.py
+++ b/third_party/zhon/tests/__init__.py
--- a/third_party/zhon/tests/test-cedict.py
+++ b/third_party/zhon/tests/test-cedict.py
--- a/third_party/zhon/tests/test-hanzi.py
+++ b/third_party/zhon/tests/test-hanzi.py
--- a/third_party/zhon/tests/test-pinyin.py
+++ b/third_party/zhon/tests/test-pinyin.py
--- a/third_party/zhon/tests/test-zhuyin.py
+++ b/third_party/zhon/tests/test-zhuyin.py
--- a/third_party/zhon/tox.ini
+++ b/third_party/zhon/tox.ini
--- a/third_party/zhon/zhon/__init__.py
+++ b/third_party/zhon/zhon/__init__.py
--- a/third_party/zhon/zhon/cedict/__init__.py
+++ b/third_party/zhon/zhon/cedict/__init__.py
--- a/third_party/zhon/zhon/cedict/all.py
+++ b/third_party/zhon/zhon/cedict/all.py
--- a/third_party/zhon/zhon/cedict/simplified.py
+++ b/third_party/zhon/zhon/cedict/simplified.py
--- a/third_party/zhon/zhon/cedict/traditional.py
+++ b/third_party/zhon/zhon/cedict/traditional.py
--- a/third_party/zhon/zhon/hanzi.py
+++ b/third_party/zhon/zhon/hanzi.py
--- a/third_party/zhon/zhon/pinyin.py
+++ b/third_party/zhon/zhon/pinyin.py
--- a/third_party/zhon/zhon/zhuyin.py
+++ b/third_party/zhon/zhon/zhuyin.py
--- a/tools/Makefile
+++ b/tools/Makefile
--- a/utils/ngram_train.sh
+++ b/utils/ngram_train.sh
--- a/utils/zh_tn.py
+++ b/utils/zh_tn.py