未验证 提交 538bf271 编写于 作者: H Hui Zhang 提交者: GitHub

chinese char/word ngram lm (#613)

* add ngram lm egs

* add zhon repo

* install kenlm, zhon

* format

* add chinese_text_normalization repo

* add ngram lm egs
上级 2bdf4c94
.DS_Store
*.pyc
tools/venv
.vscode
*.log
*.pdmodel
......@@ -10,3 +9,6 @@ tools/venv
*.tar.gz
.ipynb_checkpoints
*.npz
tools/venv
tools/kenlm
......@@ -52,4 +52,4 @@ DeepSpeech is provided under the [Apache-2.0 License](./LICENSE).
## Acknowledgement
We depends on many open source repos. See [References](doc/src/reference.md) for more information.
\ No newline at end of file
We depends on many open source repos. See [References](doc/src/reference.md) for more information.
......@@ -50,4 +50,4 @@ DeepSpeech遵循[Apache-2.0开源协议](./LICENSE)。
## 感谢
开发中参考一些优秀的仓库,详情参见 [References](doc/src/reference.md)
\ No newline at end of file
开发中参考一些优秀的仓库,详情参见 [References](doc/src/reference.md)
text_correct.txt: https://github.com/shibing624/pycorrector/raw/master/tests/test_file.txt
custom_confusion.txt: https://github.com/shibing624/pycorrector/raw/master/tests/custom_confusion.txt
此差异已折叠。
少先队员因该为老人让坐
祛痘印可以吗?有效果吗?
不知这款牛奶口感怎样? 小孩子喝行吗!
是转基因油?
我家宝宝13斤用多大码的
会起坨吗?
请问给送上楼吗?
亲是送赁上门吗
送货时候有外包装没有还是直接发货过来
会不会有坏的?
这个米煮粥好还煮饭好吃
有送的马克杯吗?
这纸尿裤分男孩女孩使用吗
买的路由器老是断网,拔了跳过路由器就可以用了
能泡开不?辣度几
请问这个米蒸出来是一粒一粒的还是一坨一坨的?
水和其他商品一样送货上门,还是自提呀?
快两个月的孩子 要穿什么码的
买回来会不会过期?
洗的还干净把吧
路由器怎么样啊,掉线严重吗?
你好这米是五斤还是十斤
收安费不
给送开果器吗
这纸好用吗?我看有不少的差评
自用好用吗
请问袜子穿久了会往下掉吗?
每一卷是独立包装的吗?
这个火龙果口味怎么样?甜不甜?
买这个送红杯吗?
一袋子多少斤
这款拉拉裤有味道吗?超市买的没有味道,不知道这个怎么样
我想问下拉拉裤上面那个贴的用来干嘛的,怎么用
这里边有没有枣核
玫瑰和薰衣草哪个好闻
这个冰糖质量怎么样,有杂质吗
倒水的时候漏吗
请问大家,这个水壶烧出来的水有异味吗?因为给宝宝用所以很在意,谢谢大家
这米煮出来糯吗?
这在款子好用吗?有香味吗?
到底是棉花的材质还是化纤的无纺布啊 求问?
我用360手机能充电几次
亲这纸好用吗?值得买吗?
24瓶?还是12瓶
是否是真的纸?
适用机洗吗?
好吃不好吃啊
真的好用吗?我也想买 
你们拿到是什么版本的
这水和超市一样吗?质量保证吗?
可以丢进马桶冲吗?
纸会不会粗?
这个翠的还不是不催的呀。。没有吃的那种不脆
这个好用吗
这纸有香味的吗?
是最近的生产日期吗
赠品是什么呀
这是两瓶还是一瓶的价格?
请问这是硬壳还是软壳?
亲,苹果收到后有坏的吗?
适合两人用吗
这个直接喝好不好喝 还是要热一下
纸有木有刺鼻气味?
酸不酸???
这啤好渴吗?
跟安慕希哪个比较好喝?
好用么,主要是带宝宝出去玩的时候用的多?
刚出生的宝宝用什么码?
能当洗手液吗?
是不是很小包的那一种?50块有24包便宜的有点不敢相信
好用吗,会不会起会不会起坨?
这个口可以直接放饮水机上用吗?
这种纸掉粉末吗
手机好用吗?会卡吗
开盖里面是拉环的吗?
这个电池真的需要一直换吗?
好用吗?是不是正品?
请问有尿显吗
容易发烫吗
苹果有腊吗
这油有这么好吗?不是过期的吧
这个夏天用会不会红屁股?透气性好吗
你好。 我想问下这个是尿不湿吗 ?
这奶为啥这么便宜?
你们买的酱油会没有颜色吗,像水一样,看着都没胃口
这个是机诜,还是手洗
这个卫生巾带香味吗?
这种洗发水好用吗
有餡嗎?好不好吃
纸质不会好差吗?
亲们,此米是真空包装吗?
是软毛的吗?!!
请问大家德运牌子的好喝还是安佳的?
这纸好用吗,薄嘛
这壶保温吗
这个威露士货到了就是跟图片上的一样吗?只要是图片上显示的都有吗?
你们买的牛奶是最近日期吗
这个除菌液,是单独放在滚筒洗衣机除菌液格,还是与洗衣液混合放在洗衣液格?
请问你们的三只松鼠寄回来的时候是用袋子装着的吗
1kg是不是两斤?
洗衣皂怎么样啊,味道重吗,用之后好不好清洗啊。
我要请问你这个是不是那个拉拉裤吗?这个花纹是不是拉拉裤?
好多人都说小米运动升级后手环就连不上了,你们有没有这种情况?
这部手机运行速度快不快?
新生儿可以用吗 抽一张会带出来很多张吗
洗后有香味吗
体验装有多少片
银装怎么样?会漏尿吗?你们都是多久换一次的??(我家大概2-3个小时左右,宝宝醒一回换一次)
声音大吗?好用不?
抽纸有味吗
苹果好吃吗?打过蜡吗?是不是坏的很多?
70g和80g得区别是啥?
袋装的和瓶装的洗衣液是一样的么?
噪音很大吗
烧出来的水会不会很多一块一块的东西
这个吹风真心好用吗?我今晚下单什么时候到
请问各位宝妈 这个乳垫的背胶粘吗
M号的你们给宝宝用到多大啊?几个月?我家宝宝3个月5㎏重,用花王的M号觉得小了。不知道这个怎么样?
这个喝了能找到女朋友吗
这袜子耐不耐穿
请问好用么 是正品么
怎么储藏 我买了两天在常温阴凉处放着下层有些化了 需要放冰箱冷冻吗
这批苏打水是否有股消毒水的味道?
质量怎么样,看到那么多差评,我不敢买了。
会不会有烂的
为什么我买的用完之后没香味
甜吗????
我看到评论里的差评说大米里有虫,是真的吗?
要放冰箱冷藏吗
好不好吃啊
这油怎么样 炒菜香不香
这纸擦手时有屑吗?
是正品的吗?
好用吗
这个特浓的苦不苦
这个好用吗?
米里真的有虫吗
是金装的吗?
双内胆有什么区别,两个一样的吗?
请问这款水可以降尿酸吗?
好用吗这个
购物袋结实吗,能放重东西吗
你好,请问这款可以剃头发刮光头吗
这个纸巾质量如何?好用吗?
好用吗?小孩子喜欢吗?
亲。煮面时会糊锅不
包邮吗运费多少
会一抽就两三张一起抽起来吗?
一箱几桶油呀
这个吹风机分冷风和热风吗
发什么快递呢
请问一下,有些枸杞说是不要洗,你们的是否建议洗呢?
请问纸有异味吗?我以前买过一箱就是这个居然有异味。
这是6个么 怎么觉得有好多
我买的荣耀10横滑home键进入后台这个操作成功率特别低,你们也是这样吗?
你们的有塑料味吗,机械的
小米路由器真心说的有这么差吗
请问大家这款刮的干净吗?谢谢
会有塑料味吗
质量真的很差吗?不敢买
这纸有气味吗
我买两箱怎么要运费
这个标准果好吃吗,酸不酸
稀吗?是不是有种兑了水的感觉?
威露士和滴露的消毒液哪个更好用呢?
曰期是几月份的
手机容易折弯吗?
我家宝宝25斤XL会紧吗?
这款200克一箱的纸张和10卷手提的价格相差那么多 质量一样吗?
豆浆可以打吗
电量有百分比吗
用快递送过来瓶子会不会打破
是三相电吗,有空调摇控器吧
拿它送人,有问题吗??
安幕希好喝吗?
这款纸尿裤好用吗?和尤妮佳比较哪个好用些?
2层厚吗?是不是一到水就烂了
为什么我宝宝拉粑粑后面总是漏出来我已经贴的很牢了,10斤的宝宝用S号也不小啊你们用了没这种情况吗?
这个产品好用吗?
刷毛柔软度咋样,这么便宜,会不会是很小个的
会不会有过敏的情况呀
请问是辣条吗
这种米只能煮粥不能煮饭吗
可以开袋即食吗?
这米好吃吗?
这个充电宝充满电需要多久
这个奶开了可以保质喝两天吗
这种薰衣草的洗衣液怎么样
你们的小米六边框掉漆了吗???
这个是机洗用还是手洗用的啊
厚度怎么样、起球吗感谢大哥大姐们
这个好喝还是康师傅红茶好喝
这种洁面膏会不会过敏,我上次用的火山岩冰感洁面啫喱对那种过敏,但听别人说那种稀的本来就特别容易过敏,不知道这种洁面膏会不会过敏!
这杯那么多差评,是真的吗,吓得我都不敢买了
枣是免洗的吗?
这个尿不湿尿过会起坨吗
感觉和苏菲比哪个更好用呢?
煮出来的饭香吗?
你好!请问这个水壶烧水开了是自动切电吗?
这个跟 原木纯品 那个啥区别?不是原木纸浆做的?
能放冰箱吗
纸有味道吗?
2016全国高考卷答题模板
2016全国大考卷答题模板
2016全国低考卷答题模板
床前明月光,疑是地上霜
床前星星光,疑是地上霜
床前白月光,疑是地上霜
落霞与孤鹜齐飞,秋水共长天一色
落霞与孤鹜齐跑,秋水共长天一色
落霞与孤鹜双飞,秋水共长天一色
众里寻他千百度,蓦然回首,那人却在,灯火阑珊处
众里寻她千百度,蓦然回首,那人却在,灯火阑珊处
众里寻ta千百度,蓦然回首,那人却在,灯火阑珊处
吸烟的人容*得癌症
就只听着我*妈所说的话,
就接受环境污*用化肥和农药,
是或者接受环境污染用化肥和农药,
现在的香港比从前的*荣很多。
现在的香港比*前的饭荣很多。
#!/bin/bash
set -e
stage=0
stop_stage=100
order=5
mem=80%
prune=0
a=22
q=8
b=8
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
if [ $# != 3 ]; then
echo "$0 token_type exp/text exp/text.arpa"
echo $@
exit 1
fi
# char or word
type=$1
text=$2
arpa=$3
if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
# text tn & wordseg preprocess
echo "process text."
python3 ${MAIN_ROOT}/utils/zh_tn.py ${type} ${text} ${text}.${type}.tn
fi
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
# train ngram lm
echo "build lm."
bash ${MAIN_ROOT}/utils/ngram_train.sh --order ${order} --mem ${mem} --prune "${prune}" ${text}.${type}.tn ${arpa}
fi
\ No newline at end of file
#! /usr/bin/env bash
. ${MAIN_ROOT}/utils/utility.sh
DIR=data/lm
mkdir -p ${DIR}
URL='https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm'
MD5="29e02312deb2e59b3c8686c7966d4fe3"
TARGET=${DIR}/zh_giga.no_cna_cmn.prune01244.klm
echo "Download language model ..."
download $URL $MD5 $TARGET
if [ $? -ne 0 ]; then
echo "Fail to download the language model!"
exit 1
fi
exit 0
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import time
import jieba
import kenlm
language_model_path = sys.argv[1]
assert os.path.exists(language_model_path)
start = time.time()
model = kenlm.Model(language_model_path)
print(f"load kenLM cost: {time.time() - start}s")
sentence = '盘点不怕被税的海淘网站❗️海淘向来便宜又保真!'
sentence_char_split = ' '.join(list(sentence))
sentence_word_split = ' '.join(jieba.lcut(sentence))
def test_score():
print('Loaded language model: %s' % language_model_path)
print(sentence)
print(model.score(sentence))
print(list(model.full_scores(sentence)))
for i, v in enumerate(model.full_scores(sentence)):
print(i, v)
print(sentence_char_split)
print(model.score(sentence_char_split))
print(list(model.full_scores(sentence_char_split)))
split_size = 0
for i, v in enumerate(model.full_scores(sentence_char_split)):
print(i, v)
split_size += 1
assert split_size == len(
sentence_char_split.split()) + 1, "error split size."
print(sentence_word_split)
print(model.score(sentence_word_split))
print(list(model.full_scores(sentence_word_split)))
for i, v in enumerate(model.full_scores(sentence_word_split)):
print(i, v)
def test_full_scores_chars():
print('Loaded language model: %s' % language_model_path)
print(sentence_char_split)
# Show scores and n-gram matches
words = ['<s>'] + list(sentence) + ['</s>']
for i, (prob, length,
oov) in enumerate(model.full_scores(sentence_char_split)):
print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i + 2 - length:
i + 2])))
if oov:
print('\t"{0}" is an OOV'.format(words[i + 1]))
print("-" * 42)
# Find out-of-vocabulary words
oov = []
for w in words:
if w not in model:
print('"{0}" is an OOV'.format(w))
oov.append(w)
assert oov == ["❗", "️", "!"], 'error oov'
def test_full_scores_words():
print('Loaded language model: %s' % language_model_path)
print(sentence_word_split)
# Show scores and n-gram matches
words = ['<s>'] + sentence_word_split.split() + ['</s>']
for i, (prob, length,
oov) in enumerate(model.full_scores(sentence_word_split)):
print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i + 2 - length:
i + 2])))
if oov:
print('\t"{0}" is an OOV'.format(words[i + 1]))
print("-" * 42)
# Find out-of-vocabulary words
oov = []
for w in words:
if w not in model:
print('"{0}" is an OOV'.format(w))
oov.append(w)
# zh_giga.no_cna_cmn.prune01244.klm is chinese charactor LM
assert oov == ["盘点", "不怕", "网站", "❗", "️", "海淘", "向来", "便宜", "保真",
"!"], 'error oov'
def test_full_scores_chars_length():
"""test bos eos size"""
print('Loaded language model: %s' % language_model_path)
r = list(model.full_scores(sentence_char_split))
n = list(model.full_scores(sentence_char_split, bos=False, eos=False))
print(r)
print(n)
assert len(r) == len(n) + 1
# bos=False, eos=False, input len == output len
print(len(n), len(sentence_char_split.split()))
assert len(n) == len(sentence_char_split.split())
k = list(model.full_scores(sentence_char_split, bos=False, eos=True))
print(k, len(k))
def test_ppl_sentence():
"""测试句子粒度的ppl得分"""
sentence_char_split1 = ' '.join('先救挨饿的人,然后治疗病人。')
sentence_char_split2 = ' '.join('先就挨饿的人,然后治疗病人。')
n = model.perplexity(sentence_char_split1)
print('1', n)
n = model.perplexity(sentence_char_split2)
print(n)
part_char_split1 = ' '.join('先救挨饿的人')
part_char_split2 = ' '.join('先就挨饿的人')
n = model.perplexity(part_char_split1)
print('2', n)
n = model.perplexity(part_char_split2)
print(n)
part_char_split1 = '先救挨'
part_char_split2 = '先就挨'
n1 = model.perplexity(part_char_split1)
print('3', n1)
n2 = model.perplexity(part_char_split2)
print(n2)
assert n1 == n2
part_char_split1 = '先 救 挨'
part_char_split2 = '先 就 挨'
n1 = model.perplexity(part_char_split1)
print('4', n1)
n2 = model.perplexity(part_char_split2)
print(n2)
part_char_split1 = '先 救 挨 饿 的 人'
part_char_split2 = '先 就 挨 饿 的 人'
n1 = model.perplexity(part_char_split1)
print('5', n1)
n2 = model.perplexity(part_char_split2)
print(n2)
part_char_split1 = '先 救 挨 饿 的 人 ,'
part_char_split2 = '先 就 挨 饿 的 人 ,'
n1 = model.perplexity(part_char_split1)
print('6', n1)
n2 = model.perplexity(part_char_split2)
print(n2)
part_char_split1 = '先 救 挨 饿 的 人 , 然 后 治 疗 病 人'
part_char_split2 = '先 就 挨 饿 的 人 , 然 后 治 疗 病 人'
n1 = model.perplexity(part_char_split1)
print('7', n1)
n2 = model.perplexity(part_char_split2)
print(n2)
part_char_split1 = '先 救 挨 饿 的 人 , 然 后 治 疗 病 人 。'
part_char_split2 = '先 就 挨 饿 的 人 , 然 后 治 疗 病 人 。'
n1 = model.perplexity(part_char_split1)
print('8', n1)
n2 = model.perplexity(part_char_split2)
print(n2)
if __name__ == '__main__':
test_score()
test_full_scores_chars()
test_full_scores_words()
test_full_scores_chars_length()
test_ppl_sentence()
export MAIN_ROOT=${PWD}/../../
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
export LD_LIBRARY_PATH=/usr/local/lib/:${LD_LIBRARY_PATH}
\ No newline at end of file
jieba>=0.39
\ No newline at end of file
#!/bin/bash
set -e
source path.sh
stage=0
stop_stage=100
source ${MAIN_ROOT}/utils/parse_options.sh || exit -1
python3 -c 'import kenlm;' || { echo "kenlm package not install!"; exit -1; }
if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
# case 1, test kenlm
# download language model
bash local/download_lm_zh.sh
if [ $? -ne 0 ]; then
exit 1
fi
# test kenlm `score` and `full_score`
python local/kenlm_score_test.py data/lm/zh_giga.no_cna_cmn.prune01244.klm
fi
mkdir -p exp
cp data/text_correct.txt exp/text
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
# case 2, chinese chararctor ngram lm build
# output: xxx.arpa xxx.kenlm.bin
input=exp/text
token_type=char
lang=zh
order=5
prune="0 1 2 4 4"
a=22
q=8
b=8
output=${input}_${lang}_${token_type}_o${order}_p${prune// /_}_a${a}_q${q}_b${b}.arpa
echo "build ${token_type} lm."
bash local/build_zh_lm.sh --order ${order} --prune "${prune}" --a ${a} --q ${a} --b ${b} ${token_type} ${input} ${output}
fi
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
# case 2, chinese chararctor ngram lm build
# output: xxx.arpa xxx.kenlm.bin
input=exp/text
token_type=word
lang=zh
order=3
prune="0 0 0"
a=22
q=8
b=8
output=${input}_${lang}_${token_type}_o${order}_p${prune// /_}_a${a}_q${q}_b${b}.arpa
echo "build ${token_type} lm."
bash local/build_zh_lm.sh --order ${order} --prune "${prune}" --a ${a} --q ${a} --b ${b} ${token_type} ${input} ${output}
fi
......@@ -57,11 +57,11 @@ if [ $? != 0 ]; then
fi
# install kaldi-comptiable feature
pushd third_party/python_kaldi_features/
python setup.py install
# install third_party
pushd third_party
bash install.sh
if [ $? != 0 ]; then
error_msg "Please check why kaldi feature install error!"
error_msg "Please check why third_party install error!"
exit -1
fi
popd
......
* [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features)
commit: fc1bd6240c2008412ab64dc25045cd872f5e126c
ref: https://zhuanlan.zhihu.com/p/55371926
licence: MIT
* [python-pinyin](https://github.com/mozillazg/python-pinyin.git)
commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03
licence: MIT
commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03
licence: MIT
* [zhon](https://github.com/tsroten/zhon)
commit: 09bf543696277f71de502506984661a60d24494c
licence: MIT
* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git)
commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d
licence: MIT
* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git)
commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c
licence: MIT
MIT License
Copyright (c) 2020 SpeechIO
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# Chinese Text Normalization for Speech Processing
## Problem
Search for "Text Normalization"(TN) on Google and Github, you can hardly find open-source projects that are "read-to-use" for text normalization tasks. Instead, you find a bunch of NLP toolkits or frameworks that *supports* TN functionality. There is quite some work between "support text normalization" and "do text normalization".
## Reason
* TN is language-dependent, more or less.
Some of TN processing methods are shared across languages, but a good TN module always involves language-specific knowledge and treatments, more or less.
* TN is task-specific.
Even for the same language, different applications require quite different TN.
* TN is "dirty"
Constructing and maintaining a set of TN rewrite-rules is painful, whatever toolkits and frameworks you choose. Subtle and intrinsic complexities hide inside TN task itself, not in tools or frameworks.
* mature TN module is an asset
Since constructing and maintaining TN is hard, it is actually an asset for commercial companies, hence it is unlikely to find a product-level TN in open-source community (correct me if you find any)
* TN is a less important topic for either academic or commercials.
## Goal
This project sets up a ready-to-use TN module for **Chinese**. Since my background is **speech processing**, this project should be able to handle most common TN tasks, in **Chinese ASR** text processing pipelines.
## Normalizers
1. supported NSW (Non-Standard-Word) Normalization
|NSW type|raw|normalized|
|-|-|-|
|cardinal|这块黄金重达324.75克|这块黄金重达三百二十四点七五克|
|date|她出生于86年8月18日,她弟弟出生于1995年3月1日|她出生于八六年八月十八日 她弟弟出生于一九九五年三月一日|
|digit|电影中梁朝伟扮演的陈永仁的编号27149|电影中梁朝伟扮演的陈永仁的编号二七一四九|
|fraction|现场有7/12的观众投出了赞成票|现场有十二分之七的观众投出了赞成票|
|money|随便来几个价格12块5,34.5元,20.1万|随便来几个价格十二块五 三十四点五元 二十点一万|
|percentage|明天有62%的概率降雨|明天有百分之六十二的概率降雨|
|telephone|这是固话0421-33441122<br>这是手机+86 18544139121|这是固话零四二一三三四四一一二二<br>这是手机八六一八五四四一三九一二一|
acknowledgement: the NSW normalization codes are based on [Zhiyang Zhou's work here](https://github.com/Joee1995/chn_text_norm.git)
1. punctuation removal
For Chinese, it removes punctuation list collected in [Zhon](https://github.com/tsroten/zhon) project, containing
* non-stop puncs
```
'"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
```
* stop puncs
```
'!?。。'
```
For English, it removes Python's `string.punctuation`
1. multilingual English word upper/lower case conversion
since ASR/TTS lexicons usually unify English entries to uppercase or lowercase, the TN module should adapt with lexicon accordingly.
## Supported text format
1. plain text, preferably one sentence per line(most common case in ASR processing).
```
今天早饭吃了没
没吃回家吃去吧
...
```
plain text is default format.
2. Kaldi's transcription format
```
KALDI_KEY_UTT001 今天早饭吃了没
KALDI_KEY_UTT002 没吃回家吃去吧
...
```
TN will skip first column key section, normalize latter transcription text
pass `--has_key` option to switch to kaldi format.
_note: All input text should be UTF-8 encoded._
## Run examples
* TN (python)
make sure you have **python3**, python2.X won't work correctly.
`sh run.sh` in `TN` dir, and compare raw text and normalized text.
* ITN (thrax)
make sure you have **thrax** installed, and your PATH should be able to find thrax binaries.
`sh run.sh` in `ITN` dir. check Makefile for grammar dependency.
## possible future work
Since TN is a typical "done is better than perfect" module in context of ASR, and the current state is sufficient for my purpose, I probably won't update this repo frequently.
there are indeed something that needs to be improved:
* For TN, NSW normalizers in TN dir are based on regular expression, I've found some unintended matches, those pattern regexps need to be refined for more precise TN coverage.
* For ITN, extend those thrax rewriting grammars to cover more scenarios.
* Further more, nowadays commercial systems start to introduce RNN-like models into TN, and a mix of (rule-based & model-based) system is state-of-the-art. More readings about this, look for Richard Sproat and KyleGorman's work at Google.
END
此差异已折叠。
UTT000 这块黄金重达324.75克
UTT001 她出生于86年8月18日,她弟弟出生于1995年3月1日
UTT002 电影中梁朝伟扮演的陈永仁的编号27149
UTT003 现场有7/12的观众投出了赞成票
UTT004 随便来几个价格12块5,34.5元,20.1万
UTT005 明天有62%的概率降雨
UTT006 这是固话0421-33441122或这是手机+86 18544139121
这块黄金重达324.75克
她出生于86年8月18日,她弟弟出生于1995年3月1日
电影中梁朝伟扮演的陈永仁的编号27149
现场有7/12的观众投出了赞成票
随便来几个价格12块5,34.5元,20.1万
明天有62%的概率降雨
这是固话0421-33441122或这是手机+86 18544139121
# for plain text
python3 cn_tn.py example_plain.txt output_plain.txt
diff example_plain.txt output_plain.txt
# for Kaldi's trans format
python3 cn_tn.py --has_key example_kaldi.txt output_kaldi.txt
diff example_kaldi.txt output_kaldi.txt
0. place install_thrax.sh into $KALDI/tools/extras/
1. recompile openfst with necessary option "--enable-grm" to support thrax:
* cd $KALDI_ROOT/tools
* make clean
* edit $KALDI_ROOT/tools/Makefile, append "--enable-grm" option to OPENFST_CONFIGURE:
OPENFST_CONFIGURE ?= --enable-static --enable-shared --enable-far --enable-ngram-fsts --enable-lookahead-fsts --with-pic --enable-grm
* make -j 10
2. install thrax
cd $KALDI_ROOT/tools
sh extras/install_thrax.sh
3. add thrax binary path into $KALDI_ROOT/tools/env.sh:
export PATH=/path/to/your/kaldi_root/tools/thrax-1.2.9/src/bin:${PATH}
usage:
before you run anything related to thrax, use:
. $KALDI_ROOT/tools/env.sh
to enable binary finding, like what we always do in kaldi.
sample usage:
sh run_en.sh
sh run_cn.sh
#!/bin/bash
## This script should be placed under $KALDI_ROOT/tools/extras/, and see INSTALL.txt for installation guide
if [ ! -f thrax-1.2.9.tar.gz ]; then
wget http://www.openfst.org/twiki/pub/GRM/ThraxDownload/thrax-1.2.9.tar.gz
tar -zxf thrax-1.2.9.tar.gz
fi
cd thrax-1.2.9
OPENFSTPREFIX=`pwd`/../openfst
LDFLAGS="-L${OPENFSTPREFIX}/lib" CXXFLAGS="-I${OPENFSTPREFIX}/include" ./configure --prefix ${OPENFSTPREFIX}
make -j 10; make install
cd ..
cd src/cn
thraxmakedep itn.grm
make
#thraxrewrite-tester --far=itn.far --rules=ITN
cat ../../testcase_cn.txt | thraxrewrite-tester --far=itn.far --rules=ITN
cd -
cd src
thraxmakedep en/verbalizer/podspeech.grm
make
cat ../testcase_en.txt
cat ../testcase_en.txt | thraxrewrite-tester --far=en/verbalizer/podspeech.far --rules=POD_SPEECH_TN
cd -
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
en/verbalizer/podspeech.far: en/verbalizer/podspeech.grm util/util.far util/case.far en/verbalizer/extra_numbers.far en/verbalizer/float.far en/verbalizer/math.far en/verbalizer/miscellaneous.far en/verbalizer/money.far en/verbalizer/numbers.far en/verbalizer/numbers_plus.far en/verbalizer/spelled.far en/verbalizer/spoken_punct.far en/verbalizer/time.far en/verbalizer/urls.far
thraxcompiler --input_grammar=$< --output_far=$@
util/util.far: util/util.grm util/byte.far util/case.far
thraxcompiler --input_grammar=$< --output_far=$@
util/byte.far: util/byte.grm
thraxcompiler --input_grammar=$< --output_far=$@
util/case.far: util/case.grm util/byte.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/extra_numbers.far: en/verbalizer/extra_numbers.grm util/byte.far en/verbalizer/numbers.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/numbers.far: en/verbalizer/numbers.grm en/verbalizer/number_names.far util/byte.far universal/thousands_punct.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/number_names.far: en/verbalizer/number_names.grm util/arithmetic.far en/verbalizer/g.fst en/verbalizer/cardinals.tsv en/verbalizer/ordinals.tsv
thraxcompiler --input_grammar=$< --output_far=$@
util/arithmetic.far: util/arithmetic.grm util/byte.far util/germanic.tsv
thraxcompiler --input_grammar=$< --output_far=$@
universal/thousands_punct.far: universal/thousands_punct.grm util/byte.far util/util.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/float.far: en/verbalizer/float.grm en/verbalizer/factorization.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/factorization.far: en/verbalizer/factorization.grm util/byte.far util/util.far en/verbalizer/numbers.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/lexical_map.far: en/verbalizer/lexical_map.grm util/byte.far en/verbalizer/lexical_map.tsv
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/math.far: en/verbalizer/math.grm en/verbalizer/float.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/miscellaneous.far: en/verbalizer/miscellaneous.grm util/byte.far ru/classifier/cyrillic.far en/verbalizer/extra_numbers.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far en/verbalizer/spelled.far
thraxcompiler --input_grammar=$< --output_far=$@
ru/classifier/cyrillic.far: ru/classifier/cyrillic.grm
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/spelled.far: en/verbalizer/spelled.grm util/byte.far ru/classifier/cyrillic.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/money.far: en/verbalizer/money.grm util/byte.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far en/verbalizer/money.tsv
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/numbers_plus.far: en/verbalizer/numbers_plus.grm en/verbalizer/factorization.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/spoken_punct.far: en/verbalizer/spoken_punct.grm en/verbalizer/lexical_map.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/time.far: en/verbalizer/time.grm util/byte.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far
thraxcompiler --input_grammar=$< --output_far=$@
en/verbalizer/urls.far: en/verbalizer/urls.grm util/byte.far en/verbalizer/lexical_map.far
thraxcompiler --input_grammar=$< --output_far=$@
clean:
rm -f util/util.far util/case.far en/verbalizer/extra_numbers.far en/verbalizer/float.far en/verbalizer/math.far en/verbalizer/miscellaneous.far en/verbalizer/money.far en/verbalizer/numbers.far en/verbalizer/numbers_plus.far en/verbalizer/spelled.far en/verbalizer/spoken_punct.far en/verbalizer/time.far en/verbalizer/urls.far util/byte.far en/verbalizer/number_names.far universal/thousands_punct.far util/arithmetic.far en/verbalizer/factorization.far en/verbalizer/lexical_map.far ru/classifier/cyrillic.far
# Text normalization covering grammars
This repository provides covering grammars for English and Russian text normalization as
documented in:
Gorman, K., and Sproat, R. 2016. Minimally supervised number normalization.
_Transactions of the Association for Computational Linguistics_ 4: 507-519.
Ng, A. H., Gorman, K., and Sproat, R. 2017. Minimally supervised
written-to-spoken text normalization. In _ASRU_, pages 665-670.
If you use these grammars in a publication, we would appreciate if you cite these works.
## Building
The grammars are written in [Thrax](thrax.opengrm.org) and compile into [OpenFst](openfst.org) FAR (FstARchive) files. To compile, simply run `make` in the `src/` directory.
## License
See `LICENSE`.
## Mandatory disclaimer
This is not an official Google product.
itn.far: itn.grm byte.far number.far hotfix.far percentage.far date.far amount.far
thraxcompiler --input_grammar=$< --output_far=$@
byte.far: byte.grm
thraxcompiler --input_grammar=$< --output_far=$@
number.far: number.grm byte.far
thraxcompiler --input_grammar=$< --output_far=$@
hotfix.far: hotfix.grm byte.far hotfix.list
thraxcompiler --input_grammar=$< --output_far=$@
percentage.far: percentage.grm byte.far number.far
thraxcompiler --input_grammar=$< --output_far=$@
date.far: date.grm byte.far number.far
thraxcompiler --input_grammar=$< --output_far=$@
amount.far: amount.grm byte.far number.far
thraxcompiler --input_grammar=$< --output_far=$@
clean:
rm -f byte.far number.far hotfix.far percentage.far date.far amount.far
import 'byte.grm' as b;
import 'number.grm' as n;
unit = (
"匹"|"张"|"座"|"回"|"场"|"尾"|"条"|"个"|"首"|"阙"|"阵"|"网"|"炮"|
"顶"|"丘"|"棵"|"只"|"支"|"袭"|"辆"|"挑"|"担"|"颗"|"壳"|"窠"|"曲"|
"墙"|"群"|"腔"|"砣"|"座"|"客"|"贯"|"扎"|"捆"|"刀"|"令"|"打"|"手"|
"罗"|"坡"|"山"|"岭"|"江"|"溪"|"钟"|"队"|"单"|"双"|"对"|"出"|"口"|
"头"|"脚"|"板"|"跳"|"枝"|"件"|"贴"|"针"|"线"|"管"|"名"|"位"|"身"|
"堂"|"课"|"本"|"页"|"家"|"户"|"层"|"丝"|"毫"|"厘"|"分"|"钱"|"两"|
"斤"|"担"|"铢"|"石"|"钧"|"锱"|"忽"|"毫"|"厘"|"分"|"寸"|"尺"|"丈"|
"里"|"寻"|"常"|"铺"|"程"|"撮"|"勺"|"合"|"升"|"斗"|"石"|"盘"|"碗"|
"碟"|"叠"|"桶"|"笼"|"盆"|"盒"|"杯"|"钟"|"斛"|"锅"|"簋"|"篮"|"盘"|
"桶"|"罐"|"瓶"|"壶"|"卮"|"盏"|"箩"|"箱"|"煲"|"啖"|"袋"|"钵"|"年"|
"月"|"日"|"季"|"刻"|"时"|"周"|"天"|"秒"|"分"|"旬"|"纪"|"岁"|"世"|
"更"|"夜"|"春"|"夏"|"秋"|"冬"|"代"|"伏"|"辈"|"丸"|"泡"|"粒"|"颗"|
"幢"|"堆"|"条"|"根"|"支"|"道"|"面"|"片"|"张"|"颗"|"块"|
(("千克":"kg")|("毫克":"mg")|("微克":"µg"))|
(("千米":"km")|("厘米":"cm")|("毫米":"mm")|("微米":"µm")|("纳米":"nm"))
);
amount = n.number unit;
export AMOUNT = CDRewrite[amount, "", "", b.kBytes*];
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Copyright 2005-2011 Google, Inc.
# Author: ttai@google.com (Terry Tai)
# Standard constants for ASCII (byte) based strings. This mirrors the
# functions provided by C/C++'s ctype.h library.
# Note that [0] is missing. Matching the string-termination character is kinda weird.
export kBytes = Optimize[
"[1]" | "[2]" | "[3]" | "[4]" | "[5]" | "[6]" | "[7]" | "[8]" | "[9]" | "[10]" |
"[11]" | "[12]" | "[13]" | "[14]" | "[15]" | "[16]" | "[17]" | "[18]" | "[19]" | "[20]" |
"[21]" | "[22]" | "[23]" | "[24]" | "[25]" | "[26]" | "[27]" | "[28]" | "[29]" | "[30]" |
"[31]" | "[32]" | "[33]" | "[34]" | "[35]" | "[36]" | "[37]" | "[38]" | "[39]" | "[40]" |
"[41]" | "[42]" | "[43]" | "[44]" | "[45]" | "[46]" | "[47]" | "[48]" | "[49]" | "[50]" |
"[51]" | "[52]" | "[53]" | "[54]" | "[55]" | "[56]" | "[57]" | "[58]" | "[59]" | "[60]" |
"[61]" | "[62]" | "[63]" | "[64]" | "[65]" | "[66]" | "[67]" | "[68]" | "[69]" | "[70]" |
"[71]" | "[72]" | "[73]" | "[74]" | "[75]" | "[76]" | "[77]" | "[78]" | "[79]" | "[80]" |
"[81]" | "[82]" | "[83]" | "[84]" | "[85]" | "[86]" | "[87]" | "[88]" | "[89]" | "[90]" |
"[91]" | "[92]" | "[93]" | "[94]" | "[95]" | "[96]" | "[97]" | "[98]" | "[99]" | "[100]" |
"[101]" | "[102]" | "[103]" | "[104]" | "[105]" | "[106]" | "[107]" | "[108]" | "[109]" | "[110]" |
"[111]" | "[112]" | "[113]" | "[114]" | "[115]" | "[116]" | "[117]" | "[118]" | "[119]" | "[120]" |
"[121]" | "[122]" | "[123]" | "[124]" | "[125]" | "[126]" | "[127]" | "[128]" | "[129]" | "[130]" |
"[131]" | "[132]" | "[133]" | "[134]" | "[135]" | "[136]" | "[137]" | "[138]" | "[139]" | "[140]" |
"[141]" | "[142]" | "[143]" | "[144]" | "[145]" | "[146]" | "[147]" | "[148]" | "[149]" | "[150]" |
"[151]" | "[152]" | "[153]" | "[154]" | "[155]" | "[156]" | "[157]" | "[158]" | "[159]" | "[160]" |
"[161]" | "[162]" | "[163]" | "[164]" | "[165]" | "[166]" | "[167]" | "[168]" | "[169]" | "[170]" |
"[171]" | "[172]" | "[173]" | "[174]" | "[175]" | "[176]" | "[177]" | "[178]" | "[179]" | "[180]" |
"[181]" | "[182]" | "[183]" | "[184]" | "[185]" | "[186]" | "[187]" | "[188]" | "[189]" | "[190]" |
"[191]" | "[192]" | "[193]" | "[194]" | "[195]" | "[196]" | "[197]" | "[198]" | "[199]" | "[200]" |
"[201]" | "[202]" | "[203]" | "[204]" | "[205]" | "[206]" | "[207]" | "[208]" | "[209]" | "[210]" |
"[211]" | "[212]" | "[213]" | "[214]" | "[215]" | "[216]" | "[217]" | "[218]" | "[219]" | "[220]" |
"[221]" | "[222]" | "[223]" | "[224]" | "[225]" | "[226]" | "[227]" | "[228]" | "[229]" | "[230]" |
"[231]" | "[232]" | "[233]" | "[234]" | "[235]" | "[236]" | "[237]" | "[238]" | "[239]" | "[240]" |
"[241]" | "[242]" | "[243]" | "[244]" | "[245]" | "[246]" | "[247]" | "[248]" | "[249]" | "[250]" |
"[251]" | "[252]" | "[253]" | "[254]" | "[255]"
];
export kDigit = Optimize[
"0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
];
export kLower = Optimize[
"a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" |
"n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
];
export kUpper = Optimize[
"A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" |
"N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
];
export kAlpha = Optimize[kLower | kUpper];
export kAlnum = Optimize[kDigit | kAlpha];
export kSpace = Optimize[
" " | "\t" | "\n" | "\r"
];
export kNotSpace = Optimize[kBytes - kSpace];
export kPunct = Optimize[
"!" | "\"" | "#" | "$" | "%" | "&" | "'" | "(" | ")" | "*" | "+" | "," |
"-" | "." | "/" | ":" | ";" | "<" | "=" | ">" | "?" | "@" | "\[" | "\\" |
"\]" | "^" | "_" | "`" | "{" | "|" | "}" | "~"
];
export kGraph = Optimize[kAlnum | kPunct];
import 'byte.grm' as b;
import 'number.grm' as n;
date_day = n.number_1_to_99 ("日"|"号");
date_month_day = n.number_1_to_99 "月" date_day;
date_year_month_day = ((n.number_0_to_9){2,4} | n.number) "年" date_month_day;
date = date_year_month_day | date_month_day | date_day;
export DATE = CDRewrite[date, "", "", b.kBytes*];
import 'byte.grm' as b;
hotfix = StringFile['hotfix.list'];
export HOTFIX = CDRewrite[hotfix, "", "", b.kBytes*];
0头 零头
10字 十字
东4环 东4环 -1.0
东4 东四 -0.5
4惠 四惠
3元桥 三元桥
4平市 四平市
5台山 五台山
西2旗 西二旗
西3旗 西三旗
4道口 四道口 -1.0
5道口 五道口 -1.0
6道口 六道口 -1.0
6里桥 六里桥
7里庄 七里庄
8宝山 八宝山
9颗松 九棵松
10里堡 十里堡
import 'byte.grm' as b;
import 'number.grm' as number;
import 'hotfix.grm' as hotfix;
import 'percentage.grm' as percentage;
import 'date.grm' as date;
import 'amount.grm' as amount; # seems not useful for now
export ITN = Optimize[percentage.PERCENTAGE @ (date.DATE <-1>) @ number.NUMBER @ hotfix.HOTFIX];
import 'byte.grm' as b;
number_1_to_9 = (
("一":"1") | ("幺":"1") |
("二":"2") | ("两":"2") |
("三":"3") |
("四":"4") |
("五":"5") |
("六":"6") |
("七":"7") |
("八":"8") |
("九":"9")
);
export number_0_to_9 = (("零":"0") | number_1_to_9);
number_10_to_19 = (
("十":"10") |
("十一":"11") |
("十二":"12") |
("十三":"13") |
("十四":"14") |
("十五":"15") |
("十六":"16") |
("十七":"17") |
("十八":"18") |
("十九":"19")
);
number_10s = (number_1_to_9 ("十":""));
number_100s = (number_1_to_9 ("百":""));
number_1000s = (number_1_to_9 ("千":""));
number_10000s = (number_1_to_9 ("万":""));
number_10_to_99 = (
((number_10s number_1_to_9)<-0.3>) |
((number_10s ("":"0"))<-0.2>) |
(number_10_to_19 <-0.1>)
);
export number_1_to_99 = (number_1_to_9 | number_10_to_99);
number_100_to_999 = (
((number_100s ("零":"0") number_1_to_9)<0.0>)|
((number_100s number_10_to_99)<0.0>) |
((number_100s number_1_to_9 ("":"0"))<0.0>) |
((number_100s ("":"00"))<0.1>)
);
number_1000_to_9999 = (
((number_1000s number_100_to_999)<0.0>) |
((number_1000s ("零":"0") number_10_to_99)<0.0>)|
((number_1000s ("零":"00") number_1_to_9)<0.0>)|
((number_1000s ("":"000"))<1>) |
((number_1000s number_1_to_9 ("":"00"))<0.0>)
);
export number = number_1_to_99 | (number_100_to_999 <-1>) | (number_1000_to_9999 <-2>);
export NUMBER = CDRewrite[number, "", "", b.kBytes*];
import 'byte.grm' as b;
import 'number.grm' as n;
percentage = (
("百分之":"") n.number_1_to_99 ("":"%")
);
export PERCENTAGE = CDRewrite[percentage, "", "", b.kBytes*];
# English covering grammar definitions
This directory defines a English text normalization covering grammar. The
primary entry-point is the FST `VERBALIZER`, defined in
`verbalizer/verbalizer.grm` and compiled in the FST archive
`verbalizer/verbalizer.far`.
verbalizer.far: verbalizer.grm util/util.far en/verbalizer/extra_numbers.far en/verbalizer/float.far en/verbalizer/math.far en/verbalizer/miscellaneous.far en/verbalizer/money.far en/verbalizer/numbers.far en/verbalizer/numbers_plus.far en/verbalizer/spelled.far en/verbalizer/spoken_punct.far en/verbalizer/time.far en/verbalizer/urls.far
thraxcompiler --input_grammar=$< --output_far=$@
0 zero
1 one
2 two
3 three
4 four
5 five
6 six
7 seven
8 eight
9 nine
10 ten
11 eleven
12 twelve
13 thirteen
14 fourteen
15 fifteen
16 sixteen
17 seventeen
18 eighteen
19 nineteen
20 twenty
30 thirty
40 forty
50 fifty
60 sixty
70 seventy
80 eighty
90 ninety
100 hundred
1000 thousand
1000000 million
1000000000 billion
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'util/byte.grm' as b;
import 'en/verbalizer/numbers.grm' as n;
digit = b.kDigit @ n.CARDINAL_NUMBERS | ("0" : "@@OTHER_ZERO_VERBALIZATIONS@@");
export DIGITS = digit (n.I[" "] digit)*;
# Various common factorizations
two_digits = b.kDigit{2} @ n.CARDINAL_NUMBERS;
three_digits = b.kDigit{3} @ n.CARDINAL_NUMBERS;
mixed =
(digit n.I[" "] two_digits)
| (two_digits n.I[" "] two_digits)
| (two_digits n.I[" "] three_digits)
| (two_digits n.I[" "] two_digits n.I[" "] two_digits)
;
export MIXED_NUMBERS = Optimize[mixed];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'util/byte.grm' as b;
import 'util/util.grm' as u;
import 'en/verbalizer/numbers.grm' as n;
func ToNumberName[expr] {
number_name_seq = n.CARDINAL_NUMBERS (" " n.CARDINAL_NUMBERS)*;
return Optimize[expr @ number_name_seq];
}
d = b.kDigit;
leading_zero = CDRewrite[n.I[" "], ("[BOS]" | " ") "0", "", b.kBytes*];
by_ones = d n.I[" "];
by_twos = (d{2} @ leading_zero) n.I[" "];
by_threes = (d{3} @ leading_zero) n.I[" "];
groupings = by_twos* (by_threes | by_twos | by_ones);
export FRACTIONAL_PART_UNGROUPED =
Optimize[ToNumberName[by_ones+ @ u.CLEAN_SPACES]]
;
export FRACTIONAL_PART_GROUPED =
Optimize[ToNumberName[groupings @ u.CLEAN_SPACES]]
;
export FRACTIONAL_PART_UNPARSED = Optimize[ToNumberName[d*]];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'en/verbalizer/factorization.grm' as f;
import 'en/verbalizer/lexical_map.grm' as l;
import 'en/verbalizer/numbers.grm' as n;
fractional_part_ungrouped = f.FRACTIONAL_PART_UNGROUPED;
fractional_part_grouped = f.FRACTIONAL_PART_GROUPED;
fractional_part_unparsed = f.FRACTIONAL_PART_UNPARSED;
__fractional_part__ = fractional_part_ungrouped | fractional_part_unparsed;
__decimal_marker__ = ".";
export FLOAT = Optimize[
(n.CARDINAL_NUMBERS
(__decimal_marker__ : " @@DECIMAL_DOT_EXPRESSION@@ ")
__fractional_part__) @ l.LEXICAL_MAP]
;
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'util/byte.grm' as b;
lexical_map = StringFile['en/verbalizer/lexical_map.tsv'];
sigma_star = b.kBytes*;
del_null = CDRewrite["__NULL__" : "", "", "", sigma_star];
export LEXICAL_MAP = Optimize[
CDRewrite[lexical_map, "", "", sigma_star] @ del_null]
;
@@CONNECTOR_RANGE@@ to
@@CONNECTOR_RATIO@@ to
@@CONNECTOR_BY@@ by
@@CONNECTOR_CONSECUTIVE_YEAR@@ to
@@JANUARY@@ january
@@FEBRUARY@@ february
@@MARCH@@ march
@@APRIL@@ april
@@MAY@@ may
@@JUNE@@ june
@@JULY@@ july
@@AUGUST@@ august
@@SEPTEMBER@@ september
@@OCTOBER@@ october
@@NOVEMBER@@ november
@@DECEMBER@@ december
@@MINUS@@ minus
@@DECIMAL_DOT_EXPRESSION@@ point
@@URL_DOT_EXPRESSION@@ dot
@@DECIMAL_EXPONENT@@ to the
@@DECIMAL_EXPONENT@@ to the power of
@@COLON@@ colon
@@SLASH@@ slash
@@SLASH@@ forward slash
@@DASH@@ dash
@@PASSWORD@@ password
@@AT@@ at
@@PORT@@ port
@@QUESTION_MARK@@ question mark
@@HASH@@ hash
@@HASH@@ hash tag
@@FRACTION_OVER@@ over
@@MONEY_AND@@ and
@@AND@@ and
@@PHONE_PLUS@@ plus
@@PHONE_EXTENSION@@ extension
@@TIME_AM@@ a m
@@TIME_PM@@ p m
@@HOUR@@ o'clock
@@MINUTE@@ minute
@@MINUTE@@ minutes
@@TIME_AFTER@@ after
@@TIME_AFTER@@ past
@@TIME_BEFORE@@ to
@@TIME_BEFORE@@ till
@@TIME_QUARTER@@ quarter
@@TIME_HALF@@ half
@@TIME_ZERO@@ oh
@@TIME_THREE_QUARTER@@ three quarters
@@ARITHMETIC_PLUS@@ plus
@@ARITHMETIC_TIMES@@ times
@@ARITHMETIC_TIMES@@ multiplied by
@@ARITHMETIC_MINUS@@ minus
@@ARITHMETIC_DIVISION@@ divided by
@@ARITHMETIC_DIVISION@@ over
@@ARITHMETIC_EQUALS@@ equals
@@PERCENT@@ percent
@@DEGREE@@ degree
@@DEGREE@@ degrees
@@SQUARE_ROOT@@ square root of
@@SQUARE_ROOT@@ the square root of
@@STAR@@ star
@@HYPHEN@@ hyphen
@@AT@@ at
@@PER@@ per
@@PERIOD@@ period
@@PERIOD@@ full stop
@@PERIOD@@ dot
@@EXCLAMATION_MARK@@ exclamation mark
@@EXCLAMATION_MARK@@ exclamation point
@@COMMA@@ comma
@@POSITIVE@@ positive
@@NEGATIVE@@ negative
@@OTHER_ZERO_VERBALIZATIONS@@ oh
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'en/verbalizer/float.grm' as f;
import 'en/verbalizer/lexical_map.grm' as l;
import 'en/verbalizer/numbers.grm' as n;
float = f.FLOAT;
card = n.CARDINAL_NUMBERS;
number = card | float;
plus = "+" : " @@ARITHMETIC_PLUS@@ ";
times = "*" : " @@ARITHMETIC_TIMES@@ ";
minus = "-" : " @@ARITHMETIC_MINUS@@ ";
division = "/" : " @@ARITHMETIC_DIVISION@@ ";
operator = plus | times | minus | division;
percent = "%" : " @@PERCENT@@";
export ARITHMETIC =
Optimize[((number operator number) | (number percent)) @ l.LEXICAL_MAP]
;
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'util/byte.grm' as b;
import 'ru/classifier/cyrillic.grm' as c;
import 'en/verbalizer/extra_numbers.grm' as e;
import 'en/verbalizer/lexical_map.grm' as l;
import 'en/verbalizer/numbers.grm' as n;
import 'en/verbalizer/spelled.grm' as s;
letter = b.kAlpha | c.kCyrillicAlpha;
dash = "-";
word = letter+;
possibly_split_word = word (((dash | ".") : " ") word)* n.D["."]?;
post_word_symbol =
("+" : ("@@ARITHMETIC_PLUS@@" | "@@POSITIVE@@")) |
("-" : ("@@ARITHMETIC_MINUS@@" | "@@NEGATIVE@@")) |
("*" : "@@STAR@@")
;
pre_word_symbol =
("@" : "@@AT@@") |
("/" : "@@SLASH@@") |
("#" : "@@HASH@@")
;
post_word = possibly_split_word n.I[" "] post_word_symbol;
pre_word = pre_word_symbol n.I[" "] possibly_split_word;
## Number/digit sequence combos, maybe with a dash
spelled_word = word @ s.SPELLED_NO_LETTER;
word_number =
(word | spelled_word)
(n.I[" "] | (dash : " "))
(e.DIGITS | n.CARDINAL_NUMBERS | e.MIXED_NUMBERS)
;
number_word =
(e.DIGITS | n.CARDINAL_NUMBERS | e.MIXED_NUMBERS)
(n.I[" "] | (dash : " "))
(word | spelled_word)
;
## Two-digit year.
# Note that in this case to be fair we really have to allow ordinals too since
# in some languages that's what you would have.
two_digit_year = n.D["'"] (b.kDigit{2} @ (n.CARDINAL_NUMBERS | e.DIGITS));
dot_com = ("." : "@@URL_DOT_EXPRESSION@@") n.I[" "] "com";
miscellaneous = Optimize[
possibly_split_word
| post_word
| pre_word
| word_number
| number_word
| two_digit_year
| dot_com
];
export MISCELLANEOUS = Optimize[miscellaneous @ l.LEXICAL_MAP];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'util/byte.grm' as b;
import 'en/verbalizer/lexical_map.grm' as l;
import 'en/verbalizer/numbers.grm' as n;
card = n.CARDINAL_NUMBERS;
__currency__ = StringFile['en/verbalizer/money.tsv'];
d = b.kDigit;
D = d - "0";
cents = ((n.D["0"] | D) d) @ card;
# Only dollar for the verbalizer tests for English. Will need to add other
# currencies.
usd_maj = Project["usd_maj" @ __currency__, 'output'];
usd_min = Project["usd_min" @ __currency__, 'output'];
and = " @@MONEY_AND@@ " | " ";
dollar1 =
n.D["$"] card n.I[" " usd_maj] n.I[and] n.D["."] cents n.I[" " usd_min]
;
dollar2 = n.D["$"] card n.I[" " usd_maj] n.D["."] n.D["00"];
dollar3 = n.D["$"] card n.I[" " usd_maj];
dollar = Optimize[dollar1 | dollar2 | dollar3];
export MONEY = Optimize[dollar @ l.LEXICAL_MAP];
usd_maj dollar
usd_maj dollars
usd_min cent
usd_min cents
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# English minimally supervised number grammar.
#
# Supports both cardinals and ordinals without overt marking.
#
# The language-specific acceptor G was compiled with digit, teen, and decade
# preterminals. The lexicon transducer L is unambiguous so no LM is used.
import 'util/arithmetic.grm' as a;
# Intersects the universal factorization transducer (F) with the
# language-specific acceptor (G).
d = a.DELTA_STAR;
f = a.IARITHMETIC_RESTRICTED;
g = LoadFst['en/verbalizer/g.fst'];
fg = Optimize[d @ Optimize[f @ Optimize[f @ Optimize[f @ g]]]];
test1 = AssertEqual["230" @ fg, "(+ (* 2 100 *) 30 +)"];
# Compiles lexicon transducer (L).
cardinal_name = StringFile['en/verbalizer/cardinals.tsv'];
cardinal_l = Optimize[(cardinal_name " ")* cardinal_name];
test2 = AssertEqual["2 100 30" @ cardinal_l, "two hundred thirty"];
ordinal_name = StringFile['en/verbalizer/ordinals.tsv'];
# In English, ordinals have the same syntax as cardinals and all but the final
# element is verbalized using a cardinal number word; e.g., "two hundred
# thirtieth".
ordinal_l = Optimize[(cardinal_name " ")* ordinal_name];
test3 = AssertEqual["2 100 30" @ ordinal_l, "two hundred thirtieth"];
# Composes L with the leaf transducer (P), then composes that with FG.
p = a.LEAVES;
export CARDINAL_NUMBER_NAME = Optimize[fg @ (p @ cardinal_l)];
test4 = AssertEqual["230" @ CARDINAL_NUMBER_NAME, "two hundred thirty"];
export ORDINAL_NUMBER_NAME = Optimize[fg @ (p @ ordinal_l)];
test5 = AssertEqual["230" @ ORDINAL_NUMBER_NAME, "two hundred thirtieth"];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'en/verbalizer/number_names.grm' as n;
import 'util/byte.grm' as bytelib;
import 'universal/thousands_punct.grm' as t;
cardinal = n.CARDINAL_NUMBER_NAME;
ordinal = n.ORDINAL_NUMBER_NAME;
# Putting these here since this grammar gets incorporated by all the others.
func I[expr] {
return "" : expr;
}
func D[expr] {
return expr : "";
}
separators = t.comma_thousands | t.no_delimiter;
# Language specific endings for ordinals.
d = bytelib.kDigit;
endings = "st" | "nd" | "rd" | "th";
st = (d* "1") - (d* "11");
nd = (d* "2") - (d* "12");
rd = (d* "3") - (d* "13");
th = Optimize[d* - st - nd - rd];
first = st ("st" : "");
second = nd ("nd" : "");
third = rd ("rd" : "");
other = th ("th" : "");
marked_ordinal = Optimize[first | second | third | other];
# The separator is a no-op here but will be needed once we replace
# the above targets.
export CARDINAL_NUMBERS = Optimize[separators @ cardinal];
export ORDINAL_NUMBERS =
Optimize[(separators endings) @ marked_ordinal @ ordinal]
;
export ORDINAL_NUMBERS_UNMARKED = Optimize[separators @ ordinal];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Grammar for things built mostly on numbers.
import 'en/verbalizer/factorization.grm' as f;
import 'en/verbalizer/lexical_map.grm' as l;
import 'en/verbalizer/numbers.grm' as n;
num = n.CARDINAL_NUMBERS;
ord = n.ORDINAL_NUMBERS_UNMARKED;
digits = f.FRACTIONAL_PART_UNGROUPED;
# Various symbols.
plus = "+" : "@@ARITHMETIC_PLUS@@";
minus = "-" : "@@ARITHMETIC_MINUS@@";
slash = "/" : "@@SLASH@@";
dot = "." : "@@URL_DOT_EXPRESSION@@";
dash = "-" : "@@DASH@@";
equals = "=" : "@@ARITHMETIC_EQUALS@@";
degree = "°" : "@@DEGREE@@";
division = ("/" | "÷") : "@@ARITHMETIC_DIVISION@@";
times = ("x" | "*") : "@@ARITHMETIC_TIMES@@";
power = "^" : "@@DECIMAL_EXPONENT@@";
square_root = "√" : "@@SQUARE_ROOT@@";
percent = "%" : "@@PERCENT@@";
# Safe roman numbers.
# NB: Do not change the formatting here. NO_EDIT must be on the same
# line as the path.
rfile =
'universal/roman_numerals.tsv' # NO_EDIT
;
roman = StringFile[rfile];
## Main categories.
cat_dot_number =
num
n.I[" "] dot n.I[" "] num
(n.I[" "] dot n.I[" "] num)+
;
cat_slash_number =
num
n.I[" "] slash n.I[" "] num
(n.I[" "] slash n.I[" "] num)*
;
cat_dash_number =
num
n.I[" "] dash n.I[" "] num
(n.I[" "] dash n.I[" "] num)*
;
cat_signed_number = ((plus | minus) n.I[" "])? num;
cat_degree = cat_signed_number n.I[" "] degree;
cat_country_code = plus n.I[" "] (num | digits);
cat_math_operations =
plus
| minus
| division
| times
| equals
| percent
| power
| square_root
;
# Roman numbers are often either cardinals or ordinals in various languages.
cat_roman = roman @ (num | ord);
# Allow
#
# number:number
# number-number
#
# to just be
#
# number number.
cat_number_number =
num ((":" | "-") : " ") num
;
# Some additional readings for these symbols.
cat_additional_readings =
("/" : "@@PER@@") |
("+" : "@@AND@@") |
("-" : ("@@HYPHEN@@" | "@@CONNECTOR_TO@@")) |
("*" : "@@STAR@@") |
("x" : ("x" | "@@CONNECTOR_BY@@")) |
("@" : "@@AT@@")
;
numbers_plus = Optimize[
cat_dot_number
| cat_slash_number
| cat_dash_number
| cat_signed_number
| cat_degree
| cat_country_code
| cat_math_operations
| cat_roman
| cat_number_number
| cat_additional_readings
];
export NUMBERS_PLUS = Optimize[numbers_plus @ l.LEXICAL_MAP];
0 zeroth
1 first
2 second
3 third
4 fourth
5 fifth
6 sixth
7 seventh
8 eighth
9 ninth
10 tenth
11 eleventh
12 twelfth
13 thirteenth
14 fourteenth
15 fifteenth
16 sixteenth
17 seventeenth
18 eighteenth
19 nineteenth
20 twentieth
30 thirtieth
40 fortieth
50 fiftieth
60 sixtieth
70 seventieth
80 eightieth
90 ninetieth
100 hundredth
1000 thousandth
1000000 millionth
1000000000 billionth
float.grm __fractional_part__ = fractional_part_ungrouped | fractional_part_unparsed;
telephone.grm __grouping__ = f.UNGROUPED;
measure.grm __measure__ = StringFile['en/verbalizer/measures.tsv'];
money.grm __currency__ = StringFile['en/verbalizer/money.tsv'];
time.grm __sep__ = ":";
time.grm __am__ = "a.m." | "am" | "AM";
time.grm __pm__ = "p.m." | "pm" | "PM";
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'util/util.grm' as util;
import 'util/case.grm' as case;
import 'en/verbalizer/extra_numbers.grm' as e;
import 'en/verbalizer/float.grm' as f;
import 'en/verbalizer/math.grm' as ma;
import 'en/verbalizer/miscellaneous.grm' as mi;
import 'en/verbalizer/money.grm' as mo;
import 'en/verbalizer/numbers.grm' as n;
import 'en/verbalizer/numbers_plus.grm' as np;
import 'en/verbalizer/spelled.grm' as s;
import 'en/verbalizer/spoken_punct.grm' as sp;
import 'en/verbalizer/time.grm' as t;
import 'en/verbalizer/urls.grm' as u;
export POD_SPEECH_TN = Optimize[RmWeight[
(u.URL
| e.MIXED_NUMBERS
| e.DIGITS
| f.FLOAT
| ma.ARITHMETIC
| mo.MONEY
| n.CARDINAL_NUMBERS
| n.ORDINAL_NUMBERS
| np.NUMBERS_PLUS
| s.SPELLED
| sp.SPOKEN_PUNCT
| t.TIME
| u.URL
| u.EMAILS) @ util.CLEAN_SPACES @ case.TOUPPER
]];
#export POD_SPEECH_TN = Optimize[RmWeight[(mi.MISCELLANEOUS) @ util.CLEAN_SPACES @ case.TOUPPER]];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This verbalizer is used whenever there is an LM symbol that consists of
# letters immediately followed by "{spelled}".l This strips the "{spelled}"
# suffix.
import 'util/byte.grm' as b;
import 'ru/classifier/cyrillic.grm' as c;
import 'en/verbalizer/lexical_map.grm' as l;
import 'en/verbalizer/numbers.grm' as n;
digit = b.kDigit @ n.CARDINAL_NUMBERS;
char_set = (("a" | "A") : "letter-a")
| (("b" | "B") : "letter-b")
| (("c" | "C") : "letter-c")
| (("d" | "D") : "letter-d")
| (("e" | "E") : "letter-e")
| (("f" | "F") : "letter-f")
| (("g" | "G") : "letter-g")
| (("h" | "H") : "letter-h")
| (("i" | "I") : "letter-i")
| (("j" | "J") : "letter-j")
| (("k" | "K") : "letter-k")
| (("l" | "L") : "letter-l")
| (("m" | "M") : "letter-m")
| (("n" | "N") : "letter-n")
| (("o" | "O") : "letter-o")
| (("p" | "P") : "letter-p")
| (("q" | "Q") : "letter-q")
| (("r" | "R") : "letter-r")
| (("s" | "S") : "letter-s")
| (("t" | "T") : "letter-t")
| (("u" | "U") : "letter-u")
| (("v" | "V") : "letter-v")
| (("w" | "W") : "letter-w")
| (("x" | "X") : "letter-x")
| (("y" | "Y") : "letter-y")
| (("z" | "Z") : "letter-z")
| (digit)
| ("&" : "@@AND@@")
| ("." : "")
| ("-" : "")
| ("_" : "")
| ("/" : "")
| (n.I["letter-"] c.kCyrillicAlpha)
;
ins_space = "" : " ";
suffix = "{spelled}" : "";
spelled = Optimize[char_set (ins_space char_set)* suffix];
export SPELLED = Optimize[spelled @ l.LEXICAL_MAP];
sigma_star = b.kBytes*;
# Gets rid of the letter- prefix since in some cases we don't want it.
del_letter = CDRewrite[n.D["letter-"], "", "", sigma_star];
spelled_no_tag = Optimize[char_set (ins_space char_set)*];
export SPELLED_NO_LETTER = Optimize[spelled_no_tag @ del_letter];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'en/verbalizer/lexical_map.grm' as l;
punct =
("." : "@@PERIOD@@")
| ("," : "@@COMMA@@")
| ("!" : "@@EXCLAMATION_MARK@@")
| ("?" : "@@QUESTION_MARK@@")
;
export SPOKEN_PUNCT = Optimize[punct @ l.LEXICAL_MAP];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'util/byte.grm' as b;
import 'en/verbalizer/lexical_map.grm' as l;
import 'en/verbalizer/numbers.grm' as n;
# Only handles 24-hour time with quarter-to, half-past and quarter-past.
increment_hour =
("0" : "1")
| ("1" : "2")
| ("2" : "3")
| ("3" : "4")
| ("4" : "5")
| ("5" : "6")
| ("6" : "7")
| ("7" : "8")
| ("8" : "9")
| ("9" : "10")
| ("10" : "11")
| ("11" : "12")
| ("12" : "1") # If someone uses 12, we assume 12-hour by default.
| ("13" : "14")
| ("14" : "15")
| ("15" : "16")
| ("16" : "17")
| ("17" : "18")
| ("18" : "19")
| ("19" : "20")
| ("20" : "21")
| ("21" : "22")
| ("22" : "23")
| ("23" : "12")
;
hours = Project[increment_hour, 'input'];
d = b.kDigit;
D = d - "0";
minutes09 = "0" D;
minutes = ("1" | "2" | "3" | "4" | "5") d;
__sep__ = ":";
sep_space = __sep__ : " ";
verbalize_hours = hours @ n.CARDINAL_NUMBERS;
verbalize_minutes =
("00" : "@@HOUR@@")
| (minutes09 @ (("0" : "@@TIME_ZERO@@") n.I[" "] n.CARDINAL_NUMBERS))
| (minutes @ n.CARDINAL_NUMBERS)
;
time_basic = Optimize[verbalize_hours sep_space verbalize_minutes];
# Special cases we handle right now.
# TODO: Need to allow for cases like
#
# half twelve (in the UK English sense)
# half twaalf (in the Dutch sense)
time_quarter_past =
n.I["@@TIME_QUARTER@@ @@TIME_AFTER@@ "]
verbalize_hours
n.D[__sep__ "15"];
time_half_past =
n.I["@@TIME_HALF@@ @@TIME_AFTER@@ "]
verbalize_hours
n.D[__sep__ "30"];
time_quarter_to =
n.I["@@TIME_QUARTER@@ @@TIME_BEFORE@@ "]
(increment_hour @ verbalize_hours)
n.D[__sep__ "45"];
time_extra = Optimize[
time_quarter_past | time_half_past | time_quarter_to]
;
# Basic time periods which most languages can be expected to have.
__am__ = "a.m." | "am" | "AM";
__pm__ = "p.m." | "pm" | "PM";
period = (__am__ : "@@TIME_AM@@") | (__pm__ : "@@TIME_PM@@");
time_variants = time_basic | time_extra;
time = Optimize[
(period (" " | n.I[" "]))? time_variants
| time_variants ((" " | n.I[" "]) period)?]
;
export TIME = Optimize[time @ l.LEXICAL_MAP];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Rules for URLs and email addresses.
import 'util/byte.grm' as bytelib;
import 'en/verbalizer/lexical_map.grm' as l;
ins_space = "" : " ";
dot = "." : "@@URL_DOT_EXPRESSION@@";
at = "@" : "@@AT@@";
url_suffix =
(".com" : dot ins_space "com") |
(".gov" : dot ins_space "gov") |
(".edu" : dot ins_space "e d u") |
(".org" : dot ins_space "org") |
(".net" : dot ins_space "net")
;
letter_string = (bytelib.kAlnum)* bytelib.kAlnum;
letter_string_dot =
((letter_string ins_space dot ins_space)* letter_string)
;
# Rules for URLs.
export URL = Optimize[
((letter_string_dot) (ins_space)
(url_suffix)) @ l.LEXICAL_MAP
];
# Rules for email addresses.
letter_by_letter = ((bytelib.kAlnum ins_space)* bytelib.kAlnum);
letter_by_letter_dot =
((letter_by_letter ins_space dot ins_space)*
letter_by_letter)
;
export EMAIL1 = Optimize[
((letter_by_letter) (ins_space)
(at) (ins_space)
(letter_by_letter_dot) (ins_space)
(url_suffix)) @ l.LEXICAL_MAP
];
export EMAIL2 = Optimize[
((letter_by_letter) (ins_space)
(at) (ins_space)
(letter_string_dot) (ins_space)
(url_suffix)) @ l.LEXICAL_MAP
];
export EMAILS = Optimize[
EMAIL1 | EMAIL2
];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'util/util.grm' as util;
import 'en/verbalizer/extra_numbers.grm' as e;
import 'en/verbalizer/float.grm' as f;
import 'en/verbalizer/math.grm' as ma;
import 'en/verbalizer/miscellaneous.grm' as mi;
import 'en/verbalizer/money.grm' as mo;
import 'en/verbalizer/numbers.grm' as n;
import 'en/verbalizer/numbers_plus.grm' as np;
import 'en/verbalizer/spelled.grm' as s;
import 'en/verbalizer/spoken_punct.grm' as sp;
import 'en/verbalizer/time.grm' as t;
import 'en/verbalizer/urls.grm' as u;
export VERBALIZER = Optimize[RmWeight[
( e.MIXED_NUMBERS
| e.DIGITS
| f.FLOAT
| ma.ARITHMETIC
| mi.MISCELLANEOUS
| mo.MONEY
| n.CARDINAL_NUMBERS
| n.ORDINAL_NUMBERS
| np.NUMBERS_PLUS
| s.SPELLED
| sp.SPOKEN_PUNCT
| t.TIME
| u.URL) @ util.CLEAN_SPACES
]];
This directory contains data used in:
Gorman, K., and Sproat, R. 2016. Minimally supervised number normalization.
Transactions of the Association for Computational Linguistics 4: 507-519.
* `minimal.txt`: A list of 30 curated numbers used as the "minimal" training
set.
* `random-trn.txt`: A list of 9000 randomly-generated numbers used as the
"medium" training set.
* `random-tst.txt`: A list of 1000 randomly-generated numbers used as the test
set.
Note that `random-trn.txt` and `random-tst.txt` are totally disjoint, but that
a small number of examples occur both in `minimal.txt` and `random-tst.txt`.
For information about the sampling procedure used to generate the random data
sets, see appendix A of the aforementioned paper.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
220
221
230
300
400
500
600
700
800
900
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1020
1021
1030
1200
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2020
2021
2030
2100
2200
5001
10000
12000
20000
21000
50001
100000
120000
200000
210000
500001
1000000
1001000
1200000
2000000
2100000
5000001
10000000
10001000
12000000
20000000
50000001
100000000
100001000
120000000
200000000
500000001
1000000000
1000001000
1200000000
2000000000
5000000001
10000000000
10000001000
12000000000
20000000000
50000000001
100000000000
100000001000
120000000000
200000000000
500000000001
# Russian covering grammar definitions
This directory defines a Russian text normalization covering grammar. The
primary entry-point is the FST `VERBALIZER`, defined in
`verbalizer/verbalizer.grm` and compiled in the FST archive
`verbalizer/verbalizer.far`.
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export kRussianLowerAlpha = Optimize[
"а" | "б" | "в" | "г" | "д" | "е" | "ё" | "ж" | "з" | "и" | "й" |
"к" | "л" | "м" | "н" | "о" | "п" | "р" | "с" | "т" | "у" | "ф" |
"х" | "ц" | "ч" | "ш" | "щ" | "ъ" | "ы" | "ь" | "э" | "ю" | "я" ];
export kRussianUpperAlpha = Optimize[
"А" | "Б" | "В" | "Г" | "Д" | "Е" | "Ё" | "Ж" | "З" | "И" | "Й" |
"К" | "Л" | "М" | "Н" | "О" | "П" | "Р" | "С" | "Т" | "У" | "Ф" |
"Х" | "Ц" | "Ч" | "Ш" | "Щ" | "Ъ" | "Ы" | "Ь" | "Э" | "Ю" | "Я" ];
export kRussianLowerAlphaStressed = Optimize[
"а́" | "е́" | "ё́" | "и́" | "о́" | "у́" | "ы́" | "э́" | "ю́" | "я́" ];
export kRussianUpperAlphaStressed = Optimize[
"А́" | "Е́" | "Ё́" | "И́" | "О́" | "У́" | "Ы́" | "Э́" | "Ю́" | "Я́" ];
export kRussianRewriteStress = Optimize[
("А́" : "А'") | ("Е́" : "Е'") | ("Ё́" : "Ё'") | ("И́" : "И'") |
("О́" : "О'") | ("У́" : "У'") | ("Ы́" : "Ы'") | ("Э́" : "Э'") |
("Ю́" : "Ю'") | ("Я́" : "Я'") |
("а́" : "а'") | ("е́" : "е'") | ("ё́" : "ё'") | ("и́" : "и'") |
("о́" : "о'") | ("у́" : "у'") | ("ы́" : "ы'") | ("э́" : "э'") |
("ю́" : "ю'") | ("я́" : "я'")
];
export kRussianRemoveStress = Optimize[
("А́" : "А") | ("Е́" : "Е") | ("Ё́" : "Ё") | ("И́" : "И") | ("О́" : "О") |
("У́" : "У") | ("Ы́" : "Ы") | ("Э́" : "Э") | ("Ю́" : "Ю") | ("Я́" : "Я") |
("а́" : "а") | ("е́" : "е") | ("ё́" : "ё") | ("и́" : "и") | ("о́" : "о") |
("у́" : "у") | ("ы́" : "ы") | ("э́" : "э") | ("ю́" : "ю") | ("я́" : "я")
];
# Pre-reform characters, just in case.
export kRussianPreReform = Optimize[
"ѣ" | "Ѣ" # http://en.wikipedia.org/wiki/Yat
];
export kCyrillicAlphaStressed = Optimize[
kRussianLowerAlphaStressed | kRussianUpperAlphaStressed
];
export kCyrillicAlpha = Optimize[
kRussianLowerAlpha | kRussianUpperAlpha | kRussianPreReform
];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'util/byte.grm' as b;
import 'ru/verbalizer/numbers.grm' as n;
digit = b.kDigit @ n.CARDINAL_NUMBERS | ("0" : "@@OTHER_ZERO_VERBALIZATIONS@@");
export DIGITS = digit (n.I[" "] digit)*;
# Various common factorizations
two_digits = b.kDigit{2} @ n.CARDINAL_NUMBERS;
three_digits = b.kDigit{3} @ n.CARDINAL_NUMBERS;
mixed =
(digit n.I[" "] two_digits)
| (two_digits n.I[" "] two_digits)
| (two_digits n.I[" "] three_digits)
| (two_digits n.I[" "] two_digits n.I[" "] two_digits)
;
export MIXED_NUMBERS = Optimize[mixed];
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import 'ru/verbalizer/factorization.grm' as f;
import 'ru/verbalizer/lexical_map.grm' as l;
import 'ru/verbalizer/numbers.grm' as n;
fractional_part_ungrouped = f.FRACTIONAL_PART_UNGROUPED;
fractional_part_grouped = f.FRACTIONAL_PART_GROUPED;
fractional_part_unparsed = f.FRACTIONAL_PART_UNPARSED;
__fractional_part__ = fractional_part_unparsed;
__decimal_marker__ = ",";
export FLOAT = Optimize[
(n.CARDINAL_NUMBERS
(__decimal_marker__ : " @@DECIMAL_DOT_EXPRESSION@@ ")
__fractional_part__) @ l.LEXICAL_MAP]
;
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册