Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
DeepSpeech
提交
f39de8d7
D
DeepSpeech
项目概览
PaddlePaddle
/
DeepSpeech
大约 2 年 前同步成功
通知
210
Star
8425
Fork
1598
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
245
列表
看板
标记
里程碑
合并请求
3
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DeepSpeech
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
245
Issue
245
列表
看板
标记
里程碑
合并请求
3
合并请求
3
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
f39de8d7
编写于
4月 21, 2022
作者:
Y
YangZhou
提交者:
GitHub
4月 21, 2022
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #1729 from zh794390558/ngram
[speechx] speedup ngram building
上级
0186f522
c938a450
变更
8
展开全部
隐藏空白更改
内联
并排
Showing
8 changed file
with
136816 addition
and
34 deletion
+136816
-34
speechx/examples/ds2_ol/aishell/local/split_data.sh
speechx/examples/ds2_ol/aishell/local/split_data.sh
+12
-6
speechx/examples/ds2_ol/aishell/run.sh
speechx/examples/ds2_ol/aishell/run.sh
+22
-21
speechx/examples/ngram/zh/local/aishell_train_lms.sh
speechx/examples/ngram/zh/local/aishell_train_lms.sh
+13
-6
speechx/examples/ngram/zh/local/split_data.sh
speechx/examples/ngram/zh/local/split_data.sh
+30
-0
speechx/examples/text_lm/README.md
speechx/examples/text_lm/README.md
+30
-1
speechx/examples/text_lm/local/data/chars.dic
speechx/examples/text_lm/local/data/chars.dic
+12638
-0
speechx/examples/text_lm/local/data/words.dic
speechx/examples/text_lm/local/data/words.dic
+123691
-0
speechx/examples/text_lm/local/mmseg.py
speechx/examples/text_lm/local/mmseg.py
+380
-0
未找到文件。
speechx/examples/ds2_ol/aishell/local/split_data.sh
浏览文件 @
f39de8d7
#!/usr/bin/env bash
#!/usr/bin/env bash
set
-eo
pipefail
data
=
$1
data
=
$1
feat_
scp
=
$2
scp
=
$2
split_
feat_
name
=
$3
split_name
=
$3
numsplit
=
$4
numsplit
=
$4
# save in $data/split{n}
# $scp to split
#
if
!
[
"
$numsplit
"
-gt
0
]
;
then
if
[[
!
$numsplit
-gt
0
]
]
;
then
echo
"Invalid num-split argument"
;
echo
"Invalid num-split argument"
;
exit
1
;
exit
1
;
fi
fi
directories
=
$(
for
n
in
`
seq
$numsplit
`
;
do
echo
$data
/split
${
numsplit
}
/
$n
;
done
)
directories
=
$(
for
n
in
`
seq
$numsplit
`
;
do
echo
$data
/split
${
numsplit
}
/
$n
;
done
)
feat_split_scp
=
$(
for
n
in
`
seq
$numsplit
`
;
do
echo
$data
/split
${
numsplit
}
/
$n
/
${
split_fea
t_name
}
;
done
)
scp_splits
=
$(
for
n
in
`
seq
$numsplit
`
;
do
echo
$data
/split
${
numsplit
}
/
$n
/
${
spli
t_name
}
;
done
)
echo
$feat_split_scp
# if this mkdir fails due to argument-list being too long, iterate.
# if this mkdir fails due to argument-list being too long, iterate.
if
!
mkdir
-p
$directories
>
&/dev/null
;
then
if
!
mkdir
-p
$directories
>
&/dev/null
;
then
for
n
in
`
seq
$numsplit
`
;
do
for
n
in
`
seq
$numsplit
`
;
do
...
@@ -21,4 +26,5 @@ if ! mkdir -p $directories >&/dev/null; then
...
@@ -21,4 +26,5 @@ if ! mkdir -p $directories >&/dev/null; then
done
done
fi
fi
utils/split_scp.pl
$feat_scp
$feat_split_scp
echo
"utils/split_scp.pl
$scp
$scp_splits
"
utils/split_scp.pl
$scp
$scp_splits
speechx/examples/ds2_ol/aishell/run.sh
浏览文件 @
f39de8d7
...
@@ -29,7 +29,7 @@ vocb_dir=$ckpt_dir/data/lang_char/
...
@@ -29,7 +29,7 @@ vocb_dir=$ckpt_dir/data/lang_char/
mkdir
-p
exp
mkdir
-p
exp
exp
=
$PWD
/exp
exp
=
$PWD
/exp
if
[
$
stage
-le
0
]
&&
[
$stop_stage
-ge
0
]
;
then
if
[
$
{
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
aishell_wav_scp
=
aishell_test.scp
aishell_wav_scp
=
aishell_test.scp
if
[
!
-d
$data
/test
]
;
then
if
[
!
-d
$data
/test
]
;
then
pushd
$data
pushd
$data
...
@@ -42,11 +42,12 @@ if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
...
@@ -42,11 +42,12 @@ if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
paste
$data
/utt_id
$data
/wavlist
>
$data
/
$aishell_wav_scp
paste
$data
/utt_id
$data
/wavlist
>
$data
/
$aishell_wav_scp
fi
fi
if
[
!
-f
$ckpt_dir
/data/mean_std.json
]
;
then
if
[
!
-d
$ckpt_dir
]
;
then
mkdir
-p
$ckpt_dir
mkdir
-p
$ckpt_dir
wget
-P
$ckpt_dir
-c
https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz
pushd
$ckpt_dir
tar
xzfv
$model_dir
/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz
-C
$ckpt_dir
wget
-c
https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz
tar
xzfv asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz
popd
fi
fi
lm
=
$data
/zh_giga.no_cna_cmn.prune01244.klm
lm
=
$data
/zh_giga.no_cna_cmn.prune01244.klm
...
@@ -65,7 +66,7 @@ wer=./aishell_wer
...
@@ -65,7 +66,7 @@ wer=./aishell_wer
export
GLOG_logtostderr
=
1
export
GLOG_logtostderr
=
1
if
[
$
stage
-le
1
]
&&
[
$stop_stage
-ge
1
]
;
then
if
[
$
{
stage
}
-le
1
]
&&
[
${
stop_stage
}
-ge
1
]
;
then
# 3. gen linear feat
# 3. gen linear feat
cmvn
=
$data
/cmvn.ark
cmvn
=
$data
/cmvn.ark
cmvn-json2kaldi
--json_file
=
$ckpt_dir
/data/mean_std.json
--cmvn_write_path
=
$cmvn
cmvn-json2kaldi
--json_file
=
$ckpt_dir
/data/mean_std.json
--cmvn_write_path
=
$cmvn
...
@@ -80,7 +81,7 @@ if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
...
@@ -80,7 +81,7 @@ if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
--streaming_chunk
=
0.36
--streaming_chunk
=
0.36
fi
fi
if
[
$
stage
-le
2
]
&&
[
$stop_stage
-ge
2
]
;
then
if
[
$
{
stage
}
-le
2
]
&&
[
${
stop_stage
}
-ge
2
]
;
then
# recognizer
# recognizer
utils/run.pl
JOB
=
1:
$nj
$data
/split
${
nj
}
/JOB/recog.wolm.log
\
utils/run.pl
JOB
=
1:
$nj
$data
/split
${
nj
}
/JOB/recog.wolm.log
\
ctc-prefix-beam-search-decoder-ol
\
ctc-prefix-beam-search-decoder-ol
\
...
@@ -92,10 +93,10 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
...
@@ -92,10 +93,10 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
--result_wspecifier
=
ark,t:
$data
/split
${
nj
}
/JOB/result
--result_wspecifier
=
ark,t:
$data
/split
${
nj
}
/JOB/result
cat
$data
/split
${
nj
}
/
*
/result
>
$exp
/
${
label_file
}
cat
$data
/split
${
nj
}
/
*
/result
>
$exp
/
${
label_file
}
utils/compute-wer.py
--char
=
1
--v
=
1
$
exp
/
${
label_file
}
$text
>
$exp
/
${
wer
}
utils/compute-wer.py
--char
=
1
--v
=
1
$
text
$exp
/
${
label_file
}
>
$exp
/
${
wer
}
fi
fi
if
[
$
stage
-le
3
]
&&
[
$stop_stage
-ge
3
]
;
then
if
[
$
{
stage
}
-le
3
]
&&
[
${
stop_stage
}
-ge
3
]
;
then
# decode with lm
# decode with lm
utils/run.pl
JOB
=
1:
$nj
$data
/split
${
nj
}
/JOB/recog.lm.log
\
utils/run.pl
JOB
=
1:
$nj
$data
/split
${
nj
}
/JOB/recog.lm.log
\
ctc-prefix-beam-search-decoder-ol
\
ctc-prefix-beam-search-decoder-ol
\
...
@@ -108,21 +109,21 @@ if [ $stage -le 3 ] && [ $stop_stage -ge 3 ];then
...
@@ -108,21 +109,21 @@ if [ $stage -le 3 ] && [ $stop_stage -ge 3 ];then
--result_wspecifier
=
ark,t:
$data
/split
${
nj
}
/JOB/result_lm
--result_wspecifier
=
ark,t:
$data
/split
${
nj
}
/JOB/result_lm
cat
$data
/split
${
nj
}
/
*
/result_lm
>
$exp
/
${
label_file
}
_lm
cat
$data
/split
${
nj
}
/
*
/result_lm
>
$exp
/
${
label_file
}
_lm
utils/compute-wer.py
--char
=
1
--v
=
1
$
exp
/
${
label_file
}
_lm
$text
>
$exp
/
${
wer
}
_
lm
utils/compute-wer.py
--char
=
1
--v
=
1
$
text
$exp
/
${
label_file
}
_lm
>
$exp
/
${
wer
}
.
lm
fi
fi
if
[
${
stage
}
-le
4
]
&&
[
${
stop_stage
}
-ge
4
]
;
then
wfst
=
$data
/wfst/
mkdir
-p
$wfst
if
[
!
-f
$wfst
/aishell_graph.zip
]
;
then
pushd
$wfst
wget
-c
https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_graph.zip
unzip aishell_graph.zip
popd
fi
wfst
=
$data
/wfst/
graph_dir
=
$wfst
/aishell_graph
mkdir
-p
$wfst
if
[
!
-f
$wfst
/aishell_graph.zip
]
;
then
pushd
$wfst
wget
-c
https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_graph.zip
unzip aishell_graph.zip
popd
fi
graph_dir
=
$wfst
/aishell_graph
if
[
$stage
-le
4
]
&&
[
$stop_stage
-ge
4
]
;
then
# TLG decoder
# TLG decoder
utils/run.pl
JOB
=
1:
$nj
$data
/split
${
nj
}
/JOB/recog.wfst.log
\
utils/run.pl
JOB
=
1:
$nj
$data
/split
${
nj
}
/JOB/recog.wfst.log
\
wfst-decoder-ol
\
wfst-decoder-ol
\
...
@@ -136,5 +137,5 @@ if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
...
@@ -136,5 +137,5 @@ if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
--result_wspecifier
=
ark,t:
$data
/split
${
nj
}
/JOB/result_tlg
--result_wspecifier
=
ark,t:
$data
/split
${
nj
}
/JOB/result_tlg
cat
$data
/split
${
nj
}
/
*
/result_tlg
>
$exp
/
${
label_file
}
_tlg
cat
$data
/split
${
nj
}
/
*
/result_tlg
>
$exp
/
${
label_file
}
_tlg
utils/compute-wer.py
--char
=
1
--v
=
1
$
exp
/
${
label_file
}
_tlg
$text
>
$exp
/
${
wer
}
_
tlg
utils/compute-wer.py
--char
=
1
--v
=
1
$
text
$exp
/
${
label_file
}
_tlg
>
$exp
/
${
wer
}
.
tlg
fi
fi
speechx/examples/ngram/zh/local/aishell_train_lms.sh
浏览文件 @
f39de8d7
...
@@ -3,6 +3,7 @@
...
@@ -3,6 +3,7 @@
# To be run from one directory above this script.
# To be run from one directory above this script.
.
./path.sh
.
./path.sh
nj
=
40
text
=
data/local/lm/text
text
=
data/local/lm/text
lexicon
=
data/local/dict/lexicon.txt
lexicon
=
data/local/dict/lexicon.txt
...
@@ -31,21 +32,27 @@ cleantext=$dir/text.no_oov
...
@@ -31,21 +32,27 @@ cleantext=$dir/text.no_oov
# oov to <SPOKEN_NOISE>
# oov to <SPOKEN_NOISE>
# lexicon line: word char0 ... charn
# lexicon line: word char0 ... charn
# text line: utt word0 ... wordn -> line: <SPOKEN_NOISE> word0 ... wordn
# text line: utt word0 ... wordn -> line: <SPOKEN_NOISE> word0 ... wordn
cat
$text
|
awk
-v
lex
=
$lexicon
'BEGIN{while((getline<lex) >0){ seen[$1]=1; } }
text_dir
=
$(
dirname
$text
)
{for(n=1; n<=NF;n++) { if (seen[$n]) { printf("%s ", $n); } else {printf("<SPOKEN_NOISE> ");} } printf("\n");}'
\
split_name
=
$(
basename
$text
)
>
$cleantext
||
exit
1
;
./local/split_data.sh
$text_dir
$text
$split_name
$nj
utils/run.pl
JOB
=
1:
$nj
$text_dir
/split
${
nj
}
/JOB/
${
split_name
}
.no_oov.log
\
cat
${
text_dir
}
/split
${
nj
}
/JOB/
${
split_name
}
\|
awk
-v
lex
=
$lexicon
'BEGIN{while((getline<lex) >0){ seen[$1]=1; } }
{for(n=1; n<=NF;n++) { if (seen[$n]) { printf("%s ", $n); } else {printf("<SPOKEN_NOISE> ");} } printf("\n");}'
\
\>
${
text_dir
}
/split
${
nj
}
/JOB/
${
split_name
}
.no_oov
||
exit
1
;
cat
${
text_dir
}
/split
${
nj
}
/
*
/
${
split_name
}
.no_oov
>
$cleantext
# compute word counts, sort in descending order
# compute word counts, sort in descending order
# line: count word
# line: count word
cat
$cleantext
|
awk
'{for(n=2;n<=NF;n++) print $n; }'
|
sort
|
uniq
-c
|
\
cat
$cleantext
|
awk
'{for(n=2;n<=NF;n++) print $n; }'
|
sort
--parallel
=
`
nproc
`
|
uniq
-c
|
\
sort
-nr
>
$dir
/word.counts
||
exit
1
;
sort
-
-parallel
=
`
nproc
`
-
nr
>
$dir
/word.counts
||
exit
1
;
# Get counts from acoustic training transcripts, and add one-count
# Get counts from acoustic training transcripts, and add one-count
# for each word in the lexicon (but not silence, we don't want it
# for each word in the lexicon (but not silence, we don't want it
# in the LM-- we'll add it optionally later).
# in the LM-- we'll add it optionally later).
cat
$cleantext
|
awk
'{for(n=2;n<=NF;n++) print $n; }'
|
\
cat
$cleantext
|
awk
'{for(n=2;n<=NF;n++) print $n; }'
|
\
cat
- <
(
grep
-w
-v
'!SIL'
$lexicon
|
awk
'{print $1}'
)
|
\
cat
- <
(
grep
-w
-v
'!SIL'
$lexicon
|
awk
'{print $1}'
)
|
\
sort
|
uniq
-c
|
sort
-nr
>
$dir
/unigram.counts
||
exit
1
;
sort
--parallel
=
`
nproc
`
|
uniq
-c
|
sort
--parallel
=
`
nproc
`
-nr
>
$dir
/unigram.counts
||
exit
1
;
# word with <s> </s>
# word with <s> </s>
cat
$dir
/unigram.counts |
awk
'{print $2}'
|
cat
- <
(
echo
"<s>"
;
echo
"</s>"
)
>
$dir
/wordlist
cat
$dir
/unigram.counts |
awk
'{print $2}'
|
cat
- <
(
echo
"<s>"
;
echo
"</s>"
)
>
$dir
/wordlist
...
...
speechx/examples/ngram/zh/local/split_data.sh
0 → 100755
浏览文件 @
f39de8d7
#!/usr/bin/env bash
set
-eo
pipefail
data
=
$1
scp
=
$2
split_name
=
$3
numsplit
=
$4
# save in $data/split{n}
# $scp to split
#
if
[[
!
$numsplit
-gt
0
]]
;
then
echo
"Invalid num-split argument"
;
exit
1
;
fi
directories
=
$(
for
n
in
`
seq
$numsplit
`
;
do
echo
$data
/split
${
numsplit
}
/
$n
;
done
)
scp_splits
=
$(
for
n
in
`
seq
$numsplit
`
;
do
echo
$data
/split
${
numsplit
}
/
$n
/
${
split_name
}
;
done
)
# if this mkdir fails due to argument-list being too long, iterate.
if
!
mkdir
-p
$directories
>
&/dev/null
;
then
for
n
in
`
seq
$numsplit
`
;
do
mkdir
-p
$data
/split
${
numsplit
}
/
$n
done
fi
echo
"utils/split_scp.pl
$scp
$scp_splits
"
utils/split_scp.pl
$scp
$scp_splits
speechx/examples/text_lm/README.md
浏览文件 @
f39de8d7
# Text PreProcess for building ngram LM
# Text PreProcess for building ngram LM
Output
`text`
file like this:
## Input
```
data/
|-- text
```
Input file is kaldi-style, which has
`utt`
at first column:
```
Y0000000000_--5llN02F84_S00000 怎么样这些日子住得还习惯吧
Y0000000000_--5llN02F84_S00002 挺好的
Y0000000000_--5llN02F84_S00003 对了美静这段日子经常不和我们一起用餐
Y0000000000_--5llN02F84_S00004 是不是对我回来有什么想法啊
Y0000000000_--5llN02F84_S00005 哪有的事啊
Y0000000000_--5llN02F84_S00006 她这两天挺累的身体也不太舒服
Y0000000000_--5llN02F84_S00007 我让她多睡一会那就好如果要是觉得不方便
Y0000000000_--5llN02F84_S00009 我就搬出去住
Y0000000000_--5llN02F84_S00010 你看你这个人你就是疑心太重
Y0000000000_--5llN02F84_S00011 你现在多好一切都井然有序的
```
## Output
```
data/
`-- text.tn
```
Output file like this:
```
```
BAC009S0002W0122 而 对 楼市 成交 抑制 作用 最 大 的 限 购
BAC009S0002W0122 而 对 楼市 成交 抑制 作用 最 大 的 限 购
...
...
speechx/examples/text_lm/local/data/chars.dic
0 → 100644
浏览文件 @
f39de8d7
此差异已折叠。
点击以展开。
speechx/examples/text_lm/local/data/words.dic
0 → 100644
浏览文件 @
f39de8d7
此差异已折叠。
点击以展开。
speechx/examples/text_lm/local/mmseg.py
0 → 100755
浏览文件 @
f39de8d7
#!/usr/bin/env python3
# modify from https://sites.google.com/site/homepageoffuyanwei/Home/remarksandexcellentdiscussion/page-2
class
Word
:
def
__init__
(
self
,
text
=
''
,
freq
=
0
):
self
.
text
=
text
self
.
freq
=
freq
self
.
length
=
len
(
text
)
class
Chunk
:
def
__init__
(
self
,
w1
,
w2
=
None
,
w3
=
None
):
self
.
words
=
[]
self
.
words
.
append
(
w1
)
if
w2
:
self
.
words
.
append
(
w2
)
if
w3
:
self
.
words
.
append
(
w3
)
#计算chunk的总长度
def
totalWordLength
(
self
):
length
=
0
for
word
in
self
.
words
:
length
+=
len
(
word
.
text
)
return
length
#计算平均长度
def
averageWordLength
(
self
):
return
float
(
self
.
totalWordLength
())
/
float
(
len
(
self
.
words
))
#计算标准差
def
standardDeviation
(
self
):
average
=
self
.
averageWordLength
()
sum
=
0.0
for
word
in
self
.
words
:
tmp
=
(
len
(
word
.
text
)
-
average
)
sum
+=
float
(
tmp
)
*
float
(
tmp
)
return
sum
#自由语素度
def
wordFrequency
(
self
):
sum
=
0
for
word
in
self
.
words
:
sum
+=
word
.
freq
return
sum
class
ComplexCompare
:
def
takeHightest
(
self
,
chunks
,
comparator
):
i
=
1
for
j
in
range
(
1
,
len
(
chunks
)):
rlt
=
comparator
(
chunks
[
j
],
chunks
[
0
])
if
rlt
>
0
:
i
=
0
if
rlt
>=
0
:
chunks
[
i
],
chunks
[
j
]
=
chunks
[
j
],
chunks
[
i
]
i
+=
1
return
chunks
[
0
:
i
]
#以下四个函数是mmseg算法的四种过滤原则,核心算法
def
mmFilter
(
self
,
chunks
):
def
comparator
(
a
,
b
):
return
a
.
totalWordLength
()
-
b
.
totalWordLength
()
return
self
.
takeHightest
(
chunks
,
comparator
)
def
lawlFilter
(
self
,
chunks
):
def
comparator
(
a
,
b
):
return
a
.
averageWordLength
()
-
b
.
averageWordLength
()
return
self
.
takeHightest
(
chunks
,
comparator
)
def
svmlFilter
(
self
,
chunks
):
def
comparator
(
a
,
b
):
return
b
.
standardDeviation
()
-
a
.
standardDeviation
()
return
self
.
takeHightest
(
chunks
,
comparator
)
def
logFreqFilter
(
self
,
chunks
):
def
comparator
(
a
,
b
):
return
a
.
wordFrequency
()
-
b
.
wordFrequency
()
return
self
.
takeHightest
(
chunks
,
comparator
)
#加载词组字典和字符字典
dictWord
=
{}
maxWordLength
=
0
def
loadDictChars
(
filepath
):
global
maxWordLength
fsock
=
open
(
filepath
)
for
line
in
fsock
:
freq
,
word
=
line
.
split
()
word
=
word
.
strip
()
dictWord
[
word
]
=
(
len
(
word
),
int
(
freq
))
maxWordLength
=
len
(
word
)
if
maxWordLength
<
len
(
word
)
else
maxWordLength
fsock
.
close
()
def
loadDictWords
(
filepath
):
global
maxWordLength
fsock
=
open
(
filepath
)
for
line
in
fsock
.
readlines
():
word
=
line
.
strip
()
dictWord
[
word
]
=
(
len
(
word
),
0
)
maxWordLength
=
len
(
word
)
if
maxWordLength
<
len
(
word
)
else
maxWordLength
fsock
.
close
()
#判断该词word是否在字典dictWord中
def
getDictWord
(
word
):
result
=
dictWord
.
get
(
word
)
if
result
:
return
Word
(
word
,
result
[
1
])
return
None
#开始加载字典
def
run
():
from
os.path
import
join
,
dirname
loadDictChars
(
join
(
dirname
(
__file__
),
'data'
,
'chars.dic'
))
loadDictWords
(
join
(
dirname
(
__file__
),
'data'
,
'words.dic'
))
class
Analysis
:
def
__init__
(
self
,
text
):
self
.
text
=
text
self
.
cacheSize
=
3
self
.
pos
=
0
self
.
textLength
=
len
(
self
.
text
)
self
.
cache
=
[]
self
.
cacheIndex
=
0
self
.
complexCompare
=
ComplexCompare
()
#简单小技巧,用到个缓存,不知道具体有没有用处
for
i
in
range
(
self
.
cacheSize
):
self
.
cache
.
append
([
-
1
,
Word
()])
#控制字典只加载一次
if
not
dictWord
:
run
()
def
__iter__
(
self
):
while
True
:
token
=
self
.
getNextToken
()
if
token
==
None
:
raise
StopIteration
yield
token
def
getNextChar
(
self
):
return
self
.
text
[
self
.
pos
]
#判断该字符是否是中文字符(不包括中文标点)
def
isChineseChar
(
self
,
charater
):
return
0x4e00
<=
ord
(
charater
)
<
0x9fa6
#判断是否是ASCII码
def
isASCIIChar
(
self
,
ch
):
import
string
if
ch
in
string
.
whitespace
:
return
False
if
ch
in
string
.
punctuation
:
return
False
return
ch
in
string
.
printable
#得到下一个切割结果
def
getNextToken
(
self
):
while
self
.
pos
<
self
.
textLength
:
if
self
.
isChineseChar
(
self
.
getNextChar
()):
token
=
self
.
getChineseWords
()
else
:
token
=
self
.
getASCIIWords
()
+
'/'
if
len
(
token
)
>
0
:
return
token
return
None
#切割出非中文词
def
getASCIIWords
(
self
):
# Skip pre-word whitespaces and punctuations
#跳过中英文标点和空格
while
self
.
pos
<
self
.
textLength
:
ch
=
self
.
getNextChar
()
if
self
.
isASCIIChar
(
ch
)
or
self
.
isChineseChar
(
ch
):
break
self
.
pos
+=
1
#得到英文单词的起始位置
start
=
self
.
pos
#找出英文单词的结束位置
while
self
.
pos
<
self
.
textLength
:
ch
=
self
.
getNextChar
()
if
not
self
.
isASCIIChar
(
ch
):
break
self
.
pos
+=
1
end
=
self
.
pos
#Skip chinese word whitespaces and punctuations
#跳过中英文标点和空格
while
self
.
pos
<
self
.
textLength
:
ch
=
self
.
getNextChar
()
if
self
.
isASCIIChar
(
ch
)
or
self
.
isChineseChar
(
ch
):
break
self
.
pos
+=
1
#返回英文单词
return
self
.
text
[
start
:
end
]
#切割出中文词,并且做处理,用上述4种方法
def
getChineseWords
(
self
):
chunks
=
self
.
createChunks
()
if
len
(
chunks
)
>
1
:
chunks
=
self
.
complexCompare
.
mmFilter
(
chunks
)
if
len
(
chunks
)
>
1
:
chunks
=
self
.
complexCompare
.
lawlFilter
(
chunks
)
if
len
(
chunks
)
>
1
:
chunks
=
self
.
complexCompare
.
svmlFilter
(
chunks
)
if
len
(
chunks
)
>
1
:
chunks
=
self
.
complexCompare
.
logFreqFilter
(
chunks
)
if
len
(
chunks
)
==
0
:
return
''
#最后只有一种切割方法
word
=
chunks
[
0
].
words
token
=
""
length
=
0
for
x
in
word
:
if
x
.
length
!=
-
1
:
token
+=
x
.
text
+
"/"
length
+=
len
(
x
.
text
)
self
.
pos
+=
length
return
token
#三重循环来枚举切割方法,这里也可以运用递归来实现
def
createChunks
(
self
):
chunks
=
[]
originalPos
=
self
.
pos
words1
=
self
.
getMatchChineseWords
()
for
word1
in
words1
:
self
.
pos
+=
len
(
word1
.
text
)
if
self
.
pos
<
self
.
textLength
:
words2
=
self
.
getMatchChineseWords
()
for
word2
in
words2
:
self
.
pos
+=
len
(
word2
.
text
)
if
self
.
pos
<
self
.
textLength
:
words3
=
self
.
getMatchChineseWords
()
for
word3
in
words3
:
# print(word3.length, word3.text)
if
word3
.
length
==
-
1
:
chunk
=
Chunk
(
word1
,
word2
)
# print("Ture")
else
:
chunk
=
Chunk
(
word1
,
word2
,
word3
)
chunks
.
append
(
chunk
)
elif
self
.
pos
==
self
.
textLength
:
chunks
.
append
(
Chunk
(
word1
,
word2
))
self
.
pos
-=
len
(
word2
.
text
)
elif
self
.
pos
==
self
.
textLength
:
chunks
.
append
(
Chunk
(
word1
))
self
.
pos
-=
len
(
word1
.
text
)
self
.
pos
=
originalPos
return
chunks
#运用正向最大匹配算法结合字典来切割中文文本
def
getMatchChineseWords
(
self
):
#use cache,check it
for
i
in
range
(
self
.
cacheSize
):
if
self
.
cache
[
i
][
0
]
==
self
.
pos
:
return
self
.
cache
[
i
][
1
]
originalPos
=
self
.
pos
words
=
[]
index
=
0
while
self
.
pos
<
self
.
textLength
:
if
index
>=
maxWordLength
:
break
if
not
self
.
isChineseChar
(
self
.
getNextChar
()):
break
self
.
pos
+=
1
index
+=
1
text
=
self
.
text
[
originalPos
:
self
.
pos
]
word
=
getDictWord
(
text
)
if
word
:
words
.
append
(
word
)
self
.
pos
=
originalPos
#没有词则放置个‘X’,将文本长度标记为-1
if
not
words
:
word
=
Word
()
word
.
length
=
-
1
word
.
text
=
'X'
words
.
append
(
word
)
self
.
cache
[
self
.
cacheIndex
]
=
(
self
.
pos
,
words
)
self
.
cacheIndex
+=
1
if
self
.
cacheIndex
>=
self
.
cacheSize
:
self
.
cacheIndex
=
0
return
words
if
__name__
==
"__main__"
:
def
cuttest
(
text
):
#cut = Analysis(text)
tmp
=
""
try
:
for
word
in
iter
(
Analysis
(
text
)):
tmp
+=
word
except
Exception
as
e
:
pass
print
(
tmp
)
print
(
"================================"
)
cuttest
(
u
"研究生命来源"
)
cuttest
(
u
"南京市长江大桥欢迎您"
)
cuttest
(
u
"请把手抬高一点儿"
)
cuttest
(
u
"长春市长春节致词。"
)
cuttest
(
u
"长春市长春药店。"
)
cuttest
(
u
"我的和服务必在明天做好。"
)
cuttest
(
u
"我发现有很多人喜欢他。"
)
cuttest
(
u
"我喜欢看电视剧大长今。"
)
cuttest
(
u
"半夜给拎起来陪看欧洲杯糊着两眼半晌没搞明白谁和谁踢。"
)
cuttest
(
u
"李智伟高高兴兴以及王晓薇出去玩,后来智伟和晓薇又单独去玩了。"
)
cuttest
(
u
"一次性交出去很多钱。 "
)
cuttest
(
u
"这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。"
)
cuttest
(
u
"我不喜欢日本和服。"
)
cuttest
(
u
"雷猴回归人间。"
)
cuttest
(
u
"工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作"
)
cuttest
(
u
"我需要廉租房"
)
cuttest
(
u
"永和服装饰品有限公司"
)
cuttest
(
u
"我爱北京天安门"
)
cuttest
(
u
"abc"
)
cuttest
(
u
"隐马尔可夫"
)
cuttest
(
u
"雷猴是个好网站"
)
cuttest
(
u
"“Microsoft”一词由“MICROcomputer(微型计算机)”和“SOFTware(软件)”两部分组成"
)
cuttest
(
u
"草泥马和欺实马是今年的流行词汇"
)
cuttest
(
u
"伊藤洋华堂总府店"
)
cuttest
(
u
"中国科学院计算技术研究所"
)
cuttest
(
u
"罗密欧与朱丽叶"
)
cuttest
(
u
"我购买了道具和服装"
)
cuttest
(
u
"PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍"
)
cuttest
(
u
"湖北省石首市"
)
cuttest
(
u
"总经理完成了这件事情"
)
cuttest
(
u
"电脑修好了"
)
cuttest
(
u
"做好了这件事情就一了百了了"
)
cuttest
(
u
"人们审美的观点是不同的"
)
cuttest
(
u
"我们买了一个美的空调"
)
cuttest
(
u
"线程初始化时我们要注意"
)
cuttest
(
u
"一个分子是由好多原子组织成的"
)
cuttest
(
u
"祝你马到功成"
)
cuttest
(
u
"他掉进了无底洞里"
)
cuttest
(
u
"中国的首都是北京"
)
cuttest
(
u
"孙君意"
)
cuttest
(
u
"外交部发言人马朝旭"
)
cuttest
(
u
"领导人会议和第四届东亚峰会"
)
cuttest
(
u
"在过去的这五年"
)
cuttest
(
u
"还需要很长的路要走"
)
cuttest
(
u
"60周年首都阅兵"
)
cuttest
(
u
"你好人们审美的观点是不同的"
)
cuttest
(
u
"买水果然后来世博园"
)
cuttest
(
u
"买水果然后去世博园"
)
cuttest
(
u
"但是后来我才知道你是对的"
)
cuttest
(
u
"存在即合理"
)
cuttest
(
u
"的的的的的在的的的的就以和和和"
)
cuttest
(
u
"I love你,不以为耻,反以为rong"
)
cuttest
(
u
" "
)
cuttest
(
u
""
)
cuttest
(
u
"hello你好人们审美的观点是不同的"
)
cuttest
(
u
"很好但主要是基于网页形式"
)
cuttest
(
u
"hello你好人们审美的观点是不同的"
)
cuttest
(
u
"为什么我不能拥有想要的生活"
)
cuttest
(
u
"后来我才"
)
cuttest
(
u
"此次来中国是为了"
)
cuttest
(
u
"使用了它就可以解决一些问题"
)
cuttest
(
u
",使用了它就可以解决一些问题"
)
cuttest
(
u
"其实使用了它就可以解决一些问题"
)
cuttest
(
u
"好人使用了它就可以解决一些问题"
)
cuttest
(
u
"是因为和国家"
)
cuttest
(
u
"老年搜索还支持"
)
cuttest
(
u
"干脆就把那部蒙人的闲法给废了拉倒!RT @laoshipukong : 27日,全国人大常委会第三次审议侵权责任法草案,删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 "
)
cuttest
(
"2022年12月30日是星期几?"
)
cuttest
(
"二零二二年十二月三十日是星期几?"
)
\ No newline at end of file
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录