Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
曾经的那一瞬间
Models
提交
558bab5d
M
Models
项目概览
曾经的那一瞬间
/
Models
11 个月 前同步成功
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
Models
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
提交
558bab5d
编写于
12月 09, 2019
作者:
C
Chen Chen
提交者:
A. Unique TensorFlower
12月 09, 2019
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Add sentence piece tokenizer in tokenization.py
PiperOrigin-RevId: 284624714
上级
9cae3c4f
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
131 addition
and
1 deletion
+131
-1
official/nlp/bert/tokenization.py
official/nlp/bert/tokenization.py
+131
-1
未找到文件。
official/nlp/bert/tokenization.py
浏览文件 @
558bab5d
# coding=utf-8
# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
...
...
@@ -29,6 +30,10 @@ import unicodedata
import
six
import
tensorflow
as
tf
import
sentencepiece
as
spm
SPIECE_UNDERLINE
=
u
"▁"
.
encode
(
"utf-8"
)
def
validate_case_matches_checkpoint
(
do_lower_case
,
init_checkpoint
):
"""Checks whether the casing config is consistent with the checkpoint name."""
...
...
@@ -366,7 +371,7 @@ class WordpieceTokenizer(object):
def
_is_whitespace
(
char
):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically cont
or
l characters but we treat them
# \t, \n, and \r are technically cont
ro
l characters but we treat them
# as whitespace since they are generally considered as such.
if
char
==
" "
or
char
==
"
\t
"
or
char
==
"
\n
"
or
char
==
"
\r
"
:
return
True
...
...
@@ -402,3 +407,128 @@ def _is_punctuation(char):
if
cat
.
startswith
(
"P"
):
return
True
return
False
def
preprocess_text
(
inputs
,
remove_space
=
True
,
lower
=
False
):
"""Preprocesses data by removing extra space and normalize data.
This method is used together with sentence piece tokenizer and is forked from:
https://github.com/google-research/google-research/blob/master/albert/tokenization.py
Args:
inputs: The input text.
remove_space: Whether to remove the extra space.
lower: Whether to lowercase the text.
Returns:
The preprocessed text.
"""
outputs
=
inputs
if
remove_space
:
outputs
=
" "
.
join
(
inputs
.
strip
().
split
())
if
six
.
PY2
and
isinstance
(
outputs
,
str
):
try
:
outputs
=
six
.
ensure_text
(
outputs
,
"utf-8"
)
except
UnicodeDecodeError
:
outputs
=
six
.
ensure_text
(
outputs
,
"latin-1"
)
outputs
=
unicodedata
.
normalize
(
"NFKD"
,
outputs
)
outputs
=
""
.
join
([
c
for
c
in
outputs
if
not
unicodedata
.
combining
(
c
)])
if
lower
:
outputs
=
outputs
.
lower
()
return
outputs
def
encode_pieces
(
sp_model
,
text
,
sample
=
False
):
"""Segements text into pieces.
This method is used together with sentence piece tokenizer and is forked from:
https://github.com/google-research/google-research/blob/master/albert/tokenization.py
Args:
sp_model: A spm.SentencePieceProcessor object.
text: The input text to be segemented.
sample: Whether to randomly sample a segmentation output or return a
deterministic one.
Returns:
A list of token pieces.
"""
if
not
sample
:
pieces
=
sp_model
.
EncodeAsPieces
(
text
)
else
:
pieces
=
sp_model
.
SampleEncodeAsPieces
(
text
,
64
,
0.1
)
new_pieces
=
[]
for
piece
in
pieces
:
piece
=
printable_text
(
piece
)
if
len
(
piece
)
>
1
and
piece
[
-
1
]
==
","
and
piece
[
-
2
].
isdigit
():
cur_pieces
=
sp_model
.
EncodeAsPieces
(
six
.
ensure_binary
(
piece
[:
-
1
]).
replace
(
SPIECE_UNDERLINE
,
b
""
))
if
piece
[
0
]
!=
SPIECE_UNDERLINE
and
cur_pieces
[
0
][
0
]
==
SPIECE_UNDERLINE
:
if
len
(
cur_pieces
[
0
])
==
1
:
cur_pieces
=
cur_pieces
[
1
:]
else
:
cur_pieces
[
0
]
=
cur_pieces
[
0
][
1
:]
cur_pieces
.
append
(
piece
[
-
1
])
new_pieces
.
extend
(
cur_pieces
)
else
:
new_pieces
.
append
(
piece
)
return
new_pieces
def
encode_ids
(
sp_model
,
text
,
sample
=
False
):
"""Segments text and return token ids.
This method is used together with sentence piece tokenizer and is forked from:
https://github.com/google-research/google-research/blob/master/albert/tokenization.py
Args:
sp_model: A spm.SentencePieceProcessor object.
text: The input text to be segemented.
sample: Whether to randomly sample a segmentation output or return a
deterministic one.
Returns:
A list of token ids.
"""
pieces
=
encode_pieces
(
sp_model
,
text
,
sample
=
sample
)
ids
=
[
sp_model
.
PieceToId
(
piece
)
for
piece
in
pieces
]
return
ids
class
FullSentencePieceTokenizer
(
object
):
"""Runs end-to-end sentence piece tokenization.
The interface of this class is intended to keep the same as above
`FullTokenizer` class for easier usage.
"""
def
__init__
(
self
,
sp_model_file
):
"""Inits FullSentencePieceTokenizer.
Args:
sp_model_file: The path to the sentence piece model file.
"""
self
.
_sp_model
=
spm
.
SentencePieceProcessor
()
self
.
_sp_model
.
Load
(
sp_model_file
)
self
.
vocab
=
{
self
.
_sp_model
.
IdToPiece
(
i
):
i
for
i
in
six
.
moves
.
range
(
self
.
_sp_model
.
GetPieceSize
())
}
def
tokenize
(
self
,
text
):
"""Tokenizes text into pieces."""
return
encode_pieces
(
self
.
_sp_model
,
text
)
def
convert_tokens_to_ids
(
self
,
tokens
):
"""Converts a list of tokens to a list of ids."""
return
[
self
.
_sp_model
.
PieceToId
(
printable_text
(
token
))
for
token
in
tokens
]
def
convert_ids_to_tokens
(
self
,
ids
):
"""Converts a list of ids ot a list of tokens."""
return
[
self
.
_sp_model
.
IdToPiece
(
id_
)
for
id_
in
ids
]
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录