Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
magicwindyyd
mindspore
提交
da8e095b
M
mindspore
项目概览
magicwindyyd
/
mindspore
与 Fork 源项目一致
Fork自
MindSpore / mindspore
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
mindspore
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
da8e095b
编写于
4月 30, 2020
作者:
X
xulei2020
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
add jieba c++ code
上级
93e7c97a
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
12 addition
and
9 deletion
+12
-9
cmake/external_libs/cppjieba.cmake
cmake/external_libs/cppjieba.cmake
+2
-2
mindspore/dataset/transforms/text/c_transforms.py
mindspore/dataset/transforms/text/c_transforms.py
+10
-7
未找到文件。
cmake/external_libs/cppjieba.cmake
浏览文件 @
da8e095b
...
...
@@ -3,8 +3,8 @@ set(cppjieba_CFLAGS "-D_FORTIFY_SOURCE=2 -O2")
mindspore_add_pkg
(
cppjieba
VER 5.0.3
HEAD_ONLY ./
URL https://codeload.github.com/yanyiwu/cppjieba/
zip
/v5.0.3
MD5
0dfef44bd32328c221f128b401e1a45
c
URL https://codeload.github.com/yanyiwu/cppjieba/
tar.gz
/v5.0.3
MD5
b8b3f7a73032c9ce9daafa4f67196c8
c
PATCHES
${
CMAKE_SOURCE_DIR
}
/third_party/patch/cppjieba/cppjieba.patch001
)
include_directories
(
${
cppjieba_INC
}
include
)
include_directories
(
${
cppjieba_INC
}
deps
)
...
...
mindspore/dataset/transforms/text/c_transforms.py
浏览文件 @
da8e095b
...
...
@@ -33,9 +33,13 @@ class JiebaTokenizer(cde.JiebaTokenizerOp):
Tokenize Chinese string into words based on dictionary.
Args:
mode (Enum): [Default "MIX"], "MP" model will tokenize with MPSegment algorithm, "HMM" mode will
tokenize with Hiddel Markov Model Segment algorithm, "MIX" model will tokenize with a mix of MPSegment and
HMMSegment algorithm.
hmm_path (str): the dictionary file is used by HMMSegment algorithm,
the dictionary can be obtained on the official website of cppjieba.
mp_path(str): the dictionary file is used by MPSegment algorithm,
the dictionary can be obtained on the official website of cppjieba.
mode (Enum): [Default "MIX"], "MP" model will tokenize with MPSegment algorithm,
"HMM" mode will tokenize with Hiddel Markov Model Segment algorithm,
"MIX" model will tokenize with a mix of MPSegment and HMMSegment algorithm.
"""
@
check_jieba_init
def
__init__
(
self
,
hmm_path
,
mp_path
,
mode
=
JiebaMode
.
MIX
):
...
...
@@ -52,9 +56,8 @@ class JiebaTokenizer(cde.JiebaTokenizerOp):
Args:
word(required, string): The word to be added to the JiebaTokenizer instance.
The added word will not be written into the built-in dictionary on disk.
freq(optional, int): The frequency of the word to be added,
The higher the frequency, the better change the word will be tokenized(default None,
use default frequency)
freq(optional, int): The frequency of the word to be added, The higher the frequency,
the better change the word will be tokenized(default None, use default frequency).
"""
if
freq
is
None
:
super
().
add_word
(
word
,
0
)
...
...
@@ -67,7 +70,7 @@ class JiebaTokenizer(cde.JiebaTokenizerOp):
Add user defined word to JiebaTokenizer's dictionary
Args:
user_dict(path/dict):Dictionary to be added, file path or Python dictionary,
Python Dict format
is
{word1:freq1, word2:freq2,...}
Python Dict format
:
{word1:freq1, word2:freq2,...}
Jieba dictionary format : word(required), freq(optional), such as:
word1 freq1
word2
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录