Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
PaddleHub
提交
c915936f
P
PaddleHub
项目概览
PaddlePaddle
/
PaddleHub
1 年多 前同步成功
通知
283
Star
12117
Fork
2091
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
200
列表
看板
标记
里程碑
合并请求
4
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PaddleHub
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
200
Issue
200
列表
看板
标记
里程碑
合并请求
4
合并请求
4
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
c915936f
编写于
2月 24, 2021
作者:
K
KP
提交者:
GitHub
2月 24, 2021
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Upgrade embedding modules (#1230)
* do finetune text cls & ner via using embedding module
上级
af009a04
变更
122
展开全部
隐藏空白更改
内联
并排
Showing
122 changed file
with
1826 addition
and
1685 deletion
+1826
-1685
modules/text/embedding/fasttext_crawl_target_word-word_dim300_en/README.md
...dding/fasttext_crawl_target_word-word_dim300_en/README.md
+22
-0
modules/text/embedding/fasttext_crawl_target_word-word_dim300_en/module.py
...dding/fasttext_crawl_target_word-word_dim300_en/module.py
+8
-28
modules/text/embedding/fasttext_wiki-news_target_word-word_dim300_en/README.md
...g/fasttext_wiki-news_target_word-word_dim300_en/README.md
+22
-0
modules/text/embedding/fasttext_wiki-news_target_word-word_dim300_en/module.py
...g/fasttext_wiki-news_target_word-word_dim300_en/module.py
+8
-28
modules/text/embedding/glove_twitter_target_word-word_dim100_en/README.md
...edding/glove_twitter_target_word-word_dim100_en/README.md
+22
-0
modules/text/embedding/glove_twitter_target_word-word_dim100_en/module.py
...edding/glove_twitter_target_word-word_dim100_en/module.py
+8
-28
modules/text/embedding/glove_twitter_target_word-word_dim200_en/README.md
...edding/glove_twitter_target_word-word_dim200_en/README.md
+22
-0
modules/text/embedding/glove_twitter_target_word-word_dim200_en/module.py
...edding/glove_twitter_target_word-word_dim200_en/module.py
+8
-28
modules/text/embedding/glove_twitter_target_word-word_dim25_en/README.md
...bedding/glove_twitter_target_word-word_dim25_en/README.md
+22
-0
modules/text/embedding/glove_twitter_target_word-word_dim25_en/module.py
...bedding/glove_twitter_target_word-word_dim25_en/module.py
+8
-28
modules/text/embedding/glove_twitter_target_word-word_dim50_en/README.md
...bedding/glove_twitter_target_word-word_dim50_en/README.md
+22
-0
modules/text/embedding/glove_twitter_target_word-word_dim50_en/module.py
...bedding/glove_twitter_target_word-word_dim50_en/module.py
+8
-28
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim100_en/README.md
...ve_wiki2014-gigaword_target_word-word_dim100_en/README.md
+22
-0
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim100_en/module.py
...ve_wiki2014-gigaword_target_word-word_dim100_en/module.py
+8
-28
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim200_en/README.md
...ve_wiki2014-gigaword_target_word-word_dim200_en/README.md
+22
-0
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim200_en/module.py
...ve_wiki2014-gigaword_target_word-word_dim200_en/module.py
+8
-28
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim300_en/README.md
...ve_wiki2014-gigaword_target_word-word_dim300_en/README.md
+22
-0
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim300_en/module.py
...ve_wiki2014-gigaword_target_word-word_dim300_en/module.py
+8
-28
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim50_en/README.md
...ove_wiki2014-gigaword_target_word-word_dim50_en/README.md
+22
-0
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim50_en/module.py
...ove_wiki2014-gigaword_target_word-word_dim50_en/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-1_dim300/README.md
...yclopedia_context_word-character_char1-1_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-1_dim300/module.py
...yclopedia_context_word-character_char1-1_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-2_dim300/README.md
...yclopedia_context_word-character_char1-2_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-2_dim300/module.py
...yclopedia_context_word-character_char1-2_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-4_dim300/README.md
...yclopedia_context_word-character_char1-4_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-4_dim300/module.py
...yclopedia_context_word-character_char1-4_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_1-2_dim300/README.md
...aidu_encyclopedia_context_word-ngram_1-2_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_1-2_dim300/module.py
...aidu_encyclopedia_context_word-ngram_1-2_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_1-3_dim300/README.md
...aidu_encyclopedia_context_word-ngram_1-3_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_1-3_dim300/module.py
...aidu_encyclopedia_context_word-ngram_1-3_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_2-2_dim300/README.md
...aidu_encyclopedia_context_word-ngram_2-2_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_2-2_dim300/module.py
...aidu_encyclopedia_context_word-ngram_2-2_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_context_word-wordLR_dim300/README.md
...v_baidu_encyclopedia_context_word-wordLR_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_context_word-wordLR_dim300/module.py
...v_baidu_encyclopedia_context_word-wordLR_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_context_word-wordPosition_dim300/README.md
...u_encyclopedia_context_word-wordPosition_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_context_word-wordPosition_dim300/module.py
...u_encyclopedia_context_word-wordPosition_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_context_word-word_dim300/README.md
...w2v_baidu_encyclopedia_context_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_context_word-word_dim300/module.py
...w2v_baidu_encyclopedia_context_word-word_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_bigram-char_dim300/README.md
...2v_baidu_encyclopedia_target_bigram-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_bigram-char_dim300/module.py
...2v_baidu_encyclopedia_target_bigram-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-1_dim300/README.md
...cyclopedia_target_word-character_char1-1_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-1_dim300/module.py
...cyclopedia_target_word-character_char1-1_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-2_dim300/README.md
...cyclopedia_target_word-character_char1-2_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-2_dim300/module.py
...cyclopedia_target_word-character_char1-2_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-4_dim300/README.md
...cyclopedia_target_word-character_char1-4_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-4_dim300/module.py
...cyclopedia_target_word-character_char1-4_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_1-2_dim300/README.md
...baidu_encyclopedia_target_word-ngram_1-2_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_1-2_dim300/module.py
...baidu_encyclopedia_target_word-ngram_1-2_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_1-3_dim300/README.md
...baidu_encyclopedia_target_word-ngram_1-3_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_1-3_dim300/module.py
...baidu_encyclopedia_target_word-ngram_1-3_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_2-2_dim300/README.md
...baidu_encyclopedia_target_word-ngram_2-2_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_2-2_dim300/module.py
...baidu_encyclopedia_target_word-ngram_2-2_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_word-wordLR_dim300/README.md
...2v_baidu_encyclopedia_target_word-wordLR_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_word-wordLR_dim300/module.py
...2v_baidu_encyclopedia_target_word-wordLR_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_word-wordPosition_dim300/README.md
...du_encyclopedia_target_word-wordPosition_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_word-wordPosition_dim300/module.py
...du_encyclopedia_target_word-wordPosition_dim300/module.py
+8
-28
modules/text/embedding/w2v_baidu_encyclopedia_target_word-word_dim300/README.md
.../w2v_baidu_encyclopedia_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_baidu_encyclopedia_target_word-word_dim300/module.py
.../w2v_baidu_encyclopedia_target_word-word_dim300/module.py
+4
-5
modules/text/embedding/w2v_financial_target_bigram-char_dim300/README.md
...bedding/w2v_financial_target_bigram-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_financial_target_bigram-char_dim300/module.py
...bedding/w2v_financial_target_bigram-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_financial_target_word-bigram_dim300/README.md
...bedding/w2v_financial_target_word-bigram_dim300/README.md
+22
-0
modules/text/embedding/w2v_financial_target_word-bigram_dim300/module.py
...bedding/w2v_financial_target_word-bigram_dim300/module.py
+8
-28
modules/text/embedding/w2v_financial_target_word-char_dim300/README.md
...embedding/w2v_financial_target_word-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_financial_target_word-char_dim300/module.py
...embedding/w2v_financial_target_word-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_financial_target_word-word_dim300/README.md
...embedding/w2v_financial_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_financial_target_word-word_dim300/module.py
...embedding/w2v_financial_target_word-word_dim300/module.py
+8
-28
modules/text/embedding/w2v_literature_target_bigram-char_dim300/README.md
...edding/w2v_literature_target_bigram-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_literature_target_bigram-char_dim300/module.py
...edding/w2v_literature_target_bigram-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_literature_target_word-bigram_dim300/README.md
...edding/w2v_literature_target_word-bigram_dim300/README.md
+22
-0
modules/text/embedding/w2v_literature_target_word-bigram_dim300/module.py
...edding/w2v_literature_target_word-bigram_dim300/module.py
+8
-28
modules/text/embedding/w2v_literature_target_word-char_dim300/README.md
...mbedding/w2v_literature_target_word-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_literature_target_word-char_dim300/module.py
...mbedding/w2v_literature_target_word-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_literature_target_word-word_dim300/README.md
...mbedding/w2v_literature_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_literature_target_word-word_dim300/module.py
...mbedding/w2v_literature_target_word-word_dim300/module.py
+8
-28
modules/text/embedding/w2v_mixed-large_target_word-char_dim300/README.md
...bedding/w2v_mixed-large_target_word-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_mixed-large_target_word-char_dim300/module.py
...bedding/w2v_mixed-large_target_word-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_mixed-large_target_word-word_dim300/README.md
...bedding/w2v_mixed-large_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_mixed-large_target_word-word_dim300/module.py
...bedding/w2v_mixed-large_target_word-word_dim300/module.py
+8
-28
modules/text/embedding/w2v_people_daily_target_bigram-char_dim300/README.md
...ding/w2v_people_daily_target_bigram-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_people_daily_target_bigram-char_dim300/module.py
...ding/w2v_people_daily_target_bigram-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_people_daily_target_word-bigram_dim300/README.md
...ding/w2v_people_daily_target_word-bigram_dim300/README.md
+22
-0
modules/text/embedding/w2v_people_daily_target_word-bigram_dim300/module.py
...ding/w2v_people_daily_target_word-bigram_dim300/module.py
+8
-28
modules/text/embedding/w2v_people_daily_target_word-char_dim300/README.md
...edding/w2v_people_daily_target_word-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_people_daily_target_word-char_dim300/module.py
...edding/w2v_people_daily_target_word-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_people_daily_target_word-word_dim300/README.md
...edding/w2v_people_daily_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_people_daily_target_word-word_dim300/module.py
...edding/w2v_people_daily_target_word-word_dim300/module.py
+8
-28
modules/text/embedding/w2v_sikuquanshu_target_word-bigram_dim300/README.md
...dding/w2v_sikuquanshu_target_word-bigram_dim300/README.md
+22
-0
modules/text/embedding/w2v_sikuquanshu_target_word-bigram_dim300/module.py
...dding/w2v_sikuquanshu_target_word-bigram_dim300/module.py
+8
-28
modules/text/embedding/w2v_sikuquanshu_target_word-word_dim300/README.md
...bedding/w2v_sikuquanshu_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_sikuquanshu_target_word-word_dim300/module.py
...bedding/w2v_sikuquanshu_target_word-word_dim300/module.py
+8
-28
modules/text/embedding/w2v_sogou_target_bigram-char_dim300/README.md
...t/embedding/w2v_sogou_target_bigram-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_sogou_target_bigram-char_dim300/module.py
...t/embedding/w2v_sogou_target_bigram-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_sogou_target_word-bigram_dim300/README.md
...t/embedding/w2v_sogou_target_word-bigram_dim300/README.md
+22
-0
modules/text/embedding/w2v_sogou_target_word-bigram_dim300/module.py
...t/embedding/w2v_sogou_target_word-bigram_dim300/module.py
+8
-28
modules/text/embedding/w2v_sogou_target_word-char_dim300/README.md
...ext/embedding/w2v_sogou_target_word-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_sogou_target_word-char_dim300/module.py
...ext/embedding/w2v_sogou_target_word-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_sogou_target_word-word_dim300/README.md
...ext/embedding/w2v_sogou_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_sogou_target_word-word_dim300/module.py
...ext/embedding/w2v_sogou_target_word-word_dim300/module.py
+8
-28
modules/text/embedding/w2v_weibo_target_bigram-char_dim300/README.md
...t/embedding/w2v_weibo_target_bigram-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_weibo_target_bigram-char_dim300/module.py
...t/embedding/w2v_weibo_target_bigram-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_weibo_target_word-bigram_dim300/README.md
...t/embedding/w2v_weibo_target_word-bigram_dim300/README.md
+22
-0
modules/text/embedding/w2v_weibo_target_word-bigram_dim300/module.py
...t/embedding/w2v_weibo_target_word-bigram_dim300/module.py
+8
-28
modules/text/embedding/w2v_weibo_target_word-char_dim300/README.md
...ext/embedding/w2v_weibo_target_word-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_weibo_target_word-char_dim300/module.py
...ext/embedding/w2v_weibo_target_word-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_weibo_target_word-word_dim300/README.md
...ext/embedding/w2v_weibo_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_weibo_target_word-word_dim300/module.py
...ext/embedding/w2v_weibo_target_word-word_dim300/module.py
+8
-28
modules/text/embedding/w2v_wiki_target_bigram-char_dim300/README.md
...xt/embedding/w2v_wiki_target_bigram-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_wiki_target_bigram-char_dim300/module.py
...xt/embedding/w2v_wiki_target_bigram-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_wiki_target_word-bigram_dim300/README.md
...xt/embedding/w2v_wiki_target_word-bigram_dim300/README.md
+22
-0
modules/text/embedding/w2v_wiki_target_word-bigram_dim300/module.py
...xt/embedding/w2v_wiki_target_word-bigram_dim300/module.py
+8
-28
modules/text/embedding/w2v_wiki_target_word-char_dim300/README.md
...text/embedding/w2v_wiki_target_word-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_wiki_target_word-char_dim300/module.py
...text/embedding/w2v_wiki_target_word-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_wiki_target_word-word_dim300/README.md
...text/embedding/w2v_wiki_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_wiki_target_word-word_dim300/module.py
...text/embedding/w2v_wiki_target_word-word_dim300/module.py
+8
-28
modules/text/embedding/w2v_zhihu_target_bigram-char_dim300/README.md
...t/embedding/w2v_zhihu_target_bigram-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_zhihu_target_bigram-char_dim300/module.py
...t/embedding/w2v_zhihu_target_bigram-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_zhihu_target_word-bigram_dim300/README.md
...t/embedding/w2v_zhihu_target_word-bigram_dim300/README.md
+22
-0
modules/text/embedding/w2v_zhihu_target_word-bigram_dim300/module.py
...t/embedding/w2v_zhihu_target_word-bigram_dim300/module.py
+8
-28
modules/text/embedding/w2v_zhihu_target_word-char_dim300/README.md
...ext/embedding/w2v_zhihu_target_word-char_dim300/README.md
+22
-0
modules/text/embedding/w2v_zhihu_target_word-char_dim300/module.py
...ext/embedding/w2v_zhihu_target_word-char_dim300/module.py
+8
-28
modules/text/embedding/w2v_zhihu_target_word-word_dim300/README.md
...ext/embedding/w2v_zhihu_target_word-word_dim300/README.md
+22
-0
modules/text/embedding/w2v_zhihu_target_word-word_dim300/module.py
...ext/embedding/w2v_zhihu_target_word-word_dim300/module.py
+8
-28
未找到文件。
modules/text/embedding/fasttext_crawl_target_word-word_dim300_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/fasttext_crawl_target_word-word_dim300_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"fasttext_crawl_target_word-word_dim300_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"fasttext.crawl.target.word-word.dim300.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"fasttext.crawl.target.word-word.dim300.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/fasttext_wiki-news_target_word-word_dim300_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/fasttext_wiki-news_target_word-word_dim300_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"fasttext_wiki-news_target_word-word_dim300_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"fasttext.wiki-news.target.word-word.dim300.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"fasttext.wiki-news.target.word-word.dim300.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/glove_twitter_target_word-word_dim100_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/glove_twitter_target_word-word_dim100_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"glove_twitter_target_word-word_dim100_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"glove.twitter.target.word-word.dim100.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"glove.twitter.target.word-word.dim100.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/glove_twitter_target_word-word_dim200_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/glove_twitter_target_word-word_dim200_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"glove_twitter_target_word-word_dim200_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"glove.twitter.target.word-word.dim200.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"glove.twitter.target.word-word.dim200.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/glove_twitter_target_word-word_dim25_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/glove_twitter_target_word-word_dim25_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"glove_twitter_target_word-word_dim25_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"glove.twitter.target.word-word.dim25.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"glove.twitter.target.word-word.dim25.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/glove_twitter_target_word-word_dim50_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/glove_twitter_target_word-word_dim50_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"glove_twitter_target_word-word_dim50_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"glove.twitter.target.word-word.dim50.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"glove.twitter.target.word-word.dim50.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim100_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim100_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"glove_wiki2014-gigaword_target_word-word_dim100_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"glove.wiki2014-gigaword.target.word-word.dim100.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"glove.wiki2014-gigaword.target.word-word.dim100.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim200_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim200_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"glove_wiki2014-gigaword_target_word-word_dim200_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"glove.wiki2014-gigaword.target.word-word.dim200.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"glove.wiki2014-gigaword.target.word-word.dim200.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim300_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim300_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"glove_wiki2014-gigaword_target_word-word_dim300_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"glove.wiki2014-gigaword.target.word-word.dim300.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"glove.wiki2014-gigaword.target.word-word.dim300.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim50_en/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/glove_wiki2014-gigaword_target_word-word_dim50_en/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"glove_wiki2014-gigaword_target_word-word_dim50_en"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"glove.wiki2014-gigaword.target.word-word.dim50.en"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"glove.wiki2014-gigaword.target.word-word.dim50.en"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-1_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-1_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_context_word-character_char1-1_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.context.word-character.char1-1.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.context.word-character.char1-1.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-2_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-2_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_context_word-character_char1-2_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.context.word-character.char1-2.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.context.word-character.char1-2.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-4_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_context_word-character_char1-4_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_context_word-character_char1-4_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.context.word-character.char1-4.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.context.word-character.char1-4.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_1-2_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_1-2_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_context_word-ngram_1-2_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_1-3_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_1-3_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_context_word-ngram_1-3_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_2-2_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_context_word-ngram_2-2_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_context_word-ngram_2-2_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.context.word-ngram.2-2.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.context.word-ngram.2-2.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_context_word-wordLR_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_context_word-wordLR_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_context_word-wordLR_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.context.word-wordLR.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.context.word-wordLR.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_context_word-wordPosition_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_context_word-wordPosition_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_context_word-wordPosition_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.context.word-wordPosition.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.context.word-wordPosition.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_context_word-word_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_context_word-word_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_context_word-word_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.context.word-word.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.context.word-word.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_bigram-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_bigram-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_bigram-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.target.bigram-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.target.bigram-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-1_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-1_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_word-character_char1-1_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.target.word-character.char1-1.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.target.word-character.char1-1.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-2_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-2_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_word-character_char1-2_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.target.word-character.char1-2.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.target.word-character.char1-2.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-4_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_word-character_char1-4_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_word-character_char1-4_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.target.word-character.char1-4.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.target.word-character.char1-4.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_1-2_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_1-2_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_word-ngram_1-2_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_1-3_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_1-3_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_word-ngram_1-3_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_2-2_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_word-ngram_2-2_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_word-ngram_2-2_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_word-wordLR_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_word-wordLR_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_word-wordLR_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.target.word-wordLR.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.target.word-wordLR.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_word-wordPosition_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_word-wordPosition_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_word-wordPosition_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.baidu_encyclopedia.target.word-wordPosition.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.baidu_encyclopedia.target.word-wordPosition.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_baidu_encyclopedia_target_word-word_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_baidu_encyclopedia_target_word-word_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,15 +12,14 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_baidu_encyclopedia_target_word-word_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
...
...
@@ -30,7 +29,7 @@ class Embedding(TokenEmbedding):
"""
Embedding model
"""
embedding_name
=
'w2v.baidu_encyclopedia.target.word-word.dim300'
embedding_name
=
"w2v.baidu_encyclopedia.target.word-word.dim300"
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
\ No newline at end of file
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_financial_target_bigram-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_financial_target_bigram-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_financial_target_bigram-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.financial.target.bigram-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.financial.target.bigram-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_financial_target_word-bigram_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_financial_target_word-bigram_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_financial_target_word-bigram_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.financial.target.word-bigram.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.financial.target.word-bigram.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_financial_target_word-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_financial_target_word-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_financial_target_word-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.financial.target.word-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.financial.target.word-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_financial_target_word-word_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_financial_target_word-word_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_financial_target_word-word_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.financial.target.word-word.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.financial.target.word-word.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_literature_target_bigram-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_literature_target_bigram-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_literature_target_bigram-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.literature.target.bigram-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.literature.target.bigram-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_literature_target_word-bigram_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_literature_target_word-bigram_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_literature_target_word-bigram_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.literature.target.word-bigram.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.literature.target.word-bigram.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_literature_target_word-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_literature_target_word-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_literature_target_word-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.literature.target.word-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.literature.target.word-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_literature_target_word-word_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_literature_target_word-word_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_literature_target_word-word_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.literature.target.word-word.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.literature.target.word-word.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_mixed-large_target_word-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_mixed-large_target_word-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_mixed-large_target_word-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.mixed-large.target.word-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.mixed-large.target.word-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_mixed-large_target_word-word_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_mixed-large_target_word-word_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_mixed-large_target_word-word_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.mixed-large.target.word-word.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.mixed-large.target.word-word.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_people_daily_target_bigram-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_people_daily_target_bigram-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_people_daily_target_bigram-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.people_daily.target.bigram-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.people_daily.target.bigram-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_people_daily_target_word-bigram_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_people_daily_target_word-bigram_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_people_daily_target_word-bigram_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.people_daily.target.word-bigram.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.people_daily.target.word-bigram.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_people_daily_target_word-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_people_daily_target_word-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_people_daily_target_word-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.people_daily.target.word-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.people_daily.target.word-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_people_daily_target_word-word_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_people_daily_target_word-word_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_people_daily_target_word-word_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.people_daily.target.word-word.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.people_daily.target.word-word.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_sikuquanshu_target_word-bigram_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_sikuquanshu_target_word-bigram_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_sikuquanshu_target_word-bigram_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.sikuquanshu.target.word-bigram.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.sikuquanshu.target.word-bigram.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_sikuquanshu_target_word-word_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_sikuquanshu_target_word-word_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_sikuquanshu_target_word-word_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.sikuquanshu.target.word-word.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.sikuquanshu.target.word-word.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_sogou_target_bigram-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_sogou_target_bigram-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_sogou_target_bigram-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.sogou.target.bigram-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.sogou.target.bigram-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_sogou_target_word-bigram_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_sogou_target_word-bigram_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_sogou_target_word-bigram_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.sogou.target.word-bigram.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.sogou.target.word-bigram.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_sogou_target_word-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_sogou_target_word-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_sogou_target_word-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.sogou.target.word-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.sogou.target.word-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_sogou_target_word-word_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_sogou_target_word-word_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_sogou_target_word-word_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.sogou.target.word-word.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.sogou.target.word-word.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_weibo_target_bigram-char_dim300/README.md
浏览文件 @
c915936f
...
...
@@ -56,6 +56,25 @@ def dot(
*
`word_a`
: 需要计算内积的单词a。
*
`word_b`
: 需要计算内积的单词b。
```
python
def
get_vocab_path
()
```
获取本地词表文件的路径信息。
```
python
def
get_tokenizer
(
*
args
,
**
kwargs
)
```
获取当前模型的tokenizer,返回一个JiebaTokenizer的实例,当前只支持中文embedding模型。
**参数**
*
`*args`
: 额外传递的列表形式的参数。
*
`**kwargs`
: 额外传递的字典形式的参数。
关于额外参数的详情,可查看
[
paddlenlp.data.tokenizer.JiebaTokenizer
](
https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/paddlenlp/data/tokenizer.py
)
更多api详情和用法可参考
[
paddlenlp.embeddings
](
https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/paddlenlp/embeddings
)
## 代码示例
...
...
@@ -125,3 +144,6 @@ paddlehub >= 2.0.0
初始发布
*
1.0.1
支持基于embedding的文本分类和序列标注finetune任务
modules/text/embedding/w2v_weibo_target_bigram-char_dim300/module.py
浏览文件 @
c915936f
...
...
@@ -12,44 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from
typing
import
List
from
paddlenlp.embeddings
import
TokenEmbedding
from
paddlehub.module.module
import
moduleinfo
,
serving
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
EmbeddingModule
@
moduleinfo
(
name
=
"w2v_weibo_target_bigram-char_dim300"
,
version
=
"1.0.
0
"
,
version
=
"1.0.
1
"
,
summary
=
""
,
author
=
"paddlepaddle"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
)
type
=
"nlp/semantic_model"
,
meta
=
EmbeddingModule
)
class
Embedding
(
TokenEmbedding
):
"""
Embedding model
"""
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
"w2v.weibo.target.bigram-char.dim300"
,
*
args
,
**
kwargs
)
@
serving
def
calc_similarity
(
self
,
data
:
List
[
List
[
str
]]):
"""
Calculate similarities of giving word pairs.
"""
results
=
[]
for
word_pair
in
data
:
if
len
(
word_pair
)
!=
2
:
raise
RuntimeError
(
f
'The input must have two words, but got
{
len
(
word_pair
)
}
. Please check your inputs.'
)
if
not
isinstance
(
word_pair
[
0
],
str
)
or
not
isinstance
(
word_pair
[
1
],
str
):
raise
RuntimeError
(
f
'The types of text pair must be (str, str), but got'
f
' (
{
type
(
word_pair
[
0
]).
__name__
}
,
{
type
(
word_pair
[
1
]).
__name__
}
). Please check your inputs.'
)
embedding_name
=
"w2v.weibo.target.bigram-char.dim300"
for
word
in
word_pair
:
if
self
.
get_idx_from_word
(
word
)
==
\
self
.
get_idx_from_word
(
self
.
vocab
.
unk_token
):
raise
RuntimeError
(
f
'Word "
{
word
}
" is not in vocab. Please check your inputs.'
)
results
.
append
(
str
(
self
.
cosine_sim
(
*
word_pair
)))
return
results
def
__init__
(
self
,
*
args
,
**
kwargs
):
super
(
Embedding
,
self
).
__init__
(
embedding_name
=
self
.
embedding_name
,
*
args
,
**
kwargs
)
modules/text/embedding/w2v_weibo_target_word-bigram_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_weibo_target_word-bigram_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_weibo_target_word-char_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_weibo_target_word-char_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_weibo_target_word-word_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_weibo_target_word-word_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_wiki_target_bigram-char_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_wiki_target_bigram-char_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_wiki_target_word-bigram_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_wiki_target_word-bigram_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_wiki_target_word-char_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_wiki_target_word-char_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_wiki_target_word-word_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_wiki_target_word-word_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_zhihu_target_bigram-char_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_zhihu_target_bigram-char_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_zhihu_target_word-bigram_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_zhihu_target_word-bigram_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_zhihu_target_word-char_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_zhihu_target_word-char_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_zhihu_target_word-word_dim300/README.md
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
modules/text/embedding/w2v_zhihu_target_word-word_dim300/module.py
浏览文件 @
c915936f
此差异已折叠。
点击以展开。
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录