Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
sfewfsaf
Synonyms
提交
b23e1c3b
S
Synonyms
项目概览
sfewfsaf
/
Synonyms
与 Fork 源项目一致
从无法访问的项目Fork
通知
6
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
S
Synonyms
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
提交
b23e1c3b
编写于
10月 25, 2018
作者:
H
Hai Liang Wang
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
计算编辑距离,去停用词
上级
99799847
变更
4
隐藏空白更改
内联
并排
Showing
4 changed file
with
31 addition
and
5 deletion
+31
-5
CHANGELOG.md
CHANGELOG.md
+6
-0
Requirements.txt
Requirements.txt
+1
-1
setup.py
setup.py
+1
-1
synonyms/synonyms.py
synonyms/synonyms.py
+23
-3
未找到文件。
CHANGELOG.md
浏览文件 @
b23e1c3b
# 3.10
*
计算编辑距离时去停用词
# 3.9
*
fix bug
# 3.8
*
获得一个分词后句子的向量,向量以BoW方式组成
...
...
Requirements.txt
浏览文件 @
b23e1c3b
synonyms>=3.6
\ No newline at end of file
synonyms>=3.10
\ No newline at end of file
setup.py
浏览文件 @
b23e1c3b
...
...
@@ -13,7 +13,7 @@ Welcome
setup
(
name
=
'synonyms'
,
version
=
'3.
8
.0'
,
version
=
'3.
10
.0'
,
description
=
'Chinese Synonyms for Natural Language Processing and Understanding'
,
long_description
=
LONGDOC
,
author
=
'Hai Liang Wang, Hu Ying Xi'
,
...
...
synonyms/synonyms.py
浏览文件 @
b23e1c3b
...
...
@@ -78,7 +78,7 @@ tokenizer settings
'''
tokenizer_dict
=
os
.
path
.
join
(
curdir
,
'data'
,
'vocab.txt'
)
if
"SYNONYMS_WORDSEG_DICT"
in
ENVIRON
:
if
os
.
exist
(
ENVIRON
[
"SYNONYMS_WORDSEG_DICT"
]):
if
os
.
path
.
exists
(
ENVIRON
[
"SYNONYMS_WORDSEG_DICT"
]):
print
(
"info: set wordseg dict with %s"
%
tokenizer_dict
)
tokenizer_dict
=
ENVIRON
[
"SYNONYMS_WORDSEG_DICT"
]
else
:
print
(
"warning: can not find dict at [%s]"
%
tokenizer_dict
)
...
...
@@ -303,23 +303,43 @@ def nearby(word):
_cache_nearby
[
w
]
=
(
words
,
scores
)
return
words
,
scores
def
compare
(
s1
,
s2
,
seg
=
True
,
ignore
=
False
):
def
compare
(
s1
,
s2
,
seg
=
True
,
ignore
=
False
,
stopwords
=
False
):
'''
compare similarity
s1 : sentence1
s2 : sentence2
seg : True : The original sentences need jieba.cut
Flase : The original sentences have been cut.
ignore: True: ignore OOV words
False: get vector randomly for OOV words
'''
if
s1
==
s2
:
return
1.0
s1_words
=
[]
s2_words
=
[]
if
seg
:
s1
=
[
x
for
x
in
jieba
.
cut
(
s1
)]
s2
=
[
x
for
x
in
jieba
.
cut
(
s2
)]
else
:
s1
=
s1
.
split
()
s2
=
s2
.
split
()
# check stopwords
if
not
stopwords
:
global
_stopwords
for
x
in
s1
:
if
not
x
in
_stopwords
:
s1_words
.
append
(
x
)
for
x
in
s2
:
if
not
x
in
_stopwords
:
s2_words
.
append
(
x
)
else
:
s1_words
=
s1
s2_words
=
s2
assert
len
(
s1
)
>
0
and
len
(
s2
)
>
0
,
"The length of s1 and s2 should > 0."
return
_similarity_distance
(
s1
,
s2
,
ignore
)
return
_similarity_distance
(
s1
_words
,
s2_words
,
ignore
)
def
display
(
word
):
print
(
"'%s'近义词:"
%
word
)
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录