Support customized dict and model

a1af98a5 · Hai Liang Wang · 792750c5 · a1af98a5 · a1af98a5 · a1af98a5
隐藏空白更改
内联并排

Showing with 35 addition and 8 deletion

README.md README.md +17 -6

setup.py setup.py +1 -1

synonyms/__init__.py synonyms/__init__.py +17 -1

未找到文件。
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Chinese Synonyms for Natural Language Processing and Understanding.

 * [Install](https://github.com/huyingxi/Synonyms#welcome)
 * [Usage](https://github.com/huyingxi/Synonyms#usage)
-* [Demo](https://github.com/huyingxi/Synonyms#demo)
+* [Quick Get Start](https://github.com/huyingxi/Synonyms#quick-get-start)
 * [Valuation](https://github.com/huyingxi/Synonyms#valuation)
 * [Benchmark](https://github.com/huyingxi/Synonyms#benchmark)
 * [Statement](https://github.com/huyingxi/Synonyms#statement)
@@ -24,7 +24,9 @@ Chinese Synonyms for Natural Language Processing and Understanding.
 ```
 pip install -U synonyms
 ```
-兼容py2和py3，当前稳定版本 [v2.x](https://github.com/huyingxi/Synonyms/releases)。**同时，Node.js 用户可以使用 [node-synonyms](https://www.npmjs.com/package/node-synonyms)了。**
+兼容py2和py3，当前稳定版本 [v2.x](https://github.com/huyingxi/Synonyms/releases)。
+
+**Node.js 用户可以使用 [node-synonyms](https://www.npmjs.com/package/node-synonyms)了。**

 ```
 npm install node-synonyms
@@ -32,12 +34,17 @@ npm install node-synonyms

 ![](./assets/3.gif)

-## Samples
-
-![](assets/2.png)
+本文档的配置和接口说明面向python工具包， node版本查看[项目](https://www.npmjs.com/package/node-synonyms)。 

 ## Usage

+支持使用环境变量配置分词词表和word2vec词向量文件。
+
+| 环境变量 | 描述 |
+| --- | --- |
+| *SYNONYMS_WORD2VEC_BIN_MODEL_ZH_CN* | 使用word2vec训练的词向量文件，二进制格式。 |
+| *SYNONYMS_WORDSEG_DICT* | 中文分词[**主字典**](https://github.com/fxsjy/jieba#%E5%BB%B6%E8%BF%9F%E5%8A%A0%E8%BD%BD%E6%9C%BA%E5%88%B6)，格式和使用[参考](https://github.com/fxsjy/jieba#%E8%BD%BD%E5%85%A5%E8%AF%8D%E5%85%B8) | 
+
 ### synonyms#nearby
 ```
 import synonyms
@@ -96,7 +103,11 @@ synonyms.nearby(人脸) = [

 ![](assets/1.png)

-## Demo
+## Samples
+
+![](assets/2.png)
+
+## Quick Get Start
 ```
 $ pip install -r Requirements.txt
 $ python demo.py

--- a/setup.py
+++ b/setup.py
@@ -13,7 +13,7 @@ Welcome

 setup(
    name='synonyms',
-    version='2.5',
+    version='2.6',
    description='Chinese Synonyms for Natural Language Processing and Understanding',
    long_description=LONGDOC,
    author='Hai Liang Wang, Hu Ying Xi',

--- a/synonyms/__init__.py
+++ b/synonyms/__init__.py
@@ -41,6 +41,9 @@ if sys.version_info[0] < 3:
 else:
    PLT = 3

+# Get Environment variables
+ENVIRON = os.environ.copy()
+
 import json
 import gzip
 import shutil
@@ -66,6 +69,17 @@ lambda fns
 _similarity_smooth = lambda x, y, z: (x * y) + z
 _sim_molecule = lambda x: np.sum(x, axis=0)  # 分子

+
+'''
+tokenizer settings
+'''
+if "SYNONYMS_WORDSEG_DICT" in ENVIRON:
+    tokenizer_dict = ENVIRON["SYNONYMS_WORDSEG_DICT"]
+    if os.exist(tokenizer_dict):
+        jieba.set_dictionary(tokenizer_dict)
+        print("info: set wordseg dict with %s" % tokenizer_dict)
+    else: print("warning: can not find dict at [%s]" % tokenizer_dict)
+
 '''
 nearby
 '''
@@ -133,13 +147,15 @@ def _segment_words(sen):

 # vectors
 _f_model = os.path.join(curdir, 'data', 'words.vector')
+if "SYNONYMS_WORD2VEC_BIN_MODEL_ZH_CN" in ENVIRON:
+    _f_model = ENVIRON["SYNONYMS_WORD2VEC_BIN_MODEL_ZH_CN"]
 def _load_w2v(model_file=_f_model, binary=True):
    '''
    load word2vec model
    '''
    if not os.path.exists(model_file):
        print("os.path : ", os.path)
-        raise Exception("Model file does not exist.")
+        raise Exception("Model file [%s] does not exist." % model_file)
    return KeyedVectors.load_word2vec_format(
        model_file, binary=binary, unicode_errors='ignore')
 print(">> Synonyms on loading vectors ...")