文本相似度计算结果不符合预期
Created by: CUITCHE
使用官方的“快速上手”代码,得到几组词的向量
import numpy as np
import paddle.fluid.dygraph as D
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel
D.guard().__enter__() # activate paddle `dygrpah` mode
model = ErnieModel.from_pretrained('ernie-1.0') # Try to get pretrained model from server, make sure you have network connection
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, _ = tokenizer.encode('hello world')
ids = D.to_variable(np.expand_dims(ids, 0)) # insert extra `batch` dimension
pooled, encoded = model(ids) # eager execution
print(pooled.numpy()) # convert results to numpy
我根据词向量计算余弦相似度,缺没得到预期的值:
def cos_sim(vector_a, vector_b):
"""计算两组向量的余弦值"""
vector_a = np.mat(vector_a)
vector_b = np.mat(vector_b)
num = float(vector_a * vector_b.T)
denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
cos = num / denom
sim = 0.5 + 0.5 * cos
return sim
def calc_cos_sim(ws):
res = list()
for w in ws:
ids, _ = tokenizer.encode(w)
ids = D.to_variable(np.expand_dims(ids, 0))
pooled, encoded = model(ids)
res.append(list(pooled.numpy()[0]))
return cos_sim(res[0], res[1])
if __name__ == '__main__':
# activate paddle `dygrpah` mode
with D.guard():
model = ErnieModel.from_pretrained('ernie-1.0')
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
pairs = [('法师', '律师'), ('医院', '法院'), ('展架', '展柜')]
for p in pairs:
print(f'{p} {calc_cos_sim(p)}')
A、('法师', '律师') 0.9383215354753538 B、('医院', '法院') 0.978857960588891 C、('展架', '展柜') 0.9633232497133388
实际上,C组应该更接近1,而A、B更远离1。
另:每次计算出来的值也不完全相同,有小的波动,如我再次运行: A、('法师', '律师') 0.956882570176395 B、('医院', '法院') 0.9377995660564593 C、('展架', '展柜') 0.9481575568652009