采用逻辑回归模型拟合text classification的demo中,input data不应该是sparse_binary_vector类型
Created by: backyes
lr(bow) 模型用户文本分类:
- 数据类型定义
settings.input_types = [
# The first input is a sparse_binary_vector,
# which means each dimension of the vector is either 0 or 1. It is the
# bag-of-words (BOW) representation of the texts.
sparse_binary_vector(len(dictionary)),
# The second input is an integer. It represents the category id of the
# sample. 2 means there are two labels in the dataset.
# (1 for positive and 0 for negative)
integer_value(2)
]
- 数据处理过程:
def process(settings, file_name):
# Open the input data file.
with open(file_name, 'r') as f:
# Read each line.
for line in f:
# Each line contains the label and text of the comment, separated by \t.
label, comment = line.strip().split('\t')
# Split the words into a list.
words = comment.split()
# convert the words into a list of ids by looking them up in word_dict.
word_vector = [settings.word_dict.get(w, UNK_IDX) for w in words]
# Return the features for the current comment. The first is a list
# of ids representing a 0-1 binary sparse vector of the text,
# the second is the integer id of the label.
yield word_vector, int(label)
从上述dataprovider的片段来看, bow(lr)模型的输入数据类型应该为dense_vector, 而不是sparse_binary_vector, 因为返回的word_vector的每个id是word在字典中的序列,因此word_vector中每项均是非零值,因此使用dense_vector更加合理,否则容易误导用户对sparse的理解。
Note: 此疑问来自追踪一个sparse bug的问题。(用户对sparse的用法比较晦涩,我们的guide)
related : https://github.com/PaddlePaddle/Paddle/issues/841 https://github.com/PaddlePaddle/Paddle/issues/847
@pengli09 @lcy-seso 请帮忙review下, 上述建议是否正确。