采用逻辑回归模型拟合text classification的demo中，input data不应该是sparse_binary_vector类型 (#851) · Issue · PaddlePaddle / Paddle

采用逻辑回归模型拟合text classification的demo中，input data不应该是sparse_binary_vector类型

Created by: backyes

lr(bow) 模型用户文本分类：

数据类型定义

    settings.input_types = [
        # The first input is a sparse_binary_vector,
        # which means each dimension of the vector is either 0 or 1. It is the
        # bag-of-words (BOW) representation of the texts.
        sparse_binary_vector(len(dictionary)),
        # The second input is an integer. It represents the category id of the
        # sample. 2 means there are two labels in the dataset.
        # (1 for positive and 0 for negative)
        integer_value(2)
    ]

数据处理过程：

def process(settings, file_name):
    # Open the input data file.
    with open(file_name, 'r') as f:
        # Read each line.
        for line in f:
            # Each line contains the label and text of the comment, separated by \t.
            label, comment = line.strip().split('\t')

            # Split the words into a list.
            words = comment.split()

            # convert the words into a list of ids by looking them up in word_dict.
            word_vector = [settings.word_dict.get(w, UNK_IDX) for w in words]

            # Return the features for the current comment. The first is a list
            # of ids representing a 0-1 binary sparse vector of the text,
            # the second is the integer id of the label.
            yield word_vector, int(label)

从上述dataprovider的片段来看， bow（lr）模型的输入数据类型应该为dense_vector, 而不是sparse_binary_vector，因为返回的word_vector的每个id是word在字典中的序列，因此word_vector中每项均是非零值，因此使用dense_vector更加合理，否则容易误导用户对sparse的理解。

Note: 此疑问来自追踪一个sparse bug的问题。（用户对sparse的用法比较晦涩，我们的guide）

@pengli09 @lcy-seso 请帮忙review下，上述建议是否正确。

PaddlePaddle / Paddle 大约 2 年 前同步成功

采用逻辑回归模型拟合text classification的demo中，input data不应该是sparse_binary_vector类型

PaddlePaddle / Paddle
大约 2 年前同步成功