换行符导致错误

Created by: 60999

欢迎您反馈PaddleHub使用问题，非常感谢您对PaddleHub的贡献！在留下您的问题时，辛苦您同步提供如下信息：

版本、环境信息 1）PaddleHub和PaddlePaddle版本： PaddleHub1.6.2，PaddlePaddle1.7.2 2）系统环境：请您描述系统类型， aistudio

代码： import pandas as pd

df = pd.read_csv('/home/aistudio/data/train.csv',dtype=str,verbose =True,skip_blank_lines=True ,sep=',',delim_whitespace =False,nrows=10) print(df) out=pd.DataFrame(data=df) out.to_csv('/home/aistudio/data/train.tsv',sep='\t',index=False,encoding= 'utf-8')

df = pd.read_csv('/home/aistudio/data/train.tsv',dtype=str,verbose =True,skip_blank_lines=True ,sep='\t',delim_whitespace =False,nrows=10) print(df) 生成的结果见附件。代码输出： Tokenization took: 0.05 ms Type conversion took: 0.17 ms Parser memory cleanup took: 0.00 ms text label 0 I just read a VERY Very similar story as mine ... 6 1 on or about XXXX/XXXX/2015 a tree fell on our ... 1 2 I have had a Macy 's credit card since XX/XX/X... 2 3 Please see my Complaint # XXXX of XXXX XXXX. T... 2 4 On XXXX/XXXX/2015 I visited the XXXX XXXX stor... 4 5 We originally took out a mortgage loan in XXXX... 1 6 Thru a company called XXXX XXXX XXXX, they neg... 3 7 On XXXX/XXXX/2015, Experian sent me an email a... 4 8 I have been dealing with Bank of America for a... 1 9 Currently we have a mortgage with a mortgage c... 1 Tokenization took: 0.06 ms Type conversion took: 0.16 ms Parser memory cleanup took: 0.00 ms text label 0 I just read a VERY Very similar story as mine ... 6 1 on or about XXXX/XXXX/2015 a tree fell on our ... 1 2 I have had a Macy 's credit card since XX/XX/X... 2 3 Please see my Complaint # XXXX of XXXX XXXX. T... 2 4 On XXXX/XXXX/2015 I visited the XXXX XXXX stor... 4 5 We originally took out a mortgage loan in XXXX... 1 6 Thru a company called XXXX XXXX XXXX, they neg... 3 7 On XXXX/XXXX/2015, Experian sent me an email a... 4 8 I have been dealing with Bank of America for a... 1 9 Currently we have a mortgage with a mortgage c... 1

可见读入读出是没有问题的。继续执行： `import pandas as pd from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset class ThuNews(BaseNLPDataset): def init(self): # 数据集存放位置 self.dataset_dir = "/home/aistudio/data/" super(ThuNews, self).init( base_path=self.dataset_dir, train_file="train.tsv", dev_file="train.tsv", test_file="train.tsv", train_file_with_header=True, dev_file_with_header=True, test_file_with_header=True, predict_file_with_header=True) # 数据集类别集合 # label_list=['体育', '科技', '社会', '娱乐', '股票', '房产', '教育', '时政', '财经', '星座', '游戏', '家居', '彩票', '时尚'])

dataset = ThuNews() for e in dataset.get_train_examples()[:3]: print("{}\t{}\t{}".format(e.guid, e.text_a, e.label))

├─train.tsv 训练集

├─dev.tsv 验证集

├─test.tsv 测试集

预测数据存放在predict.tsv文件，文件格式和train.tsv类似。去掉label一列即可。`

输出错误部分：

---------------------------------------------------------------------------IndexError Traceback (most recent call last)<ipython-input-7-9f5158d6e063> in <module> 17 # label_list=['体育', '科技', '社会', '娱乐', '股票', '房产', '教育', '时政', '财经', '星座', '游戏', '家居', '彩票', '时尚']) 18 ---> 19 dataset = ThuNews() 20 for e in dataset.get_train_examples()[:3]: 21 print("{}\t{}\t{}".format(e.guid, e.text_a, e.label)) <ipython-input-7-9f5158d6e063> in __init__(self) 13 dev_file_with_header=True, 14 test_file_with_header=True, ---> 15 predict_file_with_header=True) 16 # 数据集类别集合 17 # label_list=['体育', '科技', '社会', '娱乐', '股票', '房产', '教育', '时政', '财经', '星座', '游戏', '家居', '彩票', '时尚']) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlehub/dataset/base_nlp_dataset.py in __init__(self, base_path, train_file, dev_file, test_file, predict_file, label_file, label_list, train_file_with_header, dev_file_with_header, test_file_with_header, predict_file_with_header) 49 dev_file_with_header=dev_file_with_header, 50 test_file_with_header=test_file_with_header, ---> 51 predict_file_with_header=predict_file_with_header) 52 53 def _read_file(self, input_file, phase=None): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlehub/dataset/dataset.py in __init__(self, base_path, train_file, dev_file, test_file, predict_file, label_file, label_list, train_file_with_header, dev_file_with_header, test_file_with_header, predict_file_with_header) 92 93 if train_file: ---> 94 self._load_train_examples() 95 if dev_file: 96 self._load_dev_examples() /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlehub/dataset/dataset.py in _load_train_examples(self) 148 def _load_train_examples(self): 149 self.train_path = os.path.join(self.base_path, self.train_file) --> 150 self.train_examples = self._read_file(self.train_path, phase="train") 151 152 def _load_dev_examples(self): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlehub/dataset/base_nlp_dataset.py in _read_file(self, input_file, phase) 69 elif ncol == 2: 70 example = InputExample( ---> 71 guid=i, text_a=line[0], label=line[1]) 72 elif ncol == 3: 73 example = InputExample( IndexError: list index out of range

附件 train.tsv.zip

猜测是在每行的text列，长文本中，有换行符和". "(点后有空格)出现，被程序识别为每行的换行符，实际上text列使用了双引号"“，双引号中换行符不应被是为换行标志。手动删除双引号""中文本内容中的换行符后，读取正常。 train.tsv.txt

PaddlePaddle / PaddleHub 大约 1 年 前同步成功

换行符导致错误

├─train.tsv 训练集

├─dev.tsv 验证集

├─test.tsv 测试集

预测数据存放在predict.tsv文件，文件格式和train.tsv类似。去掉label一列即可。`

PaddlePaddle / PaddleHub
大约 1 年前同步成功