Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • PaddleHub
  • Issue
  • #581

P
PaddleHub
  • 项目概览

PaddlePaddle / PaddleHub
大约 2 年 前同步成功

通知 285
Star 12117
Fork 2091
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 200
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 4
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
PaddleHub
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 200
    • Issue 200
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 4
    • 合并请求 4
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 5月 10, 2020 by saxon_zh@saxon_zhGuest

换行符导致错误

Created by: 60999

欢迎您反馈PaddleHub使用问题,非常感谢您对PaddleHub的贡献! 在留下您的问题时,辛苦您同步提供如下信息:

  • 版本、环境信息 1)PaddleHub和PaddlePaddle版本: PaddleHub1.6.2,PaddlePaddle1.7.2 2)系统环境:请您描述系统类型, aistudio

代码: import pandas as pd

df = pd.read_csv('/home/aistudio/data/train.csv',dtype=str,verbose =True,skip_blank_lines=True ,sep=',',delim_whitespace =False,nrows=10) print(df) out=pd.DataFrame(data=df) out.to_csv('/home/aistudio/data/train.tsv',sep='\t',index=False,encoding= 'utf-8')

df = pd.read_csv('/home/aistudio/data/train.tsv',dtype=str,verbose =True,skip_blank_lines=True ,sep='\t',delim_whitespace =False,nrows=10) print(df) 生成的结果见附件。 代码输出: Tokenization took: 0.05 ms Type conversion took: 0.17 ms Parser memory cleanup took: 0.00 ms text label 0 I just read a VERY Very similar story as mine ... 6 1 on or about XXXX/XXXX/2015 a tree fell on our ... 1 2 I have had a Macy 's credit card since XX/XX/X... 2 3 Please see my Complaint # XXXX of XXXX XXXX. T... 2 4 On XXXX/XXXX/2015 I visited the XXXX XXXX stor... 4 5 We originally took out a mortgage loan in XXXX... 1 6 Thru a company called XXXX XXXX XXXX, they neg... 3 7 On XXXX/XXXX/2015, Experian sent me an email a... 4 8 I have been dealing with Bank of America for a... 1 9 Currently we have a mortgage with a mortgage c... 1 Tokenization took: 0.06 ms Type conversion took: 0.16 ms Parser memory cleanup took: 0.00 ms text label 0 I just read a VERY Very similar story as mine ... 6 1 on or about XXXX/XXXX/2015 a tree fell on our ... 1 2 I have had a Macy 's credit card since XX/XX/X... 2 3 Please see my Complaint # XXXX of XXXX XXXX. T... 2 4 On XXXX/XXXX/2015 I visited the XXXX XXXX stor... 4 5 We originally took out a mortgage loan in XXXX... 1 6 Thru a company called XXXX XXXX XXXX, they neg... 3 7 On XXXX/XXXX/2015, Experian sent me an email a... 4 8 I have been dealing with Bank of America for a... 1 9 Currently we have a mortgage with a mortgage c... 1

可见读入读出是没有问题的。 继续执行: `import pandas as pd from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset class ThuNews(BaseNLPDataset): def init(self): # 数据集存放位置 self.dataset_dir = "/home/aistudio/data/" super(ThuNews, self).init( base_path=self.dataset_dir, train_file="train.tsv", dev_file="train.tsv", test_file="train.tsv", train_file_with_header=True, dev_file_with_header=True, test_file_with_header=True, predict_file_with_header=True) # 数据集类别集合 # label_list=['体育', '科技', '社会', '娱乐', '股票', '房产', '教育', '时政', '财经', '星座', '游戏', '家居', '彩票', '时尚'])

dataset = ThuNews() for e in dataset.get_train_examples()[:3]: print("{}\t{}\t{}".format(e.guid, e.text_a, e.label))

  ├─train.tsv 训练集

  ├─dev.tsv 验证集

  ├─test.tsv 测试集

预测数据存放在predict.tsv文件,文件格式和train.tsv类似。去掉label一列即可。`

输出错误部分:

---------------------------------------------------------------------------IndexError Traceback (most recent call last)<ipython-input-7-9f5158d6e063> in <module> 17 # label_list=['体育', '科技', '社会', '娱乐', '股票', '房产', '教育', '时政', '财经', '星座', '游戏', '家居', '彩票', '时尚']) 18 ---> 19 dataset = ThuNews() 20 for e in dataset.get_train_examples()[:3]: 21 print("{}\t{}\t{}".format(e.guid, e.text_a, e.label)) <ipython-input-7-9f5158d6e063> in __init__(self) 13 dev_file_with_header=True, 14 test_file_with_header=True, ---> 15 predict_file_with_header=True) 16 # 数据集类别集合 17 # label_list=['体育', '科技', '社会', '娱乐', '股票', '房产', '教育', '时政', '财经', '星座', '游戏', '家居', '彩票', '时尚']) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlehub/dataset/base_nlp_dataset.py in __init__(self, base_path, train_file, dev_file, test_file, predict_file, label_file, label_list, train_file_with_header, dev_file_with_header, test_file_with_header, predict_file_with_header) 49 dev_file_with_header=dev_file_with_header, 50 test_file_with_header=test_file_with_header, ---> 51 predict_file_with_header=predict_file_with_header) 52 53 def _read_file(self, input_file, phase=None): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlehub/dataset/dataset.py in __init__(self, base_path, train_file, dev_file, test_file, predict_file, label_file, label_list, train_file_with_header, dev_file_with_header, test_file_with_header, predict_file_with_header) 92 93 if train_file: ---> 94 self._load_train_examples() 95 if dev_file: 96 self._load_dev_examples() /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlehub/dataset/dataset.py in _load_train_examples(self) 148 def _load_train_examples(self): 149 self.train_path = os.path.join(self.base_path, self.train_file) --> 150 self.train_examples = self._read_file(self.train_path, phase="train") 151 152 def _load_dev_examples(self): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlehub/dataset/base_nlp_dataset.py in _read_file(self, input_file, phase) 69 elif ncol == 2: 70 example = InputExample( ---> 71 guid=i, text_a=line[0], label=line[1]) 72 elif ncol == 3: 73 example = InputExample( IndexError: list index out of range

附件 train.tsv.zip

猜测是在每行的text列,长文本中,有换行符和". "(点后有空格)出现,被程序识别为每行的换行符,实际上text列使用了双引号"“,双引号中换行符不应被是为换行标志。 手动删除双引号""中文本内容中的 换行符后,读取正常。 train.tsv.txt

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/PaddleHub#581
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7