Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • 合并请求
  • !112

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板

Simplify Quick Start !112

  • Report abuse
!112 已关闭 9月 24, 2016 由 saxon_zh@saxon_zh 创建
#<User:0x00007f0e58f47d48>
  • 概览 2
  • 提交 8
  • 变更 6

Created by: wangkuiyi

Fixes https://github.com/baidu/Paddle/issues/111

Motivation

The initial purpose of this PR is that it took me >12 hours to run preprocess.sh on a VM with my MacBook Pro. I checked with Yi Yang that he can run this in a few minutes on his powerful CPU&GPU desktop. But I am afraid that the purpose of a QuickStart should be quick and start enough, so that potential clients can realistically feel the convenience brought by Paddle. Hence this PR.

Comparison

The time cost are primarily due to that the current approach downloads the full Amazon Reviews dataset, which is ~500MB gzipped and ~1.5GB unzipped. The process of the whole dataset also costs much time. So this PR's primary target is to download only part of the dataset. Compare with the existing approach,

  1. this PR uses a ~100-line Python script preprocess_data.py to replace data/get_data.sh, preprocess.py and preprocess.sh, which add up to ~300 lines code,
  2. after a short discussion with @emailweixu , we decided to use space-delimited word segmentation to replace the Moses word segmenter, so no need to download the Mesos segmenter.
  3. preprocess_data.py can read directly from the HTTP server that hosts the data, or from a local copy of the data. In either case, it reads until required number of instances are scanned. This releases it from reading the whole dataset.
  4. The new script doesn't use shuf, which exists in Linux but not in Mac OS X. So the new script works with both Linux and Mac OS X.

Usage

If we get this PR merged, the initialization steps described in the Quick Start guide would change from

cd demo/quick_start
./data/get_data.sh
./preprocess.sh

into

cd demo/quick_start
python ./process_data.py

Details

Above ./process_data.py commands read directly from the default URL http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz for a number of JSON objects until it can generate {train,test,pred}.txt which add up to 100 instances, the default number of dataset size.

If we are going to generate a bigger dataset, say 1000 instances in total, we can run

python ./process_data.py -n 1000

Or, if we already downloaded the reviews_Electronics_5.json.gz file, we can run

python ./process_data.py ~/Download/reviews_Electronics_5.json.gz

An additional command line parameter -t controls the cap size of the dictionary. If we want to generate an 1000-instance dataset while limitinug the dictionary size to 1999, we can do

python ./process_data.py -n 1000 -t 1999 ~/Download/reviews_Electronics_5.json.gz
指派人
分配到
审核者
Request review from
无
里程碑
无
分配里程碑
工时统计
标识: paddlepaddle/Paddle!112
Source branch: github/fork/wangkuiyi/simplify_quick_start
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7