Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
weixin_51232023
models
提交
44786e9c
M
models
项目概览
weixin_51232023
/
models
与 Fork 源项目一致
Fork自
PaddlePaddle / models
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
models
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
未验证
提交
44786e9c
编写于
2月 02, 2019
作者:
Q
qingqing01
提交者:
GitHub
2月 02, 2019
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Revert "Add build_raw_data.py" (#1739)
上级
ef06ad53
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
0 addition
and
62 deletion
+0
-62
fluid/PaddleNLP/text_classification/async_executor/data_generator/build_raw_data.py
...ification/async_executor/data_generator/build_raw_data.py
+0
-62
未找到文件。
fluid/PaddleNLP/text_classification/async_executor/data_generator/build_raw_data.py
已删除
100644 → 0
浏览文件 @
ef06ad53
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Build raw data
"""
from
__future__
import
print_function
import
sys
import
os
import
random
import
re
data_type
=
sys
.
argv
[
1
]
if
not
(
data_type
==
"train"
or
data_type
==
"test"
):
print
(
"python %s [test/train]"
%
sys
.
argv
[
0
],
file
=
sys
.
stderr
)
sys
.
exit
(
-
1
)
pos_folder
=
"aclImdb/"
+
data_type
+
"/pos/"
neg_folder
=
"aclImdb/"
+
data_type
+
"/neg/"
pos_train_list
=
[(
pos_folder
+
x
,
"1"
)
for
x
in
os
.
listdir
(
pos_folder
)]
neg_train_list
=
[(
neg_folder
+
x
,
"0"
)
for
x
in
os
.
listdir
(
neg_folder
)]
all_train_list
=
pos_train_list
+
neg_train_list
random
.
shuffle
(
all_train_list
)
def
load_dict
(
dictfile
):
"""
Load word id dict
"""
vocab
=
{}
wid
=
0
with
open
(
dictfile
)
as
f
:
for
line
in
f
:
vocab
[
line
.
strip
()]
=
str
(
wid
)
wid
+=
1
return
vocab
vocab
=
load_dict
(
"aclImdb/imdb.vocab"
)
unk_id
=
str
(
len
(
vocab
))
print
(
"vocab size: "
,
len
(
vocab
),
file
=
sys
.
stderr
)
pattern
=
re
.
compile
(
r
'(;|,|\.|\?|!|\s|\(|\))'
)
for
fitem
in
all_train_list
:
label
=
str
(
fitem
[
1
])
fname
=
fitem
[
0
]
with
open
(
fname
)
as
f
:
sent
=
f
.
readline
().
lower
().
replace
(
"<br />"
,
" "
).
strip
()
out_s
=
"%s | %s"
%
(
sent
,
label
)
print
(
out_s
,
file
=
sys
.
stdout
)
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录