!980 add example: aclImdb data to mindrecord

Merge pull request !980 from guozhijian/add_example_for_aclImdb

!980 add example: aclImdb data to mindrecord
Merge pull request !980 from guozhijian/add_example_for_aclImdb
93429aba · mindspore-ci-bot · Gitee · 859bf8be · 0ef5ff1b · 93429aba
8 changed file
--- a/example/cv_to_mindrecord/README.md
+++ b/example/cv_to_mindrecord/README.md
+## CV dataset to MindRecord
--- a/example/nlp_to_mindrecord/aclImdb/README.md
+++ b/example/nlp_to_mindrecord/aclImdb/README.md
+# Guideline to Transfer Large Movie Review Dataset - aclImdb to MindRecord
+
+<!-- TOC -->
+
+- [What does the example do](#what-does-the-example-do)
+- [How to use the example to generate MindRecord](#how-to-use-the-example-to-generate-mindrecord)
+    - [Download aclImdb dataset and unzip](#download-aclimdb-dataset-and-unzip)
+    - [Generate MindRecord](#generate-mindrecord)
+    - [Create MindDataset By MindRecord](#create-minddataset-by-mindrecord)
+
+
+<!-- /TOC -->
+
+## What does the example do
+
+This example is used to read data from aclImdb dataset and generate mindrecord. It just transfers the aclImdb dataset to mindrecord without any data preprocessing. You can modify the example or follow the example to implement your own example.
+
+1.  run.sh: generate MindRecord entry script.
+    - gen_mindrecord.py : read the aclImdb data and tranfer it to mindrecord.
+2.  run_read.py: create MindDataset by MindRecord entry script.
+    - create_dataset.py: use MindDataset to read MindRecord to generate dataset.
+
+## How to use the example to generate MindRecord
+
+Download aclImdb dataset, tranfer it to mindrecord, use MindDataset to read mindrecord.
+
+### Download aclImdb dataset and unzip
+
+1. Download the training data zip.
+    > [aclImdb dataset download address](http://ai.stanford.edu/~amaas/data/sentiment/) **-> Large Movie Review Dataset v1.0**
+
+2. Unzip the training data to dir example/nlp_to_mindrecord/aclImdb/data.
+    ```
+    tar -zxvf aclImdb_v1.tar.gz -C {your-mindspore}/example/nlp_to_mindrecord/aclImdb/data/
+    ```
+
+### Generate MindRecord
+
+1. Run the run.sh script.
+    ```bash
+    bash run.sh
+    ```
+
+2. Output like this:
+    ```
+    ...
+    >> begin generate mindrecord by train data
+    ...
+    [INFO] ME(20928,python):2020-05-07-23:02:40.066.546 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 256 records successfully.
+    >> transformed 24320 record...
+    [INFO] ME(20928,python):2020-05-07-23:02:40.078.344 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 256 records successfully.
+    >> transformed 24576 record...
+    [INFO] ME(20928,python):2020-05-07-23:02:40.090.237 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 256 records successfully.
+    >> transformed 24832 record...
+    [INFO] ME(20928,python):2020-05-07-23:02:40.098.785 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 168 records successfully.
+    >> transformed 25000 record...
+    [INFO] ME(20928,python):2020-05-07-23:02:40.098.957 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:214] Commit] Write metadata successfully.
+    [INFO] ME(20928,python):2020-05-07-23:02:40.099.302 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:45] Build] Init header from mindrecord file for index successfully.
+    [INFO] ME(20928,python):2020-05-07-23:02:40.122.271 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:586] DatabaseWriter] Init index db for shard: 0 successfully.
+    [INFO] ME(20928,python):2020-05-07-23:02:40.932.360 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:535] ExecuteTransaction] Insert 24596 rows to index db.
+    [INFO] ME(20928,python):2020-05-07-23:02:40.953.177 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:535] ExecuteTransa ction] Insert 404 rows to index db.
+    [INFO] ME(20928,python):2020-05-07-23:02:40.963.400 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:606] DatabaseWriter] Generate index db for shard: 0 successfully.
+    [INFO] ME(20928:139630558652224,MainProcess):2020-05-07-23:02:40.964.973 [mindspore/mindrecord/filewriter.py:313] The list of mindrecord files created are: ['output/aclImdb_train.mindrecord'], and the list of index files are: ['output/aclImdb_train.mindrecord.db']
+    >> begin generate mindrecord by test data
+    ...
+    >> transformed 24576 record...
+    [INFO] ME(20928,python):2020-05-07-23:02:42.120.007 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 256 records successfully.
+    >> transformed 24832 record...
+    [INFO] ME(20928,python):2020-05-07-23:02:42.128.862 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 168 records successfully.
+    >> transformed 25000 record...
+    [INFO] ME(20928,python):2020-05-07-23:02:42.129.024 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:214] Commit] Write metadata successfully.
+    [INFO] ME(20928,python):2020-05-07-23:02:42.129.362 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:45] Build] Init header from mindrecord file for index successfully.
+    [INFO] ME(20928,python):2020-05-07-23:02:42.151.237 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:586] DatabaseWriter] Init index db for shard: 0 successfully.
+    [INFO] ME(20928,python):2020-05-07-23:02:42.935.496 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:535] ExecuteTransaction] Insert 25000 rows to index db.
+    [INFO] ME(20928,python):2020-05-07-23:02:42.949.319 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:606] DatabaseWriter] Generate index db for shard: 0 successfully.
+    [INFO] ME(20928:139630558652224,MainProcess):2020-05-07-23:02:42.951.794 [mindspore/mindrecord/filewriter.py:313] The list of mindrecord files created are: ['output/aclImdb_test.mindrecord'], and the list of index files are: ['output/aclImdb_test.mindrecord.db']
+    ```
+
+3. Generate mindrecord files
+    ```
+    $ ls output/
+    aclImdb_test.mindrecord  aclImdb_test.mindrecord.db  aclImdb_train.mindrecord  aclImdb_train.mindrecord.db  README.md
+    ```
+
+### Create MindDataset By MindRecord
+
+1. Run the run_read.sh script.
+    ```bash
+    bash run_read.sh
+    ```
+
+2. Output like this:
+    > Caution: field "review" which is string type output is displayed in type uint8.
+    ```
+    ...
+    example 2056: {'label': array(1, dtype=int32), 'score': array(4, dtype=int32), 'id': array(5871, dtype=int32), 'review': array([ 70, 111, 114, ..., 111, 110,  46], dtype=uint8)}
+    example 2057: {'label': array(1, dtype=int32), 'score': array(1, dtype=int32), 'id': array(6092, dtype=int32), 'review': array([ 83, 111, 109, ..., 115, 101,  46], dtype=uint8)}
+    example 2058: {'label': array(1, dtype=int32), 'score': array(4, dtype=int32), 'id': array(1357, dtype=int32), 'review': array([ 42, 109,  97, ...,  58,  32,  67], dtype=uint8)}
+    ...
+    ```
+    - id : the id "3219" is from review docs like **3219**_10.txt.
+    - label : indicates whether the review is positive or negative, positive: 0, negative: 1.
+    - score : the score "10" is from review docs like 3219_**10**.txt.
+    - review : the review is from the review dos's content.
--- a/example/nlp_to_mindrecord/aclImdb/create_dataset.py
+++ b/example/nlp_to_mindrecord/aclImdb/create_dataset.py
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""create MindDataset by MindRecord"""
+import mindspore.dataset as ds
+
+def create_dataset(data_file):
+    """create MindDataset"""
+    num_readers = 4
+    data_set = ds.MindDataset(dataset_file=data_file,
+                              num_parallel_workers=num_readers,
+                              shuffle=True)
+    index = 0
+    for item in data_set.create_dict_iterator():
+        print("example {}: {}".format(index, item))
+        index += 1
+        if index % 1000 == 0:
+            print(">> read rows: {}".format(index))
+    print(">> total rows: {}".format(index))
+
+if __name__ == '__main__':
+    create_dataset('output/aclImdb_train.mindrecord')
+    create_dataset('output/aclImdb_test.mindrecord')
--- a/example/nlp_to_mindrecord/aclImdb/data/README.md
+++ b/example/nlp_to_mindrecord/aclImdb/data/README.md
+## The input dataset
--- a/example/nlp_to_mindrecord/aclImdb/gen_mindrecord.py
+++ b/example/nlp_to_mindrecord/aclImdb/gen_mindrecord.py
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""get data from aclImdb and write the data to mindrecord file"""
+import os
+from mindspore.mindrecord import FileWriter
+
+ACLIMDB_DIR = "data/aclImdb"
+
+MINDRECORD_FILE_NAME_TRAIN = "output/aclImdb_train.mindrecord"
+MINDRECORD_FILE_NAME_TEST = "output/aclImdb_test.mindrecord"
+
+def get_data_as_dict(data_dir):
+    """get data from dir like aclImdb/train"""
+    dir_list = [os.path.join(data_dir, "pos"),
+                os.path.join(data_dir, "neg")]
+
+    for index, exact_dir in enumerate(dir_list):
+        if not os.path.exists(exact_dir):
+            raise IOError("dir {} not exists".format(exact_dir))
+
+        for item in os.listdir(exact_dir):
+            data = {}
+            data["label"] = int(index)    # indicate pos: 0, neg: 1
+
+            # file name like 4372_2.txt, we will get id: 4372, score: 2
+            id_score = item.split("_", 1)
+            score = id_score[1].split(".", 1)
+            data["id"] = int(id_score[0])
+            data["score"] = int(score[0])
+
+            review_file = open(os.path.join(exact_dir, item), "r")
+            review = review_file.read()
+            review_file.close()
+            data["review"] = str(review)
+            yield data
+
+def gen_mindrecord(data_type):
+    """gen mindreocrd according exactly schema"""
+    if data_type == "train":
+        fw = FileWriter(MINDRECORD_FILE_NAME_TRAIN)
+    else:
+        fw = FileWriter(MINDRECORD_FILE_NAME_TEST)
+
+    schema = {"id": {"type": "int32"},
+              "label": {"type": "int32"},
+              "score": {"type": "int32"},
+              "review": {"type": "string"}}
+    fw.add_schema(schema, "aclImdb dataset")
+    fw.add_index(["id", "label", "score"])
+
+    get_data_iter = get_data_as_dict(os.path.join(ACLIMDB_DIR, data_type))
+
+    batch_size = 256
+    transform_count = 0
+    while True:
+        data_list = []
+        try:
+            for _ in range(batch_size):
+                data_list.append(get_data_iter.__next__())
+                transform_count += 1
+            fw.write_raw_data(data_list)
+            print(">> transformed {} record...".format(transform_count))
+        except StopIteration:
+            if data_list:
+                fw.write_raw_data(data_list)
+                print(">> transformed {} record...".format(transform_count))
+            break
+
+    fw.commit()
+
+def main():
+    # generate mindrecord for train
+    print(">> begin generate mindrecord by train data")
+    gen_mindrecord("train")
+
+    # generate mindrecord for test
+    print(">> begin generate mindrecord by test data")
+    gen_mindrecord("test")
+
+if __name__ == "__main__":
+    main()
--- a/example/nlp_to_mindrecord/aclImdb/output/README.md
+++ b/example/nlp_to_mindrecord/aclImdb/output/README.md
+## Output the mindrecord
--- a/example/nlp_to_mindrecord/aclImdb/run.sh
+++ b/example/nlp_to_mindrecord/aclImdb/run.sh
+#!/bin/bash
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+rm -f output/aclImdb_train.mindrecord*
+rm -f output/aclImdb_test.mindrecord*
+
+python gen_mindrecord.py
--- a/example/nlp_to_mindrecord/aclImdb/run_read.sh
+++ b/example/nlp_to_mindrecord/aclImdb/run_read.sh
+#!/bin/bash
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+python create_dataset.py