README.md 5.3 KB
Newer Older
S
stevezhangz 已提交
1
# BERT-pytorch
S
stevezhangz 已提交
2

Stevezhangz's avatar
Stevezhangz 已提交
3
### Introduction:
Stevezhangz's avatar
Stevezhangz 已提交
4

Stevezhangz's avatar
Stevezhangz 已提交
5 6 7
This mechine could be trained by "train_demo.py"    
  And there are mainly two datasets demo, one is a json file about poem, another is a conversation demo created by myself.    
  However I don't recommand to use those demo_datas to train, I prefer use formal datasets.   
Stevezhangz's avatar
Stevezhangz 已提交
8

Stevezhangz's avatar
Stevezhangz 已提交
9
Fine-tune method could be found in "Bert_finetune.py", Fine-tune of BERT mainly include two examples.   
Stevezhangz's avatar
Stevezhangz 已提交
10 11
  First is the word classify prediction, could be found in ["bert_for_word_classify.py"](https://codechina.csdn.net/captainAAAjohn/BERT-pytorch/-/blob/main/bert_for_word_classify.py)    
  Second is the sentences classify prediction, could be found in [" bert_for_sentence_classify.py"](https://codechina.csdn.net/captainAAAjohn/BERT-pytorch/-/blob/main/bert_for_sentence_classify.py)   
Stevezhangz's avatar
Stevezhangz 已提交
12

Stevezhangz's avatar
Stevezhangz 已提交
13
Next, I will enrich the language generation as well as conversation process.    
S
stevezhangz 已提交
14 15 16

# How to use

Stevezhangz's avatar
Stevezhangz 已提交
17 18
## preparation:
### if your develop env is unbuntu, plz load the terminal, and input the fellow bash codes.
S
stevezhangz 已提交
19 20 21 22 23 24 25

    sudo apt-get install ipython3
    sudo apt-get install pip
    sudo apt-get install git
    git clone https://github.com/stevezhangz/BERT-pytorch.git
    cd BERT-pytorch
    pip install -r requirements.txt 
Stevezhangz's avatar
Stevezhangz 已提交
26 27 28 29
   
### for win users, if u utlize the python IDEs such as pycharm, please find terminals of pycharm, and input the bash codes as shown in fellows:

    pip install git
Stevezhangz's avatar
Stevezhangz 已提交
30
    pip install ipython3
Stevezhangz's avatar
Stevezhangz 已提交
31 32 33 34 35 36
    git clone https://github.com/stevezhangz/BERT-pytorch.git
    cd BERT-pytorch
    pip install -r requirements.txt 

part of you may use anaconda3, so you have to load "anaconda3 prompt" and input fellow bash codes:

Stevezhangz's avatar
Stevezhangz 已提交
37 38
	  
    conda install pip
Stevezhangz's avatar
Stevezhangz 已提交
39
    conda install git
Stevezhangz's avatar
Stevezhangz 已提交
40
    conda install ipython3
Stevezhangz's avatar
Stevezhangz 已提交
41 42 43
    git clone https://github.com/stevezhangz/BERT-pytorch.git
    cd BERT-pytorch
    pip install -r requirements.txt 
S
stevezhangz 已提交
44
    
S
stevezhangz 已提交
45
I prepare a demo for model training(your can select poem or conversation in the source code)    
S
stevezhangz 已提交
46 47 48 49
run train_demo.py to train
  
    ipython3 train_demo.py

Stevezhangz's avatar
Stevezhangz 已提交
50
### except that, you have to learn about how to run it on your dataset
S
stevezhangz 已提交
51

S
stevezhangz 已提交
52
  - first use "general_transform_text2list" in data_process.py to transform txt or json file to list which could be defined as "[s1,s2,s3,s4.....]"
S
stevezhangz 已提交
53 54 55
  - then use "generate_vocab_normalway" in data_process.py to transform list file to "sentences, id_sentence, idx2word, word2idx, vocab_size"
  - Last but not least, use "creat_batch" in data_process.py to transform "sentences, id_sentence, idx2word, word2idx, vocab_size" to a batch.
  - finally using dataloder in pytorch to load data.
S
stevezhangz 已提交
56

Stevezhangz's avatar
Stevezhangz 已提交
57
### for example:
S
stevezhangz 已提交
58

Stevezhangz's avatar
Stevezhangz 已提交
59 60
    np.random.seed(random_seed)
    #json2list=general_transform_text2list("data/demo.txt",type="txt")
S
stevezhangz 已提交
61 62 63 64
    json2list=general_transform_text2list("data/chinese-poetry/chuci/chuci.json",type="json",args=['content'])
    data=json2list.getdata()
    list2token=generate_vocab_normalway(data,map_dir="words_info.json")
    sentences,token_list,idx2word,word2idx,vocab_size=list2token.transform()
Stevezhangz's avatar
Stevezhangz 已提交
65 66
    batch = creat_batch(batch_size,max_pred,maxlen,word2idx,idx2word,token_list,0.15)
    loader = Data.DataLoader(Text_file(batch), batch_size, True)
S
stevezhangz 已提交
67
    model=Bert(n_layers=n_layers,
Stevezhangz's avatar
Stevezhangz 已提交
68 69 70 71 72 73 74 75 76 77 78
                    vocab_size=vocab_size,
                    emb_size=d_model,
                    max_len=maxlen,
                    seg_size=n_segments,
                    dff=d_ff,
                    dk=d_k,
                    dv=d_v,
                    n_head=n_heads,
                    n_class=2,
                    drop=drop)

S
stevezhangz 已提交
79 80 81 82 83
    if use_gpu:
        with torch.cuda.device(device) as device:
            model.to(device)
            criterion = nn.CrossEntropyLoss()
            optimizer = optim.Adadelta(model.parameters(), lr=lr)
Stevezhangz's avatar
Stevezhangz 已提交
84
            model.Train(epoches=epoches,
S
stevezhangz 已提交
85 86 87 88 89 90 91 92 93 94 95 96 97 98
                        train_data_loader=loader,
                        optimizer=optimizer,
                        criterion=criterion,
                        save_dir=weight_dir,
                        save_freq=100,
                        load_dir="checkpoint/checkpoint_199.pth",
                        use_gpu=use_gpu,
                        device=device
                        )


# How to config
Modify super parameters directly in “Config.cfg”

Stevezhangz's avatar
Stevezhangz 已提交
99 100
# About fine-tune
To identify the trained bert has learned something from the training dataset, bert fine-tune on the other dataset which various from the original one is necessary. We provide two examples, first one is about the prediction of specific sentence classification(there are no meanings about the classification, because bert trainning process is a self-learning process without supervise infomation about classification of per sentence), another one is about the word prediction of a specific sentence.
Stevezhangz's avatar
Stevezhangz 已提交
101 102 103 104 105 106 107 108 109 110 111
Next, we will enrich about the language generation and conversation.

- sentence classification:

    ipython3  bert_for_sentence_classify.py
    
- word prediction:

    ipython3  bert_for_word_classify.py


S
stevezhangz 已提交
112 113 114 115
# Pretrain
Because of time, I can't spend time to train the model. You are welcome to use my model for training and contribute pre train weight to this project

# About me
S
stevezhangz 已提交
116 117
author={        
  E-mail:stevezhangz@163.com        
S
stevezhangz 已提交
118 119 120
}

# Acknowledgement
Stevezhangz's avatar
Stevezhangz 已提交
121
Acknowledgement for the open-source [poem dataset](https://github.com/chinese-poetry/chinese-poetry) and a little bit codes of this [project named nlp-tutorial](https://codechina.csdn.net/mirrors/wmathor/nlp-tutorial/-/tree/master/5-2.BERT) for inspiration.
S
stevezhangz 已提交
122 123 124