README.md 3.4 KB
Newer Older
S
stevezhangz 已提交
1
# BERT-pytorch
S
stevezhangz 已提交
2

Stevezhangz's avatar
Stevezhangz 已提交
3 4 5
Repeted by pytorch, without pre-train.  
Now, could be trained by mask word2idx.      
And, I am foucsing on the conversation training.    
S
stevezhangz 已提交
6 7 8 9 10 11 12 13 14 15 16 17

# How to use

Bash code(preparation)

    sudo apt-get install ipython3
    sudo apt-get install pip
    sudo apt-get install git
    git clone https://github.com/stevezhangz/BERT-pytorch.git
    cd BERT-pytorch
    pip install -r requirements.txt 
    
S
stevezhangz 已提交
18
I prepare a demo for model training(your can select poem or conversation in the source code)    
S
stevezhangz 已提交
19 20 21 22 23 24
run train_demo.py to train
  
    ipython3 train_demo.py

except that, you have to learn about how to run it on your dataset

S
stevezhangz 已提交
25
  - first use "general_transform_text2list" in data_process.py to transform txt or json file to list which could be defined as "[s1,s2,s3,s4.....]"
S
stevezhangz 已提交
26 27 28
  - then use "generate_vocab_normalway" in data_process.py to transform list file to "sentences, id_sentence, idx2word, word2idx, vocab_size"
  - Last but not least, use "creat_batch" in data_process.py to transform "sentences, id_sentence, idx2word, word2idx, vocab_size" to a batch.
  - finally using dataloder in pytorch to load data.
S
stevezhangz 已提交
29

S
stevezhangz 已提交
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
for example:

    json2list=general_transform_text2list("data/chinese-poetry/chuci/chuci.json",type="json",args=['content'])
    data=json2list.getdata()
    list2token=generate_vocab_normalway(data,map_dir="words_info.json")
    sentences,token_list,idx2word,word2idx,vocab_size=list2token.transform()
    batch = creat_batch(batch_size,max_pred,maxlen,vocab_size,word2idx,token_list,sentences)
    input_ids, segment_ids, masked_tokens, masked_pos, isNext = zip(*batch)
    input_ids, segment_ids, masked_tokens, masked_pos, isNext = \
        torch.LongTensor(input_ids), torch.LongTensor(segment_ids), torch.LongTensor(masked_tokens), \
        torch.LongTensor(masked_pos), torch.LongTensor(isNext)
    loader = Data.DataLoader(Text_file(input_ids, segment_ids, masked_tokens, masked_pos, isNext), batch_size, True)
    model=Bert(n_layers=n_layers,
                     vocab_size=vocab_size,
                     emb_size=d_model,
                     max_len=maxlen,
                     seg_size=n_segments,
                     dff=d_ff,
                     dk=d_k,
                     dv=d_v,
                     n_head=n_heads,
                     n_class=2,
                    )
    if use_gpu:
        with torch.cuda.device(device) as device:
            model.to(device)
            criterion = nn.CrossEntropyLoss()
            optimizer = optim.Adadelta(model.parameters(), lr=lr)
            model.Train(epoches=epoches,
                        train_data_loader=loader,
                        optimizer=optimizer,
                        criterion=criterion,
                        save_dir=weight_dir,
                        save_freq=100,
                        load_dir="checkpoint/checkpoint_199.pth",
                        use_gpu=use_gpu,
                        device=device
                        )


# How to config
Modify super parameters directly in “Config.cfg”

# Pretrain
Because of time, I can't spend time to train the model. You are welcome to use my model for training and contribute pre train weight to this project

# About me
S
stevezhangz 已提交
77 78
author={        
  E-mail:stevezhangz@163.com        
S
stevezhangz 已提交
79 80 81
}

# Acknowledgement
Stevezhangz's avatar
Stevezhangz 已提交
82
Acknowledgement for the open-source [poem dataset](https://github.com/chinese-poetry/chinese-poetry) and some code of this [project](https://codechina.csdn.net/mirrors/wmathor/nlp-tutorial/-/tree/master/5-2.BERT) for inspiration.
S
stevezhangz 已提交
83 84 85