官方的vocab.txt报UnicodeDecodeError
Created by: treper
想用ernie_encoder.py提取句子embedding,下载的README里的文件,读取vocab.txt报编码错误,文件不对?
~/workspace/nlp/ernie/ERNIE/ERNIE]$ python -u ernie_encoder.py --use_cuda false --batch_size 32 --output_dir ~/workspace/nlp/ernie/test --init_pretraining_params ~/workspace/nlp/ernie/params --data_set ~/workspace/nlp/ernie/sm.tsv --vocab_path ~/workspace/nlp/ernie/vocab.txt --max_seq_len 128 --ernie_config_path ~/workspace/nlp/ernie/ernie_config.json
----------- Configuration Arguments -----------
batch_size: 32
data_set: ~/workspace/nlp/ernie/sm.tsv
do_lower_case: True
ernie_config_path: ~/workspace/nlp/ernie/ernie_config.json
init_pretraining_params: ~/workspace/nlp/ernie/params
max_seq_len: 128
output_dir: ~/workspace/nlp/ernie/test
use_cuda: False
vocab_path: ~/workspace/nlp/ernie/vocab.txt
------------------------------------------------
attention_probs_dropout_prob: 0.1
hidden_act: relu
hidden_dropout_prob: 0.1
hidden_size: 768
initializer_range: 0.02
max_position_embeddings: 513
num_attention_heads: 12
num_hidden_layers: 12
type_vocab_size: 2
vocab_size: 18000
------------------------------------------------
Traceback (most recent call last):
File "ernie_encoder.py", line 182, in <module>
main(args)
File "ernie_encoder.py", line 108, in main
do_lower_case=args.do_lower_case)
File "~/workspace/nlp/ernie/ERNIE/ERNIE/reader/task_reader.py", line 35, in __init__
vocab_file=vocab_path, do_lower_case=do_lower_case)
File "~/workspace/nlp/ernie/ERNIE/ERNIE/tokenization.py", line 113, in __init__
self.vocab = load_vocab(vocab_file)
File "~/workspace/nlp/ernie/ERNIE/ERNIE/tokenization.py", line 73, in load_vocab
for num, line in enumerate(fin):
File "~/anaconda3/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 33: ordinal not in range(128)