Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
我傻x
bert
提交
332a6872
B
bert
项目概览
我傻x
/
bert
与 Fork 源项目一致
从无法访问的项目Fork
通知
2
Star
1
Fork
1
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
B
bert
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
提交
332a6872
编写于
11月 23, 2018
作者:
J
Jacob Devlin
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Adding new multilingual model
上级
1cd50d7a
变更
3
隐藏空白更改
内联
并排
Showing
3 changed file
with
38 addition
and
15 deletion
+38
-15
README.md
README.md
+18
-1
multilingual.md
multilingual.md
+19
-13
tokenization.py
tokenization.py
+1
-1
未找到文件。
README.md
浏览文件 @
332a6872
# BERT
**
\*\*\*\*\*
New November 23rd, 2018: Un-normalized multilingual model + Thai +
Mongolian
\*\*\*\*\*
**
We uploaded a new multilingual model which does
*not*
perform any normalization
on the input (no lower casing, accent stripping, or Unicode normalization), and
additionally inclues Thai and Mongolian.
**
It is recommended to use this version for developing multilingual models,
especially on languages with non-Latin alphabets.
**
This does not require any code changes, and can be downloaded here:
*
**[`BERT-Base, Multilingual Cased`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**
:
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
**\*\*\*\*\* New November 15th, 2018: SOTA SQuAD 2.0 System \*\*\*\*\***
We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is
...
...
@@ -207,7 +222,9 @@ The links to the models are here (right-click, 'Save link as...' on the name):
12-layer, 768-hidden, 12-heads , 110M parameters
*
**`BERT-Large, Cased`**
: 24-layer, 1024-hidden, 16-heads, 340M parameters
(Not available yet. Needs to be re-generated).
*
**[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**
:
*
**[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**
:
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
*
**[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**
:
102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
*
**[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**
:
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
...
...
multilingual.md
浏览文件 @
332a6872
...
...
@@ -4,12 +4,20 @@ There are two multilingual models currently available. We do not plan to release
more single-language models, but we may release
`BERT-Large`
versions of these
two in the future:
*
**[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**
:
*
**[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**
:
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
*
**[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**
:
102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
*
**[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**
:
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
parameters
**
The
`Multilingual Cased (New)`
model also fixes normalization issues in many
languages, so it is recommended in languages with non-Latin alphabets (and is
often better for most languages with Latin alphabets). When using this model,
make sure to pass
`--do_lower_case=false`
to
`run_pretraining.py`
and other
scripts.
**
See the
[
list of languages
](
#list-of-languages
)
that the Multilingual model
supports. The Multilingual model does include Chinese (and English), but if your
fine-tuning data is Chinese-only, then the Chinese model will likely produce
...
...
@@ -26,13 +34,14 @@ XNLI, not Google NMT). For clarity, we only report on 6 languages below:
<!-- mdformat off(no table) -->
| System | English | Chinese | Spanish | German | Arabic | Urdu |
| ------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
| XNLI Baseline - Translate Train | 73.7 | 67.0 | 68.8 | 66.5 | 65.8 | 56.6 |
| XNLI Baseline - Translate Test | 73.7 | 68.3 | 70.7 | 68.7 | 66.8 | 59.3 |
| BERT -Translate Train |
**81.4**
|
**74.2**
|
**77.3**
|
**75.2**
|
**70.5**
| 61.7 |
| BERT - Translate Test | 81.4 | 70.1 | 74.9 | 74.4 | 70.4 |
**62.1**
|
| BERT - Zero Shot | 81.4 | 63.8 | 74.3 | 70.5 | 62.1 | 58.3 |
| System | English | Chinese | Spanish | German | Arabic | Urdu |
| --------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
| XNLI Baseline - Translate Train | 73.7 | 67.0 | 68.8 | 66.5 | 65.8 | 56.6 |
| XNLI Baseline - Translate Test | 73.7 | 68.3 | 70.7 | 68.7 | 66.8 | 59.3 |
| BERT - Translate Train Cased |
**81.9**
|
**76.6**
|
**77.8**
|
**75.9**
|
**70.7**
| 61.6 |
| BERT - Translate Train Uncased | 81.4 | 74.2 | 77.3 | 75.2 | 70.5 | 61.7 |
| BERT - Translate Test Uncased | 81.4 | 70.1 | 74.9 | 74.4 | 70.4 |
**62.1**
|
| BERT - Zero Shot Uncased | 81.4 | 63.8 | 74.3 | 70.5 | 62.1 | 58.3 |
<!-- mdformat on -->
...
...
@@ -292,8 +301,5 @@ chosen because they are the top 100 languages with the largest Wikipedias:
*
Western Punjabi
*
Yoruba
The only language which we had to unfortunately exclude was Thai, since it is
the only language (other than Chinese) that does not use whitespace to delimit
words, and it has too many characters-per-word to use character-based
tokenization. Our WordPiece algorithm is quadratic with respect to the size of
the input token so very long character strings do not work with it.
The
**Multilingual Cased (New)**
release contains additionally
**Thai**
and
**Mongolian**
, which were not included in the original release.
tokenization.py
浏览文件 @
332a6872
...
...
@@ -249,7 +249,7 @@ class BasicTokenizer(object):
class
WordpieceTokenizer
(
object
):
"""Runs WordPiece tokenziation."""
def
__init__
(
self
,
vocab
,
unk_token
=
"[UNK]"
,
max_input_chars_per_word
=
1
00
):
def
__init__
(
self
,
vocab
,
unk_token
=
"[UNK]"
,
max_input_chars_per_word
=
2
00
):
self
.
vocab
=
vocab
self
.
unk_token
=
unk_token
self
.
max_input_chars_per_word
=
max_input_chars_per_word
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录