Adding new multilingual model

332a6872 · Jacob Devlin · 1cd50d7a · 332a6872 · 332a6872 · 332a6872
隐藏空白更改
内联并排

Showing with 38 addition and 15 deletion

README.md README.md +18 -1

multilingual.md multilingual.md +19 -13

tokenization.py tokenization.py +1 -1

未找到文件。
--- a/README.md
+++ b/README.md
 # BERT

+**\*\*\*\*\* New November 23rd, 2018: Un-normalized multilingual model + Thai +
+Mongolian \*\*\*\*\***
+
+We uploaded a new multilingual model which does *not* perform any normalization
+on the input (no lower casing, accent stripping, or Unicode normalization), and
+additionally inclues Thai and Mongolian.
+
+**It is recommended to use this version for developing multilingual models,
+especially on languages with non-Latin alphabets.**
+
+This does not require any code changes, and can be downloaded here:
+
+*   **[`BERT-Base, Multilingual Cased`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
+    104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+
 **\*\*\*\*\* New November 15th, 2018: SOTA SQuAD 2.0 System \*\*\*\*\***

 We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is
@@ -207,7 +222,9 @@ The links to the models are here (right-click, 'Save link as...' on the name):
    12-layer, 768-hidden, 12-heads , 110M parameters
 *   **`BERT-Large, Cased`**: 24-layer, 1024-hidden, 16-heads, 340M parameters
    (Not available yet. Needs to be re-generated).
-*   **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
+*   **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
+    104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+*   **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
    102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
 *   **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:
    Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M

--- a/multilingual.md
+++ b/multilingual.md
@@ -4,12 +4,20 @@ There are two multilingual models currently available. We do not plan to release
 more single-language models, but we may release `BERT-Large` versions of these
 two in the future:

-*   **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
+*   **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
+    104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+*   **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
    102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
 *   **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:
    Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
    parameters

+**The `Multilingual Cased (New)` model also fixes normalization issues in many
+languages, so it is recommended in languages with non-Latin alphabets (and is
+often better for most languages with Latin alphabets). When using this model,
+make sure to pass `--do_lower_case=false` to `run_pretraining.py` and other
+scripts.**
+
 See the [list of languages](#list-of-languages) that the Multilingual model
 supports. The Multilingual model does include Chinese (and English), but if your
 fine-tuning data is Chinese-only, then the Chinese model will likely produce
@@ -26,13 +34,14 @@ XNLI, not Google NMT). For clarity, we only report on 6 languages below:

 <!-- mdformat off(no table) -->

-| System                          | English  | Chinese  | Spanish  | German   | Arabic   | Urdu     |
-| ------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
-| XNLI Baseline - Translate Train | 73.7     | 67.0     | 68.8     | 66.5     | 65.8     | 56.6     |
-| XNLI Baseline - Translate Test  | 73.7     | 68.3     | 70.7     | 68.7     | 66.8     | 59.3     |
-| BERT -Translate Train           | **81.4** | **74.2** | **77.3** | **75.2** | **70.5** | 61.7     |
-| BERT - Translate Test           | 81.4     | 70.1     | 74.9     | 74.4     | 70.4     | **62.1** |
-| BERT - Zero Shot                | 81.4     | 63.8     | 74.3     | 70.5     | 62.1     | 58.3     |
+| System                            | English  | Chinese  | Spanish  | German   | Arabic   | Urdu     |
+| --------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
+| XNLI Baseline - Translate Train   | 73.7     | 67.0     | 68.8     | 66.5     | 65.8     | 56.6     |
+| XNLI Baseline - Translate Test    | 73.7     | 68.3     | 70.7     | 68.7     | 66.8     | 59.3     |
+| BERT - Translate Train Cased      | **81.9** | **76.6** | **77.8** | **75.9** | **70.7** | 61.6     |
+| BERT - Translate Train Uncased    | 81.4     | 74.2     | 77.3     | 75.2     | 70.5     | 61.7     |
+| BERT - Translate Test Uncased     | 81.4     | 70.1     | 74.9     | 74.4     | 70.4     | **62.1** |
+| BERT - Zero Shot Uncased          | 81.4     | 63.8     | 74.3     | 70.5     | 62.1     | 58.3     |

 <!-- mdformat on -->

@@ -292,8 +301,5 @@ chosen because they are the top 100 languages with the largest Wikipedias:
 *   Western Punjabi
 *   Yoruba

-The only language which we had to unfortunately exclude was Thai, since it is
-the only language (other than Chinese) that does not use whitespace to delimit
-words, and it has too many characters-per-word to use character-based
-tokenization. Our WordPiece algorithm is quadratic with respect to the size of
-the input token so very long character strings do not work with it.
+The **Multilingual Cased (New)** release contains additionally **Thai** and
+**Mongolian**, which were not included in the original release.
--- a/tokenization.py
+++ b/tokenization.py
@@ -249,7 +249,7 @@ class BasicTokenizer(object):
 class WordpieceTokenizer(object):
  """Runs WordPiece tokenziation."""

-  def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+  def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
    self.vocab = vocab
    self.unk_token = unk_token
    self.max_input_chars_per_word = max_input_chars_per_word