Merge pull request #192 from AdamBear/master

support stream chat

Merge pull request #192 from AdamBear/master
support stream chat
5709f7bb · Zhengxiao Du · GitHub · 52aa3261 · ee763423 · 5709f7bb
6 changed file
--- a/README.md
+++ b/README.md
@@ -5,12 +5,12 @@
 ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型，基于 [General Language Model (GLM)](https://github.com/THUDM/GLM) 架构，具有 62 亿参数。结合模型量化技术，用户可以在消费级的显卡上进行本地部署（INT4 量化级别下最低只需 6GB 显存）。
 ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练，辅以监督微调、反馈自助、人类反馈强化学习等技术的加持，62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。更多信息请参考我们的[博客](https://chatglm.cn/blog)。

-不过，由于ChatGLM-6B的规模较小，目前已知其具有相当多的[**局限性**](#局限性)，如事实性/数学逻辑错误，可能生成有害/有偏见内容，较弱的上下文能力，自我认知混乱，以及对英文指示生成与中文指示完全矛盾的内容。请大家在使用前了解这些问题，以免产生误解。
+不过，由于 ChatGLM-6B 的规模较小，目前已知其具有相当多的[**局限性**](#局限性)，如事实性/数学逻辑错误，可能生成有害/有偏见内容，较弱的上下文能力，自我认知混乱，以及对英文指示生成与中文指示完全矛盾的内容。请大家在使用前了解这些问题，以免产生误解。更大的基于1300亿参数[GLM-130B](https://github.com/THUDM/GLM-130B)的ChatGLM正在内测开发中。

 *Read this in [English](README_en.md).*

 ## 更新信息
-**[2023/03/19]** 增加流式输出接口`stream_chat`，已更新到网页版和命令行demo。修复输出中的中文标点
+**[2023/03/19]** 增加流式输出接口 `stream_chat`，已更新到网页版和命令行 Demo。修复输出中的中文标点。增加量化后的模型 [ChatGLM-6B-INT4](https://huggingface.co/THUDM/chatglm-6b-int4)

 ## 使用方式

@@ -34,6 +34,7 @@ ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进
 >>> from transformers import AutoTokenizer, AutoModel
 >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
 >>> model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
+>>> model = model.eval()
 >>> response, history = model.chat(tokenizer, "你好", history=[])
 >>> print(response)
 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
@@ -50,7 +51,7 @@ ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进

 如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。
 ```
-完整的模型实现可以在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm-6b) 上查看。如果你从Hugging Face Hub上下载checkpoint的速度较慢，也可以从[这里](https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/)手动下载。
+完整的模型实现可以在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm-6b) 上查看。如果你从 Hugging Face Hub 上下载checkpoint的速度较慢，也可以从[这里](https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/)手动下载。

 ### Demo

@@ -63,7 +64,7 @@ cd ChatGLM-6B

 #### 网页版 Demo

-![web-demo](resources/web-demo.png)
+![web-demo](resources/web-demo.gif)

 首先安装 Gradio：`pip install gradio`，然后运行仓库中的 [web_demo.py](web_demo.py)： 

@@ -71,9 +72,9 @@ cd ChatGLM-6B
 python web_demo.py
 ```

-程序会运行一个 Web Server，并输出地址。在浏览器中打开输出的地址即可使用。
+程序会运行一个 Web Server，并输出地址。在浏览器中打开输出的地址即可使用。最新版 Demo 实现了打字机效果，速度体验大大提升。注意，由于国内 Gradio 的网络访问较为缓慢，启用 `demo.queue().launch(share=True, inbrowser=True)` 时所有网络会经过 Gradio 服务器转发，导致打字机体验大幅下降，现在默认启动方式已经改为 `share=False`，如有需要公网访问的需求，可以重新修改为 `share=True` 启动。

-感谢[@AdamBear](https://github.com/AdamBear) 实现了基于Streamlit的网页版demo，运行方式见[#117](https://github.com/THUDM/ChatGLM-6B/pull/117).
+感谢[@AdamBear](https://github.com/AdamBear) 实现了基于 Streamlit 的网页版 Demo，运行方式见[#117](https://github.com/THUDM/ChatGLM-6B/pull/117).

 #### 命令行 Demo

@@ -100,24 +101,27 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).ha

 模型量化会带来一定的性能损失，经过测试，ChatGLM-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。使用 [GPT-Q](https://arxiv.org/abs/2210.17323) 等量化方案可以进一步压缩量化精度/提升相同量化精度下的模型性能，欢迎大家提出对应的 Pull Request。

-### CPU部署
-如果你没有GPU硬件的话，也可以在CPU上进行推理。使用方法如下
+**[2023/03/19]** 量化过程需要在内存中首先加载 FP16 格式的模型，消耗大概 13GB 的内存。如果你的内存不足的话，可以直接加载量化后的模型，仅需大概 5.2GB 的内存：
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).half().cuda()
+```
+
+### CPU 部署
+如果你没有 GPU 硬件的话，也可以在 CPU 上进行推理，但是推理速度会更慢。使用方法如下（需要大概 32GB 内存）
 ```python
 model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).float()
 ```
-CPU上推理速度可能会比较慢。

-以上方法需要32G内存。如果你只有16G内存，可以尝试
+**[2023/03/19]** 如果你的内存不足，可以直接加载量化后的模型：
 ```python
-model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).bfloat16()
+model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True).float()
 ```
-需保证空闲内存接近16G，并且推理速度会很慢。

 如果遇到了报错 `Could not find module 'nvcuda.dll'` 或者 `RuntimeError: Unknown platform: darwin` (MacOS) 的话请参考这个[Issue](https://github.com/THUDM/ChatGLM-6B/issues/6#issuecomment-1470060041).

-## ChatGLM-6B示例
+## ChatGLM-6B 示例

-以下是一些使用`web_demo.py`得到的示例截图。更多ChatGLM-6B的可能，等待你来探索发现！
+以下是一些使用 `web_demo.py` 得到的示例截图。更多 ChatGLM-6B 的可能，等待你来探索发现！

 <details><summary><b>自我认知</b></summary>

@@ -173,9 +177,9 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).bf

 ## 局限性

-由于ChatGLM-6B的小规模，其能力仍然有许多局限性。以下是我们目前发现的一些问题：
+由于 ChatGLM-6B 的小规模，其能力仍然有许多局限性。以下是我们目前发现的一些问题：

- 模型容量较小：6B的小容量，决定了其相对较弱的模型记忆和语言能力。在面对许多事实性知识任务时，ChatGLM-6B可能会生成不正确的信息；它也不擅长逻辑类问题（如数学、编程）的解答。
+- 模型容量较小：6B 的小容量，决定了其相对较弱的模型记忆和语言能力。在面对许多事实性知识任务时，ChatGLM-6B 可能会生成不正确的信息；它也不擅长逻辑类问题（如数学、编程）的解答。
    <details><summary><b>点击查看例子</b></summary>
    
    ![](limitations/factual_error.png)
@@ -184,7 +188,7 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).bf
    
    </details>
  
- 产生有害说明或有偏见的内容：ChatGLM-6B只是一个初步与人类意图对齐的语言模型，可能会生成有害、有偏见的内容。（内容可能具有冒犯性，此处不展示）
+- 产生有害说明或有偏见的内容：ChatGLM-6B 只是一个初步与人类意图对齐的语言模型，可能会生成有害、有偏见的内容。（内容可能具有冒犯性，此处不展示）

 - 英文能力不足：ChatGLM-6B 训练时使用的指示/回答大部分都是中文的，仅有极小一部分英文内容。因此，如果输入英文指示，回复的质量远不如中文，甚至与中文指示下的内容矛盾，并且出现中英夹杂的情况。


--- a/README_en.md
+++ b/README_en.md
@@ -7,7 +7,7 @@ ChatGLM-6B is an open bilingual language model based on [General Language Model
 ChatGLM-6B uses technology similar to ChatGPT, optimized for Chinese QA and dialogue. The model is trained for about 1T tokens of Chinese and English corpus, supplemented by supervised fine-tuning, feedback bootstrap, and reinforcement learning wit human feedback. With only about 6.2 billion parameters, the model is able to generate answers that are in line with human preference.

 ## Update
-**[2023/03/19]** Add streaming output function `stream_chat`, already applied in web and CLI demo. Fix Chinese punctuations in output.
+**[2023/03/19]** Add streaming output function `stream_chat`, already applied in web and CLI demo. Fix Chinese punctuations in output. Add quantized model [ChatGLM-6B-INT4](https://huggingface.co/THUDM/chatglm-6b-int4). 

 ## Getting Started

@@ -31,6 +31,7 @@ Generate dialogue with the following code
 >>> from transformers import AutoTokenizer, AutoModel
 >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
 >>> model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
+>>> model = model.eval()
 >>> response, history = model.chat(tokenizer, "你好", history=[])
 >>> print(response)
 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
@@ -98,24 +99,24 @@ After 2 to 3 rounds of dialogue, the GPU memory usage is about 10GB under 8-bit

 Model quantization brings a certain performance decline. After testing, ChatGLM-6B can still perform natural and smooth generation under 4-bit quantization. using [GPT-Q](https://arxiv.org/abs/2210.17323) etc. The quantization scheme can further compress the quantization accuracy/improve the model performance under the same quantization accuracy. You are welcome to submit corresponding Pull Requests.

+**[2023/03/19]** The quantization costs about 13GB of CPU memory to load the FP16 model. If your CPU memory is limited, you can directly load the quantized model, which costs only 5.2GB CPU memory: 
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).half().cuda()
+```
+
 ### CPU Deployment

-If your computer is not equipped with GPU, you can also conduct inference on CPU:
+If your computer is not equipped with GPU, you can also conduct inference on CPU, but the inference speed is slow (and taking about 32GB of memory):

 ```python
 model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).float()
 ```

-The inference speed will be relatively slow on CPU.
-
-The above method requires 32GB of memory. If you only have 16GB of memory, you can try:
-
+**[2023/03/19]** If your CPU memory is limited, you can directly load the quantized model:
 ```python
-model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).bfloat16()
+model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float()
 ```

-It is necessary to ensure that there is nearly 16GB of free memory, and the inference speed will be very slow.
-
 **For Mac users**: if your encounter the error `RuntimeError: Unknown platform: darwin`, please refer to this [Issue](https://github.com/THUDM/ChatGLM-6B/issues/6#issuecomment-1470060041). 

 ## ChatGLM-6B Examples

--- a/requirements.txt
+++ b/requirements.txt
 protobuf>=3.19.5,<3.20.1
-transformers>=4.26.1
+transformers==4.26.1
 icetk
 cpm_kernels
 torch>=1.10

--- a/resources/web-demo.gif
+++ b/resources/web-demo.gif
--- a/web_demo.py
+++ b/web_demo.py
@@ -42,4 +42,4 @@ with gr.Blocks() as demo:
            temperature = gr.Slider(0, 1, value=0.95, step=0.01, label="Temperature", interactive=True)
            button = gr.Button("Generate")
    button.click(predict, [txt, max_length, top_p, temperature, state], [state] + text_boxes)
-demo.queue().launch(share=True, inbrowser=True)
\ No newline at end of file
+demo.queue().launch(share=False, inbrowser=True)
--- a/web_demo2.py
+++ b/web_demo2.py
@@ -25,20 +25,31 @@ def predict(input, history=None):
    tokenizer, model = get_model()
    if history is None:
        history = []
-    response, history = model.chat(tokenizer, input, history)

-    for i, (query, response) in enumerate(history):
-        message(query, avatar_style="big-smile", key=str(i) + "_user")
-        message(response, avatar_style="bottts", key=str(i))
+    with container:
+        if len(history) > 0:
+            for i, (query, response) in enumerate(history):
+                message(query, avatar_style="big-smile", key=str(i) + "_user")
+                message(response, avatar_style="bottts", key=str(i))
+
+        message(input, avatar_style="big-smile", key=str(len(history)) + "_user")
+        st.write("AI正在回复:")
+        with st.empty():
+            for response, history in model.stream_chat(tokenizer, input, history):
+                query, response = history[-1]
+                st.write(response)

    return history


+container = st.container()
+
 # create a prompt text for the text generation
 prompt_text = st.text_area(label="用户命令输入",
            height = 100,
            placeholder="请在这儿输入您的命令")

+
 if 'state' not in st.session_state:
    st.session_state['state'] = []

@@ -47,4 +58,4 @@ if st.button("发送", key="predict"):
        # text generation
        st.session_state["state"] = predict(prompt_text, st.session_state["state"])

-    st.balloons()
\ No newline at end of file
+    st.session_state["state"]
\ No newline at end of file