提交 d1de8fe8 编写于 作者: L liuyuan

docs: optimization k8s deploy

上级 90f46849
......@@ -10,17 +10,17 @@ As a time series database for Cloud Native architecture design, TDengine support
To meet [high availability ](https://docs.taosdata.com/tdinternal/high-availability/)requirements, clusters need to meet the following requirements:
- 3 or more dnodes: The vnodes in the vgroup of TDengine are not allowed to be distributed in one dnode at the same time, so if you create a database with 3 copies, the number of dnodes is greater than or equal to 3
- 3 mnodes: nmode is responsible for the management of the entire cluster. TDengine defaults to an mnode. At this time, if the dnode where the mnode is located is dropped, the entire cluster is unavailable at this time
- There are 3 copies of the database, and the copy configuration of TDengine is DB level, which can be satisfied with 3 copies. In a 3-node cluster, any dnode goes offline, which does not affect the normal use of the cluster. **If the number of offline is 2, the cluster is unavailable at this time, and RAFT cannot complete the election** , (Enterprise Edition: In the disaster recovery scenario, any node data file is damaged, which can be restored by pulling up the dnode again)
- 3 or more dnodes : multiple vnodes in the same vgroup of TDengine are not allowed to be distributed in one dnode at the same time, so if you create a database with 3 copies, the number of dnodes is greater than or equal to 3
- 3 mnodes : m n ode is responsible for the management of the entire cluster, TDengine defaults to an mnode . If the dnode where the mnode is located is dropped, the entire cluster is unavailable .
- Database 3 replicas : TDengine replica configuration is the database level, so database 3 replicas can meet the three dnode cluster, any one dnode offline, does not affect the normal use of the cluster . **If the number of offline** **dnodes** **is 2, then the cluster is not available,** **because** **RAFT can not complete the election** **.** (Enterprise version: in the disaster recovery scenario, any node data file is damaged, can be restored by pulling up the dnode again)
## Prerequisites
Before deploying TDengine on Kubernetes, perform the following:
- Current steps are compatible with Kubernetes v1.5 and later version.
- Install and configure minikube, kubectl, and helm.
- Install and deploy Kubernetes and ensure that it can be accessed and used normally. Update any container registries or other services as necessary.
- This article applies Kubernetes 1.19 and above
- This article uses the kubectl tool to install and deploy, please install the corresponding software in advance
- Kubernetes have been installed and deployed and can access or update the necessary container repositories or other services
You can download the configuration files in this document from [GitHub](https://github.com/taosdata/TDengine-Operator/tree/3.0/src/tdengine).
......@@ -52,7 +52,7 @@ spec:
According to Kubernetes instructions for various deployments, we will use StatefulSet as the service type of TDengine. Create the file `tdengine.yaml `, where replicas defines the number of cluster nodes as 3. The node time zone is China (Asia/Shanghai), and each node is allocated 5G standard storage (refer to the [Storage Classes ](https://kubernetes.io/docs/concepts/storage/storage-classes/)configuration storage class). You can also modify accordingly according to the actual situation.
You need to pay attention to the configuration of startupProbe. After the dnode is disconnected for a period of time, restart, and the newly launched dnode will be temporarily unavailable. If the startupProbe configuration is too small, Kubernetes will think that the pod is in an abnormal state and will try to pull the pod again. At this time, dnode will restart frequently and never recover. Refer to [Configure Liveness, Readiness and Startup Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
Please pay special attention to the startupProbe configuration, after dnode 's pod drops for a period of time, then restart, this time the newly launched dnode will be temporarily unavailable . If the startupProbe configuration is too small, Kubernetes will think that the Pod is in an abnormal state , and try to restart the Pod, the dnode 's Pod will restart frequently and never return to the normal state . Refer to [Configure Liveness, Readiness and Startup Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
```YAML
---
......@@ -149,7 +149,7 @@ spec:
## Use kubectl to deploy TDengine
Execute the following commands in sequence, and you need to create the corresponding namespace in advance.
First create the corresponding namespace, and then execute the following command in sequence :
```Bash
kubectl apply -f taosd-service.yaml -n tdengine-test
......@@ -230,7 +230,7 @@ Query OK, 3 row(s) in set (0.003108s)
## Enable port forwarding
The kubectl port forwarding feature allows applications to access the TDengine cluster running on Kubernetes.
Kubectl port forwarding enables applications to access TDengine clusters running in Kubernetes environments.
```bash
kubectl port-forward -n tdengine-test tdengine-0 6041:6041 &
......@@ -325,7 +325,7 @@ Query OK, 2 row(s) in set (0.001489s)
### Test fault tolerance
The dnode where the Mnode leader is located is offline, dnode1
The dnode where the mnode leader is located is disconnected, dnode1
```Bash
kubectl get pod -l app=tdengine -n tdengine-test -o wide
......@@ -389,7 +389,7 @@ taos> select *from test1.t1
Query OK, 4 row(s) in set (0.001994s)
```
In the same way, as for the mnode dropped by the non-leader, reading and writing can of course be performed normally, so there will be no too much display here.
Similarly, as for the non-leader mnode dropped, read and write can of course be normal, here will not do too much display .
## Scaling Out Your Cluster
......@@ -425,7 +425,7 @@ The dnode list of the expanded four-node TDengine cluster:
```Plain
taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 10 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-20 09:39:04.297 | | | |
2 | tdengine-1.ta... | 10 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-20 09:28:24.240 | | | |
......@@ -469,7 +469,7 @@ tdengine-1 1/1 Running 1 (7h9m ago) 7h23m 10.244.0.59 node84 <
tdengine-2 1/1 Running 0 5h45m 10.244.1.224 node85 <none> <none>
```
After the POD is deleted, the PVC needs to be deleted manually, otherwise the previous data will continue to be used in the next expansion, resulting in the inability to join the cluster normally.
After the POD is deleted, the PVC needs to be deleted manually, otherwise the previous data will continue to be used for the next expansion, resulting in the inability to join the cluster normally.
```Bash
kubectl delete pvc aosdata-tdengine-3 -n tdengine-test
......@@ -491,7 +491,7 @@ tdengine-3 1/1 Running 0 20s 10.244.2.77 node86 <
kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show dnodes"
taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 10 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-20 09:39:04.297 | | | |
2 | tdengine-1.ta... | 10 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-20 09:28:24.240 | | | |
......@@ -504,7 +504,7 @@ Query OK, 4 row(s) in set (0.003881s)
> **When deleting the pvc, you need to pay attention to the pv persistentVolumeReclaimPolicy policy. It is recommended to change to Delete, so that the pv will be automatically cleaned up when the pvc is deleted, and the underlying csi storage resources will be cleaned up at the same time. If the policy of deleting the pvc to automatically clean up the pv is not configured, and then after deleting the pvc, when manually cleaning up the pv, the csi storage resources corresponding to the pv may not be released.**
Complete removal of TDengine cluster, need to clean statefulset, svc, configmap, pvc respectively.
Complete removal of TDengine cluster, need to clean up statefulset, svc, configmap, pvc respectively.
```Bash
kubectl delete statefulset -l app=tdengine -n tdengine-test
......@@ -534,10 +534,10 @@ Query OK, 4 row(s) in set (0.003862s)
## Finally
For the high availability and high reliability of TDengine in the k8s environment, for hardware damage and disaster recovery, it is divided into two levels:
For the high availability and high reliability of TDengine in a Kubernetes environment, hardware damage and disaster recovery are divided into two levels:
1. The disaster recovery capability of the underlying distributed Block Storage, the multi-replica of Block Storage, the current popular distributed Block Storage such as ceph, has the multi-replica capability, extending the storage replica to different racks, cabinets, computer rooms, Data center (or directly use the Block Storage service provided by Public Cloud vendors)
2. TDengine disaster recovery, in TDengine Enterprise, itself has when a dnode permanently offline (TCE-metal disk damage, data sorting loss), re-pull a blank dnode to restore the original dnode work.
1. The disaster recovery capability of the underlying distributed Block Storage, the multi-copy of Block Storage, the current popular distributed Block Storage such as CEPH , has the multi-copy capability, extending the storage copy to different racks, cabinets, computer rooms, Data center (or directly use the Block Storage service provided by Public Cloud vendors)
2. TDengine disaster recovery, in TDengine Enterprise, itself has when a dnode permanently offline (TCE-metal disk damage, data sorting loss), re-pull a blank dnode to restore the original dnode work.
Finally, welcome to [TDengine Cloud ](https://cloud.tdengine.com/)to experience the one-stop fully managed TDengine Cloud as a Service.
......
......@@ -6,27 +6,27 @@ description: 利用 Kubernetes 部署 TDengine 集群的详细指南
## 概述
作为面向云原生架构设计的时序数据库,TDengine 支持 Kubernetes 部署。这里介绍如何使用 YAML 文件一步一步从头创建一个可用于生产使用的高可用TDengine 集群,并重点介绍 Kubernetes 环境下 TDengine 的常用操作。
作为面向云原生架构设计的时序数据库,TDengine 本身就支持 Kubernetes 部署。这里介绍如何使用 YAML 文件从头一步一步创建一个可用于生产使用的高可用 TDengine 集群,并重点介绍 Kubernetes 环境下 TDengine 的常用操作。
为了满足[高可用](https://docs.taosdata.com/tdinternal/high-availability/)的需求,集群需要满足如下要求:
- 3个及以上dnode:TDengine的vgroup中的vnode,不允许同时分布在一个dnode,所以如果创建3副本的数据库,则dnode数大于等于3
- 3个mnode:nmode负责整个集群的管理工作,TDengine默认是一个mnode,此时如果mnode所在的dnode掉线,则此时整个集群不可用
- 数据库3副本,TDengine的副本配置是DB级别,3副本可满足,在3节点的集群中,任意一个dnode下线,都不影响集群的正常使用,**如果下线个数为2时,此时集群不可用,RAFT无法完成选举**(企业版:在灾难恢复场景,任一节点数据文件损坏,都可以通过重新拉起dnode进行恢复)
- 3个及以上 dnode :TDengine 的同一个 vgroup 中的多个 vnode ,不允许同时分布在一个 dnode ,所以如果创建3副本的数据库,则 dnode 数大于等于3
- 3个 mnode :mnode 负责整个集群的管理工作,TDengine 默认是一个 mnode。如果这个 mnode 所在的 dnode 掉线,则整个集群不可用。
- 数据库的3副本:TDengine 的副本配置是数据库级别,所以数据库3副本可满足在3个 dnode 的集群中,任意一个 dnode 下线,都不影响集群的正常使用。**如果下线** **dnode** **个数为2时,此时集群不可用,****因为****RAFT无法完成选举****。**(企业版:在灾难恢复场景,任一节点数据文件损坏,都可以通过重新拉起dnode进行恢复)
## 前置条件
要使用 Kubernetes 部署管理 TDengine 集群,需要做好如下准备工作。
- 本文适用 Kubernetes v1.5 以上版本
- 本文和下一章使用 minikube、kubectl 和 helm 等工具进行安装部署,请提前安装好相应软件
- 本文适用 Kubernetes v1.19 以上版本
- 本文使用 kubectl 工具进行安装部署,请提前安装好相应软件
- Kubernetes 已经安装部署并能正常访问使用或更新必要的容器仓库或其他服务
以下配置文件也可以从 [GitHub 仓库](https://github.com/taosdata/TDengine-Operator/tree/3.0/src/tdengine) 下载。
## 配置 Service 服务
创建一个 Service 配置文件:`taosd-service.yaml`,服务名称 `metadata.name` (此处为 "taosd") 将在下一步中使用到。添加 TDengine 所用到的端口:
创建一个 Service 配置文件:`taosd-service.yaml`,服务名称 `metadata.name` (此处为 "taosd") 将在下一步中使用到。首先添加 TDengine 所用到的端口,然后在选择器设置确定的标签 app (此处为 “tdengine”)。
```YAML
---
......@@ -50,9 +50,9 @@ spec:
## 有状态服务 StatefulSet
根据 Kubernetes 对各类部署的说明,我们将使用 StatefulSet 作为 TDengine 的服务类型。 创建文件 `tdengine.yaml`,其中 replicas 定义集群节点的数量为 3。节点时区为中国(Asia/Shanghai),每个节点分配 5G 标准(standard)存储(参考[Storage Classes](https://kubernetes.io/docs/concepts/storage/storage-classes/) 配置storage class)。你也可以根据实际情况进行相应修改。
根据 Kubernetes 对各类部署的说明,我们将使用 StatefulSet 作为 TDengine 的部署资源类型。 创建文件 `tdengine.yaml`,其中 replicas 定义集群节点的数量为 3。节点时区为中国(Asia/Shanghai),每个节点分配 5G 标准(standard)存储(参考[Storage Classes](https://kubernetes.io/docs/concepts/storage/storage-classes/) 配置 storage class )。你也可以根据实际情况进行相应修改。
需要注意startupProbe的配置,在dnode 掉线一段时间后,重新启动,新上线的dnode会短暂不可用,如果startupProbe配置过小,Kubernetes会认为pod处于不正常的状态,会尝试重新拉起pod,此时,dnode会频繁重启,始终无法恢复。参考 [Configure Liveness, Readiness and Startup Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
请特别注意startupProbe的配置,在 dnode 的 Pod 掉线一段时间后,再重新启动,这个时候新上线的 dnode 会短暂不可用。如果startupProbe配置过小,Kubernetes 会认为该 Pod 处于不正常的状态,并尝试重启该 Pod,该 dnode 的 Pod 会频繁重启,始终无法恢复到正常状态。参考 [Configure Liveness, Readiness and Startup Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
```YAML
---
......@@ -149,7 +149,7 @@ spec:
## 使用 kubectl 命令部署 TDengine 集群
顺序执行以下命令,需要提前创建对应的namespace。
首先创建对应的 namespace,然后顺序执行以下命令:
```Bash
kubectl apply -f taosd-service.yaml -n tdengine-test
......@@ -168,7 +168,7 @@ kubectl exec -it tdengine-2 -n tdengine-test -- taos -s "show dnodes"
```Bash
taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 0 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-19 17:54:18.469 | | | |
2 | tdengine-1.ta... | 0 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-19 17:54:38.698 | | | |
......@@ -232,13 +232,13 @@ Query OK, 3 row(s) in set (0.003108s)
利用 kubectl 端口转发功能可以使应用可以访问 Kubernetes 环境运行的 TDengine 集群。
```bash
```Plain
kubectl port-forward -n tdengine-test tdengine-0 6041:6041 &
```
使用 curl 命令验证 TDengine REST API 使用的 6041 接口。
```bash
```Plain
curl -u root:taosdata -d "show databases" 127.0.0.1:6041/rest/sql
{"code":0,"column_meta":[["name","VARCHAR",64]],"data":[["information_schema"],["performance_schema"],["test"],["test1"]],"rows":4}
```
......@@ -278,7 +278,7 @@ taos> show dnodes
Query OK, 3 row(s) in set (0.001357s)
```
通过show vgroup 查看xnode分布情况
通过show vgroup 查看 vnode 分布情况
```Bash
kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show test.vgroups"
......@@ -325,7 +325,7 @@ Query OK, 2 row(s) in set (0.001489s)
### 容错测试
Mnode leader 所在的dnode掉线,dnode1
Mnode leader 所在的 dnode 掉线,dnode1
```Bash
kubectl get pod -l app=tdengine -n tdengine-test -o wide
......@@ -389,7 +389,7 @@ taos> select *from test1.t1
Query OK, 4 row(s) in set (0.001994s)
```
同理,至于非leader得mnode掉线,读写当然可以正常进行,这里就不做过多的展示
同理,至于非leader得mnode掉线,读写当然可以正常进行,这里就不做过多的展示
## 集群扩容
......@@ -415,7 +415,7 @@ tdengine-2 1/1 Running 0 5h16m 10.244.1.224 node85 <
tdengine-3 1/1 Running 0 3m24s 10.244.2.76 node86 <none> <none>
```
此时 POD 的状态仍然是 Running,TDengine 集群中的 dnode 状态要等 POD 状态为 `ready` 之后才能看到:
此时 Pod 的状态仍然是 Running,TDengine 集群中的 dnode 状态要等 Pod 状态为 `ready` 之后才能看到:
```Bash
kubectl exec -it tdengine-3 -n tdengine-test -- taos -s "show dnodes"
......@@ -436,7 +436,7 @@ Query OK, 4 row(s) in set (0.003628s)
## 集群缩容
由于 TDengine 集群在扩缩容时会对数据进行节点间迁移,使用 kubectl 命令进行缩容需要首先使用 "drop dnodes" 命令(**如果集群中存在3副本的db,那么缩容后的dnode个数也要必须大于等于3,否则drop dnode操作会被中止**),节点删除完成后再进行 Kubernetes 集群缩容。
由于 TDengine 集群在扩缩容时会对数据进行节点间迁移,使用 kubectl 命令进行缩容需要首先使用 "drop dnodes" 命令(**如果集群中存在3副本的db,那么缩容后的** **dnode** **个数也要必须大于等于3,否则drop dnode操作会被中止**),然后再节点删除完成后再进行 Kubernetes 集群缩容。
注意:由于 Kubernetes Statefulset 中 Pod 的只能按创建顺序逆序移除,所以 TDengine drop dnode 也需要按照创建顺序逆序移除,否则会导致 Pod 处于错误状态。
......@@ -491,7 +491,7 @@ tdengine-3 1/1 Running 0 20s 10.244.2.77 node86 <
kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show dnodes"
taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 10 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-20 09:39:04.297 | | | |
2 | tdengine-1.ta... | 10 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-20 09:28:24.240 | | | |
......@@ -534,10 +534,10 @@ Query OK, 4 row(s) in set (0.003862s)
## 最后
对于在k8s环境下TDengine高可用、高可靠来说,对于硬件损坏、灾难恢复,分为两个层面来讲:
对于在 Kubernetes 环境下 TDengine 的高可用和高可靠来说,对于硬件损坏、灾难恢复,分为两个层面来讲:
1. 底层的分布式块存储具备的灾难恢复能力,块存储的多副本,当下流行的分布式块存储如ceph,就具备多副本能力,将存储副本扩展到不同的机架、机柜、机房、数据中心(或者直接使用公有云厂商提供的块存储服务)
2. TDengine的灾难恢复,在TDengine Enterprise中,本身具备了当一个dnode永久下线(物理机磁盘损坏,数据分拣丢失)后,重新拉起一个空白的dnode来恢复原dnode的工作。
1. 底层的分布式块存储具备的灾难恢复能力,块存储的多副本,当下流行的分布式块存储如 Ceph,就具备多副本能力,将存储副本扩展到不同的机架、机柜、机房、数据中心(或者直接使用公有云厂商提供的块存储服务)
2. TDengine的灾难恢复,在 TDengine Enterprise 中,本身具备了当一个 dnode 永久下线(物理机磁盘损坏,数据分拣丢失)后,重新拉起一个空白的dnode来恢复原dnode的工作。
最后,欢迎使用[TDengine Cloud](https://cloud.taosdata.com/),来体验一站式全托管的TDengine云服务。
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册