This chapter introduces how to write data into TDengine with high throughput.
## How to achieve high performance data writing
To achieve high performance writing, there are a few aspects to consider. In the following sections we will describe these important factors in achieving high performance writing.
### Application Program
From the perspective of application program, you need to consider:
1. The data size of each single write, also known as batch size. Generally speaking, higher batch size generates better writing performance. However, once the batch size is over a specific value, you will not get any additional benefit anymore. When using SQL to write into TDengine, it's better to put as much as possible data in single SQL. The maximum SQL length supported by TDengine is 1,048,576 bytes, i.e. 1 MB. It can be configured by parameter `maxSQLLength` on client side, and the default value is 65,480.
2. The number of concurrent connections. Normally more connections can get better result. However, once the number of connections exceeds the processing ability of the server side, the performance may downgrade.
3. The distribution of data to be written across tables or sub-tables. Writing to single table in one batch is more efficient than writing to multiple tables in one batch.
4. Data Writing Protocol.
- Prameter binding mode is more efficient than SQL because it doesn't have the cost of parsing SQL.
- Writing to known existing tables is more efficient than wirting to uncertain tables in automatic creating mode because the later needs to check whether the table exists or not before actually writing data into it
- Writing in SQL is more efficient than writing in schemaless mode because schemaless writing creats table automatically and may alter table schema
Application programs need to take care of the above factors and try to take advantage of them. The application progam should write to single table in each write batch. The batch size needs to be tuned to a proper value on a specific system. The number of concurrent connections needs to be tuned to a proper value too to achieve the best writing throughput.
### Data Source
Application programs need to read data from data source then write into TDengine. If you meet one or more of below situations, you need to setup message queues between the threads for reading from data source and the threads for writing into TDengine.
1. There are multiple data sources, the data generation speed of each data source is much slower than the speed of single writing thread. In this case, the purpose of message queues is to consolidate the data from multiple data sources together to increase the batch size of single write.
2. The speed of data generation from single data source is much higher than the speed of single writing thread. The purpose of message queue in this case is to provide buffer so that data is not lost and multiple writing threads can get data from the buffer.
3. The data for single table are from multiple data source. In this case the purpose of message queues is to combine the data for single table together to improve the write efficiency.
If the data source is Kafka, then the appication program is a consumer of Kafka, you can benefit from some kafka features to achieve high performance writing:
1. Put the data for a table in single partition of single topic so that it's easier to put the data for each table together and write in batch
2. Subscribe multiple topics to accumulate data together.
3. Add more consumers to gain more concurrency and throughput.
4. Incrase the size of single fetch to increase the size of write batch.
### Tune TDengine
TDengine is a distributed and high performance time series database, there are also some ways to tune TDengine to get better writing performance.
1. Set proper number of `vgroups` according to available CPU cores. Normally, we recommend 2 \* number_of_cores as a starting point. If the verification result shows this is not enough to utilize CPU resources, you can use a higher value.
2. Set proper `minTablesPerVnode`, `tableIncStepPerVnode`, and `maxVgroupsPerDb` according to the number of tables so that tables are distributed even across vgroups. The purpose is to balance the workload among all vnodes so that system resources can be utilized better to get higher performance.
For more performance tuning parameters, please refer to [Configuration Parameters](../../../reference/config).
## Sample Programs
This section will introduce the sample programs to demonstrate how to write into TDengine with high performance.
### Scenario
Below are the scenario for the sample programs of high performance wrting.
- Application program reads data from data source, the sample program simulates a data source by generating data
- The speed of single writing thread is much slower than the speed of generating data, so the program starts multiple writing threads while each thread establish a connection to TDengine and each thread has a message queue of fixed size.
- Application program maps the received data to different writing threads based on table name to make sure all the data for each table is always processed by a specific writing thread.
- Each writing thread writes the received data into TDengine once the message queue becomes empty or the read data meets a threshold.
![Thread Model of High Performance Writing into TDengine](highvolume.webp)
### Sample Programs
The sample programs listed in this section are based on the scenario described previously. If your scenarios is different, please try to adjust the code based on the principles described in this chapter.
The sample programs assume the source data is for all the different sub tables in same super table (meters). The super table has been created before the sample program starts to writing data. Sub tables are created automatically according to received data. If there are multiple super tables in your case, please try to adjust the part of creating table automatically.
| DataBaseMonitor | Calculate the writing speed and output on console every 10 seconds |
Below is the list of complete code of the classes in above table and more detailed description.
<details>
<summary>FastWriteExample</summary>
The main Program is responsible for:
1. Create message queues
2. Start writing threads
3. Start reading threads
4. Otuput writing speed every 10 seconds
The main program provides 4 parameters for tuning:
1. The number of reading threads, default value is 1
2. The number of writing threads, default alue is 2
3. The total number of tables in the generated data, default value is 1000. These tables are distributed evenly across all writing threads. If the number of tables is very big, it will cost much time to firstly create these tables.
4. The batch size of single write, default value is 3,000
The capacity of message queue also impacts performance and can be tuned by modifying program. Normally it's always better to have a larger message queue. A larger message queue means lower possibility of being blocked when enqueueing and higher throughput. But a larger message queue consumes more memory space. The default value used in the sample programs is already big enoug.
ReadTask reads data from data source. Each ReadTask is associated with a simulated data source, each data source generates data for a group of specific tables, and the data of any table is only generated from a single specific data source.
ReadTask puts data in message queue in blocking mode. That means, the putting operation is blocked if the message queue is full.
SQLWriter class encapsulates the logic of composing SQL and writing data. Please be noted that the tables have not been created before writing, but are created automatically when catching the exception of table doesn't exist. For other exceptions caught, the SQL which caused the exception are logged for you to debug.
You need to set environment variable `TDENGINE_JDBC_URL` before launching the program. If TDengine Server is setup on localhost, then the default value for user name, password and port can be used, like below:
If your TDengine server is not deployed on localhost or doesn't use default port, you need to change the above URL to correct value in your environment.
SQLWriter class encapsulates the logic of composing SQL and writing data. Please be noted that the tables have not been created before writing, but are created automatically when catching the exception of table doesn't exist. For other exceptions caught, the SQL which caused the exception are logged for you to debug. This class also checks the SQL length, if the SQL length is closed to `maxSQLLength` the SQL will be executed immediately. To improve writing efficiency, it's better to increase `maxSQLLength` properly.
<summary>SQLWriter</summary>
```python
{{#include docs/examples/python/sql_writer.py}}
```
</details>
**Steps to Launch**
<details>
<summary>Launch Sample Program in Python</summary>
1. Prerequisities
- TDengine client driver has been installed
- Python3 has been installed, the the version >= 3.8
- TDengine Python connector `taospy` has been installed
2. Install faster-fifo to replace python builtin multiprocessing.Queue
```
pip3 install faster-fifo
```
3. Click the "Copy" in the above sample programs to copy `fast_write_example.py` 、 `sql_writer.py` and `mockdatasource.py`.
Don't establish connection to TDengine in the parent process if using Python connector in multi-process way, otherwise all the connections in child processes are blocked always. This is a known issue.
@@ -114,7 +114,9 @@ The above process can be repeated to add more dnodes in the cluster.
Any node that is in the cluster and online can be the firstEp of new nodes.
Nodes use the firstEp parameter only when joining a cluster for the first time. After a node has joined the cluster, it stores the latest mnode in its end point list and no longer makes use of firstEp.
However, firstEp is used by clients that connect to the cluster. For example, if you run `taos shell` without arguments, it connects to the firstEp by default.
However, firstEp is used by clients that connect to the cluster. For example, if you run TDengine CLI `taos` without arguments, it connects to the firstEp by default.
Two dnodes that are launched without a firstEp value operate independently of each other. It is not possible to add one dnode to the other dnode and form a cluster. It is also not possible to form two independent clusters into a new cluster.
* A helper class encapsulate the logic of writing using SQL.
* <p>
* The main interfaces are two methods:
* <ol>
* <li>{@link SQLWriter#processLine}, which receive raw lines from WriteTask and group them by table names.</li>
* <li>{@link SQLWriter#flush}, which assemble INSERT statement and execute it.</li>
* </ol>
* <p>
* There is a technical skill worth mentioning: we create table as needed when "table does not exist" error occur instead of creating table automatically using syntax "INSET INTO tb USING stb".
* This ensure that checking table existence is a one-time-only operation.
Since TDengine use container hostname to establish connections, it's a bit more complex to use taos shell and native connectors(such as JDBC-JNI) with TDengine container instance. This is the recommended way to expose ports and use TDengine with docker in simple cases. If you want to use taos shell or taosc/connectors smoothly outside the `tdengine` container, see next use cases that match you need.
Since TDengine use container hostname to establish connections, it's a bit more complex to use TDengine CLI and native connectors(such as JDBC-JNI) with TDengine container instance. This is the recommended way to expose ports and use TDengine with docker in simple cases. If you want to use TDengine CLI or taosc/connectors smoothly outside the `tdengine` container, see next use cases that match you need.
### Start with host network
...
...
@@ -87,7 +87,7 @@ docker run -d \
This command starts a docker container with TDengine server running and maps the container's TCP ports from 6030 to 6049 to the host's ports from 6030 to 6049 with TCP protocol and UDP ports range 6030-6039 to the host's UDP ports 6030-6039. If the host is already running TDengine server and occupying the same port(s), you need to map the container's port to a different unused port segment. (Please see TDengine 2.0 Port Description for details). In order to support TDengine clients accessing TDengine server services, both TCP and UDP ports need to be exposed by default(unless `rpcForceTcp` is set to `1`).
If you want to use taos shell or native connectors([JDBC-JNI](https://www.taosdata.com/cn/documentation/connector/java), or [driver-go](https://github.com/taosdata/driver-go)), you need to make sure the `TAOS_FQDN` is resolvable at `/etc/hosts` or with custom DNS service.
If you want to use TDengine CLI or native connectors([JDBC-JNI](https://www.taosdata.com/cn/documentation/connector/java), or [driver-go](https://github.com/taosdata/driver-go)), you need to make sure the `TAOS_FQDN` is resolvable at `/etc/hosts` or with custom DNS service.
If you set the `TAOS_FQDN` to host's hostname, it will works as using `hosts` network like previous use case. Otherwise, like in `-e TAOS_FQDN=tdengine`, you can add the hostname record `tdengine` into `/etc/hosts` (use `127.0.0.1` here in host path, if use TDengine client/application in other hosts, you should set the right ip to the host eg. `192.168.10.1`(check the real ip in host with `hostname -i` or `ip route list default`) to make the TDengine endpoint resolvable):
...
...
@@ -391,7 +391,7 @@ test_td-1_1 /usr/bin/entrypoint.sh taosd Up 6030/tcp, 6031/tcp,