02-concept.md 13.1 KB
Newer Older
D
dingbo 已提交
1
---
D
dingbo 已提交
2
title: Concepts
D
dingbo 已提交
3 4
---

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
5
## A Typical Time-Series Data Scenario
D
dingbo 已提交
6

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
7
In typical IoT, Connected Vehicles and IT Monitoring scenarios, there are often one or multiple types of data collection points that collect one or multiple metrics. However, for a data collection point type, there are often a number of specific points distributed in places. All the collected metric data are always time-stamped, and the volume of metric data grows with time, but each data collection point has its static attributes. For a speicific type of data collection point, its collected data and static attributes are always structured. Taking power smart meter as an example, each smater meter collects three metrics: current, voltage and phase. The collected data points are similiar to the following table: 
D
dingbo 已提交
8

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
9
<figure><table>
D
dingbo 已提交
10
<thead><tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
11 12 13 14
    <th style="text-align:center;">Device ID</th>
    <th style="text-align:center;">Time Stamp</th>
    <th style="text-align:center;" colspan="3">Collected Metrics</th>
    <th style="text-align:center;" colspan="2">Tags</th>
D
dingbo 已提交
15 16
    </tr>
<tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
17 18 19 20 21 22 23
<th style="text-align:center;">Device ID</th>
<th style="text-align:center;">Time Stamp</th>
<th style="text-align:center;">current</th>
<th style="text-align:center;">voltage</th>
<th style="text-align:center;">phase</th>
<th style="text-align:center;">location</th>
<th style="text-align:center;">groupId</th>
D
dingbo 已提交
24 25 26 27
</tr>
</thead>
<tbody>
<tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
28 29 30 31 32 33 34
<td style="text-align:center;">d1001</td>
<td style="text-align:center;">1538548685000</td>
<td style="text-align:center;">10.3</td>
<td style="text-align:center;">219</td>
<td style="text-align:center;">0.31</td>
<td style="text-align:center;">Beijing.Chaoyang</td>
<td style="text-align:center;">2</td>
D
dingbo 已提交
35 36
</tr>
<tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
37 38 39 40 41 42 43
<td style="text-align:center;">d1002</td>
<td style="text-align:center;">1538548684000</td>
<td style="text-align:center;">10.2</td>
<td style="text-align:center;">220</td>
<td style="text-align:center;">0.23</td>
<td style="text-align:center;">Beijing.Chaoyang</td>
<td style="text-align:center;">3</td>
D
dingbo 已提交
44 45
</tr>
<tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
46 47 48 49 50 51 52
<td style="text-align:center;">d1003</td>
<td style="text-align:center;">1538548686500</td>
<td style="text-align:center;">11.5</td>
<td style="text-align:center;">221</td>
<td style="text-align:center;">0.35</td>
<td style="text-align:center;">Beijing.Haidian</td>
<td style="text-align:center;">3</td>
D
dingbo 已提交
53 54
</tr>
<tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
55 56 57 58 59 60 61
<td style="text-align:center;">d1004</td>
<td style="text-align:center;">1538548685500</td>
<td style="text-align:center;">13.4</td>
<td style="text-align:center;">223</td>
<td style="text-align:center;">0.29</td>
<td style="text-align:center;">Beijing.Haidian</td>
<td style="text-align:center;">2</td>
D
dingbo 已提交
62 63
</tr>
<tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
64 65 66 67 68 69 70
<td style="text-align:center;">d1001</td>
<td style="text-align:center;">1538548695000</td>
<td style="text-align:center;">12.6</td>
<td style="text-align:center;">218</td>
<td style="text-align:center;">0.33</td>
<td style="text-align:center;">Beijing.Chaoyang</td>
<td style="text-align:center;">2</td>
D
dingbo 已提交
71 72
</tr>
<tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
73 74 75 76 77 78 79
<td style="text-align:center;">d1004</td>
<td style="text-align:center;">1538548696600</td>
<td style="text-align:center;">11.8</td>
<td style="text-align:center;">221</td>
<td style="text-align:center;">0.28</td>
<td style="text-align:center;">Beijing.Haidian</td>
<td style="text-align:center;">2</td>
D
dingbo 已提交
80 81
</tr>
<tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
82 83 84 85 86 87 88
<td style="text-align:center;">d1002</td>
<td style="text-align:center;">1538548696650</td>
<td style="text-align:center;">10.3</td>
<td style="text-align:center;">218</td>
<td style="text-align:center;">0.25</td>
<td style="text-align:center;">Beijing.Chaoyang</td>
<td style="text-align:center;">3</td>
D
dingbo 已提交
89 90
</tr>
<tr>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
91 92 93 94 95 96 97
<td style="text-align:center;">d1001</td>
<td style="text-align:center;">1538548696800</td>
<td style="text-align:center;">12.3</td>
<td style="text-align:center;">221</td>
<td style="text-align:center;">0.31</td>
<td style="text-align:center;">Beijing.Chaoyang</td>
<td style="text-align:center;">2</td>
D
dingbo 已提交
98 99
</tr>
</tbody>
陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
100
</table></figure>
D
dingbo 已提交
101

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
102
<center> Table 1: Smart meter example data </center>
D
dingbo 已提交
103

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
104
Each row contains the device ID, timestamp, collected metrics (current, voltage, phase as above), and static tags (Location and groupId in Table 1) associated with the devices. Each device generates a row (data point) in a pre-defined timer or triggered by an external event. It is a sequence of data points like a stream.
D
dingbo 已提交
105

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
106
## Data Characteristics
D
dingbo 已提交
107

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
108
The data points generated by IoT, Connected Vehicles, and IT Monitoring have some strong common characteristics:
D
dingbo 已提交
109

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
110 111 112 113 114 115 116 117 118 119 120
1. Data points are always time stamped;
2. Metrics are always structured data;
3. There are rarely delete/update operations on collected data;
4. Unlike traditional databases, transaction processing is not required;
5. The ratio of reading over writing is much lower than typical Internet applications;
6. Data volume is stable and can be predicted according to the number of devices and sampling rate;
7. The user pays attention to the trend of data, not a specific data point at a specific time;
8. Data points are removed once they reach their life time (retention policy);
9. The query is always executed in a given time range and space;
10. Real-time computing or query is desired;
11. Data volume is huge, a system may generate over 10 billion data points in a day.
D
dingbo 已提交
121

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
122
By utilizing the above characteristics, TDengine designs the storage and computing engine in a special and optimized way for time-series data, resulting in massive improvements in system efficiency.
D
dingbo 已提交
123

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
124
## Metric
D
dingbo 已提交
125

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
126
Metric refers to the physical quantity collected by sensors, equipment or other types of data collection devices, such as current, voltage, temperature, pressure, GPS position, etc., which changes with time, and the data type can be integer, float, Boolean, or strings. As time goes by, the amount of collected metric data stored increases.
D
dingbo 已提交
127

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
128
## Label/Tag
D
dingbo 已提交
129

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
130
Label/Tage refers to the static properties of sensors, devices or other types of data collection devices, which do not change with time, such as device model, color, fixed location of the device, etc. The data type can be any type. Although static, TDengine allows users to add, delete or update tag values. Unlike the collected metric data, the amount of tag data stored does not change over time.
D
dingbo 已提交
131

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
132
## Data Colletion Point
D
dingbo 已提交
133

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
134
Data Collection Point(DCP) refers to hardware or software that collects metrics based on preset time periods or triggered by events. A data collection point can collect one or multiple metrics, but these metrics are collected at the same time and have the same time stamp. For some complex equipments, there are often multiple data collection points, and the sampling rate of each collection point may be different, and fully independent. For example, for a car, there is a data collection point to collect GPS position metrics, a data collection point to collect engine status metrics, and a data collection point to collect the environment metrics inside the car, so a car has three data collection points.
D
dingbo 已提交
135

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
136
## Relational Database Model
D
dingbo 已提交
137

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
138
Since time-series data is most likely to be structured data, TDengine adopts the traditional relational database model to process them with a short learning curve. You need to create a database, create tables with schema definitions, then insert data points and execute queries to explore the data. Structured storage is used, instead of NoSQL’s key-value storage.
D
dingbo 已提交
139

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
140
Compared with general database, TDegine adopts "one table for data collection point" strategy to enhance the data ingestion rate and query speed significantly. At the mean time, it introduces "Super Table" to allow each table have a set of labels to make aggregation across tables efficiently.
D
dingbo 已提交
141

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
142
## One Table for One Data Collection Point
D
dingbo 已提交
143

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
144
To utilize this time-series and other data features, TDengine requires the user to create a table for each data collection point (DCP) to store collected time-series data. For example, if there are over 10 million smart meters, it means 10 million tables shall be created. For the table above, 4 tables shall be created for devices D1001, D1002, D1003, and D1004 to store the data collected. This design has several benefits:
D
dingbo 已提交
145

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
146 147 148 149
1. Since the metric data from different DCP is fully independent, the data source of each DCP is unique, and a table has only one writer. In this way, data points can be written in a lock-free manner, and the writing speed can be greatly improved.
2. For a DCP, the metric data generated by DCP is ordered by timestamp, so the write operation can be implemented by simple appending, which further greatly improves the data writing speed.
3. The metric data from a DCP is continuously stored in block by block. If you read data for a period of time, it can greatly reduce random read operations and improve read and query performance by orders of magnitude.
4. Inside a data block for a DCP, columnar storage is used, and different compression algorithms are used for different data types. Because the change of the metrics from a DCP is not big in a time range, the compression rate is higher.
D
dingbo 已提交
150

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
151
If the metric data of multiple DPCs are traditionally written into a single table, due to the uncontrollable network delay, the timing of the data from different DCPs arriving at the server cannot be guaranteed, the writing operation must be protected by locks, and the metric data from one DCP cannot be guaranteed to be continuously stored together. **One table for one data collection point can ensure the best performance of insert and query of a single data collection point to the greatest extent.**
D
dingbo 已提交
152

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
153
TDengine suggests using DCP ID as the table name (like D1001 in the above table). Each DCP may collect one or multiple metrics (like the current, voltage, phase as above). Each metric has a corresponding column in the table. The data type for a column can be int, float, string and others. In addition, the first column in the table must be a timestamp. TDengine uses the time stamp as the index, and won’t build the index on any metrics stored. Column wise storage is used. 
D
dingbo 已提交
154

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
155
## STable: A Collection of Data Collection Points in the Same Type
D
dingbo 已提交
156

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
157
The design of one table for one data collection point will require a huge number of tables, which is difficult to manage. Furthermore, applications often need to take aggregation operations among DCPs, thus aggregation operations will become complicated. To support aggregation over multiple tables efficiently, the STable(Super Table) concept is introduced by TDengine.
D
dingbo 已提交
158

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
159
STable is an set for a type of data collection point. A STable contains a set of data collection points (tables) that have the same schema or data structure, but with different static attributes(tags). To describe a STable, in addition to defining the table structure of the metrics, it is also necessary to define the schema of its tags. The data type of tags can be int, float, string, and there can be multiple tags, which can be added, deleted, or modified afterward. If the whole system has N different types of data collection points, N STables need to be established.
D
dingbo 已提交
160

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
161
In the design of TDengine, **a table is used to represent a specific data collection point, and STable is used to represent a set of data collection points of the same type**. When creating a table for a specific data collection point, the user uses a STable as a template and specifies the tag value of the specific DCP (table). Compared with the traditional relational database, the table (a DCP) has static tags, and these tags can be added, deleted, and updated afterward. The relationship between the STable and the tables created based on the STable is as follows:
D
dingbo 已提交
162

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
163 164 165
1. A STable contains multiple tables with the same metric schema but with different tag values.
2. The schema of metrics or labels cannot be adjusted through tables, and it can only be changed via STable. Changes to the schema of a STable takes effect immediately for all belonged tables.
3. STable defines only one template and does not store any data or label information by itself. Therefore, data cannot be written to a STable, only to tables.
D
dingbo 已提交
166

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
167
Query can be executed on both table and STable. For a query on a STable, TDengine will treat the data in all belonged tables as a whole data set for processing. TDengine will first find out the tables that meet the tag filter conditions, then scan the time-series data of these tables to perform aggregation operation, which can greatly reduce the data sets to be scanned, thus greatly improving the performance of data aggregation across multiple DCPs.
D
dingbo 已提交
168 169 170

## FQDN & End Point

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
171
FQDN (Fully Qualified Domain Name) is the full domain name of a specific computer or host on the Internet. FQDN consists of two parts: hostname and domain name. For example, the FQDN of a mail server might be mail.tdengine.com. The hostname is mail, and the host is located in the domain name tdengine.com. DNS (Domain Name System) is responsible for translating FQDN into IP. For systems without DNS, it can be solved by configuring the hosts file.
D
dingbo 已提交
172

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
173
Each node of a TDengine cluster is uniquely identified by an End Point, which consists of an FQDN and a Port, such as h1.tdengine.com:6030. In this way, when the IP changes, we can still use the FQDN to dynamically find the node without changing any configuration of the cluster. In addition, FQDN is used to facilitate unified access to the same cluster from the Intranet and the Inetnet.
D
dingbo 已提交
174

陶建辉(Jeff)'s avatar
陶建辉(Jeff) 已提交
175
TDengine does not recommend using IP address to access the cluster, which is not good for cluster management.