In typical IoT, Connected Vehicles and IT Monitoring scenarios, there are often one or multiple types of data collection points that collect one or multiple metrics. However, for a data collection point type, there are often a number of specific points distributed in places. All the collected metric data are always time-stamped, and the volume of metric data grows with time, but each data collection point has its static attributes. For a speicific type of data collection point, its collected data and static attributes are always structured. Taking power smart meter as an example, each smater meter collects three metrics: current, voltage and phase. The collected data points are similiar to the following table:
Each row contains the device ID, timestamp, collected metrics (current, voltage, phase as above), and static tags (Location and groupId in Table 1) associated with the devices. Each device generates a row (data point) in a pre-defined timer or triggered by an external event. It is a sequence of data points like a stream.
## 数据特征
## Data Characteristics
除时序特征外,仔细研究发现,物联网、车联网、运维监测类数据及其应用还具有很多其他明显的特征。
The data points generated by IoT, Connected Vehicles, and IT Monitoring have some strong common characteristics:
1. 数据是结构化的;
2. 数据极少有更新或删除操作;
3. 无需传统数据库的事务处理;
4. 相对互联网应用,写多读少;
5. 流量平稳,根据设备数量和采集频次,可以预测出来;
6. 用户关注的是一段时间的趋势,而不是某一特点时间点的值;
7. 数据是有保留期限的;
8. 数据的查询分析一定是基于时间段和地理区域的;
9. 系统需要各种实时计算和统计操作,包括降采样、插值等特种操作;
10. 数据量巨大,一天采集的数据就可以超过 100 亿条。
1. Data points are always time stamped;
2. Metrics are always structured data;
3. There are rarely delete/update operations on collected data;
4. Unlike traditional databases, transaction processing is not required;
5. The ratio of reading over writing is much lower than typical Internet applications;
6. Data volume is stable and can be predicted according to the number of devices and sampling rate;
7. The user pays attention to the trend of data, not a specific data point at a specific time;
8. Data points are removed once they reach their life time (retention policy);
9. The query is always executed in a given time range and space;
10. Real-time computing or query is desired;
11. Data volume is huge, a system may generate over 10 billion data points in a day.
By utilizing the above characteristics, TDengine designs the storage and computing engine in a special and optimized way for time-series data, resulting in massive improvements in system efficiency.
Metric refers to the physical quantity collected by sensors, equipment or other types of data collection devices, such as current, voltage, temperature, pressure, GPS position, etc., which changes with time, and the data type can be integer, float, Boolean, or strings. As time goes by, the amount of collected metric data stored increases.
Label/Tage refers to the static properties of sensors, devices or other types of data collection devices, which do not change with time, such as device model, color, fixed location of the device, etc. The data type can be any type. Although static, TDengine allows users to add, delete or update tag values. Unlike the collected metric data, the amount of tag data stored does not change over time.
Data Collection Point(DCP) refers to hardware or software that collects metrics based on preset time periods or triggered by events. A data collection point can collect one or multiple metrics, but these metrics are collected at the same time and have the same time stamp. For some complex equipments, there are often multiple data collection points, and the sampling rate of each collection point may be different, and fully independent. For example, for a car, there is a data collection point to collect GPS position metrics, a data collection point to collect engine status metrics, and a data collection point to collect the environment metrics inside the car, so a car has three data collection points.
Since time-series data is most likely to be structured data, TDengine adopts the traditional relational database model to process them with a short learning curve. You need to create a database, create tables with schema definitions, then insert data points and execute queries to explore the data. Structured storage is used, instead of NoSQL’s key-value storage.
Compared with general database, TDegine adopts "one table for data collection point" strategy to enhance the data ingestion rate and query speed significantly. At the mean time, it introduces "Super Table" to allow each table have a set of labels to make aggregation across tables efficiently.
To utilize this time-series and other data features, TDengine requires the user to create a table for each data collection point (DCP) to store collected time-series data. For example, if there are over 10 million smart meters, it means 10 million tables shall be created. For the table above, 4 tables shall be created for devices D1001, D1002, D1003, and D1004 to store the data collected. This design has several benefits:
1.Since the metric data from different DCP is fully independent, the data source of each DCP is unique, and a table has only one writer. In this way, data points can be written in a lock-free manner, and the writing speed can be greatly improved.
2.For a DCP, the metric data generated by DCP is ordered by timestamp, so the write operation can be implemented by simple appending, which further greatly improves the data writing speed.
3.The metric data from a DCP is continuously stored in block by block. If you read data for a period of time, it can greatly reduce random read operations and improve read and query performance by orders of magnitude.
4.Inside a data block for a DCP, columnar storage is used, and different compression algorithms are used for different data types. Because the change of the metrics from a DCP is not big in a time range, the compression rate is higher.
If the metric data of multiple DPCs are traditionally written into a single table, due to the uncontrollable network delay, the timing of the data from different DCPs arriving at the server cannot be guaranteed, the writing operation must be protected by locks, and the metric data from one DCP cannot be guaranteed to be continuously stored together. **One table for one data collection point can ensure the best performance of insert and query of a single data collection point to the greatest extent.**
TDengine suggests using DCP ID as the table name (like D1001 in the above table). Each DCP may collect one or multiple metrics (like the current, voltage, phase as above). Each metric has a corresponding column in the table. The data type for a column can be int, float, string and others. In addition, the first column in the table must be a timestamp. TDengine uses the time stamp as the index, and won’t build the index on any metrics stored. Column wise storage is used.
对于复杂的设备,比如汽车,它有多个数据采集点,那么就需要为一台汽车建立多张表。
## STable: A Collection of Data Collection Points in the Same Type
## 超级表:同一类型数据采集点的集合
The design of one table for one data collection point will require a huge number of tables, which is difficult to manage. Furthermore, applications often need to take aggregation operations among DCPs, thus aggregation operations will become complicated. To support aggregation over multiple tables efficiently, the STable(Super Table) concept is introduced by TDengine.
STable is an set for a type of data collection point. A STable contains a set of data collection points (tables) that have the same schema or data structure, but with different static attributes(tags). To describe a STable, in addition to defining the table structure of the metrics, it is also necessary to define the schema of its tags. The data type of tags can be int, float, string, and there can be multiple tags, which can be added, deleted, or modified afterward. If the whole system has N different types of data collection points, N STables need to be established.
超级表是指某一特定类型的数据采集点的集合。同一类型的数据采集点,其表的结构是完全一样的,但每个表(数据采集点)的静态属性(标签)是不一样的。描述一个超级表(某一特定类型的数据采集点的集合),除需要定义采集量的表结构之外,还需要定义其标签的 schema,标签的数据类型可以是整数、浮点数、字符串,标签可以有多个,可以事后增加、删除或修改。如果整个系统有 N 个不同类型的数据采集点,就需要建立 N 个超级表。
In the design of TDengine, **a table is used to represent a specific data collection point, and STable is used to represent a set of data collection points of the same type**. When creating a table for a specific data collection point, the user uses a STable as a template and specifies the tag value of the specific DCP (table). Compared with the traditional relational database, the table (a DCP) has static tags, and these tags can be added, deleted, and updated afterward. The relationship between the STable and the tables created based on the STable is as follows:
1. A STable contains multiple tables with the same metric schema but with different tag values.
2. The schema of metrics or labels cannot be adjusted through tables, and it can only be changed via STable. Changes to the schema of a STable takes effect immediately for all belonged tables.
3. STable defines only one template and does not store any data or label information by itself. Therefore, data cannot be written to a STable, only to tables.
Query can be executed on both table and STable. For a query on a STable, TDengine will treat the data in all belonged tables as a whole data set for processing. TDengine will first find out the tables that meet the tag filter conditions, then scan the time-series data of these tables to perform aggregation operation, which can greatly reduce the data sets to be scanned, thus greatly improving the performance of data aggregation across multiple DCPs.
## FQDN & End Point
FQDN (fully qualified domain name, 完全限定域名)是 Internet 上特定计算机或主机的完整域名。FQDN由两部分组成:主机名和域名。例如,假设邮件服务器的FQDN可能是mail.tdengine.com。主机名是mail,主机位于域名tdengine.com中。DNS(Domain Name System),负责将FQDN翻译成IP,是互联网应用的寻址方式。对于没有DNS的系统,可以通过配置hosts文件来解决。
FQDN (Fully Qualified Domain Name) is the full domain name of a specific computer or host on the Internet. FQDN consists of two parts: hostname and domain name. For example, the FQDN of a mail server might be mail.tdengine.com. The hostname is mail, and the host is located in the domain name tdengine.com. DNS (Domain Name System) is responsible for translating FQDN into IP. For systems without DNS, it can be solved by configuring the hosts file.
Each node of a TDengine cluster is uniquely identified by an End Point, which consists of an FQDN and a Port, such as h1.tdengine.com:6030. In this way, when the IP changes, we can still use the FQDN to dynamically find the node without changing any configuration of the cluster. In addition, FQDN is used to facilitate unified access to the same cluster from the Intranet and the Inetnet.