提交 dd81e34a 编写于 作者: D danielclow

doc: english version of developer guide

上级 e95ff92a
---
sidebar_label: Stream Processing
description: "The TDengine stream processing engine combines data inserts, preprocessing, analytics, real-time computation, and alerting into a single component."
title: Stream Processing
---
Raw time-series data is often cleaned and preprocessed before being permanently stored in a database. In a traditional time-series solution, this generally requires the deployment of stream processing systems such as Kafka or Flink. However, the complexity of such systems increases the cost of development and maintenance.
With the stream processing engine built into TDengine, you can process incoming data streams in real time and define stream transformations in SQL. Incoming data is automatically processed, and the results are pushed to specified tables based on triggering rules that you define. This is a lightweight alternative to complex processing engines that returns computation results in milliseconds even in high throughput scenarios.
The stream processing engine includes data filtering, scalar function computation (including user-defined functions), and window aggregation, with support for sliding windows, session windows, and event windows. Stream processing can write data to supertables from other supertables, standard tables, or subtables. When you create a stream, the target supertable is automatically created. New data is then processed and written to that supertable according to the rules defined for the stream. You can use PARTITION BY statements to partition the data by table name or tag. Separate partitions are then written to different subtables within the target supertable.
TDengine stream processing supports the aggregation of supertables that are deployed across multiple vnodes. It can also handle out-of-order writes and includes a watermark mechanism that determines the extent to which out-of-order data is accepted by the system. You can configure whether to drop or reprocess out-of-order data through the **ignore expired** parameter.
For more information, see [Stream Processing](../../taos-sql/stream).
## Create a Stream
```sql
CREATE STREAM [IF NOT EXISTS] stream_name [stream_options] INTO stb_name AS subquery
stream_options: {
TRIGGER [AT_ONCE | WINDOW_CLOSE | MAX_DELAY time]
WATERMARK time
IGNORE EXPIRED [0 | 1]
}
```
For more information, see [Stream Processing](../../taos-sql/stream).
## Usage Scenario 1
It is common that smart electrical meter systems for businesses generate millions of data points that are widely dispersed and not ordered. The time required to clean and convert this data makes efficient, real-time processing impossible for traditional solutions. This scenario shows how you can configure TDengine stream processing to drop data points over 220 V, find the maximum voltage for 5 second windows, and output this data to a table.
### Create a Database for Raw Data
A database including one supertable and four subtables is created as follows:
```sql
DROP DATABASE IF EXISTS power;
CREATE DATABASE power;
USE power;
CREATE STABLE meters (ts timestamp, current float, voltage int, phase float) TAGS (location binary(64), groupId int);
CREATE TABLE d1001 USING meters TAGS ("California.SanFrancisco", 2);
CREATE TABLE d1002 USING meters TAGS ("California.SanFrancisco", 3);
CREATE TABLE d1003 USING meters TAGS ("California.LosAngeles", 2);
CREATE TABLE d1004 USING meters TAGS ("California.LosAngeles", 3);
```
### Create a Stream
```sql
create stream current_stream into current_stream_output_stb as select _wstart as start, _wend as end, max(current) as max_current from meters where voltage <= 220 interval (5s);
```
### Write Data
```sql
insert into d1001 values("2018-10-03 14:38:05.000", 10.30000, 219, 0.31000);
insert into d1001 values("2018-10-03 14:38:15.000", 12.60000, 218, 0.33000);
insert into d1001 values("2018-10-03 14:38:16.800", 12.30000, 221, 0.31000);
insert into d1002 values("2018-10-03 14:38:16.650", 10.30000, 218, 0.25000);
insert into d1003 values("2018-10-03 14:38:05.500", 11.80000, 221, 0.28000);
insert into d1003 values("2018-10-03 14:38:16.600", 13.40000, 223, 0.29000);
insert into d1004 values("2018-10-03 14:38:05.000", 10.80000, 223, 0.29000);
insert into d1004 values("2018-10-03 14:38:06.500", 11.50000, 221, 0.35000);
```
### Query the Results
```sql
taos> select start, end, max_current from current_stream_output_stb;
start | end | max_current |
===========================================================================
2018-10-03 14:38:05.000 | 2018-10-03 14:38:10.000 | 10.30000 |
2018-10-03 14:38:15.000 | 2018-10-03 14:38:20.000 | 12.60000 |
Query OK, 2 rows in database (0.018762s)
```
## Usage Scenario 2
In this scenario, the active power and reactive power are determined from the data gathered in the previous scenario. The location and name of each meter are concatenated with a period (.) between them, and the data set is partitioned by meter name and written to a new database.
### Create a Database for Raw Data
The procedure from the previous scenario is used to create the database.
### Create a Stream
```sql
create stream power_stream into power_stream_output_stb as select ts, concat_ws(".", location, tbname) as meter_location, current*voltage*cos(phase) as active_power, current*voltage*sin(phase) as reactive_power from meters partition by tbname;
```
### Write data
The procedure from the previous scenario is used to write the data.
### Query the Results
```sql
taos> select ts, meter_location, active_power, reactive_power from power_stream_output_stb;
ts | meter_location | active_power | reactive_power |
===================================================================================================================
2018-10-03 14:38:05.000 | California.LosAngeles.d1004 | 2307.834596289 | 688.687331847 |
2018-10-03 14:38:06.500 | California.LosAngeles.d1004 | 2387.415754896 | 871.474763418 |
2018-10-03 14:38:05.500 | California.LosAngeles.d1003 | 2506.240411679 | 720.680274962 |
2018-10-03 14:38:16.600 | California.LosAngeles.d1003 | 2863.424274422 | 854.482390839 |
2018-10-03 14:38:05.000 | California.SanFrancisco.d1001 | 2148.178871730 | 688.120784090 |
2018-10-03 14:38:15.000 | California.SanFrancisco.d1001 | 2598.589176205 | 890.081451418 |
2018-10-03 14:38:16.800 | California.SanFrancisco.d1001 | 2588.728381186 | 829.240910475 |
2018-10-03 14:38:16.650 | California.SanFrancisco.d1002 | 2175.595991997 | 555.520860397 |
Query OK, 8 rows in database (0.014753s)
```
此差异已折叠。
---
sidebar_label: Cache
title: Cache
description: "Caching System inside TDengine"
sidebar_label: Caching
title: Caching
description: "This document describes the caching component of TDengine."
---
To achieve the purpose of high performance data writing and querying, TDengine employs a lot of caching technologies in both server side and client side.
TDengine uses various kinds of caching techniques to efficiently write and query data. This document describes the caching component of TDengine.
## Write Cache
The cache management policy in TDengine is First-In-First-Out (FIFO). FIFO is also known as insert driven cache management policy and it is different from read driven cache management, which is more commonly known as Least-Recently-Used (LRU). FIFO simply stores the latest data in cache and flushes the oldest data in cache to disk, when the cache usage reaches a threshold. In IoT use cases, it is the current state i.e. the latest or most recent data that is important. The cache policy in TDengine, like much of the design and architecture of TDengine, is based on the nature of IoT data.
TDengine uses an insert-driven cache management policy, known as first in, first out (FIFO). This policy differs from read-driven "least recently used (LRU)" cache management. A FIFO policy stores the latest data in cache and flushes the oldest data from cache to disk when the cache usage reaches a threshold. In IoT use cases, the most recent data or the current state is most important. The cache policy in TDengine, like much of the design and architecture of TDengine, is based on the nature of IoT data.
The memory space used by each vnode as write cache is determined when creating a database. Parameter `vgroups` and `buffer` can be used to specify the number of vnode and the size of write cache for each vnode when creating the database. Then, the total size of write cache for this database is `vgroups * buffer`.
When you create a database, you can configure the size of the write cache on each vnode. The **vgroups** parameter determines the number of vgroups that process data in the database, and the **buffer** parameter determines the size of the write cache for each vnode.
```sql
create database db0 vgroups 100 buffer 16MB
```
The above statement creates a database of 100 vnodes while each vnode has a write cache of 16MB.
Even though in theory it's always better to have a larger cache, the extra effect would be very minor once the size of cache grows beyond a threshold. So normally it's enough to use the default value of `buffer` parameter.
In theory, larger cache sizes are always better. However, at a certain point, it becomes impossible to improve performance by increasing cache size. In most scenarios, you can retain the default cache settings.
## Read Cache
When creating a database, it's also possible to specify whether to cache the latest data of each sub table, using parameter `cachelast`. There are 3 cases:
- 0: No cache for latest data
- 1: The last row of each table is cached, `last_row` function can benefit significantly from it
- 2: The latest non-NULL value of each column for each table is cached, `last` function can benefit very much when there is no `where`, `group by`, `order by` or `interval` clause
- 3: Bot hthe last row and the latest non-NULL value of each column for each table are cached, identical to the behavior of both 1 and 2 are set together
When you create a database, you can configure whether the latest data from every subtable is cached. To do so, set the *cachelast* parameter as follows:
- 0: Caching is disabled.
- 1: The latest row of data in each subtable is cached. This option significantly improves the performance of the `LAST_ROW` function
- 2: The latest non-null value in each column of each subtable is cached. This option significantly improves the performance of the `LAST` function in normal situations, such as WHERE, ORDER BY, GROUP BY, and INTERVAL statements.
- 3: Rows and columns are both cached. This option is equivalent to simultaneously enabling options 1 and 2.
## Meta Cache
## Metadata Cache
To process data writing and querying efficiently, each vnode caches the metadata that's already retrieved. Parameters `pages` and `pagesize` are used to specify the size of metadata cache for each vnode.
To improve query and write performance, each vnode caches the metadata that it receives. When you create a database, you can configure the size of the metadata cache through the *pages* and *pagesize* parameters.
```sql
create database db0 pages 128 pagesize 16kb
```
The above statement will create a database db0 each of whose vnode is allocated a meta cache of `128 * 16 KB = 2 MB` .
The preceding SQL statement creates 128 pages on each vnode in the `db0` database. Each page has a 16 KB metadata cache.
## File System Cache
TDengine utilizes WAL to provide basic reliability. The essential of WAL is to append data in a disk file, so the file system cache also plays an important role in the writing performance. Parameter `wal` can be used to specify the policy of writing WAL, there are 2 cases:
- 1: Write data to WAL without calling fsync, the data is actually written to the file system cache without flushing immediately, in this way you can get better write performance
- 2: Write data to WAL and invoke fsync, the data is immediately flushed to disk, in this way you can get higher reliability
TDengine implements data reliability by means of a write-ahead log (WAL). Writing data to the WAL is essentially writing data to the disk in an ordered, append-only manner. For this reason, the file system cache plays an important role in write performance. When you create a database, you can use the *wal* parameter to choose higher performance or higher reliability.
- 1: This option writes to the WAL but does not enable fsync. New data written to the WAL is stored in the file system cache but not written to disk. This provides better performance.
- 2: This option writes to the WAL and enables fsync. New data written to the WAL is immediately written to disk. This provides better data reliability.
## Client Cache
To improve the overall efficiency of processing data, besides the above caches, the core library `libtaos.so` (also referred to as `taosc`) which all client programs depend on also has its own cache. `taosc` caches the metadata of the databases, super tables, tables that the invoking client has accessed, plus other critical metadata such as the cluster topology.
In addition to the server-side caching discussed previously, the core client library `libtaos.so` also makes use of caching. TDengine Client caches the metadata of all databases, supertables, and subtables that it has accessed, as well as the cluster topology.
When multiple client programs are accessing a TDengine cluster, if one of the clients modifies some metadata, the cache may become invalid in other clients. If this case happens, the client programs need to "reset query cache" to invalidate the whole cache so that `taosc` is enforced to repull the metadata it needs to rebuild the cache.
If a client modifies certain metadata while multiple clients are simultaneously accessing a TDengine cluster, the metadata caches on each client may fail or become out of sync. If this occurs, run the `reset query cache` command on the affected clientsto force them to obtain fresh metadata and reset their caches.
此差异已折叠。
```c
{{#include docs/examples/c/subscribe_demo.c}}
```
\ No newline at end of file
{{#include docs/examples/c/tmq_example.c}}
```
```py
{{#include docs/examples/python/subscribe_demo.py}}
```
\ No newline at end of file
{{#include docs/examples/python/tmq_example.py}}
```
......@@ -2,13 +2,12 @@
title: Developer Guide
---
To develop an application to process time-series data using TDengine, we recommend taking the following steps:
1. Choose the method to connect to TDengine. No matter what programming language you use, you can always use the REST interface to access TDengine, but you can also use connectors unique to each programming language.
2. Design the data model based on your own use cases. Learn the [concepts](/concept/) of TDengine including "one table for one data collection point" and the "super table" (STable) concept; learn about static labels, collected metrics, and subtables. Depending on the characteristics of your data and your requirements, you may decide to create one or more databases, and you should design the STable schema to fit your data.
Before creating an application to process time-series data with TDengine, consider the following:
1. Choose the method to connect to TDengine. TDengine offers a REST API that can be used with any programming language. It also has connectors for a variety of languages.
2. Design the data model based on your own use cases. Consider the main [concepts](/concept/) of TDengine, including "one table per data collection point" and the supertable. Learn about static labels, collected metrics, and subtables. Depending on the characteristics of your data and your requirements, you decide to create one or more databases and design a supertable schema that fit your data.
3. Decide how you will insert data. TDengine supports writing using standard SQL, but also supports schemaless writing, so that data can be written directly without creating tables manually.
4. Based on business requirements, find out what SQL query statements need to be written. You may be able to repurpose any existing SQL.
5. If you want to run real-time analysis based on time series data, including various dashboards, it is recommended that you use the TDengine continuous query feature instead of deploying complex streaming processing systems such as Spark or Flink.
5. If you want to run real-time analysis based on time series data, including various dashboards, use the TDengine stream processing component instead of deploying complex systems such as Spark or Flink.
6. If your application has modules that need to consume inserted data, and they need to be notified when new data is inserted, it is recommended that you use the data subscription function provided by TDengine without the need to deploy Kafka.
7. In many use cases (such as fleet management), the application needs to obtain the latest status of each data collection point. It is recommended that you use the cache function of TDengine instead of deploying Redis separately.
8. If you find that the SQL functions of TDengine cannot meet your requirements, then you can use user-defined functions to solve the problem.
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册