Merge pull request #21151 from taosdata/docs/python-udf

docs: add python udf english version

Merge pull request #21151 from taosdata/docs/python-udf
docs: add python udf english version
615be563 · dapan1121 · GitHub · 22f06ce2 · accdcda3 · 615be563
5 changed file
--- a/docs/en/07-develop/09-udf.md
+++ b/docs/en/07-develop/09-udf.md
@@ -6,10 +6,12 @@ description: This document describes how to create user-defined functions (UDF),

 The built-in functions of TDengine may not be sufficient for the use cases of every application. In this case, you can define custom functions for use in TDengine queries. These are known as user-defined functions (UDF). A user-defined function takes one column of data or the result of a subquery as its input.

-TDengine supports user-defined functions written in C or C++. This document describes the usage of user-defined functions.
-
 User-defined functions can be scalar functions or aggregate functions. Scalar functions, such as `abs`, `sin`, and `concat`, output a value for every row of data. Aggregate functions, such as `avg` and `max` output one value for multiple rows of data.

+TDengine supports user-defined functions written in C or Python. This document describes the usage of user-defined functions.
+
+# Implement a UDF in C 
+
 When you create a user-defined function, you must implement standard interface functions:
 - For scalar functions, implement the `scalarfn` interface function.
 - For aggregate functions, implement the `aggfn_start`, `aggfn`, and `aggfn_finish` interface functions.
@@ -17,7 +19,7 @@ When you create a user-defined function, you must implement standard interface f

 There are strict naming conventions for these interface functions. The names of the start, finish, init, and destroy interfaces must be <udf-name\>_start, <udf-name\>_finish, <udf-name\>_init, and <udf-name\>_destroy, respectively. Replace `scalarfn`, `aggfn`, and `udf` with the name of your user-defined function.

-## Implementing a Scalar Function
+## Implementing a Scalar Function in C
 The implementation of a scalar function is described as follows:
 ```c
 #include "taos.h"
@@ -49,7 +51,7 @@ int32_t scalarfn_destroy() {
 ```
 Replace `scalarfn` with the name of your function.

-## Implementing an Aggregate Function
+### Implementing an Aggregate Function in C

 The implementation of an aggregate function is described as follows:
 ```c
@@ -100,7 +102,7 @@ int32_t aggfn_destroy() {
 ```
 Replace `aggfn` with the name of your function.

-## Interface Functions
+## C UDF Interface Functions

 There are strict naming conventions for interface functions. The names of the start, finish, init, and destroy interfaces must be <udf-name\>_start, <udf-name\>_finish, <udf-name\>_init, and <udf-name\>_destroy, respectively. Replace `scalarfn`, `aggfn`, and `udf` with the name of your user-defined function.

@@ -108,7 +110,7 @@ Interface functions return a value that indicates whether the operation was succ

 For information about the parameters for interface functions, see Data Model

-### Interfaces for Scalar Functions
+### Interfaces for C UDF Scalar Functions

 `int32_t scalarfn(SUdfDataBlock* inputDataBlock, SUdfColumn *resultColumn)` 
 
@@ -118,7 +120,7 @@ The parameters in the function are defined as follows:
  - inputDataBlock: The data block to input.
  - resultColumn: The column to output. The column to output. 

-### Interfaces for Aggregate Functions
+### Interfaces for C UDF Aggregate Functions

 `int32_t aggfn_start(SUdfInterBuf *interBuf)`

@@ -126,7 +128,7 @@ The parameters in the function are defined as follows:

 `int32_t aggfn_finish(SUdfInterBuf* interBuf, SUdfInterBuf *result)`

-Replace `aggfn` with the name of your function. In the function, aggfn_start is called to generate a result buffer. Data is then divided between multiple blocks, and aggfn is called on each block to update the result. Finally, aggfn_finish is called to generate final results from the intermediate results. The final result contains only one or zero data points.
+Replace `aggfn` with the name of your function. In the function, aggfn_start is called to generate a result buffer. Data is then divided between multiple blocks, and the `aggfn` function is called on each block to update the result. Finally, aggfn_finish is called to generate the final results from the intermediate results. The final result contains only one or zero data points.

 The parameters in the function are defined as follows:
  - interBuf: The intermediate result buffer.
@@ -135,15 +137,15 @@ The parameters in the function are defined as follows:
  - result: The final result.


-### Initializing and Terminating User-Defined Functions
+### C UDF Initializing and Terminating User-Defined Functions
 `int32_t udf_init()`

 `int32_t udf_destroy()`

-Replace `udf`with the name of your function. udf_init initializes the function. udf_destroy terminates the function. If it is not necessary to initialize your function, udf_init is not required. If it is not necessary to terminate your function, udf_destroy is not required.
+Replace `udf` with the name of your function. udf_init initializes the function. udf_destroy terminates the function. If it is not necessary to initialize your function, udf_init is not required. If it is not necessary to terminate your function, udf_destroy is not required.


-## Data Structure of User-Defined Functions
+## Data Structure of C User-Defined Functions
 ```c
 typedef struct SUdfColumnMeta {
  int16_t type;
@@ -193,7 +195,7 @@ typedef struct SUdfInterBuf {
 ```
 The data structure is described as follows:

- The SUdfDataBlock block includes the number of rows (numOfRows) and number of columns (numCols). udfCols[i] (0 <= i <= numCols-1) indicates that each column is of type SUdfColumn.
+- The SUdfDataBlock block includes the number of rows (numOfRows) and the number of columns (numCols). udfCols[i] (0 <= i <= numCols-1) indicates that each column is of type SUdfColumn.
 - SUdfColumn includes the definition of the data type of the column (colMeta) and the data in the column (colData).
 - The member definitions of SUdfColumnMeta are the same as the data type definitions in `taos.h`.
 - The data in SUdfColumnData can become longer. varLenCol indicates variable-length data, and fixLenCol indicates fixed-length data. 
@@ -201,9 +203,9 @@ The data structure is described as follows:

 Additional functions are defined in `taosudf.h` to make it easier to work with these structures.

-## Compile UDF
+## Compile C UDF

-To use your user-defined function in TDengine, first compile it to a dynamically linked library (DLL).
+To use your user-defined function in TDengine, first, compile it to a shared library.

 For example, the sample UDF `bit_and.c` can be compiled into a DLL as follows:

@@ -213,12 +215,9 @@ gcc -g -O0 -fPIC -shared bit_and.c -o libbitand.so

 The generated DLL file `libbitand.so` can now be used to implement your function. Note: GCC 7.5 or later is required.

-## Manage and Use User-Defined Functions
-After compiling your function into a DLL, you add it to TDengine. For more information, see [User-Defined Functions](../12-taos-sql/26-udf.md).
-
-## Sample Code
+## C UDF Sample Code

-### Sample scalar function: [bit_and](https://github.com/taosdata/TDengine/blob/3.0/tests/script/sh/bit_and.c)
+### C UDF Sample scalar function: [bit_and](https://github.com/taosdata/TDengine/blob/3.0/tests/script/sh/bit_and.c)

 The bit_and function implements bitwise addition for multiple columns. If there is only one column, the column is returned. The bit_and function ignores null values.

@@ -231,7 +230,7 @@ The bit_and function implements bitwise addition for multiple columns. If there

 </details>

-### Sample aggregate function: [l2norm](https://github.com/taosdata/TDengine/blob/3.0/tests/script/sh/l2norm.c)
+### C UDF Sample aggregate function 1: [l2norm](https://github.com/taosdata/TDengine/blob/3.0/tests/script/sh/l2norm.c)

 The l2norm function finds the second-order norm for all data in the input column. This squares the values, takes a cumulative sum, and finds the square root.

@@ -243,3 +242,151 @@ The l2norm function finds the second-order norm for all data in the input column
 ```

 </details>
+
+### C UDF Sample aggregate function 2: [max_vol](https://github.com/taosdata/TDengine/blob/develop/tests/script/sh/max_vol.c)
+
+The max_vol function returns a string concatenating the deviceId column, the row number and column number of the maximum voltage and the maximum voltage given several voltage columns as input.
+
+Create Table:
+```bash
+create table battery(ts timestamp, vol1 float, vol2 float, vol3 float, deviceId varchar(16));
+```
+Create the UDF:
+```bash
+create aggregate function max_vol as '/root/udf/libmaxvol.so' outputtype binary(64) bufsize 10240 language 'C'; 
+```
+Use the UDF in the query：
+```bash
+select max_vol(vol1,vol2,vol3,deviceid) from battery;
+```
+
+<details>
+<summary>max_vol.c</summary>
+
+```c
+{{#include tests/script/sh/max_vol.c}}
+```
+
+</details>
+
+#Implement a UDF in Python
+
+Implement the specified interface functions when implementing a UDF in Python.
+- implement `process` function for the scalar UDF。
+- implement `start`, `reduce`, `finish` for the aggregate UDF。
+- implement `init` for initialization and `destroy` for termination。
+
+## Implement a Scalar UDF in Python
+
+The implementation of a scalar UDF is described as follows:
+
+```Python
+def init():
+    # initialization
+def destroy():
+    # destroy
+def process(input: datablock) -> tuple[output_type]:
+    # process input datablock, 
+    # datablock.data(row, col) is to access the python object in location(row,col)
+    # return tuple object consisted of object of type outputtype   
+```
+
+## Implement an Aggregate UDF in Python
+
+The implementation of an aggregate function is described as follows:
+
+```Python
+def init():
+    #initialization
+def destroy():
+    #destroy
+def start() -> bytes:
+    #return serialize(init_state)
+def reduce(inputs: datablock, buf: bytes) -> bytes
+    # deserialize buf to state
+    # reduce the inputs and state into new_state. 
+    # use inputs.data(i,j) to access python ojbect of location(i,j)
+    # serialize new_state into new_state_bytes
+    return new_state_bytes   
+def finish(buf: bytes) -> output_type:
+    #return obj of type outputtype   
+```
+
+## Python UDF interface functions
+
+### Python UDF scalar interface functions
+```Python
+def process(input: datablock) -> tuple[output_type]:
+```
+- `input` is a data block two-dimension matrix-like object, of which method `data(row, col)` returns the Python object located at location (`row`, `col`)
+- return a Python tuple object, of which each item is a Python object of type `output_type`
+
+### Python UDF aggregate interface functions
+```Python
+def start() -> bytes:
+def reduce(input: datablock, buf: bytes) -> bytes
+def finish(buf: bytes) -> output_type:
+```
+
+- first `start()` is called to return the initial result in type `bytes`
+- then the input data are divided into multiple data blocks and for each block `input`, `reduce` is called with the data block `input` and the current result `buf` bytes and generates a new intermediate result buffer. 
+- finally, the `finish` function is called on the intermediate result `buf` and outputs 0 or 1 data of type `output_type`
+
+
+### Python UDF Initialization and Termination
+```Python
+def init()
+def destroy()
+```
+Implement `init` for initialization and `destroy` for termination. 
+
+## TDengine SQL data type and Python UDF Data Type Mapping Table
+
+The following table describes the mapping between TDengine SQL data type and Python UDF Data Type. The `NULL` value of all TDengine SQL types is mapped to the `None` value in Python.
+
+|  **TDengine SQL Data Type**   | **Python Data Type** |
+| :-----------------------: | ------------ |
+|TINYINT / SMALLINT / INT  / BIGINT     | int   |
+|TINYINT UNSIGNED / SMALLINT UNSIGNED / INT UNSIGNED / BIGINT UNSIGNED | int |
+|FLOAT / DOUBLE | float |
+|BOOL | bool |
+|BINARY / VARCHAR / NCHAR | bytes|
+|TIMESTAMP | int |
+|JSON and other types | Not Supported |
+
+## Python UDF Installation
+1. Install Python package `taospyudf` that executes Python UDF
+```bash
+sudo pip install taospyudf
+ldconfig
+```
+2. If PYTHONPATH is needed to find Python packages when the Python UDF executes, include the PYTHONPATH contents into the udfdLdLibPath variable of the taos.cfg configuration file
+ 
+## Python UDF Sample Code
+### Python UDF Scalar Function Sample Code [pybitand](https://github.com/taosdata/TDengine/blob/develop/tests/script/sh/pybitand.py)
+
+The `pybitand` function implements bitwise addition for multiple columns. If there is only one column, the column is returned. The `pybitand` function ignores null values.
+
+<details>
+<summary>pybitand.py</summary>
+
+```Python
+{{#include tests/script/sh/pybitand.py}}
+```
+
+</details>
+
+### Python UDF Aggregate Function Sample Code [pyl2norm](https://github.com/taosdata/TDengine/blob/develop/tests/script/sh/pyl2norm.py)
+
+The `pyl2norm` function finds the second-order norm for all data in the input column. This squares the values, takes a cumulative sum, and finds the square root.
+<details>
+<summary>pyl2norm.py</summary>
+
+```c
+{{#include tests/script/sh/pyl2norm.py}}
+```
+
+</details>
+
+## Manage and Use User-Defined Functions
+You can add UDF to TDengine before using it in SQL queries. For more information, see [User-Defined Functions](../12-taos-sql/26-udf.md).
--- a/docs/en/12-taos-sql/22-meta.md
+++ b/docs/en/12-taos-sql/22-meta.md
@@ -120,6 +120,9 @@ Provides information about user-defined functions.
 | 5   | create_time | TIMESTAMP    | Creation time       |
 | 6   |  code_len   | INT          | Length of the source code       |
 | 7   |   bufsize   | INT          | Buffer size    |
+| 8   | func_language | BINARY(31) | UDF programming language |
+| 9   | func_body     | BINARY(16384) | UDF function body |
+| 10  | func_version  | INT           | UDF function version. starting from 0. Increasing by 1 each time it is updated|

 ## INS_INDEXES


--- a/docs/en/12-taos-sql/26-udf.md
+++ b/docs/en/12-taos-sql/26-udf.md
@@ -7,17 +7,18 @@ description: This document describes the SQL statements related to user-defined
 You can create user-defined functions and import them into TDengine.
 ## Create UDF

-SQL command can be executed on the host where the generated UDF DLL resides to load the UDF DLL into TDengine. This operation cannot be done through REST interface or web console. Once created, any client of the current TDengine can use these UDF functions in their SQL commands. UDF are stored in the management node of TDengine. The UDFs loaded in TDengine would be still available after TDengine is restarted.
+SQL command can be executed on the host where the generated UDF DLL resides to load the UDF DLL into TDengine. This operation cannot be done through REST interface or web console. Once created, any client of the current TDengine can use these UDF functions in their SQL commands. UDF is stored in the management node of TDengine. The UDFs loaded in TDengine would be still available after TDengine is restarted.

 When creating UDF, the type of UDF, i.e. a scalar function or aggregate function must be specified. If the specified type is wrong, the SQL statements using the function would fail with errors. The input data type and output data type must be consistent with the UDF definition.

 - Create Scalar Function
 ```sql
-CREATE FUNCTION function_name AS library_path OUTPUTTYPE output_type;
+CREATE [OR REPLACE] FUNCTION function_name AS library_path OUTPUTTYPE output_type [LANGUAGE 'C|Python'];
 ```
-
-  - function_name: The scalar function name to be used in SQL statement which must be consistent with the UDF name and is also the name of the compiled DLL (.so file).
-  - library_path: The absolute path of the DLL file including the name of the shared object file (.so). The path must be quoted with single or double quotes.
+  - OR REPLACE: if the UDF exists, the UDF properties are modified
+  - function_name: The scalar function name to be used in the SQL statement
+  - LANGUAGE 'C|Python': the programming language of UDF. Now C or Python is supported. If this clause is omitted, C is assumed as the programming language.
+  - library_path: For C programming language, The absolute path of the DLL file including the name of the shared object file (.so). For Python programming language, the absolute path of the Python UDF script. The path must be quoted with single or double quotes.
  - output_type: The data type of the results of the UDF.

  For example, the following SQL statement can be used to create a UDF from `libbitand.so`.
@@ -25,14 +26,20 @@ CREATE FUNCTION function_name AS library_path OUTPUTTYPE output_type;
  ```sql
  CREATE FUNCTION bit_and AS "/home/taos/udf_example/libbitand.so" OUTPUTTYPE INT;
  ```
+  For Example, the following SQL statement can be used to modify the existing function `bit_and`. The OUTPUT type is changed to BIGINT and the programming language is changed to Python.
+
+  ```sql
+  CREATE OR REPLACE FUNCTION bit_and AS "/home/taos/udf_example/bit_and.py" OUTPUTTYPE BIGINT LANGUAGE 'Python';
+  ```

 - Create Aggregate Function
 ```sql
 CREATE AGGREGATE FUNCTION function_name AS library_path OUTPUTTYPE output_type [ BUFSIZE buffer_size ];
 ```
-
-  - function_name: The aggregate function name to be used in SQL statement which must be consistent with the udfNormalFunc name and is also the name of the compiled DLL (.so file).
-  - library_path: The absolute path of the DLL file including the name of the shared object file (.so). The path must be quoted with single or double quotes.
+  - OR REPLACE: if the UDF exists, the UDF properties are modified
+  - function_name: The aggregate function name to be used in the SQL statement
+  - LANGUAGE 'C|Python': the programming language of the UDF. Now C or Python is supported. If this clause is omitted, C is assumed as the programming language.
+  - library_path: For C programming language, The absolute path of the DLL file including the name of the shared object file (.so). For Python programming language, the absolute path of the Python UDF script. The path must be quoted with single or double quotes.
  - output_type: The output data type, the value is the literal string of the supported TDengine data type.
  - buffer_size: The size of the intermediate buffer in bytes. This parameter is optional.

@@ -41,6 +48,11 @@ CREATE AGGREGATE FUNCTION function_name AS library_path OUTPUTTYPE output_type [
  ```sql
  CREATE AGGREGATE FUNCTION l2norm AS "/home/taos/udf_example/libl2norm.so" OUTPUTTYPE DOUBLE bufsize 8;
  ```
+  For example, the following SQL statement modifies the buffer size of existing UDF `l2norm` to 64 
+  ```sql
+  CREATE AGGREGATE FUNCTION l2norm AS "/home/taos/udf_example/libl2norm.so" OUTPUTTYPE DOUBLE bufsize 64;
+  ``` 
+
 For more information about user-defined functions, see [User-Defined Functions](/develop/udf).

 ## Manage UDF
@@ -61,9 +73,9 @@ SHOW FUNCTIONS;

 ## Call UDF

-The function name specified when creating UDF can be used directly in SQL statements, just like builtin functions. For example:
+The function name specified when creating UDF can be used directly in SQL statements, just like built-in functions. For example:
 ```sql
 SELECT bit_and(c1,c2) FROM table;
 ```

-The above SQL statement invokes function X for column c1 and c2 on table. You can use query keywords like WHERE with user-defined functions.
+The above SQL statement invokes function X for columns c1 and c2 on the table. You can use query keywords like WHERE with user-defined functions.
--- a/docs/zh/07-develop/09-udf.md
+++ b/docs/zh/07-develop/09-udf.md
@@ -10,7 +10,7 @@ description: "支持用户编码的聚合函数和标量函数，在查询中嵌

 TDengine 支持通过 C/Python 语言进行 UDF 定义。接下来结合示例讲解 UDF 的使用方法。

-# C 语言实现UDF
+# C 语言实现 UDF

 使用 C 语言实现 UDF 时，需要实现规定的接口函数
 - 标量函数需要实现标量接口函数 scalarfn 。
@@ -269,7 +269,7 @@ select max_vol(vol1,vol2,vol3,deviceid) from battery;

 </details>

-# Python 语言实现UDF
+# Python 语言实现 UDF
 使用 Python 语言实现 UDF 时，需要实现规定的接口函数
 - 标量函数需要实现标量接口函数 process 。
 - 聚合函数需要实现聚合接口函数 start ，reduce ，finish。
@@ -336,7 +336,10 @@ def destroy()

 其中 init 完成初始化工作。 destroy 完成清理工作。如果没有初始化工作，无需定义 init 函数。如果没有清理工作，无需定义 destroy 函数。

-## Python数据类型和TDengine数据类型映射
+## Python 数据类型和 TDengine 数据类型映射
+
+下表描述了TDengine SQL数据类型和Python数据类型的映射。任何类型的NULL值都映射成Python的None值。
+
 |  **TDengine SQL数据类型**   | **Python数据类型** |
 | :-----------------------: | ------------ |
 |TINYINT / SMALLINT / INT  / BIGINT     | int   |
@@ -350,8 +353,8 @@ def destroy()
 ## Python UDF 环境的安装
 1. 安装 taospyudf 包。此包执行Python UDF程序。
 ```bash
-pip install taospyudf
-lddconfig
+sudo pip install taospyudf
+ldconfig
 ```
 2. 如果 Python UDF 程序执行时，通过 PYTHONPATH 引用其它的包，可以设置 taos.cfg 的 UdfdLdLibPath 变量为PYTHONPATH的内容
 
@@ -382,5 +385,5 @@ pyl2norm 实现了输入列的所有数据的二阶范数，即对每个数据

 </details>

-# 管理和使用UDF
-编译好的UDF，还需要将其加入到系统才能被正常的SQL调用。关于如何管理和使用UDF，参见[UDF使用说明](../12-taos-sql/26-udf.md)
\ No newline at end of file
+# 管理和使用 UDF
+需要 UDF 将其加入到系统才能被正常的 SQL 调用。关于如何管理和使用 UDF，参见[UDF使用说明](../12-taos-sql/26-udf.md)
\ No newline at end of file
--- a/docs/zh/12-taos-sql/26-udf.md
+++ b/docs/zh/12-taos-sql/26-udf.md
@@ -11,15 +11,13 @@ description: 使用 UDF 的详细指南

 在创建 UDF 时，需要区分标量函数和聚合函数。如果创建时声明了错误的函数类别，则可能导致通过 SQL 指令调用函数时出错。此外，用户需要保证输入数据类型与 UDF 程序匹配，UDF 输出数据类型与 OUTPUTTYPE 匹配。

-使用 CREATE OR REPLACE FUNCTION，如果函数已经存在，会修改已有的函数属性。
-
 - 创建标量函数
 ```sql
 CREATE [OR REPLACE] FUNCTION function_name AS library_path OUTPUTTYPE output_type [LANGUAGE 'C|Python'];
 ```
-
-  - function_name：标量函数未来在 SQL 中被调用时的函数名，必须与函数实现中 udf 的实际名称一致；
-  - LANGUAGE 'C|Python'：函数编程语言，目前支持C语言和Python语言。  
+  - OR REPLACE: 如果函数已经存在，会修改已有的函数属性。
+  - function_name：标量函数未来在 SQL 中被调用时的函数名；
+  - LANGUAGE 'C|Python'：函数编程语言，目前支持C语言和Python语言。 如果这个从句忽略，编程语言是C语言 
  - library_path：如果编程语言是C，路径是包含 UDF 函数实现的动态链接库的库文件绝对路径（指的是库文件在当前客户端所在主机上的保存路径，通常是指向一个 .so 文件）。如果编程语言是Python，路径是包含 UDF 函数实现的Python文件路径。这个路径需要用英文单引号或英文双引号括起来；
  - output_type：此函数计算结果的数据类型名称；

@@ -38,7 +36,7 @@ CREATE [OR REPLACE] FUNCTION function_name AS library_path OUTPUTTYPE output_typ
 ```sql
 CREATE [OR REPLACE] AGGREGATE FUNCTION function_name AS library_path OUTPUTTYPE output_type [ BUFSIZE buffer_size ] [LANGUAGE 'C|Python'];
 ```
-
+  - OR REPLACE: 如果函数已经存在，会修改已有的函数属性。
  - function_name：聚合函数未来在 SQL 中被调用时的函数名，必须与函数实现中 udfNormalFunc 的实际名称一致；
  - LANGUAGE 'C|Python'：函数编程语言，目前支持C语言和Python语言。  
  - library_path：如果编程语言是C，路径是包含 UDF 函数实现的动态链接库的库文件绝对路径（指的是库文件在当前客户端所在主机上的保存路径，通常是指向一个 .so 文件）。如果编程语言是Python，路径是包含 UDF 函数实现的Python文件路径。这个路径需要用英文单引号或英文双引号括起来；；