Add docs for virtual columns and transform specs (#6119)

* Add docs for virtual columns and transform specs * PR Comments * PR comment

Add docs for virtual columns and transform specs (#6119)
* Add docs for virtual columns and transform specs * PR Comments * PR comment
aa660b87 · Jonathan Wei · David Lim · 2b64025e · aa660b87 · aa660b87
5 changed file
--- a/docs/content/ingestion/index.md
+++ b/docs/content/ingestion/index.md
@@ -87,7 +87,8 @@ An example dataSchema is shown below:
    "segmentGranularity" : "DAY",
    "queryGranularity" : "NONE",
    "intervals" : [ "2013-08-31/2013-09-01" ]
-  }
+  },
+  "transformSpec" : null
 }
 ```

@@ -97,6 +98,7 @@ An example dataSchema is shown below:
 | parser | JSON Object | Specifies how ingested data can be parsed. | yes |
 | metricsSpec | JSON Object array | A list of [aggregators](../querying/aggregations.html). | yes |
 | granularitySpec | JSON Object | Specifies how to create segments and roll up data. | yes |
+| transformSpec | JSON Object | Specifes how to filter and transform input data. See [transform specs](../ingestion/transform-spec.html).| no |

 ## Parser

@@ -244,6 +246,9 @@ for the `comment` column.
 }
 ```

+## metricsSpec
+ The `metricsSpec` is a list of [aggregators](../querying/aggregations.html). If `rollup` is false in the granularity spec, the metrics spec should be an empty list and all columns should be defined in the `dimensionsSpec` instead (without rollup, there isn't a real distinction between dimensions and metrics at ingestion time). This is optional, however.
+ 
 ## GranularitySpec

 The default granularity spec is `uniform`, and can be changed by setting the `type` field.
@@ -270,6 +275,10 @@ This spec is used to generate segments with arbitrary intervals (it tries to cre
 | rollup | boolean | rollup or not | no (default == true) |
 | intervals | string | A list of intervals for the raw data being ingested. Ignored for real-time ingestion. | no. If specified, batch ingestion tasks may skip determining partitions phase which results in faster ingestion. |

+# Transform Spec
+
+Transform specs allow Druid to transform and filter input data during ingestion. See [Transform specs](../ingestion/transform-spec.html)
+
 # IO Config

 Stream Push Ingestion: Stream push ingestion with Tranquility does not require an IO Config.

--- a/docs/content/ingestion/transform-spec.md
+++ b/docs/content/ingestion/transform-spec.md
+---
+layout: doc_page
+---
+
+# Transform Specs
+
+Transform specs allow Druid to filter and transform input data during ingestion. 
+
+## Syntax
+
+The syntax for the transformSpec is shown below:
+
+```
+"transformSpec": {
+  "transforms: <List of transforms>,
+  "filter": <filter>
+}
+```
+
+|property|description|required?|
+|--------|-----------|---------|
+|transforms|A list of [transforms](#transforms) to be applied to input rows. |no|
+|filter|A [filter](../querying/filters.html) that will be applied to input rows; only rows that pass the filter will be ingested.|no|
+
+## Transforms
+
+The `transforms` list allows the user to specify a set of column transformations to be performed on input data.
+
+Transforms allow adding new fields to input rows. Each transform has a "name" (the name of the new field) which can be referred to by DimensionSpecs, AggregatorFactories, etc.
+
+A transform behaves as a "row function", taking an entire row as input and outputting a column value.
+
+If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
+
+Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular, they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field with another field containing all nulls, which will act similarly to removing the field.
+
+Note that the transforms are applied before the filter.
+
+### Expression Transform
+
+Druid currently supports one kind of transform, the expression transform.
+
+An expression transform has the following syntax:
+
+```
+{
+  "type": "expression",
+  "name": <output field name>,
+  "expression": <expr>
+}
+```
+
+|property|description|required?|
+|--------|-----------|---------|
+|name|The output field name of the expression transform.|yes|
+|expression|An [expression](../misc/math-expr.html) that will be applied to input rows to produce a value for the transform's output field.|no|
+
+For example, the following expression transform prepends "foo" to the values of a `page` column in the input data, and creates a `fooPage` column.
+
+```
+    {
+      "type": "expression",
+      "name": "fooPage",
+      "expression": "concat('foo' + page)"
+    }
+```
+
+## Filtering
+
+The transformSpec allows Druid to filter out input rows during ingestion. A row that fails to pass the filter will not be ingested.
+
+Any of Druid's standard [filters](../querying/filters.html) can be used.
+
+Note that the filtering takes place after the transforms, so filters will operate on transformed rows and not the raw input data if transforms are present.
+
+For example, the following filter would ingest only input rows where a `country` column has the value "United States":
+
+```
+"filter": {
+  "type": "selector",
+  "dimension": "country",
+  "value": "United States"
+}
+```
\ No newline at end of file
--- a/docs/content/misc/math-expr.md
+++ b/docs/content/misc/math-expr.md
@@ -2,6 +2,12 @@
 layout: doc_page
 ---

+# Druid Expressions
+
+<div class="note info">
+This feature is still experimental. It has not been optimized for performance yet, and its implementation is known to have significant inefficiencies.
+</div>
+ 
 This expression language supports the following operators (listed in decreasing order of precedence).

 |Operators|Description|

--- a/docs/content/querying/virtual-columns.md
+++ b/docs/content/querying/virtual-columns.md
+---
+layout: doc_page
+---
+
+# Virtual Columns
+
+Virtual columns are queryable column "views" created from a set of columns during a query. 
+
+A virtual column can potentially draw from multiple underlying columns, although a virtual column always presents itself as a single column.
+
+Virtual columns can be used as dimensions or as inputs to aggregators.
+
+Each Druid query can accept a list of virtual columns as a parameter. The following scan query is provided as an example:
+
+```
+{
+ "queryType": "scan",
+ "dataSource": "page_data",
+ "columns":[],
+ "virtualColumns": [
+    {
+      "type": "expression",
+      "name": "fooPage",
+      "expression": "concat('foo' + page)",
+      "outputType": "STRING"
+    },
+    {
+      "type": "expression",
+      "name": "tripleWordCount",
+      "expression": "wordCount * 3",
+      "outputType": "LONG"
+    }
+  ],
+ "intervals": [
+   "2013-01-01/2019-01-02"
+ ] 
+}
+```
+
+
+## Virtual Column Types
+
+### Expression virtual column
+
+The expression virtual column has the following syntax:
+
+```
+{
+  "type": "expression",
+  "name": <name of the virtual column>,
+  "expression": <row expression>,
+  "outputType": <output value type of expression>
+}
+```
+
+|property|description|required?|
+|--------|-----------|---------|
+|name|The name of the virtual column.|yes|
+|expression|An [expression](../misc/math-expr.html) that takes a row as input and outputs a value for the virtual column.|yes|
+|outputType|The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, or STRING.|no, default is FLOAT|
\ No newline at end of file
--- a/docs/content/toc.md
+++ b/docs/content/toc.md
@@ -32,6 +32,7 @@ layout: toc
    * [Stream Pull](/docs/VERSION/ingestion/stream-pull.html)
  * [Updating Existing Data](/docs/VERSION/ingestion/update-existing-data.html)
  * [Ingestion Tasks](/docs/VERSION/ingestion/tasks.html)
+  * [Transform Specs](/docs/VERSION/ingestion/transform-spec.html)
  * [FAQ](/docs/VERSION/ingestion/faq.html)

 ## Querying
@@ -60,6 +61,7 @@ layout: toc
  * [Multitenancy](/docs/VERSION/querying/multitenancy.html)
  * [Caching](/docs/VERSION/querying/caching.html)
  * [Sorting Orders](/docs/VERSION/querying/sorting-orders.html)
+  * [Virtual Columns](/docs/VERSION/querying/virtual-columns.html)

 ## Design
  * [Overview](/docs/VERSION/design/design.html)
@@ -127,5 +129,6 @@ layout: toc


 ## Misc
+  * [Druid Expressions Language](/docs/VERSION/misc/math-expr.html)
  * [Papers & Talks](/docs/VERSION/misc/papers-and-talks.html)
  * [Thanks](/thanks.html)