提交 063a068a 编写于 作者: R Russell Jurney

Converted links, sans space to slash

上级 248fba68
---
layout: default
---
---
layout: default
---
Aggregations are specifications of processing over metrics available in Druid.
Available aggregations are:
### Sum aggregators
#### `longSum` aggregator
computes the sum of values as a 64-bit, signed integer
<code>{
"type" : "longSum",
"name" : <output_name>,
"fieldName" : <metric_name>
}</code>
`name` – output name for the summed value
`fieldName` – name of the metric column to sum over
#### `doubleSum` aggregator
Computes the sum of values as 64-bit floating point value. Similar to `longSum`
<code>{
"type" : "doubleSum",
"name" : <output_name>,
"fieldName" : <metric_name>
}</code>
### Count aggregator
`count` computes the row count that match the filters
<code>{
"type" : "count",
"name" : <output_name>,
}</code>
### Min / Max aggregators
#### `min` aggregator
`min` computes the minimum metric value
<code>{
"type" : "min",
"name" : <output_name>,
"fieldName" : <metric_name>
}</code>
#### `max` aggregator
`max` computes the maximum metric value
<code>{
"type" : "max",
"name" : <output_name>,
"fieldName" : <metric_name>
}</code>
### JavaScript aggregator
Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions).
All JavaScript functions must return numerical values.
<code>{
"type": "javascript",
"name": "<output_name>",
"fieldNames" : [ <column1>, <column2>, ... ],
"fnAggregate" : "function(current, column1, column2, ...) {
<updates partial aggregate (current) based on the current row values>
return <updated partial aggregate>
}"
"fnCombine" : "function(partialA, partialB) { return <combined partial results>; }"
"fnReset" : "function() { return <initial value>; }"
}</code>
**Example**
<code>{
"type": "javascript",
"name": "sum(log(x)/y) + 10",
"fieldNames": ["x", "y"],
"fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }"
"fnCombine" : "function(partialA, partialB) { return partialA + partialB; }"
"fnReset" : "function() { return 10; }"
}</code>
\ No newline at end of file
......@@ -4,14 +4,14 @@ layout: default
Batch Data Ingestion
====================
There are two choices for batch data ingestion to your Druid cluster, you can use the [[Indexing service]] or you can use the `HadoopDruidIndexerMain`. This page describes how to use the `HadoopDruidIndexerMain`.
There are two choices for batch data ingestion to your Druid cluster, you can use the [Indexing service](Indexing-service.html) or you can use the `HadoopDruidIndexerMain`. This page describes how to use the `HadoopDruidIndexerMain`.
Which should I use?
-------------------
The [[Indexing service]] is a node that can run as part of your Druid cluster and can accomplish a number of different types of indexing tasks. Even if all you care about is batch indexing, it provides for the encapsulation of things like the Database that is used for segment metadata and other things, so that your indexing tasks do not need to include such information. Long-term, the indexing service is going to be the preferred method of ingesting data.
The [Indexing service](Indexing service.html) is a node that can run as part of your Druid cluster and can accomplish a number of different types of indexing tasks. Even if all you care about is batch indexing, it provides for the encapsulation of things like the Database that is used for segment metadata and other things, so that your indexing tasks do not need to include such information. Long-term, the indexing service is going to be the preferred method of ingesting data.
The `HadoopDruidIndexerMain` runs hadoop jobs in order to separate and index data segments. It takes advantage of Hadoop as a job scheduling and distributed job execution platform. It is a simple method if you already have Hadoop running and don’t want to spend the time configuring and deploying the [[Indexing service]] just yet.
The `HadoopDruidIndexerMain` runs hadoop jobs in order to separate and index data segments. It takes advantage of Hadoop as a job scheduling and distributed job execution platform. It is a simple method if you already have Hadoop running and don’t want to spend the time configuring and deploying the [Indexing service](Indexing service.html) just yet.
HadoopDruidIndexer
------------------
......@@ -138,4 +138,4 @@ This is a specification of the properties that tell the job how to update metada
|password|password for db|yes|
|segmentTable|table to use in DB|yes|
These properties should parrot what you have configured for your [[Master]].
These properties should parrot what you have configured for your [Master](Master.html).
......@@ -3,7 +3,7 @@ layout: default
---
# Booting a Single Node Cluster #
[[Loading Your Data]] and [[Querying Your Data]] contain recipes to boot a small druid cluster on localhost. Here we will boot a small cluster on EC2. You can checkout the code, or download a tarball from [here](http://static.druid.io/artifacts/druid-services-0.5.51-SNAPSHOT-bin.tar.gz).
[Loading Your Data](Loading Your Data.html) and [Querying Your Data](Querying Your Data.html) contain recipes to boot a small druid cluster on localhost. Here we will boot a small cluster on EC2. You can checkout the code, or download a tarball from [here](http://static.druid.io/artifacts/druid-services-0.5.51-SNAPSHOT-bin.tar.gz).
The [ec2 run script](https://github.com/metamx/druid/blob/master/examples/bin/run_ec2.sh), run_ec2.sh, is located at 'examples/bin' if you have checked out the code, or at the root of the project if you've downloaded a tarball. The scripts rely on the [Amazon EC2 API Tools](http://aws.amazon.com/developertools/351), and you will need to set three environment variables:
......
......@@ -9,9 +9,9 @@ The Broker is the node to route queries to if you want to run a distributed clus
Forwarding Queries
------------------
Most druid queries contain an interval object that indicates a span of time for which data is requested. Likewise, Druid [[Segments]] are partitioned to contain data for some interval of time and segments are distributed across a cluster. Consider a simple datasource with 7 segments where each segment contains data for a given day of the week. Any query issued to the datasource for more than one day of data will hit more than one segment. These segments will likely be distributed across multiple nodes, and hence, the query will likely hit multiple nodes.
Most druid queries contain an interval object that indicates a span of time for which data is requested. Likewise, Druid [Segments](Segments.html) are partitioned to contain data for some interval of time and segments are distributed across a cluster. Consider a simple datasource with 7 segments where each segment contains data for a given day of the week. Any query issued to the datasource for more than one day of data will hit more than one segment. These segments will likely be distributed across multiple nodes, and hence, the query will likely hit multiple nodes.
To determine which nodes to forward queries to, the Broker node first builds a view of the world from information in Zookeeper. Zookeeper maintains information about [[Compute]] and [[Realtime]] nodes and the segments they are serving. For every datasource in Zookeeper, the Broker node builds a timeline of segments and the nodes that serve them. When queries are received for a specific datasource and interval, the Broker node performs a lookup into the timeline associated with the query datasource for the query interval and retrieves the nodes that contain data for the query. The Broker node then forwards down the query to the selected nodes.
To determine which nodes to forward queries to, the Broker node first builds a view of the world from information in Zookeeper. Zookeeper maintains information about [Compute](Compute.html) and [Realtime](Realtime.html) nodes and the segments they are serving. For every datasource in Zookeeper, the Broker node builds a timeline of segments and the nodes that serve them. When queries are received for a specific datasource and interval, the Broker node performs a lookup into the timeline associated with the query datasource for the query interval and retrieves the nodes that contain data for the query. The Broker node then forwards down the query to the selected nodes.
Caching
-------
......@@ -27,4 +27,4 @@ Broker nodes can be run using the `com.metamx.druid.http.BrokerMain` class.
Configuration
-------------
See [[Configuration]].
See [Configuration](Configuration.html).
---
layout: default
---
A Druid cluster consists of various node types that need to be set up depending on your use case. See our [[Design]] docs for a description of the different node types.
A Druid cluster consists of various node types that need to be set up depending on your use case. See our [Design](Design.html) docs for a description of the different node types.
Setup Scripts
-------------
......@@ -11,14 +11,14 @@ One of our community members, [housejester](https://github.com/housejester/), co
Minimum Physical Layout: Absolute Minimum
-----------------------------------------
As a special case, the absolute minimum setup is one of the standalone examples for realtime ingestion and querying; see [[Examples]] that can easily run on one machine with one core and 1GB RAM. This layout can be set up to try some basic queries with Druid.
As a special case, the absolute minimum setup is one of the standalone examples for realtime ingestion and querying; see [Examples](Examples.html) that can easily run on one machine with one core and 1GB RAM. This layout can be set up to try some basic queries with Druid.
Minimum Physical Layout: Experimental Testing with 4GB of RAM
-------------------------------------------------------------
This layout can be used to load some data from deep storage onto a Druid compute node for the first time. A minimal physical layout for a 1 or 2 core machine with 4GB of RAM is:
1. node1: [[Master]] + metadata service + zookeeper + [[Compute]]
1. node1: [Master](Master.html) + metadata service + zookeeper + [Compute](Compute.html)
2. transient nodes: indexer
This setup is only reasonable to prove that a configuration works. It would not be worthwhile to use this layout for performance measurement.
......@@ -30,13 +30,13 @@ Comfortable Physical Layout: Pilot Project with Multiple Machines
A minimal physical layout not constrained by cores that demonstrates parallel querying and realtime, using AWS-EC2 “small”/m1.small (one core, with 1.7GB of RAM) or larger, no realtime, is:
1. node1: [[Master]] (m1.small)
1. node1: [Master](Master.html) (m1.small)
2. node2: metadata service (m1.small)
3. node3: zookeeper (m1.small)
4. node4: [[Broker]] (m1.small or m1.medium or m1.large)
5. node5: [[Compute]] (m1.small or m1.medium or m1.large)
6. node6: [[Compute]] (m1.small or m1.medium or m1.large)
7. node7: [[Realtime]] (m1.small or m1.medium or m1.large)
4. node4: [Broker](Broker.html) (m1.small or m1.medium or m1.large)
5. node5: [Compute](Compute.html) (m1.small or m1.medium or m1.large)
6. node6: [Compute](Compute.html) (m1.small or m1.medium or m1.large)
7. node7: [Realtime](Realtime.html) (m1.small or m1.medium or m1.large)
8. transient nodes: indexer
This layout naturally lends itself to adding more RAM and core to Compute nodes, and to adding many more Compute nodes. Depending on the actual load, the Master, metadata server, and Zookeeper might need to use larger machines.
......@@ -48,18 +48,18 @@ High Availability Physical Layout
An HA layout allows full rolling restarts and heavy volume:
1. node1: [[Master]] (m1.small or m1.medium or m1.large)
2. node2: [[Master]] (m1.small or m1.medium or m1.large) (backup)
1. node1: [Master](Master.html) (m1.small or m1.medium or m1.large)
2. node2: [Master](Master.html) (m1.small or m1.medium or m1.large) (backup)
3. node3: metadata service (c1.medium or m1.large)
4. node4: metadata service (c1.medium or m1.large) (backup)
5. node5: zookeeper (c1.medium)
6. node6: zookeeper (c1.medium)
7. node7: zookeeper (c1.medium)
8. node8: [[Broker]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
9. node9: [[Broker]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge) (backup)
10. node10: [[Compute]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
11. node11: [[Compute]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
12. node12: [[Realtime]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
8. node8: [Broker](Broker.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
9. node9: [Broker](Broker.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge) (backup)
10. node10: [Compute](Compute.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
11. node11: [Compute](Compute.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
12. node12: [Realtime](Realtime.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
13. transient nodes: indexer
Sizing for Cores and RAM
......@@ -79,7 +79,7 @@ Local disk (“ephemeral” on AWS EC2) for caching is recommended over network
Setup
-----
Setting up a cluster is essentially just firing up all of the nodes you want with the proper [[configuration]]. One thing to be aware of is that there are a few properties in the configuration that potentially need to be set individually for each process:
Setting up a cluster is essentially just firing up all of the nodes you want with the proper [configuration](configuration.html). One thing to be aware of is that there are a few properties in the configuration that potentially need to be set individually for each process:
<code>
druid.server.type=historical|realtime
......@@ -107,8 +107,8 @@ The following table shows the possible services and fully qualified class for ma
|service|main class|
|-------|----------|
|[[ Realtime ]]|com.metamx.druid.realtime.RealtimeMain|
|[[ Master ]]|com.metamx.druid.http.MasterMain|
|[[ Broker ]]|com.metamx.druid.http.BrokerMain|
|[[ Compute ]]|com.metamx.druid.http.ComputeMain|
|[ Realtime ]( Realtime .html)|com.metamx.druid.realtime.RealtimeMain|
|[ Master ]( Master .html)|com.metamx.druid.http.MasterMain|
|[ Broker ]( Broker .html)|com.metamx.druid.http.BrokerMain|
|[ Compute ]( Compute .html)|com.metamx.druid.http.ComputeMain|
......@@ -11,9 +11,9 @@ Loading and Serving Segments
Each compute node maintains a constant connection to Zookeeper and watches a configurable set of Zookeeper paths for new segment information. Compute nodes do not communicate directly with each other or with the master nodes but instead rely on Zookeeper for coordination.
The [[Master]] node is responsible for assigning new segments to compute nodes. Assignment is done by creating an ephemeral Zookeeper entry under a load queue path associated with a compute node. For more information on how the master assigns segments to compute nodes, please see [[Master]].
The [Master](Master.html) node is responsible for assigning new segments to compute nodes. Assignment is done by creating an ephemeral Zookeeper entry under a load queue path associated with a compute node. For more information on how the master assigns segments to compute nodes, please see [Master](Master.html).
When a compute node notices a new load queue entry in its load queue path, it will first check a local disk directory (cache) for the information about segment. If no information about the segment exists in the cache, the compute node will download metadata about the new segment to serve from Zookeeper. This metadata includes specifications about where the segment is located in deep storage and about how to decompress and process the segment. For more information about segment metadata and Druid segments in general, please see [[Segments]]. Once a compute node completes processing a segment, the segment is announced in Zookeeper under a served segments path associated with the node. At this point, the segment is available for querying.
When a compute node notices a new load queue entry in its load queue path, it will first check a local disk directory (cache) for the information about segment. If no information about the segment exists in the cache, the compute node will download metadata about the new segment to serve from Zookeeper. This metadata includes specifications about where the segment is located in deep storage and about how to decompress and process the segment. For more information about segment metadata and Druid segments in general, please see [Segments](Segments.html). Once a compute node completes processing a segment, the segment is announced in Zookeeper under a served segments path associated with the node. At this point, the segment is available for querying.
Loading and Serving Segments From Cache
---------------------------------------
......@@ -25,7 +25,7 @@ The segment cache is also leveraged when a compute node is first started. On sta
Querying Segments
-----------------
Please see [[Querying]] for more information on querying compute nodes.
Please see [Querying](Querying.html) for more information on querying compute nodes.
For every query that a compute node services, it will log the query and report metrics on the time taken to run the query.
......@@ -37,4 +37,4 @@ Compute nodes can be run using the `com.metamx.druid.http.ComputeMain` class.
Configuration
-------------
See [[Configuration]].
See [Configuration](Configuration.html).
......@@ -12,4 +12,4 @@ Concepts and Terminology
- **Segment:** A collection of (internal) records that are stored and processed together.
- **Shard:** A unit of partitioning data across machine. TODO: clarify; by time or other dimensions?
- **specFile** is specification for services in JSON format; see [[Realtime]] and [[Batch-ingestion]]
- **specFile** is specification for services in JSON format; see [Realtime](Realtime.html) and [Batch-ingestion](Batch-ingestion.html)
---
layout: default
---
This describes the basic server configuration that is loaded by all the server processes; the same file is loaded by all. See also the json “specFile” descriptions in [[Realtime]] and [[Batch-ingestion]].
This describes the basic server configuration that is loaded by all the server processes; the same file is loaded by all. See also the json “specFile” descriptions in [Realtime](Realtime.html) and [Batch-ingestion](Batch-ingestion.html).
JVM Configuration Best Practices
================================
......@@ -80,7 +80,7 @@ Configuration groupings
### S3 Access
These properties are for connecting with S3 and using it to pull down segments. In the future, we plan on being able to use other deep storage file systems as well, like HDFS. The file system is actually only accessed by the [[Compute]], [[Realtime]] and [[Indexing service]] nodes.
These properties are for connecting with S3 and using it to pull down segments. In the future, we plan on being able to use other deep storage file systems as well, like HDFS. The file system is actually only accessed by the [Compute](Compute.html), [Realtime](Realtime.html) and [Indexing service](Indexing service.html) nodes.
|Property|Description|Default|
|--------|-----------|-------|
......@@ -91,7 +91,7 @@ These properties are for connecting with S3 and using it to pull down segments.
### JDBC connection
These properties specify the jdbc connection and other configuration around the “segments table” database. The only processes that connect to the DB with these properties are the [[Master]] and [[Indexing service]]. This is tested on MySQL.
These properties specify the jdbc connection and other configuration around the “segments table” database. The only processes that connect to the DB with these properties are the [Master](Master.html) and [Indexing service](Indexing service.html). This is tested on MySQL.
|Property|Description|Default|
|--------|-----------|-------|
......@@ -113,7 +113,7 @@ These properties specify the jdbc connection and other configuration around the
### Zk properties
See [[ZooKeeper]] for a description of these properties.
See [ZooKeeper](ZooKeeper.html) for a description of these properties.
### Service properties
......@@ -146,7 +146,7 @@ These are properties that the compute nodes use
### Emitter Properties
The Druid servers emit various metrics and alerts via something we call an [[Emitter]]. There are two emitter implementations included with the code, one that just logs to log4j and one that does POSTs of JSON events to a server. More information can be found on the [[Emitter]] page. The properties for using the logging emitter are described below.
The Druid servers emit various metrics and alerts via something we call an [Emitter](Emitter.html). There are two emitter implementations included with the code, one that just logs to log4j and one that does POSTs of JSON events to a server. More information can be found on the [Emitter](Emitter.html) page. The properties for using the logging emitter are described below.
|Property|Description|Default|
|--------|-----------|-------|
......@@ -158,5 +158,5 @@ The Druid servers emit various metrics and alerts via something we call an [[Emi
|Property|Description|Default|
|--------|-----------|-------|
|`druid.realtime.specFile`|The file with realtime specifications in it. See [[Realtime]].|none|
|`druid.realtime.specFile`|The file with realtime specifications in it. See [Realtime](Realtime.html).|none|
......@@ -5,4 +5,4 @@ If you are interested in contributing to the code, we accept [pull requests](htt
For issue tracking, we are using the github issue tracker. Please fill out an issue from the Issues tab on the github screen.
We also have a [[Libraries]] page that lists external libraries that people have created for working with Druid.
We also have a [Libraries](Libraries.html) page that lists external libraries that people have created for working with Druid.
......@@ -53,7 +53,7 @@ Getting data into the Druid system requires an indexing process. This gives the
- Bitmap compression
- RLE (on the roadmap, but not yet implemented)
The output of the indexing process is stored in a “deep storage” LOB store/file system ([[Deep Storage]] for information about potential options). Data is then loaded by compute nodes by first downloading the data to their local disk and then memory mapping it before serving queries.
The output of the indexing process is stored in a “deep storage” LOB store/file system ([Deep Storage](Deep Storage.html) for information about potential options). Data is then loaded by compute nodes by first downloading the data to their local disk and then memory mapping it before serving queries.
If a compute node dies, it will no longer serve its segments, but given that the segments are still available on the “deep storage” any other node can simply download the segment and start serving it. This means that it is possible to actually remove all compute nodes from the cluster and then re-provision them without any data loss. It also means that if the “deep storage” is not available, the nodes can continue to serve the segments they have already pulled down (i.e. the cluster goes stale, not down).
......
---
layout: default
---
A version may be declared as a release candidate if it has been deployed to a sizable production cluster. Release candidates are declared as stable after we feel fairly confident there are no major bugs in the version. Check out the [[Versioning]] section for how we describe software versions.
A version may be declared as a release candidate if it has been deployed to a sizable production cluster. Release candidates are declared as stable after we feel fairly confident there are no major bugs in the version. Check out the [Versioning](Versioning.html) section for how we describe software versions.
Release Candidate
-----------------
......
......@@ -3,7 +3,7 @@ layout: default
---
# Druid Personal Demo Cluster (DPDC)
Note, there are currently some issues with the CloudFormation. We are working through them and will update the documentation here when things work properly. In the meantime, the simplest way to get your feet wet with a cluster setup is to run through the instructions at [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness), though it is based on an older version. If you just want to get a feel for the types of data and queries that you can issue, check out [[Realtime Examples]]
Note, there are currently some issues with the CloudFormation. We are working through them and will update the documentation here when things work properly. In the meantime, the simplest way to get your feet wet with a cluster setup is to run through the instructions at [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness), though it is based on an older version. If you just want to get a feel for the types of data and queries that you can issue, check out [Realtime Examples](Realtime Examples.html)
## Introduction
To make it easy for you to get started with Druid, we created an AWS (Amazon Web Services) [CloudFormation](http://aws.amazon.com/cloudformation/) Template that allows you to create a small pre-configured Druid cluster using your own AWS account. The cluster contains a pre-loaded sample workload, the Wikipedia edit stream, and a basic query interface that gets you familiar with Druid capabilities like drill-downs and filters.
......@@ -14,7 +14,7 @@ This guide walks you through the steps to create the cluster and then how to cre
## What’s in this Druid Demo Cluster?
1. A single "Master" node. This node co-locates the [[Master]] process, the [[Broker]] process, Zookeeper, and the MySQL instance. You can read more about Druid architecture [[Design]].
1. A single "Master" node. This node co-locates the [Master](Master.html) process, the [Broker](Broker.html) process, Zookeeper, and the MySQL instance. You can read more about Druid architecture [Design](Design.html).
1. Three compute nodes; these compute nodes, have been pre-configured to work with the Master node and should automatically load up the Wikipedia edit stream data (no specific setup is required).
......
......@@ -20,11 +20,11 @@ What does this mean? We can talk about it in terms of four general areas
## Fault Tolerance
Druid pulls segments down from [[Deep Storage]] before serving queries on top of it. This means that for the data to exist in the Druid cluster, it must exist as a local copy on a historical node. If deep storage becomes unavailable for any reason, new segments will not be loaded into the system, but the cluster will continue to operate exactly as it was when the backing store disappeared.
Druid pulls segments down from [Deep Storage](Deep Storage.html) before serving queries on top of it. This means that for the data to exist in the Druid cluster, it must exist as a local copy on a historical node. If deep storage becomes unavailable for any reason, new segments will not be loaded into the system, but the cluster will continue to operate exactly as it was when the backing store disappeared.
Impala and Shark, on the other hand, pull their data in from HDFS (or some other Hadoop FileSystem) in response to a query. This has implications for the operation of queries if you need to take HDFS down for a bit (say a software upgrade). It's possible that data that has been cached in the nodes is still available when the backing file system goes down, but I'm not sure.
This is just one example, but Druid was built to continue operating in the face of failures of any one of its various pieces. The [[Design]] describes these design decisions from the Druid side in more detail.
This is just one example, but Druid was built to continue operating in the face of failures of any one of its various pieces. The [Design](Design.html) describes these design decisions from the Druid side in more detail.
## Query Speed
......
......@@ -3,7 +3,7 @@ layout: default
---
How does Druid compare to Vertica?
Vertica is similar to ParAccel/Redshift ([[Druid-vs-Redshift]]) described above in that it wasn’t built for real-time streaming data ingestion and it supports full SQL.
Vertica is similar to ParAccel/Redshift ([Druid-vs-Redshift](Druid-vs-Redshift.html)) described above in that it wasn’t built for real-time streaming data ingestion and it supports full SQL.
The other big difference is that instead of employing indexing, Vertica tries to optimize processing by leveraging run-length encoding (RLE) and other compression techniques along with a “projection” system that creates materialized copies of the data in a different sort order (to maximize the effectiveness of RLE).
......
......@@ -34,7 +34,7 @@ Clone Druid and build it:
Twitter Example
---------------
For a full tutorial based on the twitter example, check out this [[Twitter Tutorial]].
For a full tutorial based on the twitter example, check out this [Twitter Tutorial](Twitter Tutorial.html).
This Example uses a feature of Twitter that allows for sampling of it’s stream. We sample the Twitter stream via our [TwitterSpritzerFirehoseFactory](https://github.com/metamx/druid/blob/master/examples/src/main/java/druid/examples/twitter/TwitterSpritzerFirehoseFactory.java) class and use it to simulate the kinds of data you might ingest into Druid. Then, with the client part, the sample shows what kinds of analytics explorations you can do during and after the data is loaded.
......@@ -48,7 +48,7 @@ This Example uses a feature of Twitter that allows for sampling of it’s stream
### What you’ll do
See [[Tutorial]]
See [Tutorial](Tutorial.html)
Rand Example
------------
......
......@@ -28,11 +28,11 @@ This firehose ingests events from a predefined list of S3 objects.
#### TwitterSpritzerFirehose
See [[Examples]]. This firehose connects directly to the twitter spritzer data stream.
See [Examples](Examples.html). This firehose connects directly to the twitter spritzer data stream.
#### RandomFirehose
See [[Examples]]. This firehose creates a stream of random numbers.
See [Examples](Examples.html). This firehose creates a stream of random numbers.
#### RabbitMqFirehouse
......
......@@ -11,7 +11,7 @@ Simple granularities are specified as a string and bucket timestamps by their UT
Supported granularity strings are: `all`, `none`, `minute`, `fifteen_minute`, `thirty_minute`, `hour` and `day`
\* **`all`** buckets everything into a single bucket
\* **`none`** does not bucket data (it actually uses the granularity of the index - minimum here is `none` which means millisecond granularity). Using `none` in a [[timeseries query|TimeSeriesQuery]] is currently not recommended (the system will try to generate 0 values for all milliseconds that didn’t exist, which is often a lot).
\* **`none`** does not bucket data (it actually uses the granularity of the index - minimum here is `none` which means millisecond granularity). Using `none` in a [timeseries query|TimeSeriesQuery](timeseries query|TimeSeriesQuery.html) is currently not recommended (the system will try to generate 0 values for all milliseconds that didn’t exist, which is often a lot).
### Duration Granularities
......
......@@ -93,12 +93,12 @@ There are 9 main parts to a groupBy query:
|queryType|This String should always be “groupBy”; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database|yes|
|dimensions|A JSON list of dimensions to do the groupBy over|yes|
|orderBy|See [[OrderBy]].|no|
|having|See [[Having]].|no|
|granularity|Defines the granularity of the query. See [[Granularities]]|yes|
|filter|See [[Filters]]|no|
|aggregations|See [[Aggregations]]|yes|
|postAggregations|See [[Post Aggregations]]|no|
|orderBy|See [OrderBy](OrderBy.html).|no|
|having|See [Having](Having.html).|no|
|granularity|Defines the granularity of the query. See [Granularities](Granularities.html)|yes|
|filter|See [Filters](Filters.html)|no|
|aggregations|See [Aggregations](Aggregations.html)|yes|
|postAggregations|See [Post Aggregations](Post Aggregations.html)|no|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|context|An additional JSON Object which can be used to specify certain flags.|no|
......
......@@ -3,7 +3,7 @@ layout: default
---
Druid is an open-source analytics datastore designed for realtime, exploratory, queries on large-scale data sets (100’s of Billions entries, 100’s TB data). Druid provides for cost effective, always-on, realtime data ingestion and arbitrary data exploration.
- Check out some [[Examples]]
- Check out some [Examples](Examples.html)
- Try out Druid with our Getting Started [Tutorial](https://github.com/metamx/druid/wiki/Tutorial%3A-A-First-Look-at-Druid)
- Learn more by reading the [White Paper](http://static.druid.io/docs/druid.pdf)
......@@ -19,7 +19,7 @@ The first one is the joy that everyone feels the first time they get Hadoop runn
Druid is especially useful if you are summarizing your data sets and then querying the summarizations. If you put your summarizations into Druid, you will get quick queryability out of a system that you can be confident will scale up as your data volumes increase. Deployments have scaled up to 2TB of data per hour at peak ingested and aggregated in real-time.
We have more details about the general design of the system and why you might want to use it in our [White Paper](http://static.druid.io/docs/druid.pdf) or in our [[Design]] doc.
We have more details about the general design of the system and why you might want to use it in our [White Paper](http://static.druid.io/docs/druid.pdf) or in our [Design](Design.html) doc.
The data store world is vast, confusing and constantly in flux. This page is meant to help potential evaluators decide whether Druid is a good fit for the problem one needs to solve. If anything about it is incorrect please provide that feedback on the mailing list or via some other means, we will fix this page.
......@@ -38,11 +38,11 @@ The data store world is vast, confusing and constantly in flux. This page is mea
\* Downtime is no big deal
#### Druid vs…
\* [[Druid-vs-Impala-or-Shark]]
\* [[Druid-vs-Redshift]]
\* [[Druid-vs-Vertica]]
\* [[Druid-vs-Cassandra]]
\* [[Druid-vs-Hadoop]]
\* [Druid-vs-Impala-or-Shark](Druid-vs-Impala-or-Shark.html)
\* [Druid-vs-Redshift](Druid-vs-Redshift.html)
\* [Druid-vs-Vertica](Druid-vs-Vertica.html)
\* [Druid-vs-Cassandra](Druid-vs-Cassandra.html)
\* [Druid-vs-Hadoop](Druid-vs-Hadoop.html)
Key Features
------------
......
......@@ -3,7 +3,7 @@ layout: default
---
Disclaimer: We are still in the process of finalizing the indexing service and these configs are prone to change at any time. We will announce when we feel the indexing service and the configurations described are stable.
The indexing service is a distributed task/job queue. It accepts requests in the form of [[Tasks]] and executes those tasks across a set of worker nodes. Worker capacity can be automatically adjusted based on the number of tasks pending in the system. The indexing service is highly available, has built in retry logic, and can backup per task logs in deep storage.
The indexing service is a distributed task/job queue. It accepts requests in the form of [Tasks](Tasks.html) and executes those tasks across a set of worker nodes. Worker capacity can be automatically adjusted based on the number of tasks pending in the system. The indexing service is highly available, has built in retry logic, and can backup per task logs in deep storage.
The indexing service is composed of two main components, a coordinator node that manages task distribution and worker capacity, and worker nodes that execute tasks in separate JVMs.
......@@ -45,7 +45,7 @@ The coordinator also exposes a simple UI to show what tasks are currently runnin
#### Task Execution
The coordinator retrieves worker setup metadata from the Druid [[MySQL]] config table. This metadata contains information about the version of workers to create, the maximum and minimum number of workers in the cluster at one time, and additional information required to automatically create workers.
The coordinator retrieves worker setup metadata from the Druid [MySQL](MySQL.html) config table. This metadata contains information about the version of workers to create, the maximum and minimum number of workers in the cluster at one time, and additional information required to automatically create workers.
Tasks are assigned to workers by creating entries under specific /tasks paths associated with a worker, similar to how the Druid master node assigns segments to compute nodes. See [Worker Configuration](Indexing-Service#configuration-1). Once a worker picks up a task, it deletes the task entry and announces a task status under a /status path associated with the worker. Tasks are submitted to a worker until the worker hits capacity. If all workers in a cluster are at capacity, the indexer coordinator node automatically creates new worker resources.
......
......@@ -3,7 +3,7 @@ layout: default
---
Once you have a realtime node working, it is time to load your own data to see how Druid performs.
Druid can ingest data in three ways: via Kafka and a realtime node, via the indexing service, and via the Hadoop batch loader. Data is ingested in realtime using a [[Firehose]].
Druid can ingest data in three ways: via Kafka and a realtime node, via the indexing service, and via the Hadoop batch loader. Data is ingested in realtime using a [Firehose](Firehose.html).
## Create Config Directories ##
Each type of node needs its own config file and directory, so create them as subdirectories under the druid directory.
......@@ -17,7 +17,7 @@ mkdir config/broker
## Loading Data with Kafka ##
[KafkaFirehoseFactory](https://github.com/metamx/druid/blob/master/realtime/src/main/java/com/metamx/druid/realtime/firehose/KafkaFirehoseFactory.java) is how druid communicates with Kafka. Using this [[Firehose]] with the right configuration, we can import data into Druid in realtime without writing any code. To load data to a realtime node via Kafka, we'll first need to initialize Zookeeper and Kafka, and then configure and initialize a [[Realtime]] node.
[KafkaFirehoseFactory](https://github.com/metamx/druid/blob/master/realtime/src/main/java/com/metamx/druid/realtime/firehose/KafkaFirehoseFactory.java) is how druid communicates with Kafka. Using this [Firehose](Firehose.html) with the right configuration, we can import data into Druid in realtime without writing any code. To load data to a realtime node via Kafka, we'll first need to initialize Zookeeper and Kafka, and then configure and initialize a [Realtime](Realtime.html) node.
### Booting Kafka ###
......@@ -165,7 +165,7 @@ curl -X POST "http://localhost:8080/druid/v2/?pretty" \
}
} ]
```
Now you're ready for [[Querying Your Data]]!
Now you're ready for [Querying Your Data](Querying Your Data.html)!
## Loading Data with the HadoopDruidIndexer ##
......@@ -184,7 +184,7 @@ mysql -u root
GRANT ALL ON druid.* TO 'druid'@'localhost' IDENTIFIED BY 'diurd';
CREATE database druid;
```
The [[Master]] node will create the tables it needs based on its configuration.
The [Master](Master.html) node will create the tables it needs based on its configuration.
### Make sure you have ZooKeeper Running ###
......@@ -206,7 +206,7 @@ cd ..
```
### Launch a Master Node ###
If you've already setup a realtime node, be aware that although you can run multiple node types on one physical computer, you must assign them unique ports. Having used 8080 for the [[Realtime]] node, we use 8081 for the [[Master]].
If you've already setup a realtime node, be aware that although you can run multiple node types on one physical computer, you must assign them unique ports. Having used 8080 for the [Realtime](Realtime.html) node, we use 8081 for the [Master](Master.html).
1. Setup a configuration file called config/master/runtime.properties similar to:
```bash
......@@ -251,7 +251,7 @@ druid.paths.indexCache=/tmp/druid/indexCache
# Path on local FS for storage of segment metadata; dir will be created if needed
druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
```
2. Launch the [[Master]] node
2. Launch the [Master](Master.html) node
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
-classpath lib/*:config/master \
......@@ -324,7 +324,7 @@ We can use the same records we have been, in a file called records.json:
### Run the Hadoop Job ###
Now its time to run the Hadoop [[Batch-ingestion]] job, HadoopDruidIndexer, which will fill a historical [[Compute]] node with data. First we'll need to configure the job.
Now its time to run the Hadoop [Batch-ingestion](Batch-ingestion.html) job, HadoopDruidIndexer, which will fill a historical [Compute](Compute.html) node with data. First we'll need to configure the job.
1. Create a config called batchConfig.json similar to:
```json
......@@ -367,4 +367,4 @@ Now its time to run the Hadoop [[Batch-ingestion]] job, HadoopDruidIndexer, whic
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=realtime.spec -classpath lib/* com.metamx.druid.indexer.HadoopDruidIndexerMain batchConfig.json
```
You can now move on to [[Querying Your Data]]!
\ No newline at end of file
You can now move on to [Querying Your Data](Querying Your Data.html)!
\ No newline at end of file
......@@ -15,7 +15,7 @@ Rules
Segments are loaded and dropped from the cluster based on a set of rules. Rules indicate how segments should be assigned to different compute node tiers and how many replicants of a segment should exist in each tier. Rules may also indicate when segments should be dropped entirely from the cluster. The master loads a set of rules from the database. Rules may be specific to a certain datasource and/or a default set of rules can be configured. Rules are read in order and hence the ordering of rules is important. The master will cycle through all available segments and match each segment with the first rule that applies. Each segment may only match a single rule
For more information on rules, see [[Rule Configuration.md]].
For more information on rules, see [Rule Configuration](Rule Configuration.html).
Cleaning Up Segments
--------------------
......@@ -103,4 +103,4 @@ Master nodes can be run using the `com.metamx.druid.http.MasterMain` class.
Configuration
-------------
See [[Configuration]].
See [Configuration](Configuration.html).
......@@ -8,7 +8,7 @@ Segments Table
This is dictated by the `druid.database.segmentTable` property (Note that these properties are going to change in the next stable version after 0.4.12).
This table stores metadata about the segments that are available in the system. The table is polled by the [[Master]] to determine the set of segments that should be available for querying in the system. The table has two main functional columns, the other columns are for indexing purposes.
This table stores metadata about the segments that are available in the system. The table is polled by the [Master](Master.html) to determine the set of segments that should be available for querying in the system. The table has two main functional columns, the other columns are for indexing purposes.
The `used` column is a boolean “tombstone”. A 1 means that the segment should be “used” by the cluster (i.e. it should be loaded and available for requests). A 0 means that the segment should not be actively loaded into the cluster. We do this as a means of removing segments from the cluster without actually removing their metadata (which allows for simpler rolling back if that is ever an issue).
......@@ -34,7 +34,7 @@ Note that the format of this blob can and will change from time-to-time.
Rule Table
----------
The rule table is used to store the various rules about where segments should land. These rules are used by the [[Master]] when making segment (re-)allocation decisions about the cluster.
The rule table is used to store the various rules about where segments should land. These rules are used by the [Master](Master.html) when making segment (re-)allocation decisions about the cluster.
Config Table
------------
......@@ -44,4 +44,4 @@ The config table is used to store runtime configuration objects. We do not have
Task-related Tables
-------------------
There are also a number of tables created and used by the [[Indexing Service]] in the course of its work.
There are also a number of tables created and used by the [Indexing Service](Indexing Service.html) in the course of its work.
......@@ -22,9 +22,9 @@ The grammar for an arithmetic post aggregation is:
### Field accessor post-aggregator
This returns the value produced by the specified [[aggregator|Aggregations]].
This returns the value produced by the specified [aggregator|Aggregations](aggregator|Aggregations.html).
`fieldName` refers to the output name of the aggregator given in the [[aggregations|Aggregations]] portion of the query.
`fieldName` refers to the output name of the aggregator given in the [aggregations|Aggregations](aggregations|Aggregations.html) portion of the query.
<code>field_accessor : {
"type" : "fieldAccess",
......
......@@ -3,7 +3,7 @@ layout: default
---
# Setup #
Before we start querying druid, we're going to finish setting up a complete cluster on localhost. In [[Loading Your Data]] we setup a [[Realtime]], [[Compute]] and [[Master]] node. If you've already completed that tutorial, you need only follow the directions for 'Booting a Broker Node'.
Before we start querying druid, we're going to finish setting up a complete cluster on localhost. In [Loading Your Data](Loading Your Data.html) we setup a [Realtime](Realtime.html), [Compute](Compute.html) and [Master](Master.html) node. If you've already completed that tutorial, you need only follow the directions for 'Booting a Broker Node'.
## Booting a Broker Node ##
......@@ -98,11 +98,11 @@ com.metamx.druid.http.ComputeMain
# Querying Your Data #
Now that we have a complete cluster setup on localhost, we need to load data. To do so, refer to [[Loading Your Data]]. Having done that, its time to query our data! For a complete specification of queries, see [[Querying]].
Now that we have a complete cluster setup on localhost, we need to load data. To do so, refer to [Loading Your Data](Loading Your Data.html). Having done that, its time to query our data! For a complete specification of queries, see [Querying](Querying.html).
## Querying Different Nodes ##
As a shared-nothing system, there are three ways to query druid, against the [[Realtime]], [[Compute]] or [[Broker]] node. Querying a Realtime node returns only realtime data, querying a compute node returns only historical segments. Querying the broker will query both realtime and compute segments and compose an overall result for the query. This is the normal mode of operation for queries in druid.
As a shared-nothing system, there are three ways to query druid, against the [Realtime](Realtime.html), [Compute](Compute.html) or [Broker](Broker.html) node. Querying a Realtime node returns only realtime data, querying a compute node returns only historical segments. Querying the broker will query both realtime and compute segments and compose an overall result for the query. This is the normal mode of operation for queries in druid.
### Construct a Query ###
......@@ -183,7 +183,7 @@ Now that we know what nodes can be queried (although you should usually use the
## Querying Against the realtime.spec ##
How are we to know what queries we can run? Although [[Querying]] is a helpful index, to get a handle on querying our data we need to look at our [[Realtime]] node's realtime.spec file:
How are we to know what queries we can run? Although [Querying](Querying.html) is a helpful index, to get a handle on querying our data we need to look at our [Realtime](Realtime.html) node's realtime.spec file:
```json
[{
......@@ -225,7 +225,7 @@ Our dataSource tells us the name of the relation/table, or 'source of data', to
### aggregations ###
Note the [[Aggregations]] in our query:
Note the [Aggregations](Aggregations.html) in our query:
```json
"aggregations": [
......@@ -244,7 +244,7 @@ this matches up to the aggregators in the schema of our realtime.spec!
### dimensions ###
Lets look back at our actual records (from [[Loading Your Data]]):
Lets look back at our actual records (from [Loading Your Data](Loading Your Data.html)):
```json
{"utcdt": "2010-01-01T01:01:01", "wp": 1000, "gender": "male", "age": 100}
......@@ -359,8 +359,8 @@ Which gets us just people aged 40:
} ]
```
Check out [[Filters]] for more.
Check out [Filters](Filters.html) for more.
## Learn More ##
You can learn more about querying at [[Querying]]! Now check out [[Booting a production cluster]]!
\ No newline at end of file
You can learn more about querying at [Querying](Querying.html)! Now check out [Booting a production cluster](Booting a production cluster.html)!
\ No newline at end of file
......@@ -4,7 +4,7 @@ layout: default
Querying
========
Queries are made using an HTTP REST style request to a [[Broker]], [[Compute]], or [[Realtime]] node. The query is expressed in JSON and each of these node types expose the same REST query interface.
Queries are made using an HTTP REST style request to a [Broker](Broker.html), [Compute](Compute.html), or [Realtime](Realtime.html) node. The query is expressed in JSON and each of these node types expose the same REST query interface.
We start by describing an example query with additional comments that mention possible variations. Query operators are also summarized in a table below.
......@@ -55,7 +55,7 @@ The dataSource JSON field shown next identifies where to apply the query. In thi
\`\`\`javascript
[dataSource]() “randSeq”,
\`\`\`
The granularity JSON field specifies the bucket size for values. It could be a built-in time interval like “second”, “minute”, “fifteen\_minute”, “thirty\_minute”, “hour” or “day”. It can also be an expression like `{"type": "period", "period":"PT6m"}` meaning “6 minute buckets”. See [[Granularities]] for more information on the different options for this field. In this example, it is set to the special value “all” which means [bucket all data points together into the same time bucket]()
The granularity JSON field specifies the bucket size for values. It could be a built-in time interval like “second”, “minute”, “fifteen\_minute”, “thirty\_minute”, “hour” or “day”. It can also be an expression like `{"type": "period", "period":"PT6m"}` meaning “6 minute buckets”. See [Granularities](Granularities.html) for more information on the different options for this field. In this example, it is set to the special value “all” which means [bucket all data points together into the same time bucket]()
\`\`\`javascript
[granularity]() “all”,
\`\`\`
......@@ -63,7 +63,7 @@ The dimensions JSON field value is an array of zero or more fields as defined in
\`\`\`javascript
[dimensions]() [],
\`\`\`
A groupBy also requires the JSON field “aggregations” (See [[Aggregations]]), which are applied to the column specified by fieldName and the output of the aggregation will be named according to the value in the “name” field:
A groupBy also requires the JSON field “aggregations” (See [Aggregations](Aggregations.html)), which are applied to the column specified by fieldName and the output of the aggregation will be named according to the value in the “name” field:
\`\`\`javascript
[aggregations]() [
{ [type]() “count”, [name]() “rows” },
......@@ -71,7 +71,7 @@ A groupBy also requires the JSON field “aggregations” (See [[Aggregations]])
{ [type]() “doubleSum”, [fieldName]() “outColumn”, [name]() “randomNumberSum” }
],
\`\`\`
You can also specify postAggregations, which are applied after data has been aggregated for the current granularity and dimensions bucket. See [[Post Aggregations]] for a detailed description. In the rand example, an arithmetic type operation (division, as specified by “fn”) is performed with the result “name” of “avg\_random”. The “fields” field specifies the inputs from the aggregation stage to this expression. Note that identifiers corresponding to “name” JSON field inside the type “fieldAccess” are required but not used outside this expression, so they are prefixed with “dummy” for clarity:
You can also specify postAggregations, which are applied after data has been aggregated for the current granularity and dimensions bucket. See [Post Aggregations](Post Aggregations.html) for a detailed description. In the rand example, an arithmetic type operation (division, as specified by “fn”) is performed with the result “name” of “avg\_random”. The “fields” field specifies the inputs from the aggregation stage to this expression. Note that identifiers corresponding to “name” JSON field inside the type “fieldAccess” are required but not used outside this expression, so they are prefixed with “dummy” for clarity:
\`\`\`javascript
[postAggregations]() [{
[type]() “arithmetic”,
......@@ -99,11 +99,11 @@ The following table summarizes query properties.
|timeseries, groupBy, search, timeBoundary|dataSource|query is applied to this data source|yes|
|timeseries, groupBy, search|intervals|range of time series to include in query|yes|
|timeseries, groupBy, search, timeBoundary|context|This is a key-value map that can allow the query to alter some of the behavior of a query. It is primarily used for debugging, for example if you include `"bySegment":true` in the map, you will get results associated with the data segment they came from.|no|
|timeseries, groupBy, search|filter|Specifies the filter (the “WHERE” clause in SQL) for the query. See [[Filters]]|no|
|timeseries, groupBy, search|granularity|the timestamp granularity to bucket results into (i.e. “hour”). See [[Granularities]] for more information.|no|
|timeseries, groupBy, search|filter|Specifies the filter (the “WHERE” clause in SQL) for the query. See [Filters](Filters.html)|no|
|timeseries, groupBy, search|granularity|the timestamp granularity to bucket results into (i.e. “hour”). See [Granularities](Granularities.html) for more information.|no|
|groupBy|dimensions|constrains the groupings; if empty, then one value per time granularity bucket|yes|
|timeseries, groupBy|aggregations|aggregations that combine values in a bucket. See [[Aggregations]].|yes|
|timeseries, groupBy|postAggregations|aggregations of aggregations. See [[Post Aggregations]].|yes|
|timeseries, groupBy|aggregations|aggregations that combine values in a bucket. See [Aggregations](Aggregations.html).|yes|
|timeseries, groupBy|postAggregations|aggregations of aggregations. See [Post Aggregations](Post Aggregations.html).|yes|
|search|limit|maximum number of results (default is 1000), a system-level maximum can also be set via `com.metamx.query.search.maxSearchLimit`|no|
|search|searchDimensions|Dimensions to apply the search query to. If not specified, it will search through all dimensions.|no|
|search|query|The query portion of the search query. This is essentially a predicate that specifies if something matches.|yes|
......@@ -111,4 +111,4 @@ The following table summarizes query properties.
Additional Information about Query Types
----------------------------------------
[[TimeseriesQuery]]
[TimeseriesQuery](TimeseriesQuery.html)
......@@ -4,7 +4,7 @@ layout: default
Realtime
========
Realtime nodes provide a realtime index. Data indexed via these nodes is immediately available for querying. Realtime nodes will periodically build segments representing the data they’ve collected over some span of time and hand these segments off to [[Compute]] nodes.
Realtime nodes provide a realtime index. Data indexed via these nodes is immediately available for querying. Realtime nodes will periodically build segments representing the data they’ve collected over some span of time and hand these segments off to [Compute](Compute.html) nodes.
Running
-------
......@@ -21,7 +21,7 @@ The segment propagation diagram for real-time data ingestion can be seen below:
Configuration
-------------
Realtime nodes take a mix of base server configuration and spec files that describe how to connect, process and expose the realtime feed. See [[Configuration]] for information about general server configuration.
Realtime nodes take a mix of base server configuration and spec files that describe how to connect, process and expose the realtime feed. See [Configuration](Configuration.html) for information about general server configuration.
### Realtime “specFile”
......@@ -62,7 +62,7 @@ There are four parts to a realtime stream specification, `schema`, `config`, `fi
#### Schema
This describes the data schema for the output Druid segment. More information about concepts in Druid and querying can be found at [[Concepts-and-Terminology]] and [[Querying]].
This describes the data schema for the output Druid segment. More information about concepts in Druid and querying can be found at [Concepts-and-Terminology](Concepts-and-Terminology.html) and [Querying](Querying.html).
|Field|Type|Description|Required|
|-----|----|-----------|--------|
......@@ -83,11 +83,11 @@ This provides configuration for the data processing portion of the realtime stre
### Firehose
See [[Firehose]].
See [Firehose](Firehose.html).
### Plumber
See [[Plumber]]
See [Plumber](Plumber.html)
Constraints
-----------
......
......@@ -30,11 +30,11 @@ There are several main parts to a search query:
|--------|-----------|---------|
|queryType|This String should always be “search”; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database|yes|
|granularity|Defines the granularity of the query. See [[Granularities]]|yes|
|filter|See [[Filters]]|no|
|granularity|Defines the granularity of the query. See [Granularities](Granularities.html)|yes|
|filter|See [Filters](Filters.html)|no|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|searchDimensions|The dimensions to run the search over. Excluding this means the search is run over all dimensions.|no|
|query|See [[SearchQuerySpec]].|yes|
|query|See [SearchQuerySpec](SearchQuerySpec.html).|yes|
|sort|How the results of the search should sorted. Two possible types here are “lexicographic” and “strlen”.|yes|
|context|An additional JSON Object which can be used to specify certain flags.|no|
......
......@@ -4,7 +4,7 @@ layout: default
Segments
========
Segments are the fundamental structure to store data in Druid. [[Compute]] and [[Realtime]] nodes load and serve segments for querying. To construct segments, Druid will always shard data by a time partition. Data may be further sharded based on dimension cardinality and row count.
Segments are the fundamental structure to store data in Druid. [Compute](Compute.html) and [Realtime](Realtime.html) nodes load and serve segments for querying. To construct segments, Druid will always shard data by a time partition. Data may be further sharded based on dimension cardinality and row count.
The latest Druid segment version is `v9`.
......
......@@ -22,12 +22,12 @@ We started with a minimal CentOS installation but you can use any other compatib
1. A Kafka Broker
1. A single-node Zookeeper ensemble
1. A single-node Riak-CS cluster
1. A Druid [[Master]]
1. A Druid [[Broker]]
1. A Druid [[Compute]]
1. A Druid [[Realtime]]
1. A Druid [Master](Master.html)
1. A Druid [Broker](Broker.html)
1. A Druid [Compute](Compute.html)
1. A Druid [Realtime](Realtime.html)
This just walks through getting the relevant software installed and running. You will then need to configure the [[Realtime]] node to take in your data.
This just walks through getting the relevant software installed and running. You will then need to configure the [Realtime](Realtime.html) node to take in your data.
### Configure System
......
......@@ -84,10 +84,10 @@ There are 7 main parts to a timeseries query:
|--------|-----------|---------|
|queryType|This String should always be “timeseries”; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database|yes|
|granularity|Defines the granularity of the query. See [[Granularities]]|yes|
|filter|See [[Filters]]|no|
|aggregations|See [[Aggregations]]|yes|
|postAggregations|See [[Post Aggregations]]|no|
|granularity|Defines the granularity of the query. See [Granularities](Granularities.html)|yes|
|filter|See [Filters](Filters.html)|no|
|aggregations|See [Aggregations](Aggregations.html)|yes|
|postAggregations|See [Post Aggregations](Post Aggregations.html)|no|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|context|An additional JSON Object which can be used to specify certain flags.|no|
......
......@@ -41,7 +41,7 @@ These metrics track the number of characters added, deleted, and changed.
Setting Up
----------
There are two ways to setup Druid: download a tarball, or [[Build From Source]]. You only need to do one of these.
There are two ways to setup Druid: download a tarball, or [Build From Source](Build From Source.html). You only need to do one of these.
### Download a Tarball
......@@ -64,7 +64,7 @@ You should see a bunch of files:
Running Example Scripts
-----------------------
Let’s start doing stuff. You can start a Druid [[Realtime]] node by issuing:
Let’s start doing stuff. You can start a Druid [Realtime](Realtime.html) node by issuing:
./run_example_server.sh
......@@ -176,7 +176,7 @@ As you can probably tell, the result is indicating the maximum and minimum times
Return to your favorite editor and create the file:
<pre>timeseries_query.body</pre>
We are going to make a slightly more complicated query, the [[TimeseriesQuery]]. Copy and paste the following into the file:
We are going to make a slightly more complicated query, the [TimeseriesQuery](TimeseriesQuery.html). Copy and paste the following into the file:
<pre><code>
{
"queryType": "timeseries",
......@@ -200,7 +200,7 @@ We are going to make a slightly more complicated query, the [[TimeseriesQuery]].
}
</code></pre>
You are probably wondering, what are these [[Granularities]] and [[Aggregations]] things? What the query is doing is aggregating some metrics over some span of time.
You are probably wondering, what are these [Granularities](Granularities.html) and [Aggregations](Aggregations.html) things? What the query is doing is aggregating some metrics over some span of time.
To issue the query and get some results, run the following in your command line:
<pre><code>curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d ````timeseries\_query.body</code>
......@@ -275,7 +275,7 @@ This gives us something like the following:
Solving a Problem
-----------------
One of Druid’s main powers is to provide answers to problems, so let’s pose a problem. What if we wanted to know what the top pages in the US are, ordered by the number of edits over the last few minutes you’ve been going through this tutorial? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [[GroupByQuery]]. It would be nice if we could group by results by dimension value and somehow sort those results… and it turns out we can!
One of Druid’s main powers is to provide answers to problems, so let’s pose a problem. What if we wanted to know what the top pages in the US are, ordered by the number of edits over the last few minutes you’ve been going through this tutorial? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [GroupByQuery](GroupByQuery.html). It would be nice if we could group by results by dimension value and somehow sort those results… and it turns out we can!
Let’s create the file:
......@@ -317,7 +317,7 @@ Let’s create the file:
}
</code>
Woah! Our query just got a way more complicated. Now we have these [[Filters]] things and this [[OrderBy]] thing. Fear not, it turns out the new objects we’ve introduced to our query can help define the format of our results and provide an answer to our question.
Woah! Our query just got a way more complicated. Now we have these [Filters](Filters.html) things and this [OrderBy](OrderBy.html) thing. Fear not, it turns out the new objects we’ve introduced to our query can help define the format of our results and provide an answer to our question.
If you issue the query:
......@@ -357,9 +357,9 @@ Feel free to tweak other query parameters to answer other questions you may have
Next Steps
----------
What to know even more information about the Druid Cluster? Check out [[Tutorial: The Druid Cluster]]
What to know even more information about the Druid Cluster? Check out [Tutorial: The Druid Cluster](Tutorial: The Druid Cluster.html)
Druid is even more fun if you load your own data into it! To learn how to load your data, see [[Loading Your Data]].
Druid is even more fun if you load your own data into it! To learn how to load your data, see [Loading Your Data](Loading Your Data.html).
Additional Information
----------------------
......
......@@ -19,7 +19,7 @@ tar -zxvf druid-services-*-bin.tar.gz
cd druid-services-*
```
You can also [[Build From Source]].
You can also [Build From Source](Build From Source.html).
## External Dependencies ##
......
......@@ -145,7 +145,7 @@ As you can probably tell, the result is indicating the maximum and minimum times
Return to your favorite editor and create the file:
<pre>timeseries_query.body</pre>
We are going to make a slightly more complicated query, the [[TimeseriesQuery]]. Copy and paste the following into the file:
We are going to make a slightly more complicated query, the [TimeseriesQuery](TimeseriesQuery.html). Copy and paste the following into the file:
<pre><code>
{
"queryType": "timeseries",
......@@ -168,7 +168,7 @@ We are going to make a slightly more complicated query, the [[TimeseriesQuery]].
}
</code></pre>
You are probably wondering, what are these [[Granularities]] and [[Aggregations]] things? What the query is doing is aggregating some metrics over some span of time.
You are probably wondering, what are these [Granularities](Granularities.html) and [Aggregations](Aggregations.html) things? What the query is doing is aggregating some metrics over some span of time.
To issue the query and get some results, run the following in your command line:
<pre><code>curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d ````timeseries\_query.body</code>
......@@ -246,7 +246,7 @@ This gives us something like the following:
Solving a Problem
-----------------
One of Druid’s main powers is to provide answers to problems, so let’s pose a problem. What if we wanted to know what the top states in the US are, ordered by the number of visits by known users over the last few minutes? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [[GroupByQuery]]. It would be nice if we could group by results by dimension value and somehow sort those results… and it turns out we can!
One of Druid’s main powers is to provide answers to problems, so let’s pose a problem. What if we wanted to know what the top states in the US are, ordered by the number of visits by known users over the last few minutes? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [GroupByQuery](GroupByQuery.html). It would be nice if we could group by results by dimension value and somehow sort those results… and it turns out we can!
Let’s create the file:
......@@ -292,7 +292,7 @@ Let’s create the file:
}
</code>
Woah! Our query just got a way more complicated. Now we have these [[Filters]] things and this [[OrderBy]] thing. Fear not, it turns out the new objects we’ve introduced to our query can help define the format of our results and provide an answer to our question.
Woah! Our query just got a way more complicated. Now we have these [Filters](Filters.html) things and this [OrderBy](OrderBy.html) thing. Fear not, it turns out the new objects we’ve introduced to our query can help define the format of our results and provide an answer to our question.
If you issue the query:
......@@ -346,8 +346,8 @@ Feel free to tweak other query parameters to answer other questions you may have
Next Steps
----------
What to know even more information about the Druid Cluster? Check out [[Tutorial: The Druid Cluster]]
Druid is even more fun if you load your own data into it! To learn how to load your data, see [[Loading Your Data]].
What to know even more information about the Druid Cluster? Check out [Tutorial: The Druid Cluster](Tutorial: The Druid Cluster.html)
Druid is even more fun if you load your own data into it! To learn how to load your data, see [Loading Your Data](Loading Your Data.html).
Additional Information
----------------------
......
---
layout: default
---
Greetings! We see you’ve taken an interest in Druid. That’s awesome! Hopefully this tutorial will help clarify some core Druid concepts. We will go through one of the Real-time [[Examples]], and issue some basic Druid queries. The data source we’ll be working with is the [Twitter spritzer stream](https://dev.twitter.com/docs/streaming-apis/streams/public). If you are ready to explore Druid, brave its challenges, and maybe learn a thing or two, read on!
Greetings! We see you’ve taken an interest in Druid. That’s awesome! Hopefully this tutorial will help clarify some core Druid concepts. We will go through one of the Real-time [Examples](Examples.html), and issue some basic Druid queries. The data source we’ll be working with is the [Twitter spritzer stream](https://dev.twitter.com/docs/streaming-apis/streams/public). If you are ready to explore Druid, brave its challenges, and maybe learn a thing or two, read on!
Setting Up
----------
......@@ -52,7 +52,7 @@ You can find the example executables in the examples/bin directory:
Running Example Scripts
-----------------------
Let’s start doing stuff. You can start a Druid [[Realtime]] node by issuing:
Let’s start doing stuff. You can start a Druid [Realtime](Realtime.html) node by issuing:
./run_example_server.sh
......@@ -175,7 +175,7 @@ If you said the result is indicating the maximum and minimum timestamps we've se
Return to your favorite editor and create the file:
<pre>timeseries_query.body</pre>
We are going to make a slightly more complicated query, the [[TimeseriesQuery]]. Copy and paste the following into the file:
We are going to make a slightly more complicated query, the [TimeseriesQuery](TimeseriesQuery.html). Copy and paste the following into the file:
<pre><code>{
"queryType":"timeseries",
"dataSource":"twitterstream",
......@@ -188,7 +188,7 @@ We are going to make a slightly more complicated query, the [[TimeseriesQuery]].
}
</code></pre>
You are probably wondering, what are these [[Granularities]] and [[Aggregations]] things? What the query is doing is aggregating some metrics over some span of time.
You are probably wondering, what are these [Granularities](Granularities.html) and [Aggregations](Aggregations.html) things? What the query is doing is aggregating some metrics over some span of time.
To issue the query and get some results, run the following in your command line:
<pre><code>curl -X POST 'http://localhost:8080/druid/v2/?pretty' -H 'content-type: application/json' -d ````timeseries\_query.body</code>
......@@ -252,7 +252,7 @@ This gives us something like the following:
Solving a Problem
-----------------
One of Druid’s main powers (see what we did there?) is to provide answers to problems, so let’s pose a problem. What if we wanted to know what the top hash tags are, ordered by the number tweets, where the language is english, over the last few minutes you’ve been reading this tutorial? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [[GroupByQuery]]. It would be nice if we could group by results by dimension value and somehow sort those results… and it turns out we can!
One of Druid’s main powers (see what we did there?) is to provide answers to problems, so let’s pose a problem. What if we wanted to know what the top hash tags are, ordered by the number tweets, where the language is english, over the last few minutes you’ve been reading this tutorial? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [GroupByQuery](GroupByQuery.html). It would be nice if we could group by results by dimension value and somehow sort those results… and it turns out we can!
Let’s create the file:
......@@ -272,7 +272,7 @@ Let’s create the file:
}
</code>
Woah! Our query just got a way more complicated. Now we have these [[Filters]] things and this [[OrderBy]] thing. Fear not, it turns out the new objects we’ve introduced to our query can help define the format of our results and provide an answer to our question.
Woah! Our query just got a way more complicated. Now we have these [Filters](Filters.html) things and this [OrderBy](OrderBy.html) thing. Fear not, it turns out the new objects we’ve introduced to our query can help define the format of our results and provide an answer to our question.
If you issue the query:
......@@ -324,6 +324,6 @@ Feel free to tweak other query parameters to answer other questions you may have
Additional Information
----------------------
This tutorial is merely showcasing a small fraction of what Druid can do. Next, continue on to [[Loading Your Data]].
This tutorial is merely showcasing a small fraction of what Druid can do. Next, continue on to [Loading Your Data](Loading Your Data.html).
And thus concludes our journey! Hopefully you learned a thing or two about Druid real-time ingestion, querying Druid, and how Druid can be used to solve problems. If you have additional questions, feel free to post in our [google groups page](http://www.groups.google.com/forum/#!forum/druid-development).
......@@ -21,4 +21,4 @@ For external deployments, we recommend running the stable release tag. Releases
Tagging strategy
----------------
Tags of the codebase are equivalent to release candidates. We tag the code every time we want to take it through our release process, which includes some QA cycles and deployments. So, it is not safe to assume that a tag is a stable release, it is a solidification of the code as it goes through our production QA cycle and deployment. Tags will never change, but we often go through a number of iterations of tags before actually getting a stable release onto production. So, it is recommended that if you are not aware of what is on a tag, to stick to the stable releases listed on the [[Download]] page.
Tags of the codebase are equivalent to release candidates. We tag the code every time we want to take it through our release process, which includes some QA cycles and deployments. So, it is not safe to assume that a tag is a stable release, it is a solidification of the code as it goes through our production QA cycle and deployment. Tags will never change, but we often go through a number of iterations of tags before actually getting a stable release onto production. So, it is recommended that if you are not aware of what is on a tag, to stick to the stable releases listed on the [Download](Download.html) page.
......@@ -3,9 +3,9 @@ layout: default
---
Druid uses ZooKeeper (ZK) for management of current cluster state. The operations that happen over ZK are
1. [[Master]] leader election
2. Segment “publishing” protocol from [[Compute]] and [[Realtime]]
3. Segment load/drop protocol between [[Master]] and [[Compute]]
1. [Master](Master.html) leader election
2. Segment “publishing” protocol from [Compute](Compute.html) and [Realtime](Realtime.html)
3. Segment load/drop protocol between [Master](Master.html) and [Compute](Compute.html)
### Property Configuration
......@@ -41,7 +41,7 @@ We use the Curator LeadershipLatch recipe to do leader election at path
The `announcementsPath` and `servedSegmentsPath` are used for this.
All [[Compute]] and [[Realtime]] nodes publish themselves on the `announcementsPath`, specifically, they will create an ephemeral znode at
All [Compute](Compute.html) and [Realtime](Realtime.html) nodes publish themselves on the `announcementsPath`, specifically, they will create an ephemeral znode at
${druid.zk.paths.announcementsPath}/${druid.host}
......@@ -53,13 +53,13 @@ And as they load up segments, they will attach ephemeral znodes that look like
${druid.zk.paths.servedSegmentsPath}/${druid.host}/_segment_identifier_
Nodes like the [[Master]] and [[Broker]] can then watch these paths to see which nodes are currently serving which segments.
Nodes like the [Master](Master.html) and [Broker](Broker.html) can then watch these paths to see which nodes are currently serving which segments.
### Segment load/drop protocol between Master and Compute
The `loadQueuePath` is used for this.
When the [[Master]] decides that a [[Compute]] node should load or drop a segment, it writes an ephemeral znode to
When the [Master](Master.html) decides that a [Compute](Compute.html) node should load or drop a segment, it writes an ephemeral znode to
${druid.zk.paths.loadQueuePath}/_host_of_compute_node/_segment_identifier
......
......@@ -2,70 +2,70 @@
layout: default
---
Contents
\* [[Introduction|Home]]
\* [[Download]]
\* [[Support]]
\* [[Contribute]]
\* [Introduction|Home](Introduction|Home.html)
\* [Download](Download.html)
\* [Support](Support.html)
\* [Contribute](Contribute.html)
========================
Getting Started
\* [[Tutorial: A First Look at Druid]]
\* [[Tutorial: The Druid Cluster]]
\* [[Loading Your Data]]
\* [[Querying Your Data]]
\* [[Booting a Production Cluster]]
\* [[Examples]]
\* [[Cluster Setup]]
\* [[Configuration]]
\* [Tutorial: A First Look at Druid](Tutorial: A First Look at Druid.html)
\* [Tutorial: The Druid Cluster](Tutorial: The Druid Cluster.html)
\* [Loading Your Data](Loading Your Data.html)
\* [Querying Your Data](Querying Your Data.html)
\* [Booting a Production Cluster](Booting a Production Cluster.html)
\* [Examples](Examples.html)
\* [Cluster Setup](Cluster Setup.html)
\* [Configuration](Configuration.html)
--------------------------------------
Data Ingestion
\* [[Realtime]]
\* [[Batch|Batch Ingestion]]
\* [[Indexing Service]]
\* [Realtime](Realtime.html)
\* [Batch|Batch Ingestion](Batch|Batch Ingestion.html)
\* [Indexing Service](Indexing Service.html)
----------------------------
Querying
\* [[Querying]]
\* [Querying](Querying.html)
**\* ]
**\* [[Aggregations]]
**\* [Aggregations](Aggregations.html)
**\* ]
**\* [[Granularities]]
**\* [Granularities](Granularities.html)
\* Query Types
**\* ]
****\* ]
****\* ]
**\* [[SearchQuery]]
**\* [SearchQuery](SearchQuery.html)
**\* ]
** [[SegmentMetadataQuery]]
** [SegmentMetadataQuery](SegmentMetadataQuery.html)
**\* ]
**\* [[TimeseriesQuery]]
**\* [TimeseriesQuery](TimeseriesQuery.html)
---------------------------
Architecture
\* [[Design]]
\* [[Segments]]
\* [Design](Design.html)
\* [Segments](Segments.html)
\* Node Types
**\* ]
**\* [[Broker]]
**\* [Broker](Broker.html)
**\* ]
****\* ]
**\* [[Realtime]]
**\* [Realtime](Realtime.html)
**\* ]
**\* [[Plumber]]
**\* [Plumber](Plumber.html)
\* External Dependencies
**\* ]
**\* [[MySQL]]
**\* [MySQL](MySQL.html)
**\* ]
** [[Concepts and Terminology]]
** [Concepts and Terminology](Concepts and Terminology.html)
-------------------------------
Development
\* [[Versioning]]
\* [[Build From Source]]
\* [[Libraries]]
\* [Versioning](Versioning.html)
\* [Build From Source](Build From Source.html)
\* [Libraries](Libraries.html)
------------------------
Misc
\* [[Thanks]]
\* [Thanks](Thanks.html)
-------------
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册