Tutorial:-The-Druid-Cluster.md 10.7 KB
Newer Older
1
---
C
cheddar 已提交
2
layout: doc_page
3
---
4 5

# Tutorial: The Druid Cluster
F
fjy 已提交
6
Welcome back! In our first [tutorial](Tutorial%3A-A-First-Look-at-Druid.html), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?
C
cheddar 已提交
7 8 9

This tutorial will hopefully answer these questions!

I
Igal Levy 已提交
10
In this tutorial, we will set up other types of Druid nodes and external dependencies for a fully functional Druid cluster. The architecture of Druid is very much like the [Megazord](http://www.youtube.com/watch?v=7mQuHh1X4H4) from the popular 90s show Mighty Morphin' Power Rangers. Each Druid node has a specific purpose and the nodes come together to form a fully functional system.
C
cheddar 已提交
11

F
fjy 已提交
12
## Downloading Druid
C
cheddar 已提交
13 14 15

If you followed the first tutorial, you should already have Druid downloaded. If not, let's go back and do that first.

F
fjy 已提交
16
You can download the latest version of druid [here](http://static.druid.io/artifacts/releases/druid-services-0.6.121-bin.tar.gz)
C
cheddar 已提交
17 18

and untar the contents within by issuing:
19

C
cheddar 已提交
20 21 22 23 24
```bash
tar -zxvf druid-services-*-bin.tar.gz
cd druid-services-*
```

F
fjy 已提交
25
You can also [Build From Source](Build-from-source.html).
C
cheddar 已提交
26

F
fjy 已提交
27
## External Dependencies
C
cheddar 已提交
28 29 30

Druid requires 3 external dependencies. A "deep" storage that acts as a backup data repository, a relational database such as MySQL to hold configuration and metadata information, and [Apache Zookeeper](http://zookeeper.apache.org/) for coordination among different pieces of the cluster.

F
fjy 已提交
31
For deep storage, we have made a public S3 bucket (static.druid.io) available where data for this particular tutorial can be downloaded. More on the data later.
C
cheddar 已提交
32

F
fjy 已提交
33
#### Setting up MySQL
C
cheddar 已提交
34

I
Igal Levy 已提交
35 36 37
1. If you don't already have it, download MySQL Community Server here: [http://dev.mysql.com/downloads/mysql/](http://dev.mysql.com/downloads/mysql/).
2. Install MySQL.
3. Create a druid user and database.
38

C
cheddar 已提交
39 40 41
```bash
mysql -u root
```
42

C
cheddar 已提交
43 44 45 46 47
```sql
GRANT ALL ON druid.* TO 'druid'@'localhost' IDENTIFIED BY 'diurd';
CREATE database druid;
```

F
fjy 已提交
48
#### Setting up Zookeeper
49

C
cheddar 已提交
50 51 52 53 54 55 56 57 58
```bash
curl http://www.motorlogy.com/apache/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz -o zookeeper-3.4.5.tar.gz
tar xzf zookeeper-3.4.5.tar.gz
cd zookeeper-3.4.5
cp conf/zoo_sample.cfg conf/zoo.cfg
./bin/zkServer.sh start
cd ..
```

F
fjy 已提交
59
## The Data
C
cheddar 已提交
60

F
fjy 已提交
61
Similar to the first tutorial, the data we will be loading is based on edits that have occurred on Wikipedia. Every time someone edits a page in Wikipedia, metadata is generated about the editor and edited page. Druid collects each individual event and packages them together in a container known as a [segment](Segments.html). Segments contain data over some span of time. We've prebuilt a segment for this tutorial and will cover making your own segments in other [pages](Tutorial%3A-Loading-Your-Data-Part-1.html).The segment we are going to work with has the following format:
C
cheddar 已提交
62 63

Dimensions (things to filter on):
64

C
cheddar 已提交
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
```json
"page"
"language"
"user"
"unpatrolled"
"newPage"
"robot"
"anonymous"
"namespace"
"continent"
"country"
"region"
"city"
```

Metrics (things to aggregate over):
81

C
cheddar 已提交
82 83 84 85 86 87 88
```json
"count"
"added"
"delta"
"deleted"
```

F
fjy 已提交
89
## The Cluster
C
cheddar 已提交
90

I
Igal Levy 已提交
91
Let's start up a few nodes and download our data. First, let's make sure we have configs in the config directory for our various nodes. Issue the following from the Druid home directory:
C
cheddar 已提交
92 93

```
F
fjy 已提交
94
ls config
C
cheddar 已提交
95 96
```

F
fjy 已提交
97
If you are interested in learning more about Druid configuration files, check out this [link](Configuration.html). Many aspects of Druid are customizable. For the purposes of this tutorial, we are going to use default values for most things.
C
cheddar 已提交
98

F
fjy 已提交
99
#### Start a Coordinator Node
C
cheddar 已提交
100

101
Coordinator nodes are in charge of load assignment and distribution. Coordinator nodes monitor the status of the cluster and command historical nodes to assign and drop segments.
F
fjy 已提交
102
For more information about coordinator nodes, see [here](Coordinator.html).
C
cheddar 已提交
103

F
fjy 已提交
104
The coordinator config file should already exist at:
C
cheddar 已提交
105 106

```
F
fjy 已提交
107
config/coordinator
C
cheddar 已提交
108 109
```

F
fjy 已提交
110
In the directory, there should be a `runtime.properties` file with the following contents:
C
cheddar 已提交
111 112

```
F
fjy 已提交
113
druid.host=localhost
114
druid.service=coordinator
F
fjy 已提交
115
druid.port=8082
C
cheddar 已提交
116 117

druid.zk.service.host=localhost
F
fjy 已提交
118 119 120 121 122 123

druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
druid.db.connector.user=druid
druid.db.connector.password=diurd

druid.coordinator.startDelay=PT60s
C
cheddar 已提交
124 125
```

126
To start the coordinator node:
C
cheddar 已提交
127 128

```bash
129
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/coordinator io.druid.cli.Main server coordinator
C
cheddar 已提交
130 131
```

F
fjy 已提交
132
#### Start a Historical Node
C
cheddar 已提交
133

F
fjy 已提交
134 135
Historical nodes are the workhorses of a cluster and are in charge of loading historical segments and making them available for queries. Our Wikipedia segment will be downloaded by a historical node.
For more information about Historical nodes, see [here](Historical.html).
C
cheddar 已提交
136

F
fjy 已提交
137
The historical config file should exist at:
C
cheddar 已提交
138 139

```
F
fjy 已提交
140
config/historical
C
cheddar 已提交
141 142
```

F
fjy 已提交
143
In the directory we just created, we should have the file `runtime.properties` with the following contents:
144

C
cheddar 已提交
145
```
F
fjy 已提交
146
druid.host=localhost
147
druid.service=historical
F
fjy 已提交
148
druid.port=8081
C
cheddar 已提交
149 150 151

druid.zk.service.host=localhost

F
fjy 已提交
152
druid.extensions.coordinates=["io.druid.extensions:druid-s3-extensions:0.6.121"]
F
fjy 已提交
153

154
# Dummy read only AWS account (used to download example data)
F
fjy 已提交
155 156
druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
C
cheddar 已提交
157

F
fjy 已提交
158
druid.server.maxSize=10000000000
C
cheddar 已提交
159

F
fjy 已提交
160
# Change these to make Druid faster
F
fjy 已提交
161
druid.processing.buffer.sizeBytes=100000000
F
fjy 已提交
162
druid.processing.numThreads=1
C
cheddar 已提交
163

F
fjy 已提交
164
druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 10000000000}]
C
cheddar 已提交
165 166
```

167
To start the historical node:
C
cheddar 已提交
168 169

```bash
170
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/historical io.druid.cli.Main server historical
C
cheddar 已提交
171 172
```

F
fjy 已提交
173
#### Start a Broker Node
C
cheddar 已提交
174

175
Broker nodes are responsible for figuring out which historical and/or realtime nodes correspond to which queries. They also merge partial results from these nodes in a scatter/gather fashion.
F
fjy 已提交
176
For more information about Broker nodes, see [here](Broker.html).
C
cheddar 已提交
177

F
fjy 已提交
178
The broker config file should exist at:
C
cheddar 已提交
179 180

```
F
fjy 已提交
181
config/broker
C
cheddar 已提交
182 183
```

F
fjy 已提交
184
In the directory, there should be a `runtime.properties` file with the following contents:
C
cheddar 已提交
185 186

```
F
fjy 已提交
187
druid.host=localhost
C
cheddar 已提交
188
druid.service=broker
F
fjy 已提交
189
druid.port=8080
C
cheddar 已提交
190 191 192 193 194 195 196

druid.zk.service.host=localhost
```

To start the broker node:

```bash
F
fjy 已提交
197
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker io.druid.cli.Main server broker
C
cheddar 已提交
198 199
```

F
fjy 已提交
200
## Loading the Data
C
cheddar 已提交
201

202
The MySQL dependency we introduced earlier on contains a 'segments' table that contains entries for segments that should be loaded into our cluster. The Druid coordinator compares this table with segments that already exist in the cluster to determine what should be loaded and dropped. To load our wikipedia segment, we need to create an entry in our MySQL segment table.
C
cheddar 已提交
203 204 205

Usually, when new segments are created, these MySQL entries are created directly so you never have to do this by hand. For this tutorial, we can do this manually by going back into MySQL and issuing:

206
``` sql
C
cheddar 已提交
207
use druid;
208
INSERT INTO druid_segments (id, dataSource, created_date, start, end, partitioned, version, used, payload) VALUES ('wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z', 'wikipedia', '2013-08-08T21:26:23.799Z', '2013-08-01T00:00:00.000Z', '2013-08-02T00:00:00.000Z', '0', '2013-08-08T21:22:48.989Z', '1', '{\"dataSource\":\"wikipedia\",\"interval\":\"2013-08-01T00:00:00.000Z/2013-08-02T00:00:00.000Z\",\"version\":\"2013-08-08T21:22:48.989Z\",\"loadSpec\":{\"type\":\"s3_zip\",\"bucket\":\"static.druid.io\",\"key\":\"data/segments/wikipedia/20130801T000000.000Z_20130802T000000.000Z/2013-08-08T21_22_48.989Z/0/index.zip\"},\"dimensions\":\"dma_code,continent_code,geo,area_code,robot,country_name,network,city,namespace,anonymous,unpatrolled,page,postal_code,language,newpage,user,region_lookup\",\"metrics\":\"count,delta,variation,added,deleted\",\"shardSpec\":{\"type\":\"none\"},\"binaryVersion\":9,\"size\":24664730,\"identifier\":\"wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z\"}');
209
```
C
cheddar 已提交
210

211
If you look in your coordinator node logs, you should, after a maximum of a minute or so, see logs of the following form:
C
cheddar 已提交
212 213

```
214
2013-08-08 22:48:41,967 INFO [main-EventThread] com.metamx.druid.coordinator.LoadQueuePeon - Server[/druid/loadQueue/127.0.0.1:8081] done processing [/druid/loadQueue/127.0.0.1:8081/wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z]
C
cheddar 已提交
215 216 217
2013-08-08 22:48:41,969 INFO [ServerInventoryView-0] com.metamx.druid.client.SingleServerInventoryView - Server[127.0.0.1:8081] added segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z]
```

218
When the segment completes downloading and ready for queries, you should see the following message on your historical node logs:
C
cheddar 已提交
219 220 221 222 223

```
2013-08-08 22:48:41,959 INFO [ZkCoordinator-0] com.metamx.druid.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z] at path[/druid/segments/127.0.0.1:8081/2013-08-08T22:48:41.959Z]
```

F
fjy 已提交
224
At this point, we can query the segment. For more information on querying, see this [link](Querying.html).
C
cheddar 已提交
225

226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
### Bonus Round: Start a Realtime Node

To start the realtime node that was used in our first tutorial, you simply have to issue:

```
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=examples/wikipedia/wikipedia_realtime.spec -classpath lib/*:config/realtime io.druid.cli.Main server realtime
```

The configurations are located in `config/realtime/runtime.properties` and should contain the following:

```
druid.host=localhost
druid.service=realtime
druid.port=8083

druid.zk.service.host=localhost

F
fjy 已提交
243
druid.extensions.coordinates=["io.druid.extensions:druid-examples:0.6.121","io.druid.extensions:druid-kafka-seven:0.6.121"]
244

245 246 247 248 249 250 251
# Change this config to db to hand off to the rest of the Druid cluster
druid.publish.type=noop

# These configs are only required for real hand off
# druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
# druid.db.connector.user=druid
# druid.db.connector.password=diurd
252

F
fjy 已提交
253
druid.processing.buffer.sizeBytes=100000000
F
fjy 已提交
254 255 256
druid.processing.numThreads=1

druid.monitoring.monitors=["io.druid.segment.realtime.RealtimeMetricsMonitor"]
257 258
```

F
fjy 已提交
259 260
Next Steps
----------
261
If you are interested in how data flows through the different Druid components, check out the [Druid data flow architecture](Design.html). Now that you have an understanding of what the Druid cluster looks like, why not load some of your own data?
I
Igal Levy 已提交
262
Check out the next [tutorial](Tutorial%3A-Loading-Your-Data-Part-1.html) section for more info!