Tutorial:-The-Druid-Cluster.md 9.4 KB
Newer Older
1
---
C
cheddar 已提交
2
layout: doc_page
3
---
F
fjy 已提交
4
Welcome back! In our first [tutorial](Tutorial%3A-A-First-Look-at-Druid.html), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?
C
cheddar 已提交
5 6 7 8 9

This tutorial will hopefully answer these questions!

In this tutorial, we will set up other types of Druid nodes as well as and external dependencies for a fully functional Druid cluster. The architecture of Druid is very much like the [Megazord](http://www.youtube.com/watch?v=7mQuHh1X4H4) from the popular 90s show Mighty Morphin' Power Rangers. Each Druid node has a specific purpose and the nodes come together to form a fully functional system.

F
fjy 已提交
10
## Downloading Druid
C
cheddar 已提交
11 12 13

If you followed the first tutorial, you should already have Druid downloaded. If not, let's go back and do that first.

F
fjy 已提交
14
You can download the latest version of druid [here](http://static.druid.io/artifacts/releases/druid-services-0.6.0-bin.tar.gz)
C
cheddar 已提交
15 16

and untar the contents within by issuing:
17

C
cheddar 已提交
18 19 20 21 22
```bash
tar -zxvf druid-services-*-bin.tar.gz
cd druid-services-*
```

F
fjy 已提交
23
You can also [Build From Source](Build-from-source.html).
C
cheddar 已提交
24

F
fjy 已提交
25
## External Dependencies
C
cheddar 已提交
26 27 28

Druid requires 3 external dependencies. A "deep" storage that acts as a backup data repository, a relational database such as MySQL to hold configuration and metadata information, and [Apache Zookeeper](http://zookeeper.apache.org/) for coordination among different pieces of the cluster.

F
fjy 已提交
29
For deep storage, we have made a public S3 bucket (static.druid.io) available where data for this particular tutorial can be downloaded. More on the data later.
C
cheddar 已提交
30

F
fjy 已提交
31
#### Setting up MySQL
C
cheddar 已提交
32 33 34 35

1. If you don't already have it, download MySQL Community Server here: [http://dev.mysql.com/downloads/mysql/](http://dev.mysql.com/downloads/mysql/)
2. Install MySQL
3. Create a druid user and database
36

C
cheddar 已提交
37 38 39
```bash
mysql -u root
```
40

C
cheddar 已提交
41 42 43 44 45
```sql
GRANT ALL ON druid.* TO 'druid'@'localhost' IDENTIFIED BY 'diurd';
CREATE database druid;
```

F
fjy 已提交
46
#### Setting up Zookeeper
47

C
cheddar 已提交
48 49 50 51 52 53 54 55 56
```bash
curl http://www.motorlogy.com/apache/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz -o zookeeper-3.4.5.tar.gz
tar xzf zookeeper-3.4.5.tar.gz
cd zookeeper-3.4.5
cp conf/zoo_sample.cfg conf/zoo.cfg
./bin/zkServer.sh start
cd ..
```

F
fjy 已提交
57
## The Data
C
cheddar 已提交
58

F
fjy 已提交
59
Similar to the first tutorial, the data we will be loading is based on edits that have occurred on Wikipedia. Every time someone edits a page in Wikipedia, metadata is generated about the editor and edited page. Druid collects each individual event and packages them together in a container known as a [segment](Segments.html). Segments contain data over some span of time. We've prebuilt a segment for this tutorial and will cover making your own segments in other [pages](Tutorial%3A-Loading-Your-Data-Part-1.html).The segment we are going to work with has the following format:
C
cheddar 已提交
60 61

Dimensions (things to filter on):
62

C
cheddar 已提交
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
```json
"page"
"language"
"user"
"unpatrolled"
"newPage"
"robot"
"anonymous"
"namespace"
"continent"
"country"
"region"
"city"
```

Metrics (things to aggregate over):
79

C
cheddar 已提交
80 81 82 83 84 85 86
```json
"count"
"added"
"delta"
"deleted"
```

F
fjy 已提交
87
## The Cluster
C
cheddar 已提交
88

F
fjy 已提交
89
Let's start up a few nodes and download our data. First things though, let's make sure we have config directory where we will store configs for our various nodes:
C
cheddar 已提交
90 91

```
F
fjy 已提交
92
ls config
C
cheddar 已提交
93 94
```

F
fjy 已提交
95
If you are interested in learning more about Druid configuration files, check out this [link](Configuration.html). Many aspects of Druid are customizable. For the purposes of this tutorial, we are going to use default values for most things.
C
cheddar 已提交
96

F
fjy 已提交
97
#### Start a Coordinator Node
C
cheddar 已提交
98

99
Coordinator nodes are in charge of load assignment and distribution. Coordinator nodes monitor the status of the cluster and command historical nodes to assign and drop segments.
F
fjy 已提交
100
For more information about coordinator nodes, see [here](Coordinator.html).
C
cheddar 已提交
101

F
fjy 已提交
102
The coordinator config file should already exist at:
C
cheddar 已提交
103 104

```
F
fjy 已提交
105
config/coordinator
C
cheddar 已提交
106 107
```

F
fjy 已提交
108
In the directory, there should be a `runtime.properties` file with the following contents:
C
cheddar 已提交
109 110

```
F
fjy 已提交
111
druid.host=localhost
112
druid.service=coordinator
F
fjy 已提交
113
druid.port=8082
C
cheddar 已提交
114 115

druid.zk.service.host=localhost
F
fjy 已提交
116 117 118 119 120 121 122 123 124

druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
druid.s3.secretKey=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

druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
druid.db.connector.user=druid
druid.db.connector.password=diurd

druid.coordinator.startDelay=PT60s
C
cheddar 已提交
125 126
```

127
To start the coordinator node:
C
cheddar 已提交
128 129

```bash
130
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/coordinator io.druid.cli.Main server coordinator
C
cheddar 已提交
131 132
```

F
fjy 已提交
133
#### Start a Historical Node
C
cheddar 已提交
134

F
fjy 已提交
135 136
Historical nodes are the workhorses of a cluster and are in charge of loading historical segments and making them available for queries. Our Wikipedia segment will be downloaded by a historical node.
For more information about Historical nodes, see [here](Historical.html).
C
cheddar 已提交
137

F
fjy 已提交
138
The historical config file should exist at:
C
cheddar 已提交
139 140

```
F
fjy 已提交
141
config/historical
C
cheddar 已提交
142 143
```

F
fjy 已提交
144
In the directory we just created, we should have the file `runtime.properties` with the following contents:
145

C
cheddar 已提交
146
```
F
fjy 已提交
147
druid.host=localhost
148
druid.service=historical
F
fjy 已提交
149
druid.port=8081
C
cheddar 已提交
150 151 152

druid.zk.service.host=localhost

F
fjy 已提交
153 154
druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
C
cheddar 已提交
155

F
fjy 已提交
156
druid.server.maxSize=100000000
C
cheddar 已提交
157

F
fjy 已提交
158
druid.processing.buffer.sizeBytes=10000000
C
cheddar 已提交
159

F
fjy 已提交
160 161
druid.segmentCache.infoPath=/tmp/druid/segmentInfoCache
druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 100000000}]
C
cheddar 已提交
162 163
```

164
To start the historical node:
C
cheddar 已提交
165 166

```bash
167
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/historical io.druid.cli.Main server historical
C
cheddar 已提交
168 169
```

F
fjy 已提交
170
#### Start a Broker Node
C
cheddar 已提交
171

172
Broker nodes are responsible for figuring out which historical and/or realtime nodes correspond to which queries. They also merge partial results from these nodes in a scatter/gather fashion.
F
fjy 已提交
173
For more information about Broker nodes, see [here](Broker.html).
C
cheddar 已提交
174

F
fjy 已提交
175
The broker config file should exist at:
C
cheddar 已提交
176 177

```
F
fjy 已提交
178
config/broker
C
cheddar 已提交
179 180
```

F
fjy 已提交
181
In the directory, there should be a `runtime.properties` file with the following contents:
C
cheddar 已提交
182 183

```
F
fjy 已提交
184
druid.host=localhost
C
cheddar 已提交
185
druid.service=broker
F
fjy 已提交
186
druid.port=8080
C
cheddar 已提交
187 188 189 190 191 192 193

druid.zk.service.host=localhost
```

To start the broker node:

```bash
F
fjy 已提交
194
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker io.druid.cli.Main server broker
C
cheddar 已提交
195 196
```

F
fjy 已提交
197
## Loading the Data
C
cheddar 已提交
198

199
The MySQL dependency we introduced earlier on contains a 'segments' table that contains entries for segments that should be loaded into our cluster. The Druid coordinator compares this table with segments that already exist in the cluster to determine what should be loaded and dropped. To load our wikipedia segment, we need to create an entry in our MySQL segment table.
C
cheddar 已提交
200 201 202

Usually, when new segments are created, these MySQL entries are created directly so you never have to do this by hand. For this tutorial, we can do this manually by going back into MySQL and issuing:

203
``` sql
C
cheddar 已提交
204 205
use druid;
INSERT INTO segments (id, dataSource, created_date, start, end, partitioned, version, used, payload) VALUES ('wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z', 'wikipedia', '2013-08-08T21:26:23.799Z', '2013-08-01T00:00:00.000Z', '2013-08-02T00:00:00.000Z', '0', '2013-08-08T21:22:48.989Z', '1', '{\"dataSource\":\"wikipedia\",\"interval\":\"2013-08-01T00:00:00.000Z/2013-08-02T00:00:00.000Z\",\"version\":\"2013-08-08T21:22:48.989Z\",\"loadSpec\":{\"type\":\"s3_zip\",\"bucket\":\"static.druid.io\",\"key\":\"data/segments/wikipedia/20130801T000000.000Z_20130802T000000.000Z/2013-08-08T21_22_48.989Z/0/index.zip\"},\"dimensions\":\"dma_code,continent_code,geo,area_code,robot,country_name,network,city,namespace,anonymous,unpatrolled,page,postal_code,language,newpage,user,region_lookup\",\"metrics\":\"count,delta,variation,added,deleted\",\"shardSpec\":{\"type\":\"none\"},\"binaryVersion\":9,\"size\":24664730,\"identifier\":\"wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z\"}');
206
```
C
cheddar 已提交
207

208
If you look in your coordinator node logs, you should, after a maximum of a minute or so, see logs of the following form:
C
cheddar 已提交
209 210

```
211
2013-08-08 22:48:41,967 INFO [main-EventThread] com.metamx.druid.coordinator.LoadQueuePeon - Server[/druid/loadQueue/127.0.0.1:8081] done processing [/druid/loadQueue/127.0.0.1:8081/wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z]
C
cheddar 已提交
212 213 214
2013-08-08 22:48:41,969 INFO [ServerInventoryView-0] com.metamx.druid.client.SingleServerInventoryView - Server[127.0.0.1:8081] added segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z]
```

215
When the segment completes downloading and ready for queries, you should see the following message on your historical node logs:
C
cheddar 已提交
216 217 218 219 220

```
2013-08-08 22:48:41,959 INFO [ZkCoordinator-0] com.metamx.druid.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z] at path[/druid/segments/127.0.0.1:8081/2013-08-08T22:48:41.959Z]
```

F
fjy 已提交
221
At this point, we can query the segment. For more information on querying, see this [link](Querying.html).
C
cheddar 已提交
222

F
fjy 已提交
223 224
Next Steps
----------
C
cheddar 已提交
225

F
fjy 已提交
226 227
Now that you have an understanding of what the Druid cluster looks like, why not load some of your own data?
Check out the next [tutorial](Tutorial%3A-Loading-Your-Data-Part-1.html) section for more info!