Tutorial:-The-Druid-Cluster.md 10.1 KB
Newer Older
1
---
C
cheddar 已提交
2
layout: doc_page
3
---
F
fjy 已提交
4
Welcome back! In our first [tutorial](Tutorial%3A-A-First-Look-at-Druid.html), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?
C
cheddar 已提交
5 6 7 8 9

This tutorial will hopefully answer these questions!

In this tutorial, we will set up other types of Druid nodes as well as and external dependencies for a fully functional Druid cluster. The architecture of Druid is very much like the [Megazord](http://www.youtube.com/watch?v=7mQuHh1X4H4) from the popular 90s show Mighty Morphin' Power Rangers. Each Druid node has a specific purpose and the nodes come together to form a fully functional system.

F
fjy 已提交
10
## Downloading Druid
C
cheddar 已提交
11 12 13

If you followed the first tutorial, you should already have Druid downloaded. If not, let's go back and do that first.

F
fjy 已提交
14
You can download the latest version of druid [here](http://static.druid.io/artifacts/releases/druid-services-0.6.0-bin.tar.gz)
C
cheddar 已提交
15 16

and untar the contents within by issuing:
17

C
cheddar 已提交
18 19 20 21 22
```bash
tar -zxvf druid-services-*-bin.tar.gz
cd druid-services-*
```

F
fjy 已提交
23
You can also [Build From Source](Build-from-source.html).
C
cheddar 已提交
24

F
fjy 已提交
25
## External Dependencies
C
cheddar 已提交
26 27 28

Druid requires 3 external dependencies. A "deep" storage that acts as a backup data repository, a relational database such as MySQL to hold configuration and metadata information, and [Apache Zookeeper](http://zookeeper.apache.org/) for coordination among different pieces of the cluster.

F
fjy 已提交
29
For deep storage, we have made a public S3 bucket (static.druid.io) available where data for this particular tutorial can be downloaded. More on the data later.
C
cheddar 已提交
30

F
fjy 已提交
31
#### Setting up MySQL
C
cheddar 已提交
32 33 34 35

1. If you don't already have it, download MySQL Community Server here: [http://dev.mysql.com/downloads/mysql/](http://dev.mysql.com/downloads/mysql/)
2. Install MySQL
3. Create a druid user and database
36

C
cheddar 已提交
37 38 39
```bash
mysql -u root
```
40

C
cheddar 已提交
41 42 43 44 45
```sql
GRANT ALL ON druid.* TO 'druid'@'localhost' IDENTIFIED BY 'diurd';
CREATE database druid;
```

F
fjy 已提交
46
#### Setting up Zookeeper
47

C
cheddar 已提交
48 49 50 51 52 53 54 55 56
```bash
curl http://www.motorlogy.com/apache/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz -o zookeeper-3.4.5.tar.gz
tar xzf zookeeper-3.4.5.tar.gz
cd zookeeper-3.4.5
cp conf/zoo_sample.cfg conf/zoo.cfg
./bin/zkServer.sh start
cd ..
```

F
fjy 已提交
57
## The Data
C
cheddar 已提交
58

F
fjy 已提交
59
Similar to the first tutorial, the data we will be loading is based on edits that have occurred on Wikipedia. Every time someone edits a page in Wikipedia, metadata is generated about the editor and edited page. Druid collects each individual event and packages them together in a container known as a [segment](Segments.html). Segments contain data over some span of time. We've prebuilt a segment for this tutorial and will cover making your own segments in other [pages](Tutorial%3A-Loading-Your-Data-Part-1.html).The segment we are going to work with has the following format:
C
cheddar 已提交
60 61

Dimensions (things to filter on):
62

C
cheddar 已提交
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
```json
"page"
"language"
"user"
"unpatrolled"
"newPage"
"robot"
"anonymous"
"namespace"
"continent"
"country"
"region"
"city"
```

Metrics (things to aggregate over):
79

C
cheddar 已提交
80 81 82 83 84 85 86
```json
"count"
"added"
"delta"
"deleted"
```

F
fjy 已提交
87
## The Cluster
C
cheddar 已提交
88

F
fjy 已提交
89
Let's start up a few nodes and download our data. First things though, let's make sure we have config directory where we will store configs for our various nodes:
C
cheddar 已提交
90 91

```
F
fjy 已提交
92
ls config
C
cheddar 已提交
93 94
```

F
fjy 已提交
95
If you are interested in learning more about Druid configuration files, check out this [link](Configuration.html). Many aspects of Druid are customizable. For the purposes of this tutorial, we are going to use default values for most things.
C
cheddar 已提交
96

F
fjy 已提交
97
#### Start a Coordinator Node
C
cheddar 已提交
98

99
Coordinator nodes are in charge of load assignment and distribution. Coordinator nodes monitor the status of the cluster and command historical nodes to assign and drop segments.
F
fjy 已提交
100
For more information about coordinator nodes, see [here](Coordinator.html).
C
cheddar 已提交
101

F
fjy 已提交
102
The coordinator config file should already exist at:
C
cheddar 已提交
103 104

```
F
fjy 已提交
105
config/coordinator
C
cheddar 已提交
106 107
```

F
fjy 已提交
108
In the directory, there should be a `runtime.properties` file with the following contents:
C
cheddar 已提交
109 110

```
F
fjy 已提交
111
druid.host=localhost
112
druid.service=coordinator
F
fjy 已提交
113
druid.port=8082
C
cheddar 已提交
114 115

druid.zk.service.host=localhost
F
fjy 已提交
116 117 118 119 120 121

druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
druid.db.connector.user=druid
druid.db.connector.password=diurd

druid.coordinator.startDelay=PT60s
C
cheddar 已提交
122 123
```

124
To start the coordinator node:
C
cheddar 已提交
125 126

```bash
127
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/coordinator io.druid.cli.Main server coordinator
C
cheddar 已提交
128 129
```

F
fjy 已提交
130
#### Start a Historical Node
C
cheddar 已提交
131

F
fjy 已提交
132 133
Historical nodes are the workhorses of a cluster and are in charge of loading historical segments and making them available for queries. Our Wikipedia segment will be downloaded by a historical node.
For more information about Historical nodes, see [here](Historical.html).
C
cheddar 已提交
134

F
fjy 已提交
135
The historical config file should exist at:
C
cheddar 已提交
136 137

```
F
fjy 已提交
138
config/historical
C
cheddar 已提交
139 140
```

F
fjy 已提交
141
In the directory we just created, we should have the file `runtime.properties` with the following contents:
142

C
cheddar 已提交
143
```
F
fjy 已提交
144
druid.host=localhost
145
druid.service=historical
F
fjy 已提交
146
druid.port=8081
C
cheddar 已提交
147 148 149

druid.zk.service.host=localhost

150
# Dummy read only AWS account (used to download example data)
F
fjy 已提交
151 152
druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
C
cheddar 已提交
153

F
fjy 已提交
154
druid.server.maxSize=100000000
C
cheddar 已提交
155

F
fjy 已提交
156
druid.processing.buffer.sizeBytes=10000000
C
cheddar 已提交
157

F
fjy 已提交
158 159
druid.segmentCache.infoPath=/tmp/druid/segmentInfoCache
druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 100000000}]
C
cheddar 已提交
160 161
```

162
To start the historical node:
C
cheddar 已提交
163 164

```bash
165
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/historical io.druid.cli.Main server historical
C
cheddar 已提交
166 167
```

F
fjy 已提交
168
#### Start a Broker Node
C
cheddar 已提交
169

170
Broker nodes are responsible for figuring out which historical and/or realtime nodes correspond to which queries. They also merge partial results from these nodes in a scatter/gather fashion.
F
fjy 已提交
171
For more information about Broker nodes, see [here](Broker.html).
C
cheddar 已提交
172

F
fjy 已提交
173
The broker config file should exist at:
C
cheddar 已提交
174 175

```
F
fjy 已提交
176
config/broker
C
cheddar 已提交
177 178
```

F
fjy 已提交
179
In the directory, there should be a `runtime.properties` file with the following contents:
C
cheddar 已提交
180 181

```
F
fjy 已提交
182
druid.host=localhost
C
cheddar 已提交
183
druid.service=broker
F
fjy 已提交
184
druid.port=8080
C
cheddar 已提交
185 186 187 188 189 190 191

druid.zk.service.host=localhost
```

To start the broker node:

```bash
F
fjy 已提交
192
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker io.druid.cli.Main server broker
C
cheddar 已提交
193 194
```

F
fjy 已提交
195
## Loading the Data
C
cheddar 已提交
196

197
The MySQL dependency we introduced earlier on contains a 'segments' table that contains entries for segments that should be loaded into our cluster. The Druid coordinator compares this table with segments that already exist in the cluster to determine what should be loaded and dropped. To load our wikipedia segment, we need to create an entry in our MySQL segment table.
C
cheddar 已提交
198 199 200

Usually, when new segments are created, these MySQL entries are created directly so you never have to do this by hand. For this tutorial, we can do this manually by going back into MySQL and issuing:

201
``` sql
C
cheddar 已提交
202 203
use druid;
INSERT INTO segments (id, dataSource, created_date, start, end, partitioned, version, used, payload) VALUES ('wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z', 'wikipedia', '2013-08-08T21:26:23.799Z', '2013-08-01T00:00:00.000Z', '2013-08-02T00:00:00.000Z', '0', '2013-08-08T21:22:48.989Z', '1', '{\"dataSource\":\"wikipedia\",\"interval\":\"2013-08-01T00:00:00.000Z/2013-08-02T00:00:00.000Z\",\"version\":\"2013-08-08T21:22:48.989Z\",\"loadSpec\":{\"type\":\"s3_zip\",\"bucket\":\"static.druid.io\",\"key\":\"data/segments/wikipedia/20130801T000000.000Z_20130802T000000.000Z/2013-08-08T21_22_48.989Z/0/index.zip\"},\"dimensions\":\"dma_code,continent_code,geo,area_code,robot,country_name,network,city,namespace,anonymous,unpatrolled,page,postal_code,language,newpage,user,region_lookup\",\"metrics\":\"count,delta,variation,added,deleted\",\"shardSpec\":{\"type\":\"none\"},\"binaryVersion\":9,\"size\":24664730,\"identifier\":\"wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z\"}');
204
```
C
cheddar 已提交
205

206
If you look in your coordinator node logs, you should, after a maximum of a minute or so, see logs of the following form:
C
cheddar 已提交
207 208

```
209
2013-08-08 22:48:41,967 INFO [main-EventThread] com.metamx.druid.coordinator.LoadQueuePeon - Server[/druid/loadQueue/127.0.0.1:8081] done processing [/druid/loadQueue/127.0.0.1:8081/wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z]
C
cheddar 已提交
210 211 212
2013-08-08 22:48:41,969 INFO [ServerInventoryView-0] com.metamx.druid.client.SingleServerInventoryView - Server[127.0.0.1:8081] added segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z]
```

213
When the segment completes downloading and ready for queries, you should see the following message on your historical node logs:
C
cheddar 已提交
214 215 216 217 218

```
2013-08-08 22:48:41,959 INFO [ZkCoordinator-0] com.metamx.druid.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z] at path[/druid/segments/127.0.0.1:8081/2013-08-08T22:48:41.959Z]
```

F
fjy 已提交
219
At this point, we can query the segment. For more information on querying, see this [link](Querying.html).
C
cheddar 已提交
220

221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244
### Bonus Round: Start a Realtime Node

To start the realtime node that was used in our first tutorial, you simply have to issue:

```
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=examples/wikipedia/wikipedia_realtime.spec -classpath lib/*:config/realtime io.druid.cli.Main server realtime
```

The configurations are located in `config/realtime/runtime.properties` and should contain the following:

```
druid.host=localhost
druid.service=realtime
druid.port=8083

druid.zk.service.host=localhost

druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
druid.db.connector.user=druid
druid.db.connector.password=diurd

druid.processing.buffer.sizeBytes=10000000
```

F
fjy 已提交
245 246
Next Steps
----------
C
cheddar 已提交
247

F
fjy 已提交
248 249
Now that you have an understanding of what the Druid cluster looks like, why not load some of your own data?
Check out the next [tutorial](Tutorial%3A-Loading-Your-Data-Part-1.html) section for more info!