scala_api_examples.md 9.6 KB
Newer Older
1 2 3 4
---
title:  "Scala API Examples"
---

R
Robert Metzger 已提交
5 6
The following example programs showcase different applications of Flink from simple word counting to graph algorithms.
The code samples illustrate the use of [Flink's Scala API](scala_api_guide.html). 
7

8
The full source code of the following and more examples can be found in the [flink-scala-examples](https://github.com/apache/incubator-flink/tree/master/flink-examples/flink-scala-examples) module.
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

# Word Count

WordCount is the "Hello World" of Big Data processing systems. It computes the frequency of words in a text collection. The algorithm works in two steps: First, the texts are splits the text to individual words. Second, the words are grouped and counted.

```scala
// read input data
val input = TextFile(textInput)

// tokenize words
val words = input.flatMap { _.split(" ") map { (_, 1) } }

// count by word
val counts = words.groupBy { case (word, _) => word }
  .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }

val output = counts.write(wordsOutput, CsvOutputFormat()))
```

R
Robert Metzger 已提交
28
The {% gh_link /flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/wordcount/WordCount.scala "WordCount example" %} implements the above described algorithm with input parameters: `<degree of parallelism>, <text input path>, <output path>`. As test data, any text file will do.
29 30 31 32 33

# Page Rank

The PageRank algorithm computes the "importance" of pages in a graph defined by links, which point from one pages to another page. It is an iterative graph algorithm, which means that it repeatedly applies the same computation. In each iteration, each page distributes its current rank over all its neighbors, and compute its new rank as a taxed sum of the ranks it received from its neighbors. The PageRank algorithm was popularized by the Google search engine which uses the importance of webpages to rank the results of search queries.

34
In this simple example, PageRank is implemented with a [bulk iteration](java_api_guide.html#iterations) and a fixed number of iterations.
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73

```scala
// cases classes so we have named fields
case class PageWithRank(pageId: Long, rank: Double)
case class Edge(from: Long, to: Long, transitionProbability: Double)

// constants for the page rank formula
val dampening = 0.85
val randomJump = (1.0 - dampening) / NUM_VERTICES
val initialRank = 1.0 / NUM_VERTICES
  
// read inputs
val pages = DataSource(verticesPath, CsvInputFormat[Long]())
val edges = DataSource(edgesPath, CsvInputFormat[Edge]())

// assign initial rank
val pagesWithRank = pages map { p => PageWithRank(p, initialRank) }

// the iterative computation
def computeRank(ranks: DataSet[PageWithRank]) = {

    // send rank to neighbors
    val ranksForNeighbors = ranks join edges
        where { _.pageId } isEqualTo { _.from }
        map { (p, e) => (e.to, p.rank * e.transitionProbability) }
    
    // gather ranks per vertex and apply page rank formula
    ranksForNeighbors .groupBy { case (node, rank) => node }
                      .reduce { (a, b) => (a._1, a._2 + b._2) }
                      .map {case (node, rank) => PageWithRank(node, rank * dampening + randomJump) }
}

// invoke iteratively
val finalRanks = pagesWithRank.iterate(numIterations, computeRank)
val output = finalRanks.write(outputPath, CsvOutputFormat())
```



R
Robert Metzger 已提交
74
The {% gh_link /flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/graph/PageRank.scala "PageRank program" %} implements the above example.
75 76 77 78 79 80 81 82 83 84 85 86 87 88
It requires the following parameters to run: `<pages input path>, <link input path>, <output path>, <num pages>, <num iterations>`.

Input files are plain text files and must be formatted as follows:
- Pages represented as an (long) ID separated by new-line characters.
    * For example `"1\n2\n12\n42\n63\n"` gives five pages with IDs 1, 2, 12, 42, and 63.
- Links are represented as pairs of page IDs which are separated by space characters. Links are separated by new-line characters:
    * For example `"1 2\n2 12\n1 12\n42 63\n"` gives four (directed) links (1)->(2), (2)->(12), (1)->(12), and (42)->(63).

For this simple implementation it is required that each page has at least one incoming and one outgoing link (a page can point to itself).

# Connected Components

The Connected Components algorithm identifies parts of a larger graph which are connected by assigning all vertices in the same connected part the same component ID. Similar to PageRank, Connected Components is an iterative algorithm. In each step, each vertex propagates its current component ID to all its neighbors. A vertex accepts the component ID from a neighbor, if it is smaller than its own component ID.

89
This implementation uses a [delta iteration](iterations.html): Vertices that have not changed their component id do not participate in the next step. This yields much better performance, because the later iterations typically deal only with a few outlier vertices.
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125

```scala
// define case classes
case class VertexWithComponent(vertex: Long, componentId: Long)
case class Edge(from: Long, to: Long)

// get input data
val vertices = DataSource(verticesPath, CsvInputFormat[Long]())
val directedEdges = DataSource(edgesPath, CsvInputFormat[Edge]())

// assign each vertex its own ID as component ID
val initialComponents = vertices map { v => VertexWithComponent(v, v) }
val undirectedEdges = directedEdges flatMap { e => Seq(e, Edge(e.to, e.from)) }

def propagateComponent(s: DataSet[VertexWithComponent], ws: DataSet[VertexWithComponent]) = {
  val allNeighbors = ws join undirectedEdges
        where { _.vertex } isEqualTo { _.from }
        map { (v, e) => VertexWithComponent(e.to, v.componentId ) }
    
    val minNeighbors = allNeighbors groupBy { _.vertex } reduceGroup { cs => cs minBy { _.componentId } }

    // updated solution elements == new workset
    val s1 = s join minNeighbors
        where { _.vertex } isEqualTo { _.vertex }
        flatMap { (curr, candidate) =>
            if (candidate.componentId < curr.componentId) Some(candidate) else None
        }

  (s1, s1)
}

val components = initialComponents.iterateWithDelta(initialComponents, { _.vertex }, propagateComponent,
                    maxIterations)
val output = components.write(componentsOutput, CsvOutputFormat())
```

R
Robert Metzger 已提交
126
The {% gh_link /flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/graph/ConnectedComponents.scala "ConnectedComponents program" %} implements the above example. It requires the following parameters to run: `<vertex input path>, <edge input path>, <output path> <max num iterations>`.
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149

Input files are plain text files and must be formatted as follows:
- Vertices represented as IDs and separated by new-line characters.
    * For example `"1\n2\n12\n42\n63\n"` gives five vertices with (1), (2), (12), (42), and (63).
- Edges are represented as pairs for vertex IDs which are separated by space characters. Edges are separated by new-line characters:
    * For example `"1 2\n2 12\n1 12\n42 63\n"` gives four (undirected) links (1)-(2), (2)-(12), (1)-(12), and (42)-(63).

# Relational Query

The Relational Query example assumes two tables, one with `orders` and the other with `lineitems` as specified by the [TPC-H decision support benchmark](http://www.tpc.org/tpch/). TPC-H is a standard benchmark in the database industry. See below for instructions how to generate the input data.

The example implements the following SQL query.

```sql
SELECT l_orderkey, o_shippriority, sum(l_extendedprice) as revenue
    FROM orders, lineitem
WHERE l_orderkey = o_orderkey
    AND o_orderstatus = "F" 
    AND YEAR(o_orderdate) > 1993
    AND o_orderpriority LIKE "5%"
GROUP BY l_orderkey, o_shippriority;
```

R
Robert Metzger 已提交
150
The Flink Scala program, which implements the above query looks as follows.
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173

```scala
// --- define some custom classes to address fields by name ---
case class Order(orderId: Int, status: Char, date: String, orderPriority: String, shipPriority: Int)
case class LineItem(orderId: Int, extendedPrice: Double)
case class PrioritizedOrder(orderId: Int, shipPriority: Int, revenue: Double)

val orders = DataSource(ordersInputPath, DelimitedInputFormat(parseOrder))
val lineItem2600s = DataSource(lineItemsInput, DelimitedInputFormat(parseLineItem))

val filteredOrders = orders filter { o => o.status == "F" && o.date.substring(0, 4).toInt > 1993 && o.orderPriority.startsWith("5") }

val prioritizedItems = filteredOrders join lineItems
    where { _.orderId } isEqualTo { _.orderId } // join on the orderIds
    map { (o, li) => PrioritizedOrder(o.orderId, o.shipPriority, li.extendedPrice) }

val prioritizedOrders = prioritizedItems
    groupBy { pi => (pi.orderId, pi.shipPriority) } 
    reduce { (po1, po2) => po1.copy(revenue = po1.revenue + po2.revenue) }

val output = prioritizedOrders.write(ordersOutput, CsvOutputFormat(formatOutput))
```

R
Robert Metzger 已提交
174
The {% gh_link /flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/relational/RelationalQuery.scala "Relational Query program" %} implements the above query. It requires the following parameters to run: `<orders input path>, <lineitem input path>, <output path>, <degree of parallelism>`.
175 176

The orders and lineitem files can be generated using the [TPC-H benchmark](http://www.tpc.org/tpch/) suite's data generator tool (DBGEN). 
R
Robert Metzger 已提交
177
Take the following steps to generate arbitrary large input files for the provided Flink programs:
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195

1.  Download and unpack DBGEN
2.  Make a copy of *makefile.suite* called *Makefile* and perform the following changes:

```bash
DATABASE = DB2
MACHINE  = LINUX
WORKLOAD = TPCH
CC       = gcc
```

1.  Build DBGEN using *make*
2.  Generate lineitem and orders relations using dbgen. A scale factor
    (-s) of 1 results in a generated data set with about 1 GB size.

```bash
./dbgen -T o -s 1
```