提交 · 85d3596e65512d481f4be54df100be6bdc9c8e29 · doujutun3207 / spark

23 7月, 2014 2 次提交

SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage · 85d3596e

由 Aaron Davidson 提交于 7月 22, 2014

### Why and what?
Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.

This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.

Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl

### Memory implications
An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default.

Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7).

The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case.
This results in a worst-case sorting overhead of 4N bytes.

Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements.

### Performance implications
As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation.

Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter".

<table>
<tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr>
<tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
<tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr>
<tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr>
<tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr>
</table>

The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7.

In short, this patch should significantly improve performance for users running either Java 6 or 7.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1502 from aarondav/sort and squashes the following commits:

652d936 [Aaron Davidson] Update license, move Sorter to java src
a7b5b1c [Aaron Davidson] fix licenses
5c0efaf [Aaron Davidson] Update tmpLength
ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs
034bf10 [Aaron Davidson] Change to Apache v2 Timsort
b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark]
6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage

85d3596e

[MLLIB] make Mima ignore updateFeatures (private) in ALS · 14078717

由 Xiangrui Meng 提交于 7月 22, 2014

Fix Mima issues in #1521.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1533 from mengxr/mima-als and squashes the following commits:

78386e1 [Xiangrui Meng] make Mima ignore updateFeatures (private) in ALS

14078717

22 7月, 2014 9 次提交

[SPARK-2612] [mllib] Fix data skew in ALS · 75db1742

由 peng.zhang 提交于 7月 22, 2014

Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #1521 from renozhang/fix-als and squashes the following commits:

b5727a4 [peng.zhang] Remove no need argument
1a4f7a0 [peng.zhang] Fix data skew in ALS

75db1742

[SPARK-2452] Create a new valid for each instead of using lineId. · 81fec992

由 Prashant Sharma 提交于 7月 22, 2014

Author: Prashant Sharma <prashant@apache.org>

Closes #1441 from ScrapCodes/SPARK-2452/multi-statement and squashes the following commits:

26c5c72 [Prashant Sharma] Added a test case.
7e8d28d [Prashant Sharma] SPARK-2452, create a new valid for each instead of using lineId, because Line ids can be same sometimes.

81fec992

[SPARK-2470] PEP8 fixes to PySpark · 5d16d5bb

由 Nicholas Chammas 提交于 7月 21, 2014

This pull request aims to resolve all outstanding PEP8 violations in PySpark.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Author: nchammas <nicholas.chammas@gmail.com>

Closes #1505 from nchammas/master and squashes the following commits:

98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle
cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses
e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break
9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses
22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses
24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest
7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py
8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py
b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py
d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py
aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py
1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py
95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py
a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib
c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py
d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py
81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py
1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py
7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py
ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py
f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py
a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py
f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py
4dd148f [nchammas] Merge pull request #5 from apache/master
f7e4581 [Nicholas Chammas] unrelated pep8 fix
a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
de7292a [nchammas] Merge pull request #4 from apache/master
2e4fe00 [nchammas] Merge pull request #3 from apache/master
89fde08 [nchammas] Merge pull request #2 from apache/master
69f6e22 [Nicholas Chammas] PEP8 fixes
2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
69da6cf [nchammas] Merge pull request #1 from apache/master

5d16d5bb

[SPARK-2086] Improve output of toDebugString to make shuffle boundaries more clear · c3462c65

由 Gregory Owen 提交于 7月 21, 2014

Changes RDD.toDebugString() to show hierarchy and shuffle transformations more clearly

New output:

```
(3) FlatMappedValuesRDD[325] at apply at Transformer.scala:22
 |  MappedValuesRDD[324] at apply at Transformer.scala:22
 |  CoGroupedRDD[323] at apply at Transformer.scala:22
 +-(5) MappedRDD[320] at apply at Transformer.scala:22
 |  |  MappedRDD[319] at apply at Transformer.scala:22
 |  |  MappedValuesRDD[318] at apply at Transformer.scala:22
 |  |  MapPartitionsRDD[317] at apply at Transformer.scala:22
 |  |  ShuffledRDD[316] at apply at Transformer.scala:22
 |  +-(10) MappedRDD[315] at apply at Transformer.scala:22
 |     |   ParallelCollectionRDD[314] at apply at Transformer.scala:22
 +-(100) MappedRDD[322] at apply at Transformer.scala:22
     |   ParallelCollectionRDD[321] at apply at Transformer.scala:22
```

Author: Gregory Owen <greowen@gmail.com>

Closes #1364 from GregOwen/to-debug-string and squashes the following commits:

08f5c78 [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
1603f7b [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly

c3462c65

[SPARK-2561][SQL] Fix apply schema · 511a7314

由 Michael Armbrust 提交于 7月 21, 2014

We need to use the analyzed attributes otherwise we end up with a tree that will never resolve.

Author: Michael Armbrust <michael@databricks.com>

Closes #1470 from marmbrus/fixApplySchema and squashes the following commits:

f968195 [Michael Armbrust] Use analyzed attributes when applying the schema.
4969015 [Michael Armbrust] Add test case.

511a7314

[SPARK-2434][MLlib]: Warning messages that point users to original MLlib... · a4d60208

由 Burak 提交于 7月 21, 2014

[SPARK-2434][MLlib]: Warning messages that point users to original MLlib implementations added to Examples

[SPARK-2434][MLlib]: Warning messages that refer users to the original MLlib implementations of some popular example machine learning algorithms added both in the comments and the code. The following examples have been modified:
Scala:
* LocalALS
* LocalFileLR
* LocalKMeans
* LocalLP
* SparkALS
* SparkHdfsLR
* SparkKMeans
* SparkLR
Python:
 * kmeans.py
 * als.py
 * logistic_regression.py

Author: Burak <brkyvz@gmail.com>

Closes #1515 from brkyvz/SPARK-2434 and squashes the following commits:

7505da9 [Burak] [SPARK-2434][MLlib]: Warning messages added, scalastyle errors fixed, and added missing punctuation
b96b522 [Burak] [SPARK-2434][MLlib]: Warning messages added and scalastyle errors fixed
4762f39 [Burak] [SPARK-2434]: Warning messages added
17d3d83 [Burak] SPARK-2434: Added warning messages to the naive implementations of the example algorithms
2cb5301 [Burak] SPARK-2434: Warning messages redirecting to original implementaions added.

a4d60208

Fix flakey HiveQuerySuite test · abeacffb

由 Aaron Davidson 提交于 7月 21, 2014

Result may not be returned in the expected order, so relax that constraint.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1514 from aarondav/flakey and squashes the following commits:

e5af823 [Aaron Davidson] Fix flakey HiveQuerySuite test

abeacffb

[SPARK-2494] [PySpark] make hash of None consistant cross machines · 872538c6

由 Davies Liu 提交于 7月 21, 2014

In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1371 from davies/hash_of_none and squashes the following commits:

d01745f [Davies Liu] add comments, remove outdated unit tests
5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
b7118aa [Davies Liu] use __builtin__ instead of __builtins__
839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines

872538c6

SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler · f89cf65d

由 Sandy Ryza 提交于 7月 21, 2014

Author: Sandy Ryza <sandy@cloudera.com>

Closes #634 from sryza/sandy-spark-1707 and squashes the following commits:

2f6e358 [Sandy Ryza] Default min registered executors ratio to .8 for YARN
354c630 [Sandy Ryza] Remove outdated comments
c744ef3 [Sandy Ryza] Take out waitForInitialAllocations
2a4329b [Sandy Ryza] SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler

f89cf65d

21 7月, 2014 7 次提交

[SPARK-2190][SQL] Specialized ColumnType for Timestamp · cd273a23

由 Cheng Lian 提交于 7月 21, 2014

JIRA issue: [SPARK-2190](https://issues.apache.org/jira/browse/SPARK-2190)

Added specialized in-memory column type for `Timestamp`. Whitelisted all timestamp related Hive tests except `timestamp_udf`, which is timezone sensitive.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1440 from liancheng/timestamp-column-type and squashes the following commits:

e682175 [Cheng Lian] Enabled more timezone sensitive Hive tests.
53a358f [Cheng Lian] Fixed failed test suites
01b592d [Cheng Lian] Fixed SimpleDateFormat thread safety issue
2a59343 [Cheng Lian] Removed timezone sensitive Hive timestamp tests
45dd05d [Cheng Lian] Added Timestamp specific in-memory columnar representation

cd273a23

[SPARK-1945][MLLIB] Documentation Improvements for Spark 1.0 · db56f2df

由 Michael Giannakopoulos 提交于 7月 20, 2014

Standalone application examples are added to 'mllib-linear-methods.md' file written in Java.
This commit is related to the issue [Add full Java Examples in MLlib docs](https://issues.apache.org/jira/browse/SPARK-1945).
Also I changed the name of the sigmoid function from 'logit' to 'f'. This is because the logit function
is the inverse of sigmoid.

Thanks,
Michael

Author: Michael Giannakopoulos <miccagiann@gmail.com>

Closes #1311 from miccagiann/master and squashes the following commits:

8ffe5ab [Michael Giannakopoulos] Update code so as to comply with code standards.
f7ad5cc [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
38d92c7 [Michael Giannakopoulos] Adding PCA, SVD and LBFGS examples in Java. Performing minor updates in the already committed examples so as to eradicate the call of 'productElement' function whenever is possible.
cc0a089 [Michael Giannakopoulos] Modyfied Java examples so as to comply with coding standards.
b1141b2 [Michael Giannakopoulos] Added Java examples for Clustering and Collaborative Filtering [mllib-clustering.md & mllib-collaborative-filtering.md].
837f7a8 [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
15f0eb4 [Michael Giannakopoulos] Java examples included in 'mllib-linear-methods.md' file.

db56f2df

Improve scheduler delay tooltip. · f6e7302c

由 Kay Ousterhout 提交于 7月 20, 2014

As a result of shivaram's experience debugging long scheduler delay, I think we should improve the tooltip to point people in the right direction if scheduler delay is large.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1488 from kayousterhout/better_tooltips and squashes the following commits:

22176fd [Kay Ousterhout] Improve scheduler delay tooltip.

f6e7302c

[SPARK-2552][MLLIB] stabilize logistic function in pyspark · b86db517

由 Xiangrui Meng 提交于 7月 20, 2014

to avoid overflow in `exp(x)` if `x` is large.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1493 from mengxr/py-logistic and squashes the following commits:

259e863 [Xiangrui Meng] stabilize logistic function in pyspark

b86db517

SPARK-2564. ShuffleReadMetrics.totalBlocksRead is redundant · 9564f854

由 Sandy Ryza 提交于 7月 20, 2014

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1474 from sryza/sandy-spark-2564 and squashes the following commits:

35b8388 [Sandy Ryza] Fix compile error on upmerge
7b985fb [Sandy Ryza] Fix test compile error
43f79e6 [Sandy Ryza] SPARK-2564. ShuffleReadMetrics.totalBlocksRead is redundant

9564f854

[SPARK-2495][MLLIB] remove private[mllib] from linear models' constructors · 1b10b811

由 Xiangrui Meng 提交于 7月 20, 2014

This is part of SPARK-2495 to allow users construct linear models manually.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1492 from mengxr/public-constructor and squashes the following commits:

a48b766 [Xiangrui Meng] remove private[mllib] from linear models' constructors

1b10b811

[SPARK-2598] RangePartitioner's binary search does not use the given Ordering · fa51b0fb

由 Reynold Xin 提交于 7月 20, 2014

We should fix this in branch-1.0 as well.

Author: Reynold Xin <rxin@apache.org>

Closes #1500 from rxin/rangePartitioner and squashes the following commits:

c0a94f5 [Reynold Xin] [SPARK-2598] RangePartitioner's binary search does not use the given Ordering.

fa51b0fb

20 7月, 2014 8 次提交

SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical section... · 98ab4112

由 Sandy Ryza 提交于 7月 20, 2014

...s of CoGroupedRDD and PairRDDFunctions

This also removes an unnecessary tuple creation in cogroup.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1447 from sryza/sandy-spark-2519-2 and squashes the following commits:

b6d9699 [Sandy Ryza] Remove missed Tuple2 match in CoGroupedRDD
a109828 [Sandy Ryza] Remove another pattern matching in MappedValuesRDD and revert some changes in PairRDDFunctions
be10f8a [Sandy Ryza] SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical sections of CoGroupedRDD and PairRDDFunctions

98ab4112

[SPARK-2524] missing document about spark.deploy.retainedDrivers · 4da01e38

由 lianhuiwang 提交于 7月 19, 2014

https://issues.apache.org/jira/browse/SPARK-2524
The configuration on spark.deploy.retainedDrivers is undocumented but actually used
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60

Author: lianhuiwang <lianhuiwang09@gmail.com>
Author: Wang Lianhui <lianhuiwang09@gmail.com>
Author: unknown <Administrator@taguswang-PC1.tencent.com>

Closes #1443 from lianhuiwang/SPARK-2524 and squashes the following commits:

64660fd [Wang Lianhui] address pwendell's comments
5f6bbb7 [Wang Lianhui] missing document about spark.deploy.retainedDrivers
44a3f50 [unknown] Merge remote-tracking branch 'upstream/master'
eacf933 [lianhuiwang] Merge remote-tracking branch 'upstream/master'
8bbfe76 [lianhuiwang] Merge remote-tracking branch 'upstream/master'
480ce94 [lianhuiwang] address aarondav comments
f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED

4da01e38

SPARK-2587: Fix error message in make-distribution.sh · c1194987

由 Mark Wagner 提交于 7月 19, 2014

make-distribution.sh gives a slightly off error message when using --with-hive.

Author: Mark Wagner <mwagner@mwagner-ld.linkedin.biz>

Closes #1489 from wagnermarkd/SPARK-2587 and squashes the following commits:

7b5d3ff [Mark Wagner] SPARK-2587: Fix error message in make-distribution.sh

c1194987

Typo fix to the programming guide in the docs · 0d01e85f

由 Cesar Arevalo 提交于 7月 19, 2014

Typo fix to the programming guide in the docs. Changed the word "distibuted" to "distributed".

Author: Cesar Arevalo <cesar@zephyrhealthinc.com>

Closes #1495 from cesararevalo/master and squashes the following commits:

0c2e3a7 [Cesar Arevalo] Typo fix to the programming guide in the docs

0d01e85f

P
SPARK-2596 HOTFIX: Deal with non-existent JIRAs. · d39e3b96
由 Patrick Wendell 提交于 7月 19, 2014
```
A small bug that was found in our JIRA sync script.
```
d39e3b96

SPARK-2596 A tool for mirroring github pull requests on JIRA. · 49e47274

由 Patrick Wendell 提交于 7月 19, 2014

For a bunch of reasons we should automatically populate a JIRA with information about new pull requests when they arrive. I've written a small python script to do this that we can run from Jenkins every 5 or 10 minutes to keep things in sync.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1496 from pwendell/github-integration and squashes the following commits:

55ad226 [Patrick Wendell] Small fix
afda547 [Patrick Wendell] Use sequence instead of dictiory for JIRA's
3e18cc1 [Patrick Wendell] Small edits
84c5606 [Patrick Wendell] SPARK-2596 A tool for mirroring github pull requests on JIRA.

49e47274

R
Revert "[SPARK-2521] Broadcast RDD object (instead of sending it along with every task)." · 1efb3698
由 Reynold Xin 提交于 7月 19, 2014
```
This reverts commit 7b8cd175.
```
1efb3698

SPARK-2407: Added Parser of SQL SUBSTR() · 2a732110

由 chutium 提交于 7月 19, 2014

follow-up of #1359

Author: chutium <teng.qiu@gmail.com>

Closes #1442 from chutium/master and squashes the following commits:

b49cc8a [chutium] SPARK-2407: Added Parser of SQL SUBSTRING() #1442
9a60ccf [chutium] SPARK-2407: Added Parser of SQL SUBSTR() #1442
06e933b [chutium] Merge https://github.com/apache/spark
c870172 [chutium] Merge https://github.com/apache/spark
094f773 [chutium] Merge https://github.com/apache/spark
88cb37d [chutium] Merge https://github.com/apache/spark
1de83a7 [chutium] SPARK-2407: Added Parse of SQL SUBSTR()

2a732110

19 7月, 2014 9 次提交

put 'curRequestSize = 0' after 'logDebug' it · 805f329b

由 Lijie Xu 提交于 7月 19, 2014

This is a minor change. We should first logDebug($curRequestSize) and then set it to 0.

Author: Lijie Xu <csxulijie@gmail.com>

Closes #1477 from JerryLead/patch-1 and squashes the following commits:

aed722d [Lijie Xu] put 'curRequestSize = 0' after 'logDebug' it

805f329b

[SPARK-2521] Broadcast RDD object (instead of sending it along with every task). · 7b8cd175

由 Reynold Xin 提交于 7月 18, 2014

Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables.

The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large.

The user-facing impact of the change include:

1. Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations
2. Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput.

In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently).

A simple way to test this:
```scala
val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a);
sc.parallelize(1 to 1000, 1000).map { x => a; x }.groupBy { x => a; x }.count
```

Numbers on 3 r3.8xlarge instances on EC2
```
master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s
with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s
```

Author: Reynold Xin <rxin@apache.org>

Closes #1452 from rxin/broadcast-task and squashes the following commits:

762e0be [Reynold Xin] Warn large broadcasts.
ade6eac [Reynold Xin] Log broadcast size.
c3b6f11 [Reynold Xin] Added a unit test for clean up.
754085f [Reynold Xin] Explain why broadcasting serialized copy of the task.
04b17f0 [Reynold Xin] [SPARK-2521] Broadcast RDD object once per TaskSet (instead of sending it for every task).

7b8cd175

[SPARK-2359][MLlib] Correlations · a243364b

由 Doris Xin 提交于 7月 18, 2014

Implementation for Pearson and Spearman's correlation.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1367 from dorx/correlation and squashes the following commits:

c0dd7dc [Doris Xin] here we go
32d83a3 [Doris Xin] Reviewer comments
4db0da1 [Doris Xin] added private[stat] to Spearman
b716f70 [Doris Xin] minor fixes
6e1b42a [Doris Xin] More comments addressed. Still some open questions
8104f44 [Doris Xin] addressed comments. some open questions still
39387c2 [Doris Xin] added missing header
bd3cf19 [Doris Xin] Merge branch 'master' into correlation
6341884 [Doris Xin] race condition bug squished
bd2bacf [Doris Xin] Race condition bug
b775ff9 [Doris Xin] old wrong impl
534ebf2 [Doris Xin] Merge branch 'master' into correlation
818fa31 [Doris Xin] wip units
9d808ee [Doris Xin] wip units
b843a13 [Doris Xin] revert change in stat counter
28561b6 [Doris Xin] wip
bb2e977 [Doris Xin] minor fix
8e02c63 [Doris Xin] Merge branch 'master' into correlation
2a40aa1 [Doris Xin] initial, untested implementation of Pearson
dfc4854 [Doris Xin] WIP

a243364b

[SPARK-2571] Correctly report shuffle read metrics. · 7b971b91

由 Kay Ousterhout 提交于 7月 18, 2014

Currently, shuffle read metrics are incorrectly reported when stages have multiple shuffle dependencies (they are set to be the metrics from just one of the shuffle dependencies, rather than the accumulated metrics from all of the shuffle dependencies). This fixes that problem, and should probably be back-ported to the 0.9 branch.

Thanks ryanra for discovering this problem!

cc rxin andrewor14

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1476 from kayousterhout/join_bug and squashes the following commits:

0203a16 [Kay Ousterhout] Fix broken unit tests.
f463c2e [Kay Ousterhout] [SPARK-2571] Correctly report shuffle read metrics.

7b971b91

[SPARK-2540] [SQL] Add HiveDecimal & HiveVarchar support in unwrapping data · 7f172081

由 Cheng Hao 提交于 7月 18, 2014

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1436 from chenghao-intel/unwrapdata and squashes the following commits:

34cc21a [Cheng Hao] update the table scan accodringly since the unwrapData function changed
afc39da [Cheng Hao] Polish the code
39d6475 [Cheng Hao] Add HiveDecimal & HiveVarchar support in unwrap data

7f172081

[SPARK-2535][SQL] Add StringComparison case to NullPropagation. · 3a1709fa

由 Takuya UESHIN 提交于 7月 18, 2014

`StringComparison` expressions including `null` literal cases could be added to `NullPropagation`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1451 from ueshin/issues/SPARK-2535 and squashes the following commits:

e99c237 [Takuya UESHIN] Add some tests.
8f9b984 [Takuya UESHIN] Add StringComparison case to NullPropagation.

3a1709fa

[MLlib] SPARK-1536: multiclass classification support for decision tree · d88f6be4

由 Manish Amde 提交于 7月 18, 2014

The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted.

It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the ```2^(m-1) - 1``` categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints.

The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets.

cc: mengxr, etrain, hirakendu, atalwalkar, srowen

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Evan Sparks <sparks@cs.berkeley.edu>

Closes #886 from manishamde/multiclass and squashes the following commits:

26f8acc [Manish Amde] another attempt at fixing mima
c5b2d04 [Manish Amde] more MIMA fixes
1ce7212 [Manish Amde] change problem filter for mima
10fdd82 [Manish Amde] fixing MIMA excludes
e1c970d [Manish Amde] merged master
abf2901 [Manish Amde] adding classes to MimaExcludes.scala
45e767a [Manish Amde] adding developer api annotation for overriden methods
c8428c4 [Manish Amde] fixing weird multiline bug
afced16 [Manish Amde] removed label weights support
2d85a48 [Manish Amde] minor: fixed scalastyle issues reprise
4e85f2c [Manish Amde] minor: fixed scalastyle issues
b2ae41f [Manish Amde] minor: scalastyle
e4c1321 [Manish Amde] using while loop for regression histograms
d75ac32 [Manish Amde] removed WeightedLabeledPoint from this PR
0fecd38 [Manish Amde] minor: add newline to EOF
2061cf5 [Manish Amde] merged from master
06b1690 [Manish Amde] fixed off-by-one error in bin to split conversion
9cc3e31 [Manish Amde] added implicit conversion import
5c1b2ca [Manish Amde] doc for PointConverter class
485eaae [Manish Amde] implicit conversion from LabeledPoint to WeightedLabeledPoint
3d7f911 [Manish Amde] updated doc
8e44ab8 [Manish Amde] updated doc
adc7315 [Manish Amde] support ordered categorical splits for multiclass classification
e3e8843 [Manish Amde] minor code formatting
23d4268 [Manish Amde] minor: another minor code style
34ee7b9 [Manish Amde] minor: code style
237762d [Manish Amde] renaming functions
12e6d0a [Manish Amde] minor: removing line in doc
9a90c93 [Manish Amde] Merge branch 'master' into multiclass
1892a2c [Manish Amde] tests and use multiclass binaggregate length when atleast one categorical feature is present
f5f6b83 [Manish Amde] multiclass for continous variables
8cfd3b6 [Manish Amde] working for categorical multiclass classification
828ff16 [Manish Amde] added categorical variable test
bce835f [Manish Amde] code cleanup
7e5f08c [Manish Amde] minor doc
1dd2735 [Manish Amde] bin search logic for multiclass
f16a9bb [Manish Amde] fixing while loop
d811425 [Manish Amde] multiclass bin aggregate logic
ab5cb21 [Manish Amde] multiclass logic
d8e4a11 [Manish Amde] sample weights
ed5a2df [Manish Amde] fixed classification requirements
d012be7 [Manish Amde] fixed while loop
18d2835 [Manish Amde] changing default values for num classes
6b912dc [Manish Amde] added numclasses to tree runner, predict logic for multiclass, add multiclass option to train
75f2bfc [Manish Amde] minor code style fix
e547151 [Manish Amde] minor modifications
34549d0 [Manish Amde] fixing error during merge
098e8c5 [Manish Amde] merged master
e006f9d [Manish Amde] changing variable names
5c78e1a [Manish Amde] added multiclass support
6c7af22 [Manish Amde] prepared for multiclass without breaking binary classification
46e06ee [Manish Amde] minor mods
3f85a17 [Manish Amde] tests for multiclass classification
4d5f70c [Manish Amde] added multiclass support for find splits bins
46f909c [Manish Amde] todo for multiclass support
455bea9 [Manish Amde] fixed tests
14aea48 [Manish Amde] changing instance format to weighted labeled point
a1a6e09 [Manish Amde] added weighted point class
968ca9d [Manish Amde] merged master
7fc9545 [Manish Amde] added docs
ce004a1 [Manish Amde] minor formatting
b27ad2c [Manish Amde] formatting
426bb28 [Manish Amde] programming guide blurb
8053fed [Manish Amde] more formatting
5eca9e4 [Manish Amde] grammar
4731cda [Manish Amde] formatting
5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
cbd9f14 [Manish Amde] modified scala.math to math
dad9652 [Manish Amde] removed unused imports
e0426ee [Manish Amde] renamed parameter
718506b [Manish Amde] added unit test
1517155 [Manish Amde] updated documentation
9dbdabe [Manish Amde] merge from master
719d009 [Manish Amde] updating user documentation
fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
0287772 [Evan Sparks] Fixing scalastyle issue.
2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
abc5a23 [Evan Sparks] Parameterizing max memory.
50b143a [Manish Amde] adding support for very deep trees

d88f6be4

Reservoir sampling implementation. · 586e716e

由 Reynold Xin 提交于 7月 18, 2014

This is going to be used in https://issues.apache.org/jira/browse/SPARK-2568

Author: Reynold Xin <rxin@apache.org>

Closes #1478 from rxin/reservoirSample and squashes the following commits:

17bcbf3 [Reynold Xin] Added seed.
badf20d [Reynold Xin] Renamed the method.
6940010 [Reynold Xin] Reservoir sampling implementation.

586e716e

Added t2 instance types · 7f87ab98

由 Basit Mustafa 提交于 7月 18, 2014

New t2 instance types require HVM amis, bailout assumption of pvm
causes failures when using t2 instance types.

Author: Basit Mustafa <basitmustafa@computes-things-for-basit.local>

Closes #1446 from 24601/master and squashes the following commits:

01fe128 [Basit Mustafa] Makin' it pretty
392a95e [Basit Mustafa] Added t2 instance types

7f87ab98

18 7月, 2014 5 次提交

SPARK-2553. Fix compile error · 30b8d369

由 Sandy Ryza 提交于 7月 18, 2014

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1479 from sryza/sandy-spark-2553 and squashes the following commits:

2cb5ed8 [Sandy Ryza] SPARK-2553. Fix compile error

30b8d369

SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency... · e52b8719

由 Sandy Ryza 提交于 7月 17, 2014

... per key

My humble opinion is that avoiding allocations in this performance-critical section is worth the extra code.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1461 from sryza/sandy-spark-2553 and squashes the following commits:

7eaf7f2 [Sandy Ryza] SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency per key

e52b8719

[SPARK-2570] [SQL] Fix the bug of ClassCastException · 29809a6d

由 Cheng Hao 提交于 7月 17, 2014

Exception thrown when running the example of HiveFromSpark.
Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
at org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45)
at org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1475 from chenghao-intel/hive_from_spark and squashes the following commits:

d4c0500 [Cheng Hao] Fix the bug of ClassCastException

29809a6d

[SPARK-2411] Add a history-not-found page to standalone Master · 6afca2d1

由 Andrew Or 提交于 7月 17, 2014

**Problem.** Right now, if you click on an application after it has finished, it simply refreshes the page if there are no event logs for the application. This is not super intuitive especially because event logging is not enabled by default. We should direct the user to enable this if they attempt to view a SparkUI after the fact without event logs.

**Fix.** The new page conveys different messages in each of the following scenarios:
(1) Application did not enable event logging,
(2) Event logs are not found in the specified directory, and
(3) Exception is thrown while replaying the logs

Here are screenshots of what the page looks like in each of the above scenarios:

(1)
<img src="https://issues.apache.org/jira/secure/attachment/12656204/Event%20logging%20not%20enabled.png" width="75%">

(2)
<img src="https://issues.apache.org/jira/secure/attachment/12656203/Application%20history%20not%20found.png">

(3)
<img src="https://issues.apache.org/jira/secure/attachment/12656202/Application%20history%20load%20error.png" width="95%">

Author: Andrew Or <andrewor14@gmail.com>

Closes #1336 from andrewor14/master-link and squashes the following commits:

2f06206 [Andrew Or] Merge branch 'master' of github.com:apache/spark into master-link
97cddc0 [Andrew Or] Add different severity levels
832b687 [Andrew Or] Mention spark.eventLog.dir in error message
51980c3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into master-link
ded208c [Andrew Or] Merge branch 'master' of github.com:apache/spark into master-link
89d6405 [Andrew Or] Reword message
e7df7ed [Andrew Or] Add a history not found page to standalone Master

6afca2d1

[SPARK-2299] Consolidate various stageIdTo* hash maps in JobProgressListener · 72e9021e

由 Reynold Xin 提交于 7月 17, 2014

This should reduce memory usage for the web ui as well as slightly increase its speed in draining the UI event queue.

@andrewor14

Author: Reynold Xin <rxin@apache.org>

Closes #1262 from rxin/ui-consolidate-hashtables and squashes the following commits:

1ac3f97 [Reynold Xin] Oops. Properly handle description.
f5736ad [Reynold Xin] Code review comments.
b8828dc [Reynold Xin] Merge branch 'master' into ui-consolidate-hashtables
7a7b6c4 [Reynold Xin] Revert css change.
f959bb8 [Reynold Xin] [SPARK-2299] Consolidate various stageIdTo* hash maps in JobProgressListener to speed it up.
63256f5 [Reynold Xin] [SPARK-2320] Reduce <pre> block font size.

72e9021e

doujutun3207 / spark 与 Fork 源项目一致

doujutun3207 / spark
与 Fork 源项目一致