1. 24 7月, 2014 14 次提交
    • D
      [SPARK-2661][bagel]unpersist old processed rdd · 42dfab7d
      Daoyuan 提交于
      Unpersist useless rdd during bagel iteration to make full use of memory.
      
      Author: Daoyuan <daoyuan.wang@intel.com>
      
      Closes #1519 from adrian-wang/bagelunpersist and squashes the following commits:
      
      182c9dd [Daoyuan] rename var nextUseless to lastRDD
      87fd3a4 [Daoyuan] bagel unpersist old processed rdd
      42dfab7d
    • S
      SPARK-2310. Support arbitrary Spark properties on the command line with ... · e34922a2
      Sandy Ryza 提交于
      ...spark-submit
      
      The PR allows invocations like
        spark-submit --class org.MyClass --spark.shuffle.spill false myjar.jar
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1253 from sryza/sandy-spark-2310 and squashes the following commits:
      
      1dc9855 [Sandy Ryza] More doc and cleanup
      00edfb9 [Sandy Ryza] Review comments
      91b244a [Sandy Ryza] Change format to --conf PROP=VALUE
      8fabe77 [Sandy Ryza] SPARK-2310. Support arbitrary Spark properties on the command line with spark-submit
      e34922a2
    • M
      [SPARK-2658][SQL] Add rule for true = 1. · 78d18fdb
      Michael Armbrust 提交于
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1556 from marmbrus/fixBooleanEqualsOne and squashes the following commits:
      
      ad8edd4 [Michael Armbrust] Add rule for true = 1 and false = 0.
      78d18fdb
    • G
      SPARK-2662: Fix NPE for JsonProtocol · 9e7725c8
      GuoQiang Li 提交于
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1511 from witgo/JsonProtocol and squashes the following commits:
      
      2b6227f [GuoQiang Li] Fix NPE for JsonProtocol
      9e7725c8
    • A
      Replace RoutingTableMessage with pair · 2d25e348
      Ankur Dave 提交于
      RoutingTableMessage was used to construct routing tables to enable
      joining VertexRDDs with partitioned edges. It stored three elements: the
      destination vertex ID, the source edge partition, and a byte specifying
      the position in which the edge partition referenced the vertex to enable
      join elimination.
      
      However, this was incompatible with sort-based shuffle (SPARK-2045). It
      was also slightly wasteful, because partition IDs are usually much
      smaller than 2^32, though this was mitigated by a custom serializer that
      used variable-length encoding.
      
      This commit replaces RoutingTableMessage with a pair of (VertexId, Int)
      where the Int encodes both the source partition ID (in the lower 30
      bits) and the position (in the top 2 bits).
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #1553 from ankurdave/remove-RoutingTableMessage and squashes the following commits:
      
      697e17b [Ankur Dave] Replace RoutingTableMessage with pair
      2d25e348
    • W
      [SPARK-2484][SQL] Build should not run hivecompatibility tests by default. · 60f0ae3d
      witgo 提交于
      Author: witgo <witgo@qq.com>
      
      Closes #1403 from witgo/hive_compatibility and squashes the following commits:
      
      4e5ecdb [witgo] The default does not run hive compatibility tests
      60f0ae3d
    • P
      [SPARK-2549] Functions defined inside of other functions trigger failures · 9b763329
      Prashant Sharma 提交于
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1510 from ScrapCodes/SPARK-2549/fun-in-fun and squashes the following commits:
      
      9458bc5 [Prashant Sharma] Tested by removing an inner function from excludes.
      bc03b1c [Prashant Sharma] SPARK-2549 Functions defined inside of other functions trigger failures
      9b763329
    • I
      [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a... · efdaeb11
      Ian O Connell 提交于
      [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a resource pool in Spark SQL for Kryo instances.
      
      Author: Ian O Connell <ioconnell@twitter.com>
      
      Closes #1377 from ianoc/feature/SPARK-2102 and squashes the following commits:
      
      5498566 [Ian O Connell] Docs update suggested by Patrick
      20e8555 [Ian O Connell] Slight style change
      f92c294 [Ian O Connell] Add docs for new KryoSerializer option
      f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer
      4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization
      665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer
      efdaeb11
    • M
      [SPARK-2569][SQL] Fix shipping of TEMPORARY hive UDFs. · 1871574a
      Michael Armbrust 提交于
      Instead of shipping just the name and then looking up the info on the workers, we now ship the whole classname.  Also, I refactored the file as it was getting pretty large to move out the type conversion code to its own file.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1552 from marmbrus/fixTempUdfs and squashes the following commits:
      
      b695904 [Michael Armbrust] Make add jar execute with Hive.  Ship the whole function class name since sometimes we cannot lookup temporary functions on the workers.
      1871574a
    • W
      SPARK-2226: [SQL] transform HAVING clauses with aggregate expressions that... · e060d3ee
      William Benton 提交于
      SPARK-2226:  [SQL] transform HAVING clauses with aggregate expressions that aren't in the aggregation list
      
      This change adds an analyzer rule to
        1. find expressions in `HAVING` clause filters that depend on unresolved attributes,
        2. push these expressions down to the underlying aggregates, and then
        3. project them away above the filter.
      
      It also enables the `HAVING` queries in the Hive compatibility suite.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #1497 from willb/spark-2226 and squashes the following commits:
      
      92c9a93 [William Benton] Removed unnecessary import
      f1d4f34 [William Benton] Cleanups missed in prior commit
      0e1624f [William Benton] Incorporated suggestions from @marmbrus; thanks!
      541d4ee [William Benton] Cleanups from review
      5a12647 [William Benton] Explanatory comments and stylistic cleanups.
      c7f2b2c [William Benton] Whitelist HAVING queries.
      29a26e3 [William Benton] Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)
      e060d3ee
    • R
      SPARK-2277: clear host->rack info properly · 91903e0a
      Rui Li 提交于
      Hi mridulm, I just think of this issue of [#1212](https://github.com/apache/spark/pull/1212): I added FakeRackUtil to hold the host -> rack mapping. It should be cleaned up after use so that it won't mess up with test cases others may add later.
      Really sorry about this.
      
      Author: Rui Li <rui.li@intel.com>
      
      Closes #1454 from lirui-intel/SPARK-2277-fix-UT and squashes the following commits:
      
      f8ea25c [Rui Li] SPARK-2277: clear host->rack info properly
      91903e0a
    • T
      [SPARK-2588][SQL] Add some more DSLs. · 1b790cf7
      Takuya UESHIN 提交于
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1491 from ueshin/issues/SPARK-2588 and squashes the following commits:
      
      43d0a46 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2588
      1023ea0 [Takuya UESHIN] Modify tests to use DSLs.
      2310bf1 [Takuya UESHIN] Add some more DSLs.
      1b790cf7
    • W
      [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be... · f776bc98
      woshilaiceshide 提交于
      [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.
      
      Make spark's "local[N]" better.
      In our company, we use "local[N]" in production. It works exellentlly. It's our best choice.
      
      Author: woshilaiceshide <woshilaiceshide@qq.com>
      
      Closes #1544 from woshilaiceshide/localX and squashes the following commits:
      
      6c85154 [woshilaiceshide] [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.
      f776bc98
    • A
      [SPARK-2609] Log thread ID when spilling ExternalAppendOnlyMap · 25921110
      Andrew Or 提交于
      It's useful to know whether one thread is constantly spilling or multiple threads are spilling relatively infrequently. Right now everything looks a little jumbled and we can't tell which lines belong to the same thread. For instance:
      
      ```
      06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (194 times so far)
      06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
      06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
      06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 10 MB to disk (197 times so far)
      06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 9 MB to disk (45 times so far)
      06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 23 MB to disk (198 times so far)
      06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 38 MB to disk (25 times so far)
      06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 161 MB to disk (25 times so far)
      06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 0 MB to disk (199 times so far)
      06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (166 times so far)
      06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (199 times so far)
      06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (200 times so far)
      ```
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1517 from andrewor14/external-log and squashes the following commits:
      
      90e48bb [Andrew Or] Log thread ID when spilling
      25921110
  2. 23 7月, 2014 8 次提交
    • X
      [SPARK-2617] Correct doc and usages of preservesPartitioning · 4c7243e1
      Xiangrui Meng 提交于
      The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.
      
      This PR
      
      1. adds notes in `maPartitions*`,
      2. makes `RDD.sample` preserve partitioner,
      3. changes `preservesPartitioning` to false in  `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
      4. fixes some wrong usages in MLlib.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1526 from mengxr/preserve-partitioner and squashes the following commits:
      
      b361e65 [Xiangrui Meng] update doc based on pwendell's comments
      3b1ba19 [Xiangrui Meng] update doc
      357575c [Xiangrui Meng] fix unit test
      20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
      d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
      4c7243e1
    • A
      Remove GraphX MessageToPartition for compatibility with sort-based shuffle · 6c2be93f
      Ankur Dave 提交于
      MessageToPartition was used in `Graph#partitionBy`. Unlike a Tuple2, it marked the key as transient to avoid sending it over the network. However, it was incompatible with sort-based shuffle (SPARK-2045) and represented only a minor optimization: for partitionBy, it improved performance by 6.3% (30.4 s to 28.5 s) and reduced communication by 5.6% (114.2 MB to 107.8 MB).
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #1537 from ankurdave/remove-MessageToPartition and squashes the following commits:
      
      f9d0054 [Ankur Dave] Remove MessageToPartition
      ab71364 [Ankur Dave] Remove unused VertexBroadcastMsg
      6c2be93f
    • G
      [YARN] SPARK-2577: File upload to viewfs is broken due to mount point re... · 02e45729
      Gera Shegalov 提交于
      Opting to the option 2 defined in SPARK-2577, i.e., retrieve and pass the correct file system object to addResource.
      
      Author: Gera Shegalov <gera@twitter.com>
      
      Closes #1483 from gerashegalov/master and squashes the following commits:
      
      90c9087 [Gera Shegalov] [YARN] SPARK-2577: File upload to viewfs is broken due to mount point resolution
      02e45729
    • G
      [YARN][SPARK-2606]:In some cases,the spark UI pages display incorrect · ddadf1b0
      GuoQiang Li 提交于
      The issue is caused by #1112 .
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1501 from witgo/webui_style and squashes the following commits:
      
      4b34998 [GuoQiang Li] In some cases, pages display incorrect in WebUI
      ddadf1b0
    • C
      Graphx example · 5f7b9916
      CrazyJvm 提交于
      fix examples
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1523 from CrazyJvm/graphx-example and squashes the following commits:
      
      663457a [CrazyJvm] outDegrees does not take parameters
      7cfff1d [CrazyJvm] fix example for joinVertices
      5f7b9916
    • C
      [SPARK-2615] [SQL] Add Equal Sign "==" Support for HiveQl · 79fe7634
      Cheng Hao 提交于
      Currently, the "==" in HiveQL expression will cause exception thrown, this patch will fix it.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1522 from chenghao-intel/equal and squashes the following commits:
      
      f62a0ff [Cheng Hao] Add == Support for HiveQl
      79fe7634
    • A
      SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage · 85d3596e
      Aaron Davidson 提交于
      ### Why and what?
      Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.
      
      This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.
      
      Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl
      
      ### Memory implications
      An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default.
      
      Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
      This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7).
      
      The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case.
      This results in a worst-case sorting overhead of 4N bytes.
      
      Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements.
      
      ### Performance implications
      As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation.
      
      Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter".
      
      <table>
      <tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr>
      <tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
      <tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr>
      <tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr>
      <tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr>
      </table>
      
      The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7.
      
      In short, this patch should significantly improve performance for users running either Java 6 or 7.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1502 from aarondav/sort and squashes the following commits:
      
      652d936 [Aaron Davidson] Update license, move Sorter to java src
      a7b5b1c [Aaron Davidson] fix licenses
      5c0efaf [Aaron Davidson] Update tmpLength
      ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs
      034bf10 [Aaron Davidson] Change to Apache v2 Timsort
      b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark]
      6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
      85d3596e
    • X
      [MLLIB] make Mima ignore updateFeatures (private) in ALS · 14078717
      Xiangrui Meng 提交于
      Fix Mima issues in #1521.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1533 from mengxr/mima-als and squashes the following commits:
      
      78386e1 [Xiangrui Meng] make Mima ignore updateFeatures (private) in ALS
      14078717
  3. 22 7月, 2014 9 次提交
    • P
      [SPARK-2612] [mllib] Fix data skew in ALS · 75db1742
      peng.zhang 提交于
      Author: peng.zhang <peng.zhang@xiaomi.com>
      
      Closes #1521 from renozhang/fix-als and squashes the following commits:
      
      b5727a4 [peng.zhang] Remove no need argument
      1a4f7a0 [peng.zhang] Fix data skew in ALS
      75db1742
    • P
      [SPARK-2452] Create a new valid for each instead of using lineId. · 81fec992
      Prashant Sharma 提交于
      Author: Prashant Sharma <prashant@apache.org>
      
      Closes #1441 from ScrapCodes/SPARK-2452/multi-statement and squashes the following commits:
      
      26c5c72 [Prashant Sharma] Added a test case.
      7e8d28d [Prashant Sharma] SPARK-2452, create a new valid for each  instead of using lineId, because Line ids can be same sometimes.
      81fec992
    • N
      [SPARK-2470] PEP8 fixes to PySpark · 5d16d5bb
      Nicholas Chammas 提交于
      This pull request aims to resolve all outstanding PEP8 violations in PySpark.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1505 from nchammas/master and squashes the following commits:
      
      98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle
      cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses
      e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break
      9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses
      22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses
      24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest
      7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py
      8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py
      b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py
      d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py
      aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py
      1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py
      95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py
      a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib
      c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py
      d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py
      81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py
      1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py
      7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py
      ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py
      f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py
      a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py
      f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py
      4dd148f [nchammas] Merge pull request #5 from apache/master
      f7e4581 [Nicholas Chammas] unrelated pep8 fix
      a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
      de7292a [nchammas] Merge pull request #4 from apache/master
      2e4fe00 [nchammas] Merge pull request #3 from apache/master
      89fde08 [nchammas] Merge pull request #2 from apache/master
      69f6e22 [Nicholas Chammas] PEP8 fixes
      2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
      6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
      69da6cf [nchammas] Merge pull request #1 from apache/master
      5d16d5bb
    • G
      [SPARK-2086] Improve output of toDebugString to make shuffle boundaries more clear · c3462c65
      Gregory Owen 提交于
      Changes RDD.toDebugString() to show hierarchy and shuffle transformations more clearly
      
      New output:
      
      ```
      (3) FlatMappedValuesRDD[325] at apply at Transformer.scala:22
       |  MappedValuesRDD[324] at apply at Transformer.scala:22
       |  CoGroupedRDD[323] at apply at Transformer.scala:22
       +-(5) MappedRDD[320] at apply at Transformer.scala:22
       |  |  MappedRDD[319] at apply at Transformer.scala:22
       |  |  MappedValuesRDD[318] at apply at Transformer.scala:22
       |  |  MapPartitionsRDD[317] at apply at Transformer.scala:22
       |  |  ShuffledRDD[316] at apply at Transformer.scala:22
       |  +-(10) MappedRDD[315] at apply at Transformer.scala:22
       |     |   ParallelCollectionRDD[314] at apply at Transformer.scala:22
       +-(100) MappedRDD[322] at apply at Transformer.scala:22
           |   ParallelCollectionRDD[321] at apply at Transformer.scala:22
      ```
      
      Author: Gregory Owen <greowen@gmail.com>
      
      Closes #1364 from GregOwen/to-debug-string and squashes the following commits:
      
      08f5c78 [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
      1603f7b [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
      c3462c65
    • M
      [SPARK-2561][SQL] Fix apply schema · 511a7314
      Michael Armbrust 提交于
      We need to use the analyzed attributes otherwise we end up with a tree that will never resolve.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1470 from marmbrus/fixApplySchema and squashes the following commits:
      
      f968195 [Michael Armbrust] Use analyzed attributes when applying the schema.
      4969015 [Michael Armbrust] Add test case.
      511a7314
    • B
      [SPARK-2434][MLlib]: Warning messages that point users to original MLlib... · a4d60208
      Burak 提交于
      [SPARK-2434][MLlib]: Warning messages that point users to original MLlib implementations added to Examples
      
      [SPARK-2434][MLlib]: Warning messages that refer users to the original MLlib implementations of some popular example machine learning algorithms added both in the comments and the code. The following examples have been modified:
      Scala:
      * LocalALS
      * LocalFileLR
      * LocalKMeans
      * LocalLP
      * SparkALS
      * SparkHdfsLR
      * SparkKMeans
      * SparkLR
      Python:
       * kmeans.py
       * als.py
       * logistic_regression.py
      
      Author: Burak <brkyvz@gmail.com>
      
      Closes #1515 from brkyvz/SPARK-2434 and squashes the following commits:
      
      7505da9 [Burak] [SPARK-2434][MLlib]: Warning messages added, scalastyle errors fixed, and added missing punctuation
      b96b522 [Burak] [SPARK-2434][MLlib]: Warning messages added and scalastyle errors fixed
      4762f39 [Burak] [SPARK-2434]: Warning messages added
      17d3d83 [Burak] SPARK-2434: Added warning messages to the naive implementations of the example algorithms
      2cb5301 [Burak] SPARK-2434: Warning messages redirecting to original implementaions added.
      a4d60208
    • A
      Fix flakey HiveQuerySuite test · abeacffb
      Aaron Davidson 提交于
      Result may not be returned in the expected order, so relax that constraint.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1514 from aarondav/flakey and squashes the following commits:
      
      e5af823 [Aaron Davidson] Fix flakey HiveQuerySuite test
      abeacffb
    • D
      [SPARK-2494] [PySpark] make hash of None consistant cross machines · 872538c6
      Davies Liu 提交于
      In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1371 from davies/hash_of_none and squashes the following commits:
      
      d01745f [Davies Liu] add comments, remove outdated unit tests
      5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
      b7118aa [Davies Liu] use __builtin__ instead of __builtins__
      839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines
      872538c6
    • S
      SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler · f89cf65d
      Sandy Ryza 提交于
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #634 from sryza/sandy-spark-1707 and squashes the following commits:
      
      2f6e358 [Sandy Ryza] Default min registered executors ratio to .8 for YARN
      354c630 [Sandy Ryza] Remove outdated comments
      c744ef3 [Sandy Ryza] Take out waitForInitialAllocations
      2a4329b [Sandy Ryza] SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler
      f89cf65d
  4. 21 7月, 2014 7 次提交
    • C
      [SPARK-2190][SQL] Specialized ColumnType for Timestamp · cd273a23
      Cheng Lian 提交于
      JIRA issue: [SPARK-2190](https://issues.apache.org/jira/browse/SPARK-2190)
      
      Added specialized in-memory column type for `Timestamp`. Whitelisted all timestamp related Hive tests except `timestamp_udf`, which is timezone sensitive.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1440 from liancheng/timestamp-column-type and squashes the following commits:
      
      e682175 [Cheng Lian] Enabled more timezone sensitive Hive tests.
      53a358f [Cheng Lian] Fixed failed test suites
      01b592d [Cheng Lian] Fixed SimpleDateFormat thread safety issue
      2a59343 [Cheng Lian] Removed timezone sensitive Hive timestamp tests
      45dd05d [Cheng Lian] Added Timestamp specific in-memory columnar representation
      cd273a23
    • M
      [SPARK-1945][MLLIB] Documentation Improvements for Spark 1.0 · db56f2df
      Michael Giannakopoulos 提交于
      Standalone application examples are added to 'mllib-linear-methods.md' file written in Java.
      This commit is related to the issue [Add full Java Examples in MLlib docs](https://issues.apache.org/jira/browse/SPARK-1945).
      Also I changed the name of the sigmoid function from 'logit' to 'f'. This is because the logit function
      is the inverse of sigmoid.
      
      Thanks,
      Michael
      
      Author: Michael Giannakopoulos <miccagiann@gmail.com>
      
      Closes #1311 from miccagiann/master and squashes the following commits:
      
      8ffe5ab [Michael Giannakopoulos] Update code so as to comply with code standards.
      f7ad5cc [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
      38d92c7 [Michael Giannakopoulos] Adding PCA, SVD and LBFGS examples in Java. Performing minor updates in the already committed examples so as to eradicate the call of 'productElement' function whenever is possible.
      cc0a089 [Michael Giannakopoulos] Modyfied Java examples so as to comply with coding standards.
      b1141b2 [Michael Giannakopoulos] Added Java examples for Clustering and Collaborative Filtering [mllib-clustering.md & mllib-collaborative-filtering.md].
      837f7a8 [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
      15f0eb4 [Michael Giannakopoulos] Java examples included in 'mllib-linear-methods.md' file.
      db56f2df
    • K
      Improve scheduler delay tooltip. · f6e7302c
      Kay Ousterhout 提交于
      As a result of shivaram's experience debugging long scheduler delay, I think we should improve the tooltip to point people in the right direction if scheduler delay is large.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #1488 from kayousterhout/better_tooltips and squashes the following commits:
      
      22176fd [Kay Ousterhout] Improve scheduler delay tooltip.
      f6e7302c
    • X
      [SPARK-2552][MLLIB] stabilize logistic function in pyspark · b86db517
      Xiangrui Meng 提交于
      to avoid overflow in `exp(x)` if `x` is large.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1493 from mengxr/py-logistic and squashes the following commits:
      
      259e863 [Xiangrui Meng] stabilize logistic function in pyspark
      b86db517
    • S
      SPARK-2564. ShuffleReadMetrics.totalBlocksRead is redundant · 9564f854
      Sandy Ryza 提交于
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1474 from sryza/sandy-spark-2564 and squashes the following commits:
      
      35b8388 [Sandy Ryza] Fix compile error on upmerge
      7b985fb [Sandy Ryza] Fix test compile error
      43f79e6 [Sandy Ryza] SPARK-2564. ShuffleReadMetrics.totalBlocksRead is redundant
      9564f854
    • X
      [SPARK-2495][MLLIB] remove private[mllib] from linear models' constructors · 1b10b811
      Xiangrui Meng 提交于
      This is part of SPARK-2495 to allow users construct linear models manually.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1492 from mengxr/public-constructor and squashes the following commits:
      
      a48b766 [Xiangrui Meng] remove private[mllib] from linear models' constructors
      1b10b811
    • R
      [SPARK-2598] RangePartitioner's binary search does not use the given Ordering · fa51b0fb
      Reynold Xin 提交于
      We should fix this in branch-1.0 as well.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1500 from rxin/rangePartitioner and squashes the following commits:
      
      c0a94f5 [Reynold Xin] [SPARK-2598] RangePartitioner's binary search does not use the given Ordering.
      fa51b0fb
  5. 20 7月, 2014 2 次提交
    • S
      SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical section... · 98ab4112
      Sandy Ryza 提交于
      ...s of CoGroupedRDD and PairRDDFunctions
      
      This also removes an unnecessary tuple creation in cogroup.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1447 from sryza/sandy-spark-2519-2 and squashes the following commits:
      
      b6d9699 [Sandy Ryza] Remove missed Tuple2 match in CoGroupedRDD
      a109828 [Sandy Ryza] Remove another pattern matching in MappedValuesRDD and revert some changes in PairRDDFunctions
      be10f8a [Sandy Ryza] SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical sections of CoGroupedRDD and PairRDDFunctions
      98ab4112
    • L
      [SPARK-2524] missing document about spark.deploy.retainedDrivers · 4da01e38
      lianhuiwang 提交于
      https://issues.apache.org/jira/browse/SPARK-2524
      The configuration on spark.deploy.retainedDrivers is undocumented but actually used
      https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60
      
      Author: lianhuiwang <lianhuiwang09@gmail.com>
      Author: Wang Lianhui <lianhuiwang09@gmail.com>
      Author: unknown <Administrator@taguswang-PC1.tencent.com>
      
      Closes #1443 from lianhuiwang/SPARK-2524 and squashes the following commits:
      
      64660fd [Wang Lianhui] address pwendell's comments
      5f6bbb7 [Wang Lianhui] missing document about spark.deploy.retainedDrivers
      44a3f50 [unknown] Merge remote-tracking branch 'upstream/master'
      eacf933 [lianhuiwang] Merge remote-tracking branch 'upstream/master'
      8bbfe76 [lianhuiwang] Merge remote-tracking branch 'upstream/master'
      480ce94 [lianhuiwang] address aarondav comments
      f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED
      4da01e38