1. 26 7月, 2014 8 次提交
    • M
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · afd757a2
      Michael Armbrust 提交于
      This reverts commit 06dc0d2c.
      
      #1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1594 from marmbrus/revertJDBC and squashes the following commits:
      
      59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
      afd757a2
    • K
      [SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI. · 37ad3b72
      Kay Ousterhout 提交于
      Due to problems with when we update runningStages (in DAGScheduler.scala)
      and how we decide to send a SparkListenerStageCompleted message to
      SparkListeners, sometimes stages can be shown as "running" in the UI forever
      (even after they have failed).  This issue can manifest when stages are
      resubmitted with 0 tasks, or when the DAGScheduler catches non-serializable
      tasks. The problem also resulted in a (small) memory leak in the DAGScheduler,
      where stages can stay in runningStages forever. This commit fixes
      that problem and adds a unit test.
      
      Thanks tsudukim for helping to look into this issue!
      
      cc markhamstra rxin
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #1566 from kayousterhout/dag_fix and squashes the following commits:
      
      217d74b [Kay Ousterhout] [SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI.
      37ad3b72
    • J
      [SPARK-2125] Add sort flag and move sort into shuffle implementations · 47b6b38c
      jerryshao 提交于
      This patch adds a sort flag into ShuffleDependecy and moves sort into hash shuffle implementation.
      
      Moving sort into shuffle implementation can give space for other shuffle implementations (like sort-based shuffle) to better optimize sort through shuffle.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #1210 from jerryshao/SPARK-2125 and squashes the following commits:
      
      2feaf7b [jerryshao] revert MimaExcludes
      ceddf75 [jerryshao] add MimaExeclude
      f674ff4 [jerryshao] Add missing Scope restriction
      b9fe0dd [jerryshao] Fix some style issues according to comments
      ef6b729 [jerryshao] Change sort flag into Option
      3f6eeed [jerryshao] Fix issues related to unit test
      2f552a5 [jerryshao] Minor changes about naming and order
      c92a281 [jerryshao] Move sort into shuffle implementations
      47b6b38c
    • B
      [SQL]Update HiveMetastoreCatalog.scala · ab3c6a45
      baishuo(白硕) 提交于
      I think it's better to defined hiveQlTable as a val
      
      Author: baishuo(白硕) <vc_java@hotmail.com>
      
      Closes #1569 from baishuo/patch-1 and squashes the following commits:
      
      dc2f895 [baishuo(白硕)] Update HiveMetastoreCatalog.scala
      a7b32a2 [baishuo(白硕)] Update HiveMetastoreCatalog.scala
      ab3c6a45
    • Y
      [SPARK-2682] Javadoc generated from Scala source code is not in javadoc's index · a19d8c89
      Yin Huai 提交于
      Add genjavadocSettings back to SparkBuild. It requires #1585 .
      
      https://issues.apache.org/jira/browse/SPARK-2682
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1584 from yhuai/SPARK-2682 and squashes the following commits:
      
      2e89461 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2682
      54e3b66 [Yin Huai] Add genjavadocSettings back.
      a19d8c89
    • C
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · 06dc0d2c
      Cheng Lian 提交于
      JIRA issue:
      
      - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
      
      Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)
      
      TODO
      
      - [x] Use `spark-submit` to launch the server, the CLI and beeline
      - [x] Migration guideline draft for Shark users
      
      ----
      
      Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:
      
      ```bash
      $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
      ```
      
      This actually shows usage information of `SparkSubmit` rather than `BeeLine`.
      
      ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~
      
      **UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1399 from liancheng/thriftserver and squashes the following commits:
      
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      06dc0d2c
    • Y
      [SPARK-2683] unidoc failed because org.apache.spark.util.CallSite uses Java keywords as value names · 32bcf9af
      Yin Huai 提交于
      Renaming `short` to `shortForm` and `long` to `longForm`.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2683
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1585 from yhuai/SPARK-2683 and squashes the following commits:
      
      5ddb843 [Yin Huai] "short" and "long" are Java keyworks. In order to generate javadoc, renaming "short" to "shortForm" and "long" to "longForm".
      32bcf9af
    • xingsensen's avatar
      replace println to log4j · a2715ccd
      xingsensen 提交于
      Our program needs to receive a large amount of data and run for a long
      time.
      We set the log level to WARN but "Storing iterator" "received single"
      as such message written to the log file. (over yarn)
      
      Author: fireflyc <fireflyc@126.com>
      
      Closes #1372 from fireflyc/fix-replace-stdout-log and squashes the following commits:
      
      e684140 [fireflyc] 'info' modified into the 'debug'
      fa22a38 [fireflyc] replace println to log4j
      a2715ccd
  2. 25 7月, 2014 12 次提交
    • C
      [SPARK-2665] [SQL] Add EqualNS & Unit Tests · 184aa1c6
      Cheng Hao 提交于
      Hive Supports the operator "<=>", which returns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1570 from chenghao-intel/equalns and squashes the following commits:
      
      8d6c789 [Cheng Hao] Remove the test case orc_predicate_pushdown
      5b2ca88 [Cheng Hao] Add cases into whitelist
      8e66cdd [Cheng Hao] Rename the EqualNSTo ==> EqualNullSafe
      7af4b0b [Cheng Hao] Add EqualNS & Unit Tests
      184aa1c6
    • R
      [SPARK-2529] Clean closures in foreach and foreachPartition. · eb82abd8
      Reynold Xin 提交于
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1583 from rxin/closureClean and squashes the following commits:
      
      8982fe6 [Reynold Xin] [SPARK-2529] Clean closures in foreach and foreachPartition.
      eb82abd8
    • M
      SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup · 8529ced3
      Matei Zaharia 提交于
      JIRA: https://issues.apache.org/jira/browse/SPARK-2657
      
      Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers.
      
      There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `DeveloperAPI` so I think it's okay to change it here.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1555 from mateiz/compact-groupby and squashes the following commits:
      
      845a356 [Matei Zaharia] Lower initial size of CompactBuffer's vector to 8
      07621a7 [Matei Zaharia] Review comments
      0c1cd12 [Matei Zaharia] Don't use varargs in CompactBuffer.apply
      bdc8a39 [Matei Zaharia] Small tweak to +=, and typos
      f61f040 [Matei Zaharia] Fix line lengths
      59da88b0 [Matei Zaharia] Fix line lengths
      197cde8 [Matei Zaharia] Make CompactBuffer extend Seq to make its toSeq more efficient
      775110f [Matei Zaharia] Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
      9b4c6e8 [Matei Zaharia] Use CompactBuffer in CoGroupedRDD
      ed577ab [Matei Zaharia] Use CompactBuffer in groupByKey
      10f0de1 [Matei Zaharia] A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers
      8529ced3
    • D
      [SPARK-2656] Python version of stratified sampling · 2f75a4a3
      Doris Xin 提交于
      exact sample size not supported for now.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1554 from dorx/pystratified and squashes the following commits:
      
      4ba927a [Doris Xin] use rel diff (+- 50%) instead of abs diff (+- 50)
      bdc3f8b [Doris Xin] updated unit to check sample holistically
      7713c7b [Doris Xin] Python version of stratified sampling
      2f75a4a3
    • D
      [SPARK-2538] [PySpark] Hash based disk spilling aggregation · 14174abd
      Davies Liu 提交于
      During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
      
      It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1460 from davies/spill and squashes the following commits:
      
      cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
      37d71f7 [Davies Liu] balance the partitions
      902f036 [Davies Liu] add shuffle.py into run-tests
      dcf03a9 [Davies Liu] fix memory_info() of psutil
      67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
      f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
      e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
      400be01 [Davies Liu] address all the comments
      6178844 [Davies Liu] refactor and improve docs
      fdd0a49 [Davies Liu] add long doc string for ExternalMerger
      1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
      e6cc7f9 [Davies Liu] Merge branch 'master' into spill
      3652583 [Davies Liu] address comments
      e78a0a0 [Davies Liu] fix style
      24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
      57ee7ef [Davies Liu] update docs
      286aaff [Davies Liu] let spilled aggregation in Python configurable
      e9a40f6 [Davies Liu] recursive merger
      6edbd1f [Davies Liu] Hash based disk spilling aggregation
      14174abd
    • P
      [SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default · eff9714e
      Prashant Sharma 提交于
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1051 from ScrapCodes/SPARK-2014/pyspark-cache and squashes the following commits:
      
      f192df7 [Prashant Sharma] Code Review
      2a2f43f [Prashant Sharma] [SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default
      eff9714e
    • T
      [SPARK-2464][Streaming] Fixed Twitter stream stopping bug · a45d5480
      Tathagata Das 提交于
      Stopping the Twitter Receiver would call twitter4j's TwitterStream.shutdown, which in turn causes an Exception to be thrown to the listener. This exception caused the Receiver to be restarted. This patch check whether the receiver was stopped or not, and accordingly restarts on exception.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #1577 from tdas/twitter-stop and squashes the following commits:
      
      011b525 [Tathagata Das] Fixed Twitter stream stopping bug.
      a45d5480
    • N
      SPARK-2250: show stage RDDs in UI · fec641b8
      Neville Li 提交于
      Author: Neville Li <neville@spotify.com>
      
      Closes #1188 from nevillelyh/neville/ui and squashes the following commits:
      
      d3ac425 [Neville Li] SPARK-2250: show persisted RDD in stage UI
      f075db9 [Neville Li] SPARK-2035: show call stack even when description is available
      fec641b8
    • G
      [SPARK-2037]: yarn client mode doesn't support spark.yarn.max.executor.failures · 323a83c5
      GuoQiang Li 提交于
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1180 from witgo/SPARK-2037 and squashes the following commits:
      
      3d52411 [GuoQiang Li] review commit
      7058f4d [GuoQiang Li] Correctly stop SparkContext
      6d0561f [GuoQiang Li] Fix: yarn client mode doesn't support spark.yarn.max.executor.failures
      323a83c5
    • X
      [SPARK-2479 (partial)][MLLIB] fix binary metrics unit tests · c960b505
      Xiangrui Meng 提交于
      Allow small errors in comparison.
      
      @dbtsai , this unit test blocks https://github.com/apache/spark/pull/1562 . I may need to merge this one first. We can change it to use the tools in https://github.com/apache/spark/pull/1425 after that PR gets merged.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits:
      
      5076a7f [Xiangrui Meng] fix binary metrics unit tests
      c960b505
    • Y
      [SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java... · b352ef17
      Yin Huai 提交于
      [SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java collections to Scala collections JsonRDD.scala
      
      In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a Scala one. These two operations are pretty expensive because they read elements from a Java Map/List and then load to a Scala Map/List. We can use Scala wrappers to wrap those Java collections instead of using toMap/toList.
      
      I did a quick test to see the performance. I had a 2.9GB cached RDD[String] storing one JSON object per record (twitter dataset). My simple test program is attached below.
      ```scala
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)
      import sqlContext._
      
      val jsonData = sc.textFile("...")
      jsonData.cache.count
      
      val jsonSchemaRDD = sqlContext.jsonRDD(jsonData)
      jsonSchemaRDD.registerAsTable("jt")
      
      sqlContext.sql("select count(*) from jt").collect
      ```
      Stages for the schema inference and the table scan both had 48 tasks. These tasks were executed sequentially. For the current implementation, scanning the JSON dataset will materialize values of all fields of a record. The inferred schema of the dataset can be accessed at https://gist.github.com/yhuai/05fe8a57c638c6666f8d.
      
      From the result, there was no significant difference on running `jsonRDD`. For the simple aggregation query, results are attached below.
      ```
      Original:
      Run 1: 26.1s
      Run 2: 27.03s
      Run 3: 27.035s
      
      With this change:
      Run 1: 21.086s
      Run 2: 21.035s
      Run 3: 21.029s
      ```
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2603
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1504 from yhuai/removeToMapToList and squashes the following commits:
      
      6831b77 [Yin Huai] Fix failed tests.
      09b9bca [Yin Huai] Merge remote-tracking branch 'upstream/master' into removeToMapToList
      d1abdb8 [Yin Huai] Remove unnecessary toMap and toList.
      b352ef17
    • T
      [Build] SPARK-2619: Configurable filemode for the spark/bin folder in debian package · 9fd14147
      tzolov 提交于
      Add  a `<deb.bin.filemode>744</deb.bin.filemode>` property to the `assembly/pom.xml` that defaults to `744`.
      Use this property for ../bin folder <filemode>.
      
      This patch doesn't change the current default modes but allows one override the modes at build time:
      `-Ddeb.bin.filemode=<new mode>`
      
      Author: tzolov <christian.tzolov@gmail.com>
      
      Closes #1531 from tzolov/SPARK-2619 and squashes the following commits:
      
      6d95343 [tzolov] [Build] SPARK-2619: Configurable filemode for the spark/bin folder in the .deb package
      9fd14147
  3. 24 7月, 2014 15 次提交
    • R
      SPARK-2150: Provide direct link to finished application UI in yarn resou... · 46e224aa
      Rahul Singhal 提交于
      ...rce manager UI
      
      Use the event logger directory to provide a direct link to finished
      application UI in yarn resourcemanager UI.
      
      Author: Rahul Singhal <rahul.singhal@guavus.com>
      
      Closes #1094 from rahulsinghaliitd/SPARK-2150 and squashes the following commits:
      
      95f230c [Rahul Singhal] SPARK-2150: Provide direct link to finished application UI in yarn resource manager UI
      46e224aa
    • D
      [SPARK-2661][bagel]unpersist old processed rdd · 42dfab7d
      Daoyuan 提交于
      Unpersist useless rdd during bagel iteration to make full use of memory.
      
      Author: Daoyuan <daoyuan.wang@intel.com>
      
      Closes #1519 from adrian-wang/bagelunpersist and squashes the following commits:
      
      182c9dd [Daoyuan] rename var nextUseless to lastRDD
      87fd3a4 [Daoyuan] bagel unpersist old processed rdd
      42dfab7d
    • S
      SPARK-2310. Support arbitrary Spark properties on the command line with ... · e34922a2
      Sandy Ryza 提交于
      ...spark-submit
      
      The PR allows invocations like
        spark-submit --class org.MyClass --spark.shuffle.spill false myjar.jar
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1253 from sryza/sandy-spark-2310 and squashes the following commits:
      
      1dc9855 [Sandy Ryza] More doc and cleanup
      00edfb9 [Sandy Ryza] Review comments
      91b244a [Sandy Ryza] Change format to --conf PROP=VALUE
      8fabe77 [Sandy Ryza] SPARK-2310. Support arbitrary Spark properties on the command line with spark-submit
      e34922a2
    • M
      [SPARK-2658][SQL] Add rule for true = 1. · 78d18fdb
      Michael Armbrust 提交于
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1556 from marmbrus/fixBooleanEqualsOne and squashes the following commits:
      
      ad8edd4 [Michael Armbrust] Add rule for true = 1 and false = 0.
      78d18fdb
    • G
      SPARK-2662: Fix NPE for JsonProtocol · 9e7725c8
      GuoQiang Li 提交于
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1511 from witgo/JsonProtocol and squashes the following commits:
      
      2b6227f [GuoQiang Li] Fix NPE for JsonProtocol
      9e7725c8
    • A
      Replace RoutingTableMessage with pair · 2d25e348
      Ankur Dave 提交于
      RoutingTableMessage was used to construct routing tables to enable
      joining VertexRDDs with partitioned edges. It stored three elements: the
      destination vertex ID, the source edge partition, and a byte specifying
      the position in which the edge partition referenced the vertex to enable
      join elimination.
      
      However, this was incompatible with sort-based shuffle (SPARK-2045). It
      was also slightly wasteful, because partition IDs are usually much
      smaller than 2^32, though this was mitigated by a custom serializer that
      used variable-length encoding.
      
      This commit replaces RoutingTableMessage with a pair of (VertexId, Int)
      where the Int encodes both the source partition ID (in the lower 30
      bits) and the position (in the top 2 bits).
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #1553 from ankurdave/remove-RoutingTableMessage and squashes the following commits:
      
      697e17b [Ankur Dave] Replace RoutingTableMessage with pair
      2d25e348
    • W
      [SPARK-2484][SQL] Build should not run hivecompatibility tests by default. · 60f0ae3d
      witgo 提交于
      Author: witgo <witgo@qq.com>
      
      Closes #1403 from witgo/hive_compatibility and squashes the following commits:
      
      4e5ecdb [witgo] The default does not run hive compatibility tests
      60f0ae3d
    • P
      [SPARK-2549] Functions defined inside of other functions trigger failures · 9b763329
      Prashant Sharma 提交于
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1510 from ScrapCodes/SPARK-2549/fun-in-fun and squashes the following commits:
      
      9458bc5 [Prashant Sharma] Tested by removing an inner function from excludes.
      bc03b1c [Prashant Sharma] SPARK-2549 Functions defined inside of other functions trigger failures
      9b763329
    • I
      [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a... · efdaeb11
      Ian O Connell 提交于
      [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a resource pool in Spark SQL for Kryo instances.
      
      Author: Ian O Connell <ioconnell@twitter.com>
      
      Closes #1377 from ianoc/feature/SPARK-2102 and squashes the following commits:
      
      5498566 [Ian O Connell] Docs update suggested by Patrick
      20e8555 [Ian O Connell] Slight style change
      f92c294 [Ian O Connell] Add docs for new KryoSerializer option
      f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer
      4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization
      665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer
      efdaeb11
    • M
      [SPARK-2569][SQL] Fix shipping of TEMPORARY hive UDFs. · 1871574a
      Michael Armbrust 提交于
      Instead of shipping just the name and then looking up the info on the workers, we now ship the whole classname.  Also, I refactored the file as it was getting pretty large to move out the type conversion code to its own file.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1552 from marmbrus/fixTempUdfs and squashes the following commits:
      
      b695904 [Michael Armbrust] Make add jar execute with Hive.  Ship the whole function class name since sometimes we cannot lookup temporary functions on the workers.
      1871574a
    • W
      SPARK-2226: [SQL] transform HAVING clauses with aggregate expressions that... · e060d3ee
      William Benton 提交于
      SPARK-2226:  [SQL] transform HAVING clauses with aggregate expressions that aren't in the aggregation list
      
      This change adds an analyzer rule to
        1. find expressions in `HAVING` clause filters that depend on unresolved attributes,
        2. push these expressions down to the underlying aggregates, and then
        3. project them away above the filter.
      
      It also enables the `HAVING` queries in the Hive compatibility suite.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #1497 from willb/spark-2226 and squashes the following commits:
      
      92c9a93 [William Benton] Removed unnecessary import
      f1d4f34 [William Benton] Cleanups missed in prior commit
      0e1624f [William Benton] Incorporated suggestions from @marmbrus; thanks!
      541d4ee [William Benton] Cleanups from review
      5a12647 [William Benton] Explanatory comments and stylistic cleanups.
      c7f2b2c [William Benton] Whitelist HAVING queries.
      29a26e3 [William Benton] Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)
      e060d3ee
    • R
      SPARK-2277: clear host->rack info properly · 91903e0a
      Rui Li 提交于
      Hi mridulm, I just think of this issue of [#1212](https://github.com/apache/spark/pull/1212): I added FakeRackUtil to hold the host -> rack mapping. It should be cleaned up after use so that it won't mess up with test cases others may add later.
      Really sorry about this.
      
      Author: Rui Li <rui.li@intel.com>
      
      Closes #1454 from lirui-intel/SPARK-2277-fix-UT and squashes the following commits:
      
      f8ea25c [Rui Li] SPARK-2277: clear host->rack info properly
      91903e0a
    • T
      [SPARK-2588][SQL] Add some more DSLs. · 1b790cf7
      Takuya UESHIN 提交于
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1491 from ueshin/issues/SPARK-2588 and squashes the following commits:
      
      43d0a46 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2588
      1023ea0 [Takuya UESHIN] Modify tests to use DSLs.
      2310bf1 [Takuya UESHIN] Add some more DSLs.
      1b790cf7
    • W
      [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be... · f776bc98
      woshilaiceshide 提交于
      [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.
      
      Make spark's "local[N]" better.
      In our company, we use "local[N]" in production. It works exellentlly. It's our best choice.
      
      Author: woshilaiceshide <woshilaiceshide@qq.com>
      
      Closes #1544 from woshilaiceshide/localX and squashes the following commits:
      
      6c85154 [woshilaiceshide] [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.
      f776bc98
    • A
      [SPARK-2609] Log thread ID when spilling ExternalAppendOnlyMap · 25921110
      Andrew Or 提交于
      It's useful to know whether one thread is constantly spilling or multiple threads are spilling relatively infrequently. Right now everything looks a little jumbled and we can't tell which lines belong to the same thread. For instance:
      
      ```
      06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (194 times so far)
      06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
      06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
      06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 10 MB to disk (197 times so far)
      06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 9 MB to disk (45 times so far)
      06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 23 MB to disk (198 times so far)
      06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 38 MB to disk (25 times so far)
      06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 161 MB to disk (25 times so far)
      06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 0 MB to disk (199 times so far)
      06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (166 times so far)
      06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (199 times so far)
      06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (200 times so far)
      ```
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1517 from andrewor14/external-log and squashes the following commits:
      
      90e48bb [Andrew Or] Log thread ID when spilling
      25921110
  4. 23 7月, 2014 5 次提交
    • X
      [SPARK-2617] Correct doc and usages of preservesPartitioning · 4c7243e1
      Xiangrui Meng 提交于
      The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.
      
      This PR
      
      1. adds notes in `maPartitions*`,
      2. makes `RDD.sample` preserve partitioner,
      3. changes `preservesPartitioning` to false in  `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
      4. fixes some wrong usages in MLlib.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1526 from mengxr/preserve-partitioner and squashes the following commits:
      
      b361e65 [Xiangrui Meng] update doc based on pwendell's comments
      3b1ba19 [Xiangrui Meng] update doc
      357575c [Xiangrui Meng] fix unit test
      20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
      d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
      4c7243e1
    • A
      Remove GraphX MessageToPartition for compatibility with sort-based shuffle · 6c2be93f
      Ankur Dave 提交于
      MessageToPartition was used in `Graph#partitionBy`. Unlike a Tuple2, it marked the key as transient to avoid sending it over the network. However, it was incompatible with sort-based shuffle (SPARK-2045) and represented only a minor optimization: for partitionBy, it improved performance by 6.3% (30.4 s to 28.5 s) and reduced communication by 5.6% (114.2 MB to 107.8 MB).
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #1537 from ankurdave/remove-MessageToPartition and squashes the following commits:
      
      f9d0054 [Ankur Dave] Remove MessageToPartition
      ab71364 [Ankur Dave] Remove unused VertexBroadcastMsg
      6c2be93f
    • G
      [YARN] SPARK-2577: File upload to viewfs is broken due to mount point re... · 02e45729
      Gera Shegalov 提交于
      Opting to the option 2 defined in SPARK-2577, i.e., retrieve and pass the correct file system object to addResource.
      
      Author: Gera Shegalov <gera@twitter.com>
      
      Closes #1483 from gerashegalov/master and squashes the following commits:
      
      90c9087 [Gera Shegalov] [YARN] SPARK-2577: File upload to viewfs is broken due to mount point resolution
      02e45729
    • G
      [YARN][SPARK-2606]:In some cases,the spark UI pages display incorrect · ddadf1b0
      GuoQiang Li 提交于
      The issue is caused by #1112 .
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1501 from witgo/webui_style and squashes the following commits:
      
      4b34998 [GuoQiang Li] In some cases, pages display incorrect in WebUI
      ddadf1b0
    • C
      Graphx example · 5f7b9916
      CrazyJvm 提交于
      fix examples
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1523 from CrazyJvm/graphx-example and squashes the following commits:
      
      663457a [CrazyJvm] outDegrees does not take parameters
      7cfff1d [CrazyJvm] fix example for joinVertices
      5f7b9916