1. 11 3月, 2016 3 次提交
    • S
      [SPARK-13759][SQL] Add IsNotNull constraints for expressions with an inequality · 19f4ac6d
      Sameer Agarwal 提交于
      ## What changes were proposed in this pull request?
      
      This PR adds support for inferring `IsNotNull` constraints from expressions with an `!==`. More specifically, if an operator has a condition on `a !== b`, we know that both `a` and `b` in the operator output can no longer be null.
      
      ## How was this patch tested?
      
      1. Modified a test in `ConstraintPropagationSuite` to test for expressions with an inequality.
      2. Added a test in `NullFilteringSuite` for making sure an Inner join with a "non-equal" condition appropriately filters out null from their input.
      
      cc nongli
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11594 from sameeragarwal/isnotequal-constraints.
      19f4ac6d
    • B
      [SPARK-13727][CORE] SparkConf.contains does not consider deprecated keys · 235f4ac6
      bomeng 提交于
      The contains() method does not return consistently with get() if the key is deprecated. For example,
      import org.apache.spark.SparkConf
      val conf = new SparkConf()
      conf.set("spark.io.compression.lz4.block.size", "12345")  # display some deprecated warning message
      conf.get("spark.io.compression.lz4.block.size") # return 12345
      conf.get("spark.io.compression.lz4.blockSize") # return 12345
      conf.contains("spark.io.compression.lz4.block.size") # return true
      conf.contains("spark.io.compression.lz4.blockSize") # return false
      
      The fix will make the contains() and get() more consistent.
      
      I've added a test case for this.
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Unit tests should be sufficient.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #11568 from bomeng/SPARK-13727.
      235f4ac6
    • L
      [SPARK-13636] [SQL] Directly consume UnsafeRow in wholestage codegen plans · d24801ad
      Liang-Chi Hsieh 提交于
      JIRA: https://issues.apache.org/jira/browse/SPARK-13636
      
      ## What changes were proposed in this pull request?
      
      As shown in the wholestage codegen verion of Sort operator, when Sort is top of Exchange (or other operator that produce UnsafeRow), we will create variables from UnsafeRow, than create another UnsafeRow using these variables. We should avoid the unnecessary unpack and pack variables from UnsafeRows.
      
      ## How was this patch tested?
      
      All existing wholestage codegen tests should be passed.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11484 from viirya/direct-consume-unsaferow.
      d24801ad
  2. 10 3月, 2016 25 次提交
    • M
      [SPARK-13758][STREAMING][CORE] enhance exception message to avoid misleading · 74267beb
      mwws 提交于
      We have a recoverable Spark streaming job with checkpoint enabled, it could be executed correctly at first time, but throw following exception when restarted and recovered from checkpoint.
      ```
      org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
       	at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
       	at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
       	at org.apache.spark.rdd.RDD.union(RDD.scala:565)
       	at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
       	at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
       	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
      ```
      
      According to exception, it shows I invoked transformations and actions in other transformations, but I did not. The real reason is that I used external RDD in DStream operation. External RDD data is not stored in checkpoint, so that during recovering, the initial value of _sc in this RDD is assigned to null and hit above exception. But you can find the error message is misleading, it indicates nothing about the real issue
      Here is the code to reproduce it.
      
      ```scala
      object Repo {
      
        def createContext(ip: String, port: Int, checkpointDirectory: String):StreamingContext = {
      
          println("Creating new context")
          val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
          val ssc = new StreamingContext(sparkConf, Seconds(2))
          ssc.checkpoint(checkpointDirectory)
      
          var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
      
          val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
          words.foreachRDD((rdd: RDD[String]) => {
            val res = rdd.map(word => (word, word.length)).collect()
            println("words: " + res.mkString(", "))
      
            cached = cached.union(rdd)
            cached.checkpoint()
            println("cached words: " + cached.collect.mkString(", "))
          })
          ssc
        }
      
        def main(args: Array[String]) {
      
          val ip = "localhost"
          val port = 9999
          val dir = "/home/maowei/tmp"
      
          val ssc = StreamingContext.getOrCreate(dir,
            () => {
              createContext(ip, port, dir)
            })
          ssc.start()
          ssc.awaitTermination()
        }
      }
      ```
      
      Author: mwws <wei.mao@intel.com>
      
      Closes #11595 from mwws/SPARK-MissleadingLog.
      74267beb
    • S
      [SPARK-13663][CORE] Upgrade Snappy Java to 1.1.2.1 · 927e22ef
      Sean Owen 提交于
      ## What changes were proposed in this pull request?
      
      Update snappy to 1.1.2.1 to pull in a single fix -- the OOM fix we already worked around.
      Supersedes https://github.com/apache/spark/pull/11524
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11631 from srowen/SPARK-13663.
      927e22ef
    • S
      [SPARK-11108][ML] OneHotEncoder should support other numeric types · 9fe38aba
      sethah 提交于
      Adding support for other numeric types:
      
      * Integer
      * Short
      * Long
      * Float
      * Decimal
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #9777 from sethah/SPARK-11108.
      9fe38aba
    • D
      [MINOR][SQL] Replace DataFrameWriter.stream() with startStream() in comments. · 9525c563
      Dongjoon Hyun 提交于
      ## What changes were proposed in this pull request?
      
      According to #11627 , this PR replace `DataFrameWriter.stream()` with `startStream()` in comments of `ContinuousQueryListener.java`.
      
      ## How was this patch tested?
      
      Manual. (It changes on comments.)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11629 from dongjoon-hyun/minor_rename.
      9525c563
    • J
      [SPARK-13706][ML] Add Python Example for Train Validation Split · 3e3c3d58
      JeremyNixon 提交于
      ## What changes were proposed in this pull request?
      
      This pull request adds a python example for train validation split.
      
      ## How was this patch tested?
      
      This was style tested through lint-python, generally tested with ./dev/run-tests, and run in notebook and shell environments. It was viewed in docs locally with jekyll serve.
      
      This contribution is my original work and I license it to Spark under its open source license.
      
      Author: JeremyNixon <jnixon2@gmail.com>
      
      Closes #11547 from JeremyNixon/tvs_example.
      3e3c3d58
    • P
      [SPARK-7420][STREAMING][TESTS] Enable test: o.a.s.streaming.JobGeneratorSuite... · 8bcad28a
      proflin 提交于
      [SPARK-7420][STREAMING][TESTS] Enable test: o.a.s.streaming.JobGeneratorSuite "Do not clear received…
      
      ## How was this patch tested?
      
      unit test
      
      Author: proflin <proflin.me@gmail.com>
      
      Closes #11626 from lw-lin/SPARK-7420.
      8bcad28a
    • R
      [SPARK-13794][SQL] Rename DataFrameWriter.stream() DataFrameWriter.startStream() · 8a3acb79
      Reynold Xin 提交于
      ## What changes were proposed in this pull request?
      The new name makes it more obvious with the verb "start" that we are actually starting some execution.
      
      ## How was this patch tested?
      This is just a rename. Existing unit tests should cover it.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11627 from rxin/SPARK-13794.
      8a3acb79
    • H
      [SPARK-13766][SQL] Consistent file extensions for files written by internal data sources · aa0eba2c
      hyukjinkwon 提交于
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13766
      This PR makes the file extensions (written by internal datasource) consistent.
      
      **Before**
      
      - TEXT, CSV and JSON
      ```
      [.COMPRESSION_CODEC_NAME]
      ```
      
      - Parquet
      ```
      [.COMPRESSION_CODEC_NAME].parquet
      ```
      
      - ORC
      ```
      .orc
      ```
      
      **After**
      
      - TEXT, CSV and JSON
      ```
      .txt[.COMPRESSION_CODEC_NAME]
      .csv[.COMPRESSION_CODEC_NAME]
      .json[.COMPRESSION_CODEC_NAME]
      ```
      
      - Parquet
      ```
      [.COMPRESSION_CODEC_NAME].parquet
      ```
      
      - ORC
      ```
      [.COMPRESSION_CODEC_NAME].orc
      ```
      
      When the compression codec is set,
      - For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name `.parquet` and `.orc` at the end.
      
      - For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names `.json`, `.csv` and `.txt` before the compression extension.
      
      ## How was this patch tested?
      
      Unit tests are used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11604 from HyukjinKwon/SPARK-13766.
      aa0eba2c
    • Y
      Revert "[SPARK-13760][SQL] Fix BigDecimal constructor for FloatType" · 79064612
      Yin Huai 提交于
      This reverts commit 926e9c45.
      79064612
    • S
      [SPARK-13760][SQL] Fix BigDecimal constructor for FloatType · 926e9c45
      Sameer Agarwal 提交于
      ## What changes were proposed in this pull request?
      
      A very minor change for using `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The latter is deprecated and can result in inconsistencies due to an implicit conversion to `Double`.
      
      ## How was this patch tested?
      
      N/A
      
      cc yhuai
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11597 from sameeragarwal/bigdecimal.
      926e9c45
    • S
      [SPARK-13492][MESOS] Configurable Mesos framework webui URL. · a4a0addc
      Sergiusz Urbaniak 提交于
      ## What changes were proposed in this pull request?
      
      Previously the Mesos framework webui URL was being derived only from the Spark UI address leaving no possibility to configure it. This commit makes it configurable. If unset it falls back to the previous behavior.
      
      Motivation:
      This change is necessary in order to be able to install Spark on DCOS and to be able to give it a custom service link. The configured `webui_url` is configured to point to a reverse proxy in the DCOS environment.
      
      ## How was this patch tested?
      
      Locally, using unit tests and on DCOS testing and stable revision.
      
      Author: Sergiusz Urbaniak <sur@mesosphere.io>
      
      Closes #11369 from s-urbaniak/sur-webui-url.
      a4a0addc
    • T
      [MINOR] Fix typo in 'hypot' docstring · 5f7dbdba
      Tristan Reid 提交于
      Minor typo:  docstring for pyspark.sql.functions: hypot has extra characters
      
      N/A
      
      Author: Tristan Reid <treid@netflix.com>
      
      Closes #11616 from tristanreid/master.
      5f7dbdba
    • Z
      [SPARK-13775] History page sorted by completed time desc by default. · 238447db
      zhuol 提交于
      ## What changes were proposed in this pull request?
      Originally the page is sorted by AppID by default.
      After tests with users' feedback, we think it might be best to sort by completed time (desc).
      
      ## How was this patch tested?
      Manually test, with screenshot as follows.
      ![sorted-by-complete-time-desc](https://cloud.githubusercontent.com/assets/11683054/13647686/d6dea924-e5fa-11e5-8fc5-68e039b74b6f.png)
      
      Author: zhuol <zhuol@yahoo-inc.com>
      
      Closes #11608 from zhuoliu/13775.
      238447db
    • S
      [SPARK-13778][CORE] Set the executor state for a worker when removing it · 40e06767
      Shixiong Zhu 提交于
      ## What changes were proposed in this pull request?
      
      When a worker is lost, the executors on this worker are also lost. But Master's ApplicationPage still displays their states as running.
      
      This patch just sets the executor state to `LOST` when a worker is lost.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11609 from zsxwing/SPARK-13778.
      40e06767
    • A
      [SPARK-13747][SQL] Fix concurrent query with fork-join pool · 37fcda3e
      Andrew Or 提交于
      ## What changes were proposed in this pull request?
      
      Fix this use case, which was already fixed in SPARK-10548 in 1.6 but was broken in master due to #9264:
      
      ```
      (1 to 100).par.foreach { _ => sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() }
      ```
      
      This threw `IllegalArgumentException` consistently before this patch. For more detail, see the JIRA.
      
      ## How was this patch tested?
      
      New test in `SQLExecutionSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11586 from andrewor14/fix-concurrent-sql.
      37fcda3e
    • S
      [SPARK-13781][SQL] Use ExpressionSets in ConstraintPropagationSuite · dbf2a7cf
      Sameer Agarwal 提交于
      ## What changes were proposed in this pull request?
      
      This PR is a small follow up on https://github.com/apache/spark/pull/11338 (https://issues.apache.org/jira/browse/SPARK-13092) to use `ExpressionSet` as part of the verification logic in `ConstraintPropagationSuite`.
      ## How was this patch tested?
      
      No new tests added. Just changes the verification logic in `ConstraintPropagationSuite`.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11611 from sameeragarwal/expression-set.
      dbf2a7cf
    • S
      [SPARK-11861][ML] Add feature importances for decision trees · e1772d3f
      sethah 提交于
      This patch adds an API entry point for single decision tree feature importances.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #9912 from sethah/SPARK-11861.
      e1772d3f
    • G
      [SPARK-13527][SQL] Prune Filters based on Constraints · c6aa356c
      gatorsmile 提交于
      #### What changes were proposed in this pull request?
      
      Remove all the deterministic conditions in a [[Filter]] that are contained in the Child's Constraints.
      
      For example, the first query can be simplified to the second one.
      
      ```scala
          val queryWithUselessFilter = tr1
            .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
            .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
            .where(
              ("tr1.a".attr > 10 || "tr1.c".attr < 10) &&
              'd.attr < 100 &&
              "tr2.a".attr === "tr1.a".attr)
      ```
      ```scala
          val query = tr1
            .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
            .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
      ```
      #### How was this patch tested?
      
      Six test cases are added.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11406 from gatorsmile/FilterRemoval.
      c6aa356c
    • D
      [SPARK-13523] [SQL] Reuse exchanges in a query · 3dc9ae2e
      Davies Liu 提交于
      ## What changes were proposed in this pull request?
      
      It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache).
      
      Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query.
      
      In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan.
      
      Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning.
      
      After the rule, the plan will looks like:
      
      ```
      WholeStageCodegen
      :  +- Project [id#0L]
      :     +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
      :        :- Project [id#0L]
      :        :  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
      :        :     :- Range 0, 1, 4, 1024, [id#0L]
      :        :     +- INPUT
      :        +- INPUT
      :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
      :  +- WholeStageCodegen
      :     :  +- Range 0, 1, 4, 1024, [id#1L]
      +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
      ```
      
      ![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png)
      
      For three ways SortMergeJoin,
      ```
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [id#0L]
      :     +- SortMergeJoin [id#0L], [id#4L], None
      :        :- INPUT
      :        +- INPUT
      :- WholeStageCodegen
      :  :  +- Project [id#0L]
      :  :     +- SortMergeJoin [id#0L], [id#3L], None
      :  :        :- INPUT
      :  :        +- INPUT
      :  :- WholeStageCodegen
      :  :  :  +- Sort [id#0L ASC], false, 0
      :  :  :     +- INPUT
      :  :  +- Exchange hashpartitioning(id#0L, 200), None
      :  :     +- WholeStageCodegen
      :  :        :  +- Range 0, 1, 4, 33554432, [id#0L]
      :  +- WholeStageCodegen
      :     :  +- Sort [id#3L ASC], false, 0
      :     :     +- INPUT
      :     +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None
      +- WholeStageCodegen
         :  +- Sort [id#4L ASC], false, 0
         :     +- INPUT
         +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None
      ```
      ![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png)
      
      If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents.
      
      ## How was this patch tested?
      
      Added some unit tests for this.  Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in #11514 ).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11403 from davies/dedup.
      3dc9ae2e
    • Y
      [SPARK-13615][ML] GeneralizedLinearRegression supports save/load · 0dd06485
      Yanbo Liang 提交于
      ## What changes were proposed in this pull request?
      ```GeneralizedLinearRegression``` supports ```save/load```.
      cc mengxr
      ## How was this patch tested?
      unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11465 from yanboliang/spark-13615.
      0dd06485
    • H
      [SPARK-13728][SQL] Fix ORC PPD test so that pushed filters can be checked. · cad29a40
      hyukjinkwon 提交于
      ## What changes were proposed in this pull request?
      https://issues.apache.org/jira/browse/SPARK-13728
      
      https://github.com/apache/spark/pull/11509 makes the output only single ORC file.
      It was 10 files but this PR writes only single file. So, this could not skip stripes in ORC by the pushed down filters.
      So, this PR simply repartitions data into 10 so that the test could pass.
      ## How was this patch tested?
      
      unittest and `./dev/run_tests` for code style test.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11593 from HyukjinKwon/SPARK-13728.
      cad29a40
    • G
      [SPARK-13763][SQL] Remove Project when its Child's Output is Nil · 23369c3b
      gatorsmile 提交于
      #### What changes were proposed in this pull request?
      
      As shown in another PR: https://github.com/apache/spark/pull/11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,
      
      ```SQL
      SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value
      ```
      Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below:
      
      ```
      == Analyzed Logical Plan ==
      value: int
      Project [value#22]
      +- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22]
         +- SubqueryAlias dummyTable
            +- Project [1 AS 1#21]
               +- OneRowRelation$
      
      == Optimized Logical Plan ==
      Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
      +- Project
         +- OneRowRelation$
      ```
      
      After the fix, the optimized plan removed the useless `Project`, as shown below:
      ```
      == Optimized Logical Plan ==
      Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
      +- OneRowRelation$
      ```
      
      This PR is to remove `Project` when its Child's output is Nil
      
      #### How was this patch tested?
      
      Added a new unit test case into the suite `ColumnPruningSuite.scala`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11599 from gatorsmile/projectOneRowRelation.
      23369c3b
    • S
      [SPARK-13595][BUILD] Move docker, extras modules into external · 256704c7
      Sean Owen 提交于
      ## What changes were proposed in this pull request?
      
      Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`
      
      ## How was this patch tested?
      
      This is tested with Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11523 from srowen/SPARK-13595.
      256704c7
    • D
      7791d0c3
    • D
      [SPARK-13242] [SQL] codegen fallback in case-when if there many branches · 9634e17d
      Davies Liu 提交于
      ## What changes were proposed in this pull request?
      
      If there are many branches in a CaseWhen expression, the generated code could go above the 64K limit for single java method, will fail to compile. This PR change it to fallback to interpret mode if there are more than 20 branches.
      
      This PR is based on #11243 and #11221, thanks to joehalliwell
      
      Closes #11243
      Closes #11221
      
      ## How was this patch tested?
      
      Add a test with 50 branches.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11592 from davies/fix_when.
      9634e17d
  3. 09 3月, 2016 12 次提交
    • D
      [SPARK-13698][SQL] Fix Analysis Exceptions when Using Backticks in Generate · 53ba6d6e
      Dilip Biswal 提交于
      ## What changes were proposed in this pull request?
      Analysis exception occurs while running the following query.
      ```
      SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
      ```
      ```
      Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7
      'Project ['ints]
      +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
         +- SubqueryAlias nestedarray
            +- LocalRelation [a#0], [[[[1,2,3]]]]
      ```
      
      ## How was this patch tested?
      
      Added new unit tests in SQLQuerySuite and HiveQlSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #11538 from dilipbiswal/SPARK-13698.
      53ba6d6e
    • A
      [SPARK-13769][CORE] Update Java Doc in Spark Submit · 8e8633e0
      Ahmed Kamal 提交于
      JIRA : https://issues.apache.org/jira/browse/SPARK-13769
      
      The java doc here (https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51)
      needs to be updated from "The latter two operations are currently supported only for standalone cluster mode." to "The latter two operations are currently supported only for standalone and mesos cluster modes."
      
      Author: Ahmed Kamal <ahmed.kamal@badrit.com>
      
      Closes #11600 from AhmedKamal/SPARK-13769.
      8e8633e0
    • D
      [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code. · c3689bc2
      Dongjoon Hyun 提交于
      ## What changes were proposed in this pull request?
      
      In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
      
      ```
      -    final ArrayList<Product2<Object, Object>> dataToWrite =
      -      new ArrayList<Product2<Object, Object>>();
      +    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
      ```
      
      Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
      
      ## How was this patch tested?
      
      Manual.
      Pass the existing tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11541 from dongjoon-hyun/SPARK-13702.
      c3689bc2
    • A
      [SPARK-13631][CORE] Thread-safe getLocationsWithLargestOutputs · cbff2803
      Andy Sloane 提交于
      ## What changes were proposed in this pull request?
      
      If a job is being scheduled in one thread which has a dependency on an
      RDD currently executing a shuffle in another thread, Spark would throw a
      NullPointerException. This patch synchronizes access to `mapStatuses` and
      skips null status entries (which are in-progress shuffle tasks).
      
      ## How was this patch tested?
      
      Our client code unit test suite, which was reliably reproducing the race
      condition with 10 threads, shows that this fixes it. I have not found a minimal
      test case to add to Spark, but I will attempt to do so if desired.
      
      The same test case was tripping up on SPARK-4454, which was fixed by
      making other DAGScheduler code thread-safe.
      
      shivaram srowen
      
      Author: Andy Sloane <asloane@tetrationanalytics.com>
      
      Closes #11505 from a1k0n/SPARK-13631.
      cbff2803
    • T
      [SPARK-13640][SQL] Synchronize ScalaReflection.mirror method. · 2c5af7d4
      Takuya UESHIN 提交于
      ## What changes were proposed in this pull request?
      
      `ScalaReflection.mirror` method should be synchronized when scala version is `2.10` because `universe.runtimeMirror` is not thread safe.
      
      ## How was this patch tested?
      
      I added a test to check thread safety of `ScalaRefection.mirror` method in `ScalaReflectionSuite`, which will throw the following Exception in Scala `2.10` without this patch:
      
      ```
      [info] - thread safety of mirror *** FAILED *** (49 milliseconds)
      [info]   java.lang.UnsupportedOperationException: tail of empty list
      [info]   at scala.collection.immutable.Nil$.tail(List.scala:339)
      [info]   at scala.collection.immutable.Nil$.tail(List.scala:334)
      [info]   at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
      [info]   at scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
      [info]   at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
      [info]   at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
      [info]   at scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
      [info]   at scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
      [info]   at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
      [info]   at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
      [info]   at org.apache.spark.sql.catalyst.ScalaReflection$.mirror(ScalaReflection.scala:36)
      [info]   at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:256)
      [info]   at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:252)
      [info]   at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
      [info]   at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
      [info]   at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
      [info]   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      [info]   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
      [info]   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      [info]   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      ```
      
      Notice that the test will pass when Scala version is `2.11`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #11487 from ueshin/issues/SPARK-13640.
      2c5af7d4
    • D
      [SPARK-13692][CORE][SQL] Fix trivial Coverity/Checkstyle defects · f3201aee
      Dongjoon Hyun 提交于
      ## What changes were proposed in this pull request?
      
      This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle.
      
      - Implement both null and type checking in equals functions.
      - Fix wrong type casting logic in SimpleJavaBean2.equals.
      - Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
      - Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
      - Fix coding style: Add '{}' to single `for` statement in mllib examples.
      - Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`.
      - Remove unused fields in `ChunkFetchIntegrationSuite`.
      - Add `stop()` to prevent resource leak.
      
      Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583).
      
      ## How was this patch tested?
      
      manual via `./dev/lint-java` and Coverity site.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11530 from dongjoon-hyun/SPARK-13692.
      f3201aee
    • J
      [SPARK-7286][SQL] Deprecate !== in favour of =!= · 035d3acd
      Jakob Odersky 提交于
      This PR replaces #9925 which had issues with CI. **Please see the original PR for any previous discussions.**
      
      ## What changes were proposed in this pull request?
      Deprecate the SparkSQL column operator !== and use =!= as an alternative.
      Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===).
      
      ## How was this patch tested?
      All currently existing tests.
      
      Author: Jakob Odersky <jodersky@gmail.com>
      
      Closes #11588 from jodersky/SPARK-7286.
      035d3acd
    • H
      [SPARK-13754] Keep old data source name for backwards compatibility · cc4ab37e
      Hossein 提交于
      ## Motivation
      CSV data source was contributed by Databricks. It is the inlined version of https://github.com/databricks/spark-csv. The data source name was `com.databricks.spark.csv`. As a result there are many tables created on older versions of spark with that name as the source. For backwards compatibility we should keep the old name.
      
      ## Proposed changes
      `com.databricks.spark.csv` was added to list of `backwardCompatibilityMap` in `ResolvedDataSource.scala`
      
      ## Tests
      A unit test was added to `CSVSuite` to parse a csv file using the old name.
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #11589 from falaki/SPARK-13754.
      cc4ab37e
    • D
      [SPARK-13750][SQL] fix sizeInBytes of HadoopFsRelation · 982ef2b8
      Davies Liu 提交于
      ## What changes were proposed in this pull request?
      
      This PR fix the sizeInBytes of HadoopFsRelation.
      
      ## How was this patch tested?
      
      Added regression test for that.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11590 from davies/fix_sizeInBytes.
      982ef2b8
    • B
      [SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a property... · d8813fa0
      Bryan Cutler 提交于
      [SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a property when getting param list
      
      ## What changes were proposed in this pull request?
      
      Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance.  This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error.
      
      ## How was this patch tested?
      
      Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class.  Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.
      d8813fa0
    • J
      [SPARK-13755] Escape quotes in SQL plan visualization node labels · 81f54acc
      Josh Rosen 提交于
      When generating Graphviz DOT files in the SQL query visualization we need to escape double-quotes inside node labels. This is a followup to #11309, which fixed a similar graph in Spark Core's DAG visualization.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11587 from JoshRosen/graphviz-escaping.
      81f54acc
    • S
      [SPARK-13668][SQL] Reorder filter/join predicates to short-circuit isNotNull checks · e430614e
      Sameer Agarwal 提交于
      ## What changes were proposed in this pull request?
      
      If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates.
      
      For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation.
      
      ## How was this patch tested?
      
      new unit tests that verify the physical plan for both filters and joins in `ReorderedPredicateSuite`
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11511 from sameeragarwal/reorder-isnotnull.
      e430614e