1. 14 3月, 2016 3 次提交
    • D
      [SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x. · 473263f9
      Dongjoon Hyun 提交于
      ## What changes were proposed in this pull request?
      
      For 2.0.0, we had better make **sbt** and **sbt plugins** up-to-date. This PR checks the status of each plugins and bumps the followings.
      
      * sbt: 0.13.9 --> 0.13.11
      * sbteclipse-plugin: 2.2.0 --> 4.0.0
      * sbt-dependency-graph: 0.7.4 --> 0.8.2
      * sbt-mima-plugin: 0.1.6 --> 0.1.9
      * sbt-revolver: 0.7.2 --> 0.8.0
      
      All other plugins are up-to-date. (Note that `sbt-avro` seems to be change from 0.3.2 to 1.0.1, but it's not published in the repository.)
      
      During upgrade, this PR also updated the following MiMa error. Note that the related excluding filter is already registered correctly. It seems due to the change of MiMa exception result.
      ```
       // SPARK-12896 Send only accumulator updates to driver, not TaskMetrics
       ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulable.this"),
      -ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulator.this"),
      +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulator.this"),
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins build.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11669 from dongjoon-hyun/update_mima.
      473263f9
    • J
      [SQL] fix typo in DataSourceRegister · f3daa099
      Jacky Li 提交于
      ## What changes were proposed in this pull request?
      fix typo in DataSourceRegister
      
      ## How was this patch tested?
      
      found when going through latest code
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #11686 from jackylk/patch-12.
      f3daa099
    • S
      [SPARK-13812][SPARKR] Fix SparkR lint-r test errors. · c7e68c39
      Sun Rui 提交于
      ## What changes were proposed in this pull request?
      
      This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github.
      
      ## How was this patch tested?
      
      dev/lint-r
      SparkR unit tests
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #11652 from sun-rui/SPARK-13812.
      c7e68c39
  2. 13 3月, 2016 4 次提交
    • B
      [SPARK-13810][CORE] Add Port Configuration Suggestions on Bind Exceptions · 515e4afb
      Bjorn Jonsson 提交于
      ## What changes were proposed in this pull request?
      Currently, when a java.net.BindException is thrown, it displays the following message:
      
      java.net.BindException: Address already in use: Service '$serviceName' failed after 16 retries!
      
      This change adds port configuration suggestions to the BindException, for example, for the UI, it now displays
      
      java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries! Consider explicitly setting the appropriate port for 'SparkUI' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
      
      ## How was this patch tested?
      Manual tests
      
      Author: Bjorn Jonsson <bjornjon@gmail.com>
      
      Closes #11644 from bjornjon/master.
      515e4afb
    • D
      [MINOR][DOCS] Replace `DataFrame` with `Dataset` in Javadoc. · db88d020
      Dongjoon Hyun 提交于
      ## What changes were proposed in this pull request?
      
      SPARK-13817 (PR #11656) replaces `DataFrame` with `Dataset` from Java. This PR fixes the remaining broken links and sample Java code in `package-info.java`. As a result, it will update the following Javadoc.
      
      * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/attribute/package-summary.html
      * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/package-summary.html
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11675 from dongjoon-hyun/replace_dataframe_with_dataset_in_javadoc.
      db88d020
    • C
      [SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows() · c079420d
      Cheng Lian 提交于
      ## What changes were proposed in this pull request?
      
      This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful.
      
      ## How was this patch tested?
      
      Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11678 from liancheng/remove-collect-rows-and-take-rows.
      c079420d
    • C
      [SPARK-13828][SQL] Bring back stack trace of AnalysisException thrown from... · 4eace4d3
      Cheng Lian 提交于
      [SPARK-13828][SQL] Bring back stack trace of AnalysisException thrown from QueryExecution.assertAnalyzed
      
      PR #11443 added an extra `plan: Option[LogicalPlan]` argument to `AnalysisException` and attached partially analyzed plan to thrown `AnalysisException` in `QueryExecution.assertAnalyzed()`.  However, the original stack trace wasn't properly inherited.  This PR fixes this issue by inheriting the stack trace.
      
      A test case is added to verify that the first entry of `AnalysisException` stack trace isn't from `QueryExecution`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11677 from liancheng/analysis-exception-stacktrace.
      4eace4d3
  3. 12 3月, 2016 8 次提交
    • D
      [SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources · ba8c86d0
      Davies Liu 提交于
      ## What changes were proposed in this pull request?
      
      This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them.
      
      Also fix the problem for sameResult() on two DataSourceScan.
      
      Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad).
      
      ## How was this patch tested?
      
      Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11514 from davies/existing_rdd.
      ba8c86d0
    • D
      [SPARK-13830] prefer block manager than direct result for large result · 2ef4c596
      Davies Liu 提交于
      ## What changes were proposed in this pull request?
      
      The current RPC can't handle large blocks very well, it's very slow to fetch 100M block (about 1 minute). Once switch to block manager to fetch that, it took about 10 seconds (still could be improved).
      
      ## How was this patch tested?
      
      existing unit tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11659 from davies/direct_result.
      2ef4c596
    • A
      [SPARK-13139][SQL] Parse Hive DDL commands ourselves · 66d9d0ed
      Andrew Or 提交于
      ## What changes were proposed in this pull request?
      
      This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`.
      
      Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.
      
      ## How was this patch tested?
      
      Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11573 from andrewor14/parser-plus-plus.
      66d9d0ed
    • Z
      [SPARK-13814] [PYSPARK] Delete unnecessary imports in python examples files · 42afd72c
      Zheng RuiFeng 提交于
      JIRA:  https://issues.apache.org/jira/browse/SPARK-13814
      
      ## What changes were proposed in this pull request?
      
      delete unnecessary imports in python examples files
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11651 from zhengruifeng/del_import_pe.
      42afd72c
    • J
      [SPARK-13807] De-duplicate `Python*Helper` instantiation code in PySpark streaming · 073bf9d4
      Josh Rosen 提交于
      This patch de-duplicates code in PySpark streaming which loads the `Python*Helper` classes. I also changed a few `raise e` statements to simply `raise` in order to preserve the full exception stacktrace when re-throwing.
      
      Here's a link to the whitespace-change-free diff: https://github.com/apache/spark/compare/master...JoshRosen:pyspark-reflection-deduplication?w=0
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11641 from JoshRosen/pyspark-reflection-deduplication.
      073bf9d4
    • N
      [SPARK-13328][CORE] Poor read performance for broadcast variables with dynamic resource allocation · ff776b2f
      Nezih Yigitbasi 提交于
      When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt)
      
      Author: Nezih Yigitbasi <nyigitbasi@netflix.com>
      
      Closes #11241 from nezihyigitbasi/SPARK-13328.
      ff776b2f
    • L
      [STREAMING][MINOR] Fix a duplicate "be" in comments · eb650a81
      Liwei Lin 提交于
      Author: Liwei Lin <proflin.me@gmail.com>
      
      Closes #11650 from lw-lin/typo.
      eb650a81
    • M
      [SPARK-13780][SQL] Add missing dependency to build. · 99b7187c
      Marcelo Vanzin 提交于
      This is needed to avoid odd compiler errors when building just the
      sql package with maven, because of odd interactions between scalac
      and shaded classes.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11640 from vanzin/SPARK-13780.
      99b7187c
  4. 11 3月, 2016 25 次提交
    • C
      [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame · 6d37e1eb
      Cheng Lian 提交于
      ## What changes were proposed in this pull request?
      
      PR #11443 temporarily disabled MiMA check, this PR re-enables it.
      
      One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API  changes.
      
      ## How was this patch tested?
      
      Tested by MiMA check triggered by Jenkins.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11656 from liancheng/re-enable-mima.
      6d37e1eb
    • M
      [SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive. · 07f1c544
      Marcelo Vanzin 提交于
      In preparation for the demise of assemblies, this change allows the
      YARN backend to use multiple jars and globs as the "Spark jar". The
      config option has been renamed to "spark.yarn.jars" to reflect that.
      
      A second option "spark.yarn.archive" was also added; if set, this
      takes precedence and uploads an archive expected to contain the jar
      files with the Spark code and its dependencies.
      
      Existing deployments should keep working, mostly. This change drops
      support for the "SPARK_JAR" environment variable, and also does not
      fall back to using "jarOfClass" if no configuration is set, falling
      back to finding files under SPARK_HOME instead. This should be fine
      since "jarOfClass" probably wouldn't work unless you were using
      spark-submit anyway.
      
      Tested with the unit tests, and trying the different config options
      on a YARN cluster.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11500 from vanzin/SPARK-13577.
      07f1c544
    • N
      [HOT-FIX][SQL][ML] Fix compile error from use of DataFrame in Java MaxAbsScaler example · 8fff0f92
      Nick Pentreath 提交于
      ## What changes were proposed in this pull request?
      
      Fix build failure introduced in #11392 (change `DataFrame` -> `Dataset<Row>`).
      
      ## How was this patch tested?
      
      Existing build/unit tests
      
      Author: Nick Pentreath <nick.pentreath@gmail.com>
      
      Closes #11653 from MLnick/java-maxabs-example-fix.
      8fff0f92
    • S
      [SPARK-13787][ML][PYSPARK] Pyspark feature importances for decision tree and random forest · 234f781a
      sethah 提交于
      ## What changes were proposed in this pull request?
      
      This patch adds a `featureImportance` property to the Pyspark API for `DecisionTreeRegressionModel`, `DecisionTreeClassificationModel`, `RandomForestRegressionModel` and `RandomForestClassificationModel`.
      
      ## How was this patch tested?
      
      Python doc tests for the affected classes were updated to check feature importances.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11622 from sethah/SPARK-13787.
      234f781a
    • Y
      [SPARK-13512][ML] add example and doc for MaxAbsScaler · 0b713e04
      Yuhao Yang 提交于
      ## What changes were proposed in this pull request?
      
      jira: https://issues.apache.org/jira/browse/SPARK-13512
      Add example and doc for ml.feature.MaxAbsScaler.
      
      ## How was this patch tested?
       unit tests
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #11392 from hhbyyh/maxabsdoc.
      0b713e04
    • J
      [SPARK-13294][PROJECT INFRA] Remove MiMa's dependency on spark-class / Spark assembly · 6ca990fb
      Josh Rosen 提交于
      This patch removes the need to build a full Spark assembly before running the `dev/mima` script.
      
      - I modified the `tools` project to remove a direct dependency on Spark, so `sbt/sbt tools/fullClasspath` will now return the classpath for the `GenerateMIMAIgnore` class itself plus its own dependencies.
         - This required me to delete two classes full of dead code that we don't use anymore
      - `GenerateMIMAIgnore` now uses [ClassUtil](http://software.clapper.org/classutil/) to find all of the Spark classes rather than our homemade JAR traversal code. The problem in our own code was that it didn't handle folders of classes properly, which is necessary in order to generate excludes with an assembly-free Spark build.
      - `./dev/mima` no longer runs through `spark-class`, eliminating the need to reason about classpath ordering between `SPARK_CLASSPATH` and the assembly.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11178 from JoshRosen/remove-assembly-in-run-tests.
      6ca990fb
    • Z
      [SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIB · d18276cb
      Zheng RuiFeng 提交于
      JIRA: https://issues.apache.org/jira/browse/SPARK-13672
      
      ## What changes were proposed in this pull request?
      
      add two python examples of BisectingKMeans for ml and mllib
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11515 from zhengruifeng/mllib_bkm_pe.
      d18276cb
    • M
      [MINOR][CORE] Fix a duplicate "and" in a log message. · e33bc67c
      Marcelo Vanzin 提交于
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11642 from vanzin/spark-conf-typo.
      e33bc67c
    • W
      [HOT-FIX] fix compile · 74c4e265
      Wenchen Fan 提交于
      Fix the compilation failure introduced by https://github.com/apache/spark/pull/11555 because of a merge conflict.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11648 from cloud-fan/hotbug.
      74c4e265
    • W
      [SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions · 6871cc8f
      Wenchen Fan 提交于
      ## What changes were proposed in this pull request?
      
      Add SQL generation support for window functions. The idea is simple, just treat `Window` operator like `Project`, i.e. add subquery to its child when necessary, generate a `SELECT ... FROM ...` SQL string, implement `sql` method for window related expressions, e.g. `WindowSpecDefinition`, `WindowFrame`, etc.
      
      This PR also fixed SPARK-13720 by improving the process of adding extra `SubqueryAlias`(the `RecoverScopingInfo` rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split `RecoverScopingInfo` into 2 rules: `AddSubQuery` and `UpdateQualifier`. `AddSubQuery` only add subquery if necessary, and `UpdateQualifier` will re-propagate and update qualifiers bottom up.
      
      Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here.
      
      Many thanks to gatorsmile for the initial discussion and test cases!
      
      ## How was this patch tested?
      
      new tests in `LogicalPlanToSQLSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11555 from cloud-fan/window.
      6871cc8f
    • G
      [SPARK-13732][SPARK-13797][SQL] Remove projectList from Window and Eliminate useless Window · 560489f4
      gatorsmile 提交于
      #### What changes were proposed in this pull request?
      
      `projectList` is useless. Its value is always the same as the child.output. Remove it from the class `Window`. Removal can simplify the codes in Analyzer and Optimizer.
      
      This PR is based on the discussion started by cloud-fan in a separate PR:
      https://github.com/apache/spark/pull/5604#discussion_r55140466
      
      This PR also eliminates useless `Window`.
      
      cloud-fan yhuai
      
      #### How was this patch tested?
      
      Existing test cases cover it.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #11565 from gatorsmile/removeProjListWindow.
      560489f4
    • Y
      [SPARK-13389][SPARKR] SparkR support first/last with ignore NAs · 4d535d1f
      Yanbo Liang 提交于
      ## What changes were proposed in this pull request?
      
      SparkR support first/last with ignore NAs
      
      cc sun-rui felixcheung shivaram
      
      ## How was the this patch tested?
      
      unit tests
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11267 from yanboliang/spark-13389.
      4d535d1f
    • S
      [SPARK-13789] Infer additional constraints from attribute equality · c3a6269c
      Sameer Agarwal 提交于
      ## What changes were proposed in this pull request?
      
      This PR adds support for inferring an additional set of data constraints based on attribute equality. For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), we can now automatically infer an additional constraint of the form `b = 5`
      
      ## How was this patch tested?
      
      Tested that new constraints are properly inferred for filters (by adding a new test) and equi-joins (by modifying an existing test)
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11618 from sameeragarwal/infer-isequal-constraints.
      c3a6269c
    • O
      [SPARK-13327][SPARKR] Added parameter validations for colnames<- · 416e71af
      Oscar D. Lara Yejas 提交于
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      
      Closes #11220 from olarayej/SPARK-13312-3.
      416e71af
    • D
      [MINOR][DOC] Fix supported hive version in doc · 88fa8666
      Dongjoon Hyun 提交于
      ## What changes were proposed in this pull request?
      
      Today, Spark 1.6.1 and updated docs are release. Unfortunately, there is obsolete hive version information on docs: [Building Spark](http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support). This PR fixes the following two lines.
      ```
      -By default Spark will build with Hive 0.13.1 bindings.
      +By default Spark will build with Hive 1.2.1 bindings.
      -# Apache Hadoop 2.4.X with Hive 13 support
      +# Apache Hadoop 2.4.X with Hive 1.2.1 support
      ```
      `sql/README.md` file also describe
      
      ## How was this patch tested?
      
      Manual.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11639 from dongjoon-hyun/fix_doc_hive_version.
      88fa8666
    • C
      [SPARK-13244][SQL] Migrates DataFrame to Dataset · 1d542785
      Cheng Lian 提交于
      ## What changes were proposed in this pull request?
      
      This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.
      
      Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).
      
      There are several noticeable API changes related to those returning arrays:
      
      1.  `collect`/`take`
      
          -   Old APIs in class `DataFrame`:
      
              ```scala
              def collect(): Array[Row]
              def take(n: Int): Array[Row]
              ```
      
          -   New APIs in class `Dataset[T]`:
      
              ```scala
              def collect(): Array[T]
              def take(n: Int): Array[T]
      
              def collectRows(): Array[Row]
              def takeRows(n: Int): Array[Row]
              ```
      
          Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.
      
          Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).
      
      1.  `randomSplit`
      
          -   Old APIs in class `DataFrame`:
      
              ```scala
              def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
              def randomSplit(weights: Array[Double]): Array[DataFrame]
              ```
      
          -   New APIs in class `Dataset[T]`:
      
              ```scala
              def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
              def randomSplit(weights: Array[Double]): Array[Dataset[T]]
              ```
      
          Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.
      
      1.  `groupBy`
      
          Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.
      
      Other noticeable changes:
      
      1.  Dataset always do eager analysis now
      
          We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.
      
      ## How was this patch tested?
      
      Existing tests do the work.
      
      ## TODO
      
      - [ ] Fix all tests
      - [ ] Re-enable MiMA check
      - [ ] Update ScalaDoc (`since`, `group`, and example code)
      
      Author: Cheng Lian <lian@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Cheng Lian <liancheng@users.noreply.github.com>
      
      Closes #11443 from liancheng/ds-to-df.
      1d542785
    • S
      [SPARK-13604][CORE] Sync worker's state after registering with master · 27fe6bac
      Shixiong Zhu 提交于
      ## What changes were proposed in this pull request?
      
      Here lists all cases that Master cannot talk with Worker for a while and then network is back.
      
      1. Master doesn't know the network issue (not yet timeout)
      
        a. Worker doesn't know the network issue (onDisconnected is not called)
          - Worker keeps sending Heartbeat. Both Worker and Master don't know the network issue. Nothing to do. (Finally, Master will notice the heartbeat timeout if network is not recovered)
      
        b. Worker knows the network issue (onDisconnected is called)
          - Worker stops sending Heartbeat and sends `RegisterWorker` to master. Master will reply `RegisterWorkerFailed("Duplicate worker ID")`. Worker calls "System.exit(1)" (Finally, Master will notice the heartbeat timeout if network is not recovered) (May leak driver processes. See [SPARK-13602](https://issues.apache.org/jira/browse/SPARK-13602))
      
      2. Worker timeout (Master knows the network issue). In such case,  master removes Worker and its executors and drivers.
      
        a. Worker doesn't know the network issue (onDisconnected is not called)
          - Worker keeps sending Heartbeat.
          - If the network is back, say Master receives Heartbeat, Master sends `ReconnectWorker` to Worker
          - Worker send `RegisterWorker` to master.
          - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors)
      
        b. Worker knows the network issue (onDisconnected is called)
          - Worker stop sending `Heartbeat`. Worker will send "RegisterWorker" to master.
          - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors)
      
      This PR fixes executors and drivers leak in 2.a and 2.b when Worker reregisters with Master. The approach is making Worker send `WorkerLatestState` to sync the state after registering with master successfully. Then Master will ask Worker to kill unknown executors and drivers.
      
      Note:  Worker cannot just kill executors after registering with master because in the worker, `LaunchExecutor` and `RegisteredWorker` are processed in two threads. If `LaunchExecutor` happens before `RegisteredWorker`, Worker's executor list will contain new executors after Master accepts `RegisterWorker`. We should not kill these executors. So sending the list to Master and let Master tell Worker which executors should be killed.
      
      ## How was this patch tested?
      
      test("SPARK-13604: Master should ask Worker kill unknown executors and drivers")
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11455 from zsxwing/orphan-executors.
      27fe6bac
    • D
      [SPARK-13751] [SQL] generate better code for Filter · 020ff8cd
      Davies Liu 提交于
      ## What changes were proposed in this pull request?
      
      This PR improve the codegen of Filter by:
      
      1. filter out the rows early if it have null value in it that will cause the condition result in null or false. After this, we could simplify the condition, because the input are not nullable anymore.
      
      2. Split the condition as conjunctive predicates, then check them one by one.
      
      Here is a piece of generated code for Filter in TPCDS Q55:
      ```java
      /* 109 */       /*** CONSUME: Filter ((((isnotnull(d_moy#149) && isnotnull(d_year#147)) && (d_moy#149 = 11)) && (d_year#147 = 1999)) && isnotnull(d_date_sk#141)) */
      /* 110 */       /* input[0, int] */
      /* 111 */       boolean project_isNull2 = rdd_row.isNullAt(0);
      /* 112 */       int project_value2 = project_isNull2 ? -1 : (rdd_row.getInt(0));
      /* 113 */       /* input[1, int] */
      /* 114 */       boolean project_isNull3 = rdd_row.isNullAt(1);
      /* 115 */       int project_value3 = project_isNull3 ? -1 : (rdd_row.getInt(1));
      /* 116 */       /* input[2, int] */
      /* 117 */       boolean project_isNull4 = rdd_row.isNullAt(2);
      /* 118 */       int project_value4 = project_isNull4 ? -1 : (rdd_row.getInt(2));
      /* 119 */
      /* 120 */       if (project_isNull3) continue;
      /* 121 */       if (project_isNull4) continue;
      /* 122 */       if (project_isNull2) continue;
      /* 123 */
      /* 124 */       /* (input[1, int] = 11) */
      /* 125 */       boolean filter_value6 = false;
      /* 126 */       filter_value6 = project_value3 == 11;
      /* 127 */       if (!filter_value6) continue;
      /* 128 */
      /* 129 */       /* (input[2, int] = 1999) */
      /* 130 */       boolean filter_value9 = false;
      /* 131 */       filter_value9 = project_value4 == 1999;
      /* 132 */       if (!filter_value9) continue;
      /* 133 */
      /* 134 */       filter_metricValue1.add(1);
      /* 135 */
      /* 136 */       /*** CONSUME: Project [d_date_sk#141] */
      /* 137 */
      /* 138 */       project_rowWriter1.write(0, project_value2);
      /* 139 */       append(project_result1.copy());
      ```
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11585 from davies/gen_filter.
      020ff8cd
    • D
      [SPARK-3854][BUILD] Scala style: require spaces before `{`. · 91fed8e9
      Dongjoon Hyun 提交于
      ## What changes were proposed in this pull request?
      
      Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern  for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
      ```
      // Correct:
      if (true) {
        println("Wow!")
      }
      
      // Incorrect:
      if (true){
         println("Wow!")
      }
      ```
      IntelliJ also shows new warnings based on this.
      
      ## How was this patch tested?
      
      Pass the Jenkins ScalaStyle test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11637 from dongjoon-hyun/SPARK-3854.
      91fed8e9
    • J
      [SPARK-13696] Remove BlockStore class & simplify interfaces of mem. & disk stores · 81d48532
      Josh Rosen 提交于
      Today, both the MemoryStore and DiskStore implement a common `BlockStore` API, but I feel that this API is inappropriate because it abstracts away important distinctions between the behavior of these two stores.
      
      For instance, the disk store doesn't have a notion of storing deserialized objects, so it's confusing for it to expose object-based APIs like putIterator() and getValues() instead of only exposing binary APIs and pushing the responsibilities of serialization and deserialization to the client. Similarly, the DiskStore put() methods accepted a `StorageLevel` parameter even though the disk store can only store blocks in one form.
      
      As part of a larger BlockManager interface cleanup, this patch remove the BlockStore interface and refines the MemoryStore and DiskStore interfaces to reflect more narrow sets of responsibilities for those components. Some of the benefits of this interface cleanup are reflected in simplifications to several unit tests to eliminate now-unnecessary mocking, significant simplification of the BlockManager's `getLocal()` and `doPut()` methods, and a narrower API between the MemoryStore and DiskStore.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11534 from JoshRosen/remove-blockstore-interface.
      81d48532
    • T
      [SQL][TEST] Increased timeouts to reduce flakiness in ContinuousQueryManagerSuite · 3d2b6f56
      Tathagata Das 提交于
      ## What changes were proposed in this pull request?
      
      ContinuousQueryManager is sometimes flaky on Jenkins. I could not reproduce it on my machine, so I guess it about the waiting times which causes problems if Jenkins is loaded. I have increased the wait time in the hope that it will be less flaky.
      
      ## How was this patch tested?
      
      I reran the unit test many times on a loop in my machine. I am going to run it a few time in Jenkins, that's the real test.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #11638 from tdas/cqm-flaky-test.
      3d2b6f56
    • N
      [SPARK-13790] Speed up ColumnVector's getDecimal · 747d2f53
      Nong Li 提交于
      ## What changes were proposed in this pull request?
      
      We should reuse an object similar to the other non-primitive type getters. For
      a query that computes averages over decimal columns, this shows a 10% speedup
      on overall query times.
      
      ## How was this patch tested?
      
      Existing tests and this benchmark
      
      ```
      TPCDS Snappy:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
      --------------------------------------------------------------------------------
      q27-agg (master)                       10627 / 11057         10.8          92.3
      q27-agg (this patch)                     9722 / 9832         11.8          84.4
      ```
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #11624 from nongli/spark-13790.
      747d2f53
    • S
      [SPARK-13759][SQL] Add IsNotNull constraints for expressions with an inequality · 19f4ac6d
      Sameer Agarwal 提交于
      ## What changes were proposed in this pull request?
      
      This PR adds support for inferring `IsNotNull` constraints from expressions with an `!==`. More specifically, if an operator has a condition on `a !== b`, we know that both `a` and `b` in the operator output can no longer be null.
      
      ## How was this patch tested?
      
      1. Modified a test in `ConstraintPropagationSuite` to test for expressions with an inequality.
      2. Added a test in `NullFilteringSuite` for making sure an Inner join with a "non-equal" condition appropriately filters out null from their input.
      
      cc nongli
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11594 from sameeragarwal/isnotequal-constraints.
      19f4ac6d
    • B
      [SPARK-13727][CORE] SparkConf.contains does not consider deprecated keys · 235f4ac6
      bomeng 提交于
      The contains() method does not return consistently with get() if the key is deprecated. For example,
      import org.apache.spark.SparkConf
      val conf = new SparkConf()
      conf.set("spark.io.compression.lz4.block.size", "12345")  # display some deprecated warning message
      conf.get("spark.io.compression.lz4.block.size") # return 12345
      conf.get("spark.io.compression.lz4.blockSize") # return 12345
      conf.contains("spark.io.compression.lz4.block.size") # return true
      conf.contains("spark.io.compression.lz4.blockSize") # return false
      
      The fix will make the contains() and get() more consistent.
      
      I've added a test case for this.
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Unit tests should be sufficient.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #11568 from bomeng/SPARK-13727.
      235f4ac6
    • L
      [SPARK-13636] [SQL] Directly consume UnsafeRow in wholestage codegen plans · d24801ad
      Liang-Chi Hsieh 提交于
      JIRA: https://issues.apache.org/jira/browse/SPARK-13636
      
      ## What changes were proposed in this pull request?
      
      As shown in the wholestage codegen verion of Sort operator, when Sort is top of Exchange (or other operator that produce UnsafeRow), we will create variables from UnsafeRow, than create another UnsafeRow using these variables. We should avoid the unnecessary unpack and pack variables from UnsafeRows.
      
      ## How was this patch tested?
      
      All existing wholestage codegen tests should be passed.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11484 from viirya/direct-consume-unsaferow.
      d24801ad