1. 02 5月, 2018 1 次提交
    • A
      FTS detects when primary is in recovery avoiding config change · d453a4aa
      Ashwin Agrawal 提交于
      Previous behavior when primary is in crash recovery FTS probe fails and hence
      qqprimary is marked down. This change provides a recovery progress metric so that
      FTS can detect progress. We added last replayed LSN number inside the error
      message to determine recovery progress. This allows FTS to distinguish between
      recovery in progress and recovery hang or rolling panics. Only when FTS detects
      recovery is not making progress then FTS marks primary down.
      
      For testing a new fault injector is added to allow simulation of recovery hang
      and recovery in progress.
      
      Just fyi...this reverts the reverted commit 7b7219a4.
      Co-authored-by: NAshwin Agrawal <aagrawal@pivotal.io>
      Co-authored-by: NDavid Kimura <dkimura@pivotal.io>
      d453a4aa
  2. 01 5月, 2018 2 次提交
  3. 10 3月, 2018 2 次提交
    • A
      Move probe related functions to ftsprobe.c · 1e8fd668
      Asim R P 提交于
      Also incorporate the following review comments to 5a7aaf43:
      
       * Break the processing of response and updating of cluster
         configuration into two separate functions.
       * Probe timeout is considered as a failure and retried.
       * Added a new state to distinguish whether a response is processed or
         not.
       * Reduce logging volumn by making use of gp_log_fts GUC.
       * And a bunch of renames.
      Co-authored-by: NDavid Kimura <dkimura@pivotal.io>
      1e8fd668
    • A
      Make FTS single process, using async libpq API, remove threads · 5a7aaf43
      Asim R P 提交于
      A couple of problems with the threaded implementation were:
      
      1. The implementation was divided into two steps - spin up threads and
      send PROBE message to all primary segments in the first step.  Only
      after receiving response from all segments, in the second  step, spin up
      threads and send subsequent messages (e.g. syncrep off, promote), if
      necessary.  A single stubborn segment that wouldn't respond in the first
      step would delay the second step for everyone else.
      
      2. Standard elog.c interface cannot be used for error handling as it is
      not thread safe.  As a result, any assertion failures within threads
      could be missed out or cause wierd behavior such as garbage characters
      in process list entry of fts process.
      
      With the single process implementation, both these problems go away.  We
      introduce interal states for a connection from FTS to a segment (primary
      or mirror).  The states are transitioned until a final state is reached.
      Once all connections have reached final state (success or failure), the
      probe cycle ends.
      Co-authored-by: NDavid Kimura <dkimura@pivotal.io>
      Co-authored-by: NTaylor Vesely <tvesely@pivotal.io>
      5a7aaf43
  4. 13 1月, 2018 1 次提交
  5. 16 12月, 2017 2 次提交
    • J
      FTS turn off syncrep for primary when mirror is detected down · 9fbfec7a
      Jimmy Yih 提交于
      Synchronized replication is on by default for Greenplum. When the
      primary's mirror is detected down, the primary will continue to block
      all commits until the mirror comes back up.
      
      This commit introduces a new FTS message that will be sent to the
      primary if an FTS probe detects a mirror is down and primary is stuck
      in synchronous replication state. The new message will allow the
      primary to turn off synchronous replication by setting the GUC
      synchronous_standby_names to empty string (persisted in
      gp_replication.conf) and setting the WalSndCtl shared-memory syncrep
      state. As a result, backends that may be waiting for the mirror to
      receive a commit will be unblocked.
      
      FTS must note mirror's status as down in configuration before syncrep can be
      turned off by the primary.
      
      Author: Jimmy Yih <jyih@pivotal.io>
      Author: Asim R P <apraveen@pivotal.io>
      Author: Jacob Champion <pchampion@pivotal.io>
      9fbfec7a
    • J
      Rename fts probe variables to be more generic · 9a18c185
      Jimmy Yih 提交于
      Along with rename, fix relative unit tests.
      
      Author: Jimmy Yih <jyih@pivotal.io>
      Author: Asim R P <apraveen@pivotal.io>
      9a18c185
  6. 08 11月, 2017 2 次提交
    • A
      Cleanup code to reduce number of #ifdef USE_SEGWALREP. · cfacd5e2
      Ashwin Agrawal 提交于
      Moved lot of filerep code to ftsprobefilerep.c, so now ftsprobe.c becomes very
      specific to walrep.
      
      Also, along the way removed calling `probePollOut()` and `probePollIn()` from
      `probeSegmentHelper()` as its not needed.
      cfacd5e2
    • X
      Handle FTS probe message in a backend process · f86af5ec
      Xin Zhang 提交于
      Use special connection parameter "gpconntype" to distinguish a normal libpq
      connection from a connection initiated by FTS prober process for probing a
      primary segment. It's similar to the "replication=true" for the
      WalSender process. Currently we support `gpconntype="fts"`, this can be
      extended to support other backend connection types.
      
      Currently, for the FTS handler backend, we support `FTS_MSG_TYPE_PROBE` message type.
      The rest of the message handling and process creation is similar to
      WalSender process.
      Signed-off-by: NAshwin Agrawal <aagrawal@pivotal.io>
      Signed-off-by: NAsim R P <apraveen@pivotal.io>
      f86af5ec
  7. 27 10月, 2017 1 次提交
    • A
      Coverity fixes for FTS. · c3a96e1a
      Abhijit Subramanya 提交于
      CID 178114: Performance inefficiencies  (PASS_BY_VALUE)
      The `ProbeConnectionInfo` parameter was being passed to the
      `probeSegmentHelper()` function by value. This is very inefficient. Fixed it by
      passing the parameter as a reference using a pointer.
      
      CID 178115: Calling "pqGetInt" without checking return value (as is done
      elsewhere 52 out of 57 times).
      The return value of `pqGetInt()` function was not being checked in the
      `processProbeResponse()` function. Fixed it by checking for the return value.
      If EOF is returned, a message is logged and false is returned to the caller.
      Signed-off-by: NTaylor Vesely <tvesely@pivotal.io>
      c3a96e1a
  8. 19 10月, 2017 1 次提交