"DetectDataCenterQuery":"select substring_index(substring_index(@@hostname, '-',3), '-', -1) as dc",
"PhysicalEnvironmentPattern":"",
"DetectSemiSyncEnforcedQuery":""
}
```
...
...
@@ -60,3 +61,75 @@ You will configure data center awareness in one of two methods:
### Cluster domain
To a lesser importance, and mostly for visibility, `DetectClusterDomainQuery` should return the VIP or CNAME or otherwise the address of the cluster's master
### Semi-sync topology
In some environments, it is important to control the not only the number of semi-sync replicas, but also if a replica is a semi-sync or an async replica.
`orchestrator` can detect an undesired semi-sync configuration and toggle the semi-sync flags
`rpl_semi_sync_slave_enabled` and `rpl_semi_sync_master_enabled` to correct the situation.
`orchestrator` can detect if there is an incorrect number of semi-sync replicas in the topology ([`LockedSemiSyncMaster`](failure-detection.md#lockedsemisyncmaster) and
[`MasterWithTooManySemiSyncReplicas`](failure-detection.md#masterwithtoomanysemisyncreplicas)), and can then correct the situation by enabling/disabling
the semi-sync replica flags accordingly.
This behavior can be controlled by the following options:
-`DetectSemiSyncEnforcedQuery`: query that returns the semi-sync priority (zero means async replica; higher number means higher priority)
-`EnforceExactSemiSyncReplicas`: flag that decides whether to enforce a _strict_ semi-sync replica topology. If enabled, the recovery of `LockedSemiSyncMaster`
and `MasterWithTooManyReplicas` will enable _and disable_ semi-sync on the replicas to match the desired topology exactly based on the priority order.
-`RecoverLockedSemiSyncMaster`: flag that decides whether to recover from a `LockedSemiSyncMaster` scenario. If enabled, the recovery of `LockedSemiSyncMaster`
will enable _(but never disable)_ semi-sync on the replicas in the priority order to match the master wait count. This option has no effect if
`EnforceExactSemiSyncReplicas` is set. It is only useful if you'd like to only handle a situation which which there are too few semi-sync replicas,
but not if there are too many.
-`ReasonableLockedSemiSyncMasterSeconds`: number of seconds after which the `LockedSemiSyncMaster` condition is triggered; if not set, falls back to `ReasonableReplicationLagSeconds`
The priority order is defined by `DetectSemiSyncEnforcedQuery` (zero means async replica; higher number is higher priority), the promotion rule (`DetectPromotionRuleQuery`)
and the hostname (fallback).
**Example 1**: Enforcing a strict semi-sync replica topology with two replicas and `rpl_semi_sync_master_wait_for_slave_count=1`:
```
"DetectSemiSyncEnforcedQuery": "select priority from meta.semi_sync where cluster_member = @@hostname",
@@ -38,6 +38,7 @@ Observe the following list of potential failures:
* UnreachableMasterWithLaggingReplicas
* UnreachableMaster
* LockedSemiSyncMaster
* MasterWithTooManySemiSyncReplicas
* AllMasterReplicasNotReplicating
* AllMasterReplicasNotReplicatingOrDead
* DeadCoMaster
...
...
@@ -96,15 +97,43 @@ This scenario can happen when the master is overloaded. Clients would see a "Too
`orchestrator` responds to this scenario by restarting replication on all of master's immediate replicas. This will close the old client connections on those replicas and attempt to initiate new ones. These may now fail to connect, leading to a complete replication failure on all replicas. This will next lead `orchestrator` to analyze a `DeadMaster`.
### LockedSemiSyncMaster
#### `LockedSemiSyncMaster`
1. Master is running with semi-sync enabled
1. Master is running with semi-sync enabled (`rpl_semi_sync_master_enabled=1`)
2. Number of connected semi-sync replicas falls short of expected `rpl_semi_sync_master_wait_for_slave_count`
3.`rpl_semi_sync_master_timeout` is high enough such that master locks writes and does not fall back to asynchronous replication
Remediation can be to disable semi-sync on the master, or to bring up (or enable) sufficient semi-sync replicas.
This condition only triggers after `ReasonableLockedSemiSyncMasterSeconds` has passed. If `ReasonableLockedSemiSyncMasterSeconds` is not set,
it trigger after `ReasonableReplicationLagSeconds`.
At this time `orchestrator` does not invoke processes for this type of analysis.
Remediation of this condition can be to disable semi-sync on the master, or to bring up (or enable) sufficient semi-sync replicas.
If `EnforceExactSemiSyncReplicas` is enabled, `orchestrator` will determine the desired semi-sync topology and enable/disable semi-sync on the replicas to match it.
The desired topology is defined by the priority order (see below) and the master wait count.
If `RecoverLockedSemiSyncMaster` is enabled, `orchestrator` will enable (but never disable) semi-sync on the replicas in priority order until
the number of semi-sync replicas matches the master wait count. Please note that `RecoverLockedSemiSyncMaster` has no effect if `EnforceExactSemiSyncReplicas` is set.
The priority order is defined by `DetectSemiSyncEnforcedQuery` (higher number is higher priority), the promotion rule (`DetectPromotionRuleQuery`) and the hostname (fallback).
If `EnforceExactSemiSyncReplicas` and `RecoverLockedSemiSyncMaster` are both disabled (default), `orchestrator` does not invoke any recovery processes for this type of analysis.
Please also consult the [semi-sync topology](configuration-discovery-classifying.md#semi-sync-topology) documentation for more details.
#### `MasterWithTooManySemiSyncReplicas`
1. Master is running with semi-sync enabled (`rpl_semi_sync_master_enabled=1`)
2. Number of connected semi-sync replicas is higher than the expected `rpl_semi_sync_master_wait_for_slave_count`
3.`EnforceExactSemiSyncReplicas` is enabled (this analysis is not triggered if this flag is not enabled)
If `EnforceExactSemiSyncReplicas` is enabled, `orchestrator` will determine the desired semi-sync topology and enable/disable semi-sync on the replicas to match it.
The desired topology is defined by the priority order and the master wait count.
The priority order is defined by `DetectSemiSyncEnforcedQuery` (higher number is higher priority), the promotion rule (`DetectPromotionRuleQuery`) and the hostname (fallback).
If `EnforceExactSemiSyncReplicas` is disabled (default), `orchestrator` does not invoke any recovery processes for this type of analysis.
Please also consult the [semi-sync topology](configuration-discovery-classifying.md#semi-sync-topology) documentation for more details.
DefaultInstancePortint// In case port was not specified on command line
SlaveLagQuerystring// Synonym to ReplicationLagQuery
ReplicationLagQuerystring// custom query to check on replica lg (e.g. heartbeat table). Must return a single row with a single numeric column, which is the lag.
ReplicationCredentialsQuerystring// custom query to get replication credentials. Must return a single row, with two text columns: 1st is username, 2nd is password. This is optional, and can be used by orchestrator to configure replication after master takeover or setup of co-masters. You need to ensure the orchestrator user has the privileges to run this query
ReplicationCredentialsQuerystring// custom query to get replication credentials. Must return a single row, with five text columns: 1st is username, 2nd is password, 3rd is SSLCaCert, 4th is SSLCert, 5th is SSLKey. This is optional, and can be used by orchestrator to configure replication after master takeover or setup of co-masters. You need to ensure the orchestrator user has the privileges to run this query
DiscoverByShowSlaveHostsbool// Attempt SHOW SLAVE HOSTS before PROCESSLIST
UseSuperReadOnlybool// Should orchestrator super_read_only any time it sets read_only
InstancePollSecondsuint// Number of seconds between instance reads
URLPrefixstring// URL prefix to run orchestrator on non-root web path, e.g. /orchestrator to put it behind nginx.
DiscoveryIgnoreReplicaHostnameFilters[]string// Regexp filters to apply to prevent auto-discovering new replicas. Usage: unreachable servers due to firewalls, applications which trigger binlog dumps
DiscoveryIgnoreMasterHostnameFilters[]string// Regexp filters to apply to prevent auto-discovering a master. Usage: pointing your master temporarily to replicate seom data from external host
DiscoveryIgnoreMasterHostnameFilters[]string// Regexp filters to apply to prevent auto-discovering a master. Usage: pointing your master temporarily to replicate some data from external host
DiscoveryIgnoreHostnameFilters[]string// Regexp filters to apply to prevent discovering instances of any kind
ConsulAddressstring// Address where Consul HTTP api is found. Example: 127.0.0.1:8500
ConsulSchemestring// Scheme (http or https) for Consul
...
...
@@ -275,6 +275,9 @@ type Configuration struct {
KVClusterMasterPrefixstring// Prefix to use for clusters' masters entries in KV stores (internal, consul, ZK), default: "mysql/master"
WebMessagestring// If provided, will be shown on all web pages below the title bar
MaxConcurrentReplicaOperationsint// Maximum number of concurrent operations on replicas
EnforceExactSemiSyncReplicasbool// If true, semi-sync replicas will be enabled/disabled to match the wait count in the desired priority order; this applies to LockedSemiSyncMaster and MasterWithTooManySemiSyncReplicas
RecoverLockedSemiSyncMasterbool// If true, orchestrator will recover from a LockedSemiSync state by enabling semi-sync on replicas to match the wait count; this behavior can be overridden by EnforceExactSemiSyncReplicas
ReasonableLockedSemiSyncMasterSecondsuint// Time to evaluate the LockedSemiSyncHypothesis before triggering the LockedSemiSync analysis; falls back to ReasonableReplicationLagSeconds if not set
}
// ToJSONString will marshal this configuration as JSON
// classifyAndPrioritizeReplicas takes a list of replica instances and classifies them based on their semi-sync priority, excluding replicas
// that are down. The function furthermore prioritizes the possible semi-sync replicas based on SemiSyncPriority, PromotionRule and hostname (fallback).
returnnil,fmt.Errorf("RecoverDeadMaster: failed promotion. FailMasterPromotionOnLagMinutes is set to %d (minutes) and promoted replica %+v 's lag is %d (seconds)",config.Config.FailMasterPromotionOnLagMinutes,promotedReplica.Key,promotedReplica.ReplicationLagSeconds.Int64)
returnnil,fmt.Errorf("RecoverDeadMaster: failed promotion. FailMasterPromotionIfSQLThreadNotUpToDate is set and promoted replica %+v 's sql thread is not up to date (relay logs still unapplied). Aborting promotion",promotedReplica.Key)
AuditTopologyRecovery(topologyRecovery,fmt.Sprintf("found an active or recent recovery on %+v. Will not issue another RecoverLockedSemiSyncMaster.",analysisEntry.AnalyzedInstanceKey))
AuditTopologyRecovery(topologyRecovery,fmt.Sprintf("no action taken to recover locked semi sync master on %+v. Enable RecoverLockedSemiSyncMaster or EnforceExactSemiSyncReplicas change this behavior.",analysisEntry.AnalyzedInstanceKey))
returnfalse,nil,err
}
returnfalse,nil,nil
// checkAndRecoverMasterWithTooManySemiSyncReplicas registers and performs a recovery for MasterWithTooManySemiSyncReplicas
AuditTopologyRecovery(topologyRecovery,fmt.Sprintf("found an active or recent recovery on %+v. Will not issue another RecoverMasterWithTooManySemiSyncReplicas.",analysisEntry.AnalyzedInstanceKey))
// recoverSemiSyncReplicas analyzes the replica topology for the given master and applies to repair it. If exactReplicaTopology is set, it will enable/disable the semi-sync enabled
// variable (rpl_semi_sync_replica_enabled) of the replicas depending on their semi-sync priority and promotion rule. If exactReplicaTopology is not set, the function will only ever
// enable semi-sync on replicas and never disable it. This variable typically corresponds to the EnforceExactSemiSyncReplicas config variable.
AuditTopologyRecovery(topologyRecovery,fmt.Sprintf("semi-sync: cannot determine actions based on possible semi-sync replicas; cannot recover on %+v",&analysisEntry.AnalyzedInstanceKey))
returntrue,topologyRecovery,log.Errorf("cannot determine actions based on possible semi-sync replicas; cannot recover on %+v",&analysisEntry.AnalyzedInstanceKey)
}
// Disable semi-sync master on all replicas; this is to avoid semi-sync failures on the replicas (rpl_semi_sync_master_no_tx)
// and to make it consistent with the logic in SetReadOnly
for_,replica:=rangereplicas{
inst.MaybeDisableSemiSyncMaster(replica)// it's okay if this fails
}
// Take action: we first enable and then disable (two loops) in order to avoid "locked master" scenarios