backend-alarm.md 7.6 KB
Newer Older
wu-sheng's avatar
wu-sheng 已提交
1
# Alarm
2
Alarm core is driven by a collection of rules, which are defined in `config/alarm-settings.yml`.
A
aderm 已提交
3
There are three parts in alarm rule definition.
wu-sheng's avatar
wu-sheng 已提交
4 5
1. [Alarm rules](#rules). They define how metrics alarm should be triggered, what conditions should be considered.
1. [Webhooks](#webhook). The list of web service endpoint, which should be called after the alarm is triggered.
J
Jared Tan 已提交
6
1. [gRPCHook](#gRPCHook). The host and port of remote gRPC method, which should be called after the alarm is triggered.
wu-sheng's avatar
wu-sheng 已提交
7

8 9 10 11 12 13 14 15 16 17
## Entity name
Define the relation between scope and entity name.
- **Service**: Service name
- **Instance**: {Instance name} of {Service name}
- **Endpoint**: {Endpoint name} in {Service name}
- **Database**: Database service name
- **Service Relation**: {Source service name} to {Dest service name}
- **Instance Relation**: {Source instance name} of {Source service name} to {Dest instance name} of {Dest service name}
- **Endpoint Relation**: {Source endpoint name} in {Source Service name} to {Dest endpoint name} in {Dest service name}

wu-sheng's avatar
wu-sheng 已提交
18 19 20
## Rules
Alarm rule is constituted by following keys
- **Rule name**. Unique name, show in alarm message. Must end with `_rule`.
21 22
- **Metrics name**. A.K.A. metrics name in oal script. Only long, double, int types are supported. See
[List of all potential metrics name](#list-of-all-potential-metrics-name).
23 24
- **Include names**. The following entity names are included in this rule. Please follow [Entity name define](#entity-name).
- **Exclude names**. The following entity names are excluded in this rule. Please follow [Entity name define](#entity-name).
25 26
- **Include names regex**. Provide a regex to include the entity names. If both setting the include name list and include name regex, both rules will take effect.
- **Exclude names regex**. Provide a regex to exclude the exclude names. If both setting the exclude name list and exclude name regex, both rules will take effect.
27 28 29 30
- **Threshold**. The target value. 
For multiple values metrics, such as **percentile**, the threshold is an array. Described like  `value1, value2, value3, value4, value5`.
Each value could the threshold for each value of the metrics. Set the value to `-` if don't want to trigger alarm by this or some of the values.  
Such as in **percentile**, `value1` is threshold of P50, and `-, -, value3, value4, value5` means, there is no threshold for P50 and P75 in percentile alarm rule.
31
- **OP**. Operator, support `>`, `>=`, `<`, `<=`, `=`. Welcome to contribute all OPs.
wu-sheng's avatar
wu-sheng 已提交
32 33 34 35 36 37
- **Period**. How long should the alarm rule should be checked. This is a time window, which goes with the
backend deployment env time.
- **Count**. In the period window, if the number of **value**s over threshold(by OP), reaches count, alarm
should send.
- **Silence period**. After alarm is triggered in Time-N, then keep silence in the **TN -> TN + period**.
By default, it is as same as **Period**, which means in a period, same alarm(same ID in same 
38
metrics name) will be trigger once. 
wu-sheng's avatar
wu-sheng 已提交
39 40 41 42 43 44


```yaml
rules:
  # Rule unique name, must be ended with `_rule`.
  endpoint_percent_rule:
45 46
    # Metrics value need to be long, double or int
    metrics-name: endpoint_percent
wu-sheng's avatar
wu-sheng 已提交
47 48
    threshold: 75
    op: <
49
    # The length of time to evaluate the metrics
wu-sheng's avatar
wu-sheng 已提交
50
    period: 10
51
    # How many times after the metrics match the condition, will trigger alarm
wu-sheng's avatar
wu-sheng 已提交
52 53 54
    count: 3
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 10
wu-sheng's avatar
wu-sheng 已提交
55
  service_percent_rule:
56 57
    metrics-name: service_percent
    # [Optional] Default, match all services in this metrics
wu-sheng's avatar
wu-sheng 已提交
58 59 60
    include-names:
      - service_a
      - service_b
61 62
    exclude-names:
      - service_c
63
    # Single value metrics threshold.
wu-sheng's avatar
wu-sheng 已提交
64 65 66 67
    threshold: 85
    op: <
    period: 10
    count: 4
68 69 70 71 72 73 74 75 76 77
  service_resp_time_percentile_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_percentile
    op: ">"
    # Multiple value metrics threshold. Thresholds for P50, P75, P90, P95, P99.
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
wu-sheng's avatar
wu-sheng 已提交
78 79
```

wu-sheng's avatar
wu-sheng 已提交
80
### Default alarm rules
81 82 83
We provided a default `alarm-setting.yml` in our distribution only for convenience, which including following rules
1. Service average response time over 1s in last 3 minutes.
1. Service success rate lower than 80% in last 2 minutes.
84
1. Percentile of service response time is over 1s in last 3 minutes
85
1. Service Instance average response time over 1s in last 2 minutes, and the instance name matches the regex.
86
1. Endpoint average response time over 1s in last 2 minutes.
87 88
1. Database access average response time over 1s in last 2 minutes.
1. Endpoint relation average response time over 1s in last 2 minutes.
89

wu-sheng's avatar
wu-sheng 已提交
90
### List of all potential metrics name
91
The metrics names are defined in official [OAL scripts](../../guides/backend-oal-scripts.md), right now 
92
metrics from **Service**, **Service Instance**, **Endpoint**, **Service Relation**, **Service Instance Relation**, **Endpoint Relation** scopes could be used in Alarm, and the **Database access** same with **Service** scope.
wu-sheng's avatar
wu-sheng 已提交
93 94

Submit issue or pull request if you want to support any other scope in alarm.
wu-sheng's avatar
wu-sheng 已提交
95 96

## Webhook
97
Webhook requires the peer is a web container. The alarm message will send through HTTP post by `application/json` content type. The JSON format is based on `List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage>` with following key information.
98
- **scopeId**, **scope**. All scopes are defined in org.apache.skywalking.oap.server.core.source.DefaultScopeDefine.
99 100 101
- **name**. Target scope entity name. Please follow [Entity name define](#entity-name).
- **id0**. The ID of the scope entity matched the name. When using relation scope, it is the source entity ID.
- **id1**. When using relation scope, it will be the dest entity ID. Otherwise, it is empty.
102
- **ruleName**. The rule name you configured in `alarm-settings.yml`.
wu-sheng's avatar
wu-sheng 已提交
103 104 105 106 107 108 109
- **alarmMessage**. Alarm text message.
- **startTime**. Alarm time measured in milliseconds, between the current time and midnight, January 1, 1970 UTC.

Example as following
```json
[{
	"scopeId": 1, 
110 111
	"scope": "SERVICE",
	"name": "serviceA", 
112 113
	"id0": "12",  
	"id1": "",  
114
    "ruleName": "service_resp_time_rule",
wu-sheng's avatar
wu-sheng 已提交
115 116 117 118
	"alarmMessage": "alarmMessage xxxx",
	"startTime": 1560524171000
}, {
	"scopeId": 1,
119 120
	"scope": "SERVICE",
	"name": "serviceB",
121 122
	"id0": "23",
	"id1": "",
123
    "ruleName": "service_resp_time_rule",
wu-sheng's avatar
wu-sheng 已提交
124 125 126 127
	"alarmMessage": "alarmMessage yyy",
	"startTime": 1560524171000
}]
```
128

J
Jared Tan 已提交
129 130 131 132 133 134 135 136 137 138
## gRPCHook
The alarm message will send through remote gRPC method by `Protobuf` content type. 
The message format with following key information which are defined in `oap-server/server-alarm-plugin/src/main/proto/alarm-hook.proto`.

Part of protocol looks as following:
```protobuf
message AlarmMessage {
    int64 scopeId = 1;
    string scope = 2;
    string name = 3;
139 140
    string id0 = 4;
    string id1 = 5;
J
Jared Tan 已提交
141 142 143 144 145 146
    string ruleName = 6;
    string alarmMessage = 7;
    int64 startTime = 8;
}
```

147 148 149 150 151 152
## Update the settings dynamically
Since 6.5.0, the alarm settings can be updated dynamically at runtime by [Dynamic Configuration](dynamic-config.md),
which will override the settings in `alarm-settings.yml`.

In order to determine that whether an alarm rule is triggered or not, SkyWalking needs to cache the metrics of a time window for
each alarm rule, if any attribute (`metrics-name`, `op`, `threshold`, `period`, `count`, etc.) of a rule is changed,
A
aderm 已提交
153
the sliding window will be destroyed and re-created, causing the alarm of this specific rule to restart again.