backend-alarm.md 6.3 KB
Newer Older
wu-sheng's avatar
wu-sheng 已提交
1
# Alarm
2
Alarm core is driven by a collection of rules, which are defined in `config/alarm-settings.yml`.
A
aderm 已提交
3
There are three parts in alarm rule definition.
wu-sheng's avatar
wu-sheng 已提交
4 5
1. [Alarm rules](#rules). They define how metrics alarm should be triggered, what conditions should be considered.
1. [Webhooks](#webhook). The list of web service endpoint, which should be called after the alarm is triggered.
J
Jared Tan 已提交
6
1. [gRPCHook](#gRPCHook). The host and port of remote gRPC method, which should be called after the alarm is triggered.
wu-sheng's avatar
wu-sheng 已提交
7 8 9 10

## Rules
Alarm rule is constituted by following keys
- **Rule name**. Unique name, show in alarm message. Must end with `_rule`.
11 12
- **Metrics name**. A.K.A. metrics name in oal script. Only long, double, int types are supported. See
[List of all potential metrics name](#list-of-all-potential-metrics-name).
wu-sheng's avatar
wu-sheng 已提交
13 14
- **Include names**. The following entity names are included in this rule. Such as Service name,
endpoint name.
15 16
- **Exclude names**. The following entity names are excluded in this rule. Such as Service name,
  endpoint name.
17 18 19 20
- **Threshold**. The target value. 
For multiple values metrics, such as **percentile**, the threshold is an array. Described like  `value1, value2, value3, value4, value5`.
Each value could the threshold for each value of the metrics. Set the value to `-` if don't want to trigger alarm by this or some of the values.  
Such as in **percentile**, `value1` is threshold of P50, and `-, -, value3, value4, value5` means, there is no threshold for P50 and P75 in percentile alarm rule.
21
- **OP**. Operator, support `>`, `>=`, `<`, `<=`, `=`. Welcome to contribute all OPs.
wu-sheng's avatar
wu-sheng 已提交
22 23 24 25 26 27
- **Period**. How long should the alarm rule should be checked. This is a time window, which goes with the
backend deployment env time.
- **Count**. In the period window, if the number of **value**s over threshold(by OP), reaches count, alarm
should send.
- **Silence period**. After alarm is triggered in Time-N, then keep silence in the **TN -> TN + period**.
By default, it is as same as **Period**, which means in a period, same alarm(same ID in same 
28
metrics name) will be trigger once. 
wu-sheng's avatar
wu-sheng 已提交
29 30 31 32 33 34


```yaml
rules:
  # Rule unique name, must be ended with `_rule`.
  endpoint_percent_rule:
35 36
    # Metrics value need to be long, double or int
    metrics-name: endpoint_percent
wu-sheng's avatar
wu-sheng 已提交
37 38
    threshold: 75
    op: <
39
    # The length of time to evaluate the metrics
wu-sheng's avatar
wu-sheng 已提交
40
    period: 10
41
    # How many times after the metrics match the condition, will trigger alarm
wu-sheng's avatar
wu-sheng 已提交
42 43 44
    count: 3
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 10
wu-sheng's avatar
wu-sheng 已提交
45
  service_percent_rule:
46 47
    metrics-name: service_percent
    # [Optional] Default, match all services in this metrics
wu-sheng's avatar
wu-sheng 已提交
48 49 50
    include-names:
      - service_a
      - service_b
51 52
    exclude-names:
      - service_c
53
    # Single value metrics threshold.
wu-sheng's avatar
wu-sheng 已提交
54 55 56 57
    threshold: 85
    op: <
    period: 10
    count: 4
58 59 60 61 62 63 64 65 66 67
  service_resp_time_percentile_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_percentile
    op: ">"
    # Multiple value metrics threshold. Thresholds for P50, P75, P90, P95, P99.
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
wu-sheng's avatar
wu-sheng 已提交
68 69
```

wu-sheng's avatar
wu-sheng 已提交
70
### Default alarm rules
71 72 73
We provided a default `alarm-setting.yml` in our distribution only for convenience, which including following rules
1. Service average response time over 1s in last 3 minutes.
1. Service success rate lower than 80% in last 2 minutes.
74
1. Percentile of service response time is over 1s in last 3 minutes
75 76 77
1. Service Instance average response time over 1s in last 2 minutes.
1. Endpoint average response time over 1s in last 2 minutes.

wu-sheng's avatar
wu-sheng 已提交
78
### List of all potential metrics name
79 80
The metrics names are defined in official [OAL scripts](../../guides/backend-oal-scripts.md), right now 
metrics from **Service**, **Service Instance**, **Endpoint** scopes could be used in Alarm, we will extend in further versions. 
wu-sheng's avatar
wu-sheng 已提交
81 82

Submit issue or pull request if you want to support any other scope in alarm.
wu-sheng's avatar
wu-sheng 已提交
83 84 85

## Webhook
Webhook requires the peer is a web container. The alarm message will send through HTTP post by `application/json` content type. The JSON format is based on `List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage` with following key information.
86
- **scopeId**, **scope**. All scopes are defined in org.apache.skywalking.oap.server.core.source.DefaultScopeDefine.
wu-sheng's avatar
wu-sheng 已提交
87 88 89
- **name**. Target scope entity name.
- **id0**. The ID of scope entity, matched the name.
- **id1**. Not used today.
90
- **ruleName**. The rule name you configured in `alarm-settings.yml`.
wu-sheng's avatar
wu-sheng 已提交
91 92 93 94 95 96 97
- **alarmMessage**. Alarm text message.
- **startTime**. Alarm time measured in milliseconds, between the current time and midnight, January 1, 1970 UTC.

Example as following
```json
[{
	"scopeId": 1, 
98
        "scope": "SERVICE",
wu-sheng's avatar
wu-sheng 已提交
99
        "name": "serviceA", 
100 101
	"id0": "12",  
	"id1": "",  
102
        "ruleName": "service_resp_time_rule",
wu-sheng's avatar
wu-sheng 已提交
103 104 105 106
	"alarmMessage": "alarmMessage xxxx",
	"startTime": 1560524171000
}, {
	"scopeId": 1,
107
        "scope": "SERVICE",
wu-sheng's avatar
wu-sheng 已提交
108
        "name": "serviceB",
109 110
	"id0": "23",
	"id1": "",
111
        "ruleName": "service_resp_time_rule",
wu-sheng's avatar
wu-sheng 已提交
112 113 114 115
	"alarmMessage": "alarmMessage yyy",
	"startTime": 1560524171000
}]
```
116

J
Jared Tan 已提交
117 118 119 120 121 122 123 124 125 126
## gRPCHook
The alarm message will send through remote gRPC method by `Protobuf` content type. 
The message format with following key information which are defined in `oap-server/server-alarm-plugin/src/main/proto/alarm-hook.proto`.

Part of protocol looks as following:
```protobuf
message AlarmMessage {
    int64 scopeId = 1;
    string scope = 2;
    string name = 3;
127 128
    string id0 = 4;
    string id1 = 5;
J
Jared Tan 已提交
129 130 131 132 133 134
    string ruleName = 6;
    string alarmMessage = 7;
    int64 startTime = 8;
}
```

135 136 137 138 139 140
## Update the settings dynamically
Since 6.5.0, the alarm settings can be updated dynamically at runtime by [Dynamic Configuration](dynamic-config.md),
which will override the settings in `alarm-settings.yml`.

In order to determine that whether an alarm rule is triggered or not, SkyWalking needs to cache the metrics of a time window for
each alarm rule, if any attribute (`metrics-name`, `op`, `threshold`, `period`, `count`, etc.) of a rule is changed,
A
aderm 已提交
141
the sliding window will be destroyed and re-created, causing the alarm of this specific rule to restart again.