backend-alarm.md 4.7 KB
Newer Older
wu-sheng's avatar
wu-sheng 已提交
1
# Alarm
2
Alarm core is driven by a collection of rules, which are defined in `config/alarm-settings.yml`.
wu-sheng's avatar
wu-sheng 已提交
3
There are two parts in alarm rule definition.
wu-sheng's avatar
wu-sheng 已提交
4 5
1. [Alarm rules](#rules). They define how metrics alarm should be triggered, what conditions should be considered.
1. [Webhooks](#webhook). The list of web service endpoint, which should be called after the alarm is triggered.
wu-sheng's avatar
wu-sheng 已提交
6 7 8 9

## Rules
Alarm rule is constituted by following keys
- **Rule name**. Unique name, show in alarm message. Must end with `_rule`.
10 11
- **Metrics name**. A.K.A. metrics name in oal script. Only long, double, int types are supported. See
[List of all potential metrics name](#list-of-all-potential-metrics-name).
wu-sheng's avatar
wu-sheng 已提交
12 13
- **Include names**. The following entity names are included in this rule. Such as Service name,
endpoint name.
14 15
- **Exclude names**. The following entity names are excluded in this rule. Such as Service name,
  endpoint name.
wu-sheng's avatar
wu-sheng 已提交
16 17 18 19 20 21 22 23
- **Threshold**. The target value.
- **OP**. Operator, support `>`, `<`, `=`. Welcome to contribute all OPs.
- **Period**. How long should the alarm rule should be checked. This is a time window, which goes with the
backend deployment env time.
- **Count**. In the period window, if the number of **value**s over threshold(by OP), reaches count, alarm
should send.
- **Silence period**. After alarm is triggered in Time-N, then keep silence in the **TN -> TN + period**.
By default, it is as same as **Period**, which means in a period, same alarm(same ID in same 
24
metrics name) will be trigger once. 
wu-sheng's avatar
wu-sheng 已提交
25 26 27 28 29 30


```yaml
rules:
  # Rule unique name, must be ended with `_rule`.
  endpoint_percent_rule:
31 32
    # Metrics value need to be long, double or int
    metrics-name: endpoint_percent
wu-sheng's avatar
wu-sheng 已提交
33 34
    threshold: 75
    op: <
35
    # The length of time to evaluate the metrics
wu-sheng's avatar
wu-sheng 已提交
36
    period: 10
37
    # How many times after the metrics match the condition, will trigger alarm
wu-sheng's avatar
wu-sheng 已提交
38 39 40
    count: 3
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 10
wu-sheng's avatar
wu-sheng 已提交
41 42
    
  service_percent_rule:
43 44
    metrics-name: service_percent
    # [Optional] Default, match all services in this metrics
wu-sheng's avatar
wu-sheng 已提交
45 46 47
    include-names:
      - service_a
      - service_b
48 49
    exclude-names:
      - service_c
wu-sheng's avatar
wu-sheng 已提交
50 51 52 53
    threshold: 85
    op: <
    period: 10
    count: 4
wu-sheng's avatar
wu-sheng 已提交
54 55
```

wu-sheng's avatar
wu-sheng 已提交
56
### Default alarm rules
57 58 59
We provided a default `alarm-setting.yml` in our distribution only for convenience, which including following rules
1. Service average response time over 1s in last 3 minutes.
1. Service success rate lower than 80% in last 2 minutes.
60
1. Service 90% response time is over 1s in last 3 minutes
61 62 63
1. Service Instance average response time over 1s in last 2 minutes.
1. Endpoint average response time over 1s in last 2 minutes.

wu-sheng's avatar
wu-sheng 已提交
64
### List of all potential metrics name
65 66
The metrics names are defined in official [OAL scripts](../../guides/backend-oal-scripts.md), right now 
metrics from **Service**, **Service Instance**, **Endpoint** scopes could be used in Alarm, we will extend in further versions. 
wu-sheng's avatar
wu-sheng 已提交
67 68

Submit issue or pull request if you want to support any other scope in alarm.
wu-sheng's avatar
wu-sheng 已提交
69 70 71

## Webhook
Webhook requires the peer is a web container. The alarm message will send through HTTP post by `application/json` content type. The JSON format is based on `List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage` with following key information.
72
- **scopeId**, **scope**. All scopes are defined in org.apache.skywalking.oap.server.core.source.DefaultScopeDefine.
wu-sheng's avatar
wu-sheng 已提交
73 74 75
- **name**. Target scope entity name.
- **id0**. The ID of scope entity, matched the name.
- **id1**. Not used today.
76
- **ruleName**. The rule name you configured in `alarm-settings.yml`.
wu-sheng's avatar
wu-sheng 已提交
77 78 79 80 81 82 83
- **alarmMessage**. Alarm text message.
- **startTime**. Alarm time measured in milliseconds, between the current time and midnight, January 1, 1970 UTC.

Example as following
```json
[{
	"scopeId": 1, 
84
        "scope": "SERVICE",
wu-sheng's avatar
wu-sheng 已提交
85 86 87
        "name": "serviceA", 
	"id0": 12,  
	"id1": 0,  
88
        "ruleName": "service_resp_time_rule",
wu-sheng's avatar
wu-sheng 已提交
89 90 91 92
	"alarmMessage": "alarmMessage xxxx",
	"startTime": 1560524171000
}, {
	"scopeId": 1,
93
        "scope": "SERVICE",
wu-sheng's avatar
wu-sheng 已提交
94 95 96
        "name": "serviceB",
	"id0": 23,
	"id1": 0,
97
        "ruleName": "service_resp_time_rule",
wu-sheng's avatar
wu-sheng 已提交
98 99 100 101
	"alarmMessage": "alarmMessage yyy",
	"startTime": 1560524171000
}]
```
102 103 104 105 106 107 108 109

## Update the settings dynamically
Since 6.5.0, the alarm settings can be updated dynamically at runtime by [Dynamic Configuration](dynamic-config.md),
which will override the settings in `alarm-settings.yml`.

In order to determine that whether an alarm rule is triggered or not, SkyWalking needs to cache the metrics of a time window for
each alarm rule, if any attribute (`metrics-name`, `op`, `threshold`, `period`, `count`, etc.) of a rule is changed,
the sliding window will be destroyed and re-created, causing the alarm of this specific rule to restart again.