提交 576b8c14 编写于 作者: C chris-sun-star

update docs

上级 de6adad6
# 什么是 OBAgent
OBAgent 是一个监控采集框架。OBAgent 支持推、拉两种数据采集模式,可以满足不同的应用场景。OBAgent 默认支持的插件包括主机数据采集、OceanBase 数据库指标的采集、监控数据标签处理和 Prometheus 协议的 HTTP 服务。要使 OBAgent 支持其他数据源的采集,或者自定义数据的处理流程,您只需要开发对应的插件即可。
## 特性
OBAgent 具有以下特性:
- 采用 Go 语言开发,无需外部依赖。
- 插件驱动,易开发。
# kV 配置文件说明
本文介绍 KV 配置文件中的相关配置项,并列出了配置文件模板供您参考。
```yaml
# encrypted=true 的配置项,需要加密存储,目前仅支持 aes 加密。
# 请将 {} 中的变量替换成您的真实值。如果您在 monagent 启动配置中将加密方法设置为 aes,您需要配置加密之后的值。
## 基础认证相关
# monagent_basic_auth.yaml
configVersion: "2021-08-20T07:52:28.5443+08:00"
configs:
- key: http.server.basic.auth.username
value: {http_basic_auth_user}
valueType: string
- key: http.server.basic.auth.password
value: {http_basic_auth_password}
valueType: string
encrypted: true
- key: http.admin.basic.auth.username
value: {pprof_basic_auth_user}
valueType: string
- key: http.admin.basic.auth.password
value: {pprof_basic_auth_password}
valueType: string
encrypted: true
## 流水线相关
# monagent_pipeline.yaml
configVersion: "2021-08-20T07:52:28.5443+08:00"
configs:
- key: monagent.ob.monitor.user
value: {monitor_user}
valueType: string
- key: monagent.ob.monitor.password
value: {monitor_password}
valueType: string
encrypted: true
- key: monagent.ob.sql.port
value: {sql_port}
valueType: int64
- key: monagent.ob.rpc.port
value: {rpc_port}
valueType: int64
- key: monagent.host.ip
value: {host_ip}
valueType: string
- key: monagent.ob.cluster.name
value: {cluster_name}
valueType: string
- key: monagent.ob.cluster.id
value: {cluster_id}
valueType: int64
- key: monagent.ob.zone.name
value: {zone_name}
valueType: string
- key: monagent.pipeline.ob.status
value: {ob_monitor_status}
valueType: string
- key: monagent.pipeline.node.status
value: {host_monitor_status}
valueType: string
```
## 配置模版
KV 的相关配置文件模板如下:
- monagent_basic_auth.yaml,基础认证相关的 KV 配置项
- monagent_pipeline.yaml,流水线相关的 KV 配置项
# monagent 配置文件说明
本文介绍 monagent 配置文件中的相关配置项,并列出了配置文件模板供您参考。
`monagent.yaml` 配置文件的示例如下:
```yaml
## 日志相关配置
log:
level: debug
filename: log/monagent.log
maxsize: 30
maxage: 7
maxbackups: 10
localtime: true
compress: true
## 进程相关配置。其中,address 是默认的拉取 metrics 和管理相关接口,adminAddress 是 pprof 调试端口。
server:
address: "0.0.0.0:8088"
adminAddress: "0.0.0.0:8089"
runDir: run
## 配置相关,加密方法支持 aes 和 plain。其中,aes 使用下面 key 文件中的 key 对需要加密的配置项进行加密。
## modulePath 用来存放配置模版,propertiesPath 用来存放 KV 变量配置
cryptoMethod: plain
cryptoPath: conf/.config_secret.key
modulePath: conf/module_config
propertiesPath: conf/config_properties
```
## 配置模版
monagent 的相关配置文件模板见下表:
配置文件名称 | 说明
--- | ---
monagent_basic_auth.yaml | 基础认证的相关配置,用来配置两个端口的开启或者关闭,配置禁用后对应的变量为 {disable_http_basic_auth} 和 {disable_pprof_basic_auth}。
monagent_config | 配置模块相关的配置,无需修改。
monitor_node_host.yaml | 主机监控流水线配置模版,无需修改。
monitor_ob.yaml | OceanBase 数据库监控流水线配置模版,无需修改。
# Prometheus 配置文件说明
本文介绍 Prometheus 配置文件中的相关配置项,并提供了配置文件模板供您参考。
## Prometheus 配置模板
`prometheus.yaml` 配置文件示例如下:
```yaml
# OBAgent 的 RPM 包中包含 Prometheus 的配置模版,您可以根据实际情况修改。
# 要开启基础认证,您需要配置 {http_basic_auth_user} 和 {http_basic_auth_password}。
# {target} 替换成主机的 IP 和 端口号
# rules 目录包含两个报警配置模版,分别是默认的主机和 OceanBase 数据库的报警配置。如需自定义报警项,您可以参考此目录。
# 全局配置
global:
# 抓取间隔
scrape_interval: 1s
# 评估规则间隔
evaluation_interval: 10s
# 报警规则配置
# Prometheus 将根据这些信息,推送报警信息至 alertmanager。
rule_files:
- "rules/*rules.yaml"
# 抓取配置
# 用来配置 Prometheus 的数据采集。
scrape_configs:
- job_name: prometheus
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- "localhost:9090"
- job_name: node
basic_auth:
username: { http_basic_auth_user }
password: { http_basic_auth_password }
metrics_path: /metrics/node/host
scheme: http
static_configs:
- targets:
- { target }
- job_name: ob_basic
basic_auth:
username: { http_basic_auth_user }
password: { http_basic_auth_password }
metrics_path: /metrics/ob/basic
scheme: http
static_configs:
- targets:
- { target }
- job_name: ob_extra
basic_auth:
username: { http_basic_auth_user }
password: { http_basic_auth_password }
metrics_path: /metrics/ob/extra
scheme: http
static_configs:
- targets:
- { target }
- job_name: agent
basic_auth:
username: { http_basic_auth_user }
password: { http_basic_auth_password }
metrics_path: /metrics/stat
scheme: http
static_configs:
- targets:
- { target }
```
# OBAgent 开发指南
OBAgent 是一个插件驱动的监控采集框架。要扩展 OBAgent 的功能,或者自定义数据的处理流程,您可以开发对应的插件。开发插件时,您只需要实现插件的基本接口和对应类型插件的接口即可。
## OBAgent 数据处理流程
![Screenshot 2021-09-15 at 11.36.11.png](https://intranetproxy.alipay.com/skylark/lark/0/2021/png/28412/1631676986085-49f40134-9502-438b-bb32-5a3ee6591fbc.png#clientId=u89a16740-3189-4&from=ui&id=ucf4512dc&margin=%5Bobject%20Object%5D&name=Screenshot%202021-09-15%20at%2011.36.11.png&originHeight=868&originWidth=1638&originalType=binary&ratio=1&size=108483&status=done&style=none&taskId=uc6b3bf07-12ab-4e56-9568-bdab0001cc4)
OBAgent 的数据处理流程包括数据采集、处理和上报,需要用到的插件包含输入插件(Inputs)、处理插件(Process)、输出插件(OutPuts 和 Exporter)。插件详细信息,参考 [外部插件](#外部插件) 章节。
## 外部插件
OBAgent 支持的插件类型见下表:
| 插件类型 | 功能描述 |
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 输入插件(Input) | 收集各种时间序列性指标,包含各种系统信息和应用信息的插件。 |
| 处理插件(Process) | 串行进行数据处理。 |
| 输出插件(Output) | 仅适用于推模式。用来将 metrics 数据推送到远端。 |
输出插件(Exporter) | 仅适用于拉模式。通过 HTTP 服务暴露数据。用来为 metrics 做格式转换。
### 输入插件接口定义
```go
type Input interface {
Collect() ([]metric.Metric, error)
}
```
输入插件需要实现 Collect 方法采集数据。输入插件返回一组 metrics 和 error。
### 处理插件接口定义
```go
type Processor interface {
Process(metrics ...metric.Metric) ([]metric.Metric, error)
}
```
处理插件需要实现 Process 方法处理数据,输入参数是一组 metrics,输出一组 metrics 和 error。
### 输出插件(Exporter)接口定义
```go
type Exporter interface {
Export(metrics []metric.Metric) (*bytes.Buffer, error)
}
```
输出插件(Exporter)需要实现 Export 方法,输入参数是一组 metrcs,输出 byte buffer 和 error。
### 输出插件(Output)接口定义
```go
type Output interface {
Write(metrics []metric.Metric) error
}
```
输出插件(Output)需要实现 Write 方法,输入参数是一组 metrics,输出 error。
## Metric 接口定义
OBAgent 数据处理流程中流转的数据定义为统一的 Metric 接口。
```go
type Metric interface {
Clone() Metric
SetName(name string)
GetName() string
SetTime(time time.Time)
GetTime() time.Time
SetMetricType(metricType Type)
GetMetricType() Type
Fields() map[string]interface{}
Tags() map[string]string
}
```
## 插件基本接口定义
所有的 OBAgent 插件都必须实现以下的基本接口:
```go
//Initializer 包含 Init 函数
type Initializer interface {
Init(config map[string]interface{}) error
}
//Closer 包含 Close 函数
type Closer interface {
Close() error
}
//Describer 包含 SampleConfig 和 Description
type Describer interface {
SampleConfig() string
Description() string
}
```
函数详情见下表:
函数名 | 说明
--- | ---
Init | 初始化插件。
Close | 在插件退出时调用,用来关闭一些资源。
SampleConfig | 用来返回插件的配置样例。
Description | 用来返回插件的描述信息。
# 手动部署 OBAgent
OBAgent 提供使用 OBD 部署和手动部署。要手动部署 OBAgent,您要配置 OBAgent、Prometheus 和 Prometheus Alertmanager(可选)。推荐您使用 OBD 部署 OBAgent。
## 前提条件
在部署 OBAgent 之前,您需要确认以下信息:
- OceanBase 数据库服务已经部署并启动。
- OBAgent 已经安装。详细信息,参考 [安装 OBAgent](install-obagent.md)
- OBAgent 的默认端口 8088、8089 未占用。您也可以自定义端口。
## 操作步骤
按以下步骤部署 OBAgent:
### 步骤1:部署 monagent
1. 修改配置文件,详细信息,参考 [monagent 配置](../config-reference/monagent-config.md)[KV 配置](../config-reference/kv-config.md)
2. 启动 monagent 进程。推荐您使用 Supervisor 启动 monagent 进程。
```bash
# 将当前目录切换至 OBAgent 的安装目录
cd /home/admin/obagent
# 启动 monagent 进程
nohup ./bin/monagent -c conf/monagent.yaml >> ./log/monagent_stdout.log 2>&1 &
```
```bash
# Supervisor 配置样例
[program:monagent]
command=./bin/monagent -c conf/monagent.yaml
directory=/home/admin/obagent
autostart=true
autorestart=true
redirect_stderr=true
priority=10
stdout_logfile=log/monagent_stdout.log
```
### (可选)步骤2:部署 Prometheus
> 说明:您需要安装 Prometheus。
1. 配置 Prometheus,详情参考 [Prometheus 配置文件说明](../config-reference/prometheus-config.md)
2. 启动 Prometheus。
```bash
./prometheus --config.file=./prometheus.yaml
```
### (可选)步骤3:部署 Prometheus Alertmanager
- 下载并解压 Prometheus Alertmanager。
- 启动 Prometheus Alertmanager。
- 配置 Prometheus Alertmanager。更多信息,参考 [Prometheus 文档](https://www.prometheus.io/docs/alerting/latest/configuration/)
OBAgent 提供默认的报警项,配置文件位于 `conf/prometheus_config/rules`。其中,`host_rules.yaml` 存储机器报警项,`ob_rules.yaml` 存储 OceanBase 数据库报警项。如果默认报警项不能满足您的需求,按照以下方式自定义报警项:
```yaml
#在 Prometheus 的配置文件中增加报警相关的配置。报警相关的配置文件需放在 rules 目录,且命名满足 *rule.yaml。
groups:
- name: node-alert
rules:
- alert: disk-full
expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype=~"ext4|xfs"} * 100) / node_filesystem_size_bytes {mountpoint="/",fstype=~"ext4|xfs"}) > 80
for: 1m
labels:
serverity: page
annotations:
summary: "{{ $labels.instance }} disk full "
description: "{{ $labels.instance }} disk > {{ $value }} "
```
## (可选)更新 KV 配置
OBAgent 提供了更新配置的接口。您可以通过 HTTP 服务更新 KV 配置项:
```bash
# 您可以同时更新多个 KV 的配置项,写多组 key 和 value 即可。
curl --user user:pass -H "Content-Type:application/json" -d '{"configs":[{"key":"monagent.pipeline.ob.status", "value":"active"}]}' -L 'http://ip:port/api/v1/module/config/update'
```
# 使用 OBD 部署 OBAgent
OBAgent 提供使用 ODB 部署和手动部署。要手动部署 OBAgent,您要配置 OBAgent、Prometheus 和 Prometheus Alertmanager(可选)。推荐您使用 OBD 部署 OBAgent。
## 前提条件
在部署 OBAgent 之前,您需要确认 OBAgent 的默认端口 8088、8089 未占用。您也可以自定义端口。
> **说明**:如果您的机器可以连接公网,在您执行了 `obd cluster deploy` 命令之后,OBD 将检查您的目标机器是否有 OBAgent 安装包。如果没有安装包,OBD 将自动从 yum 源获取。
## 同时部署 OceanBase 集群和 OBAgent
如果您希望同时部署 OceanBase 集群和 OBAgent,您只需要在 OceanBase 数据库的配置文件中添加以下 OBAgent 的配置信息:
```yaml
obagent:
servers:
- 127.0.0.1
depends:
- oceanbase-ce
global:
home_path: /root/observer
```
> **注意**:`servers` 字段必须与 oceanbase-ce 的 `servers` 字段一致。
详细信息,参考 [配置文件](https://github.com/oceanbase/obdeploy/blob/master/example/obagent/distributed-with-obproxy-and-obagent-example.yaml)
## 单独部署 OBAgent
OBD 不支持为已部署的集群添加新的组件。如果您希望为要已部署的集群配置 OBAgent,您需要单独部署 OBAgent。要单独部署 OBAgent,您需要准备 OBAgent 的配置文件。请确保配置文件中的以下字段与 OceanBase 数据库相同:
```yaml
obagent:
global:
# Username for HTTP authentication. The default value is admin.
http_basic_auth_user: admin
# Password for HTTP authentication. The default value is root.
http_basic_auth_password: root
# Username for debug service. The default value is admin.
pprof_basic_auth_user: admin
# Password for debug service. The default value is root.
pprof_basic_auth_password: root
# 以下配置必须与 OceanBase 数据库一致
# Monitor username for OceanBase Database. The user must have read access to OceanBase Database as a system tenant. The default value is root.
monitor_user: root
# Monitor password for OceanBase Database. The default value is empty. When a depends exists, OBD gets this value from the oceanbase-ce of the depends. The value is the same as the root_password in oceanbase-ce.
monitor_password:
# The SQL port for observer. The default value is 2881. When a depends exists, OBD gets this value from the oceanbase-ce of the depends. The value is the same as the mysql_port in oceanbase-ce.
sql_port: 2881
# The RPC port for observer. The default value is 2882. When a depends exists, OBD gets this value from the oceanbase-ce of the depends. The value is the same as the rpc_port in oceanbase-ce.
rpc_port: 2882
# Cluster name for OceanBase Database. When a depends exists, OBD gets this value from the oceanbase-ce of the depends. The value is the same as the appname in oceanbase-ce.
cluster_name: obcluster
# Cluster ID for OceanBase Database. When a depends exists, OBD gets this value from the oceanbase-ce of the depends. The value is the same as the cluster_id in oceanbase-ce.
cluster_id: 1
# Zone name for your observer. The default value is zone1. When a depends exists, OBD gets this value from the oceanbase-ce of the depends. The value is the same as the zone name in oceanbase-ce.
zone_name: zone1
# Monitor status for OceanBase Database. Active is to enable. Inactive is to disable. The default value is active. When you deploy an cluster automatically, OBD decides whether to enable this parameter based on depends.
```
更多信息,参考 [OBAgent 配置文件](https://github.com/oceanbase/obdeploy/blob/master/example/obagent/obagent-only-example.yaml
)
## 启动 OBAgent
使用以下命令 启动 OBAgent:
```bash
# 传入配置信息
obd cluster deploy <deploy name> [-c <yaml path>] [-f] [-U] [-A]
# 启动 OBAgent
obd cluster start <deploy name> [flags]
```
更多信息,参考 [OBD 使用文档](https://github.com/oceanbase/obdeploy/blob/master/README-CN.md#obd-cluster-deploy)
## (可选)启动 Prometheus
> 说明:您需要安装 Prometheus。
运行以下命令,启动 Prometheus:
```bash
./prometheus --config.file=./prometheus.yaml
```
## (可选)部署 Prometheus Alertmanager
- 下载并解压 Prometheus Alertmanager。
- 启动 Prometheus Alertmanager。
- 配置 Prometheus Alertmanager。更多信息,参考 [Prometheus 文档](https://www.prometheus.io/docs/alerting/latest/configuration/)
OBAgent 提供默认的报警项,配置文件位于 `conf/prometheus_config/rules`。其中,`host_rules.yaml` 存储机器报警项,`ob_rules.yaml` 存储 OceanBase 数据库报警项。如果默认报警项不能满足您的需求,按照以下方式自定义报警项:
```yaml
# 在 Prometheus 的配置文件中增加报警相关的配置。报警相关的配置文件需放在 rules 目录,且命名满足 *rule.yaml。
groups:
- name: node-alert
rules:
- alert: disk-full
expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype=~"ext4|xfs"} * 100) / node_filesystem_size_bytes {mountpoint="/",fstype=~"ext4|xfs"}) > 80
for: 1m
labels:
serverity: page
annotations:
summary: "{{ $labels.instance }} disk full "
description: "{{ $labels.instance }} disk > {{ $value }} "
```
## (可选)更新 KV 配置
要更新 KV 配置,请使用 `obd cluster edit-config`。详细信息,参考 [OBD 使用文档](https://github.com/oceanbase/obdeploy/blob/master/README-CN.md#obd-cluster-edit-config)
# 安装 OBAgent
您可以使用 RPM 包或者构建源码安装 OBAgent。
## 环境依赖
构建 OBAgent 需要 Go 1.14 版本及以上。
## RPM 包
OBAgent 提供 RPM 包,您可以去 [Release 页面](https://mirrors.aliyun.com/oceanbase/community/stable/el/7/x86_64/obagent-1.0.0-1.el7.x86_64.rpm) 下载 RPM 包,然后使用以下命令安装:
```bash
rpm -ivh obagent-1.0.0-1.el7.x86_64.rpm
```
## 构建源码
### Debug 模式
```bash
make build // make build will be debug mode by default
make build-debug
```
### Release 模式
```bash
make build-release
```
## OBAgent 安装目录结构
OBAgent 的安装目录包含三个子目录:`bin``conf``run`。OBAgent 的安装目录如下:
```bash
# 目录结构示例
.
├── bin
│   └── monagent
├── conf
│   ├── config_properties
│   │   ├── monagent_basic_auth.yaml
│   │   └── monagent_pipeline.yaml
│   ├── module_config
│   │   ├── monagent_basic_auth.yaml
│   │   ├── monagent_config.yaml
│   │   ├── monitor_node_host.yaml
│   │   └── monitor_ob.yaml
│   ├── monagent.yaml
│   └── prometheus_config
│   ├── prometheus.yaml
│   └── rules
│   ├── host_rules.yaml
│   └── ob_rules.yaml
└── run
```
其中,`bin` 用来存放二进制文件。`conf` 用来存放程序启动配置、模块配置模板、KV 变量配置和 Prometheus 的配置模板。`run` 用来存放运行文件。更多关于配置文件的信息,参考 配置文件参考(LINK TODO)。
| 报警项 | 监控指标 | 阈值 | 说明 | # OBAgent 报警项
| --- | --- | --- | --- |
| ob_host_connection_percent_over_threshold | 100 * max(ob_active_session_num{@LABELS} / 262144) by (@GBLABELS) | 80 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_connection_percent_over_thre](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_connection_percent_over_thre) |
| ob_host_cpu_percent | 100 * (1 - sum(rate(node_cpu_seconds_total{mode="idle", @LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(node_cpu_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) | 100 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_cpu_percent_over_threshold](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_cpu_percent_over_threshold) |
| ob_cpu_percent_over_threshold | 100 * sum(ob_sysstat{stat_id="140006",@LABELS}) by (@GBLABELS) / sum(ob_sysstat{stat_id="140005",@LABELS}) by (@GBLABELS) | 90 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cpu_percent_over_threshold](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cpu_percent_over_threshold) |
| ob_host_disk_percent_over_threshold | 100 * (1 - avg(node_filesystem_avail_bytes{@LABELS}) by (@GBLABELS) / avg(node_filesystem_size_bytes{@LABELS}) by (@GBLABELS)) | 97 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_disk_percent_over_threshold](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_disk_percent_over_threshold) |
| ob_cluster_frozen_version_delta_over_threshold | max(ob_zone_stat{name="frozen_version",@LABELS}) by (@GBLABELS) - min(ob_zone_stat{name="last_merged_version",@LABELS}) by (@GBLABELS) | 1 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_frozen_version_delta_over](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_frozen_version_delta_over) |
| ob_host_net_recv_percent_over_threshold | 100 * max(sum(rate(node_network_receive_bytes_total{@LABELS}[@INTERVAL])) by (device,@GBLABELS) / sum(bandwidth{@LABELS})) by (@GBLABELS) | 80 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_net_recv_percent_over_thresh](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_net_recv_percent_over_thresh) |
| ob_host_net_send_percent_over_threshold | 100 * max(sum(rate(node_network_transmit_bytes_total{@LABELS}[@INTERVAL])) by (device,@GBLABELS) / sum(bandwidth{@LABELS})) by (@GBLABELS) | 80 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_net_send_percent_over_thresh](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_net_send_percent_over_thresh) |
| ob_cluster_exists_inactive_server | max(ob_server_num{status="inactive",@LABELS}) by (@GBLABELS) | 0 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_exists_inactive_server](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_exists_inactive_server) |
| ob_cluster_exists_index_fail_table | sum(ob_index_error_num{@LABELS}) by (@GBLABELS) | 0 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_exists_index_fail_table](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_exists_index_fail_table) |
| ob_host_load1_per_cpu_over_threshold | sum(node_load1{@LABELS}) by (@GBLABELS) / sum(cpu_count{@LABELS}) by (@GBLABELS) | 2 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_load1_per_cpu_over_threshold](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_load1_per_cpu_over_threshold) |
| ob_host_mem_percent_over_threshold | (1 - (avg(node_memory_MemFree_bytes{@LABELS}) by (@GBLABELS) + avg(node_memory_Cached_bytes{@LABELS}) by (@GBLABELS) + avg(node_memory_Buffers_bytes{@LABELS}) by (@GBLABELS)) / avg(node_memory_MemTotal_bytes{@LABELS}) by (@GBLABELS)) * 100 | 90 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_mem_percent_over_threshold](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_mem_percent_over_threshold) |
| ob_cluster_merge_timeout | max(ob_zone_stat{name="is_merge_timeout",@LABELS}) by (@GBLABELS) | ==1 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_merge_timeout](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_merge_timeout) |
| ob_cluster_merge_error | max(ob_zone_stat{name="is_merge_error",@LABELS}) by (@GBLABELS) | ==1 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_merge_error](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_cluster_merge_error) |
| ob_host_partition_count_over_threshold | sum(ob_partition_num{@LABELS}) by (@GBLABELS) | 30000 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_partition_count_over_thresho](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_partition_count_over_thresho) |
| ob_host_disk_readonly | max(node_filesystem_readonly{@LABELS}) by (@GBLABELS) | 1 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_disk_readonly](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_host_disk_readonly) |
| ob_server_sstable_percent_over_threshold | 100 * (sum(ob_disk_total_bytes{@LABELS}) by (@GBLABELS) - sum(ob_disk_free_bytes{@LABELS}) by (@GBLABELS)) / sum(ob_disk_total_bytes{@LABELS}) by (@GBLABELS) | 85 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_server_sstable_percent_over_thres](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/ob_server_sstable_percent_over_thres) |
| tenant_active_memstore_percent_over_threshold | 100 * sum(ob_sysstat{stat_id="130000",@LABELS}) by (@GBLABELS) / sum(ob_sysstat{stat_id="130002",@LABELS}) by (@GBLABELS) | 110 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/tenant_active_memstore_percent_over_](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/tenant_active_memstore_percent_over_) |
| tenant_memstore_percent_over_threshold | 100 * sum(ob_sysstat{stat_id="130001",@LABELS}) by (@GBLABELS) / sum(ob_sysstat{stat_id="130004",@LABELS}) by (@GBLABELS) | 85 | [https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/tenant_memstore_percent_over_thresho](https://www.oceanbase.com/docs/oceanbase-cloud-platform/oceanbase-cloud-platform/V3.1.1/tenant_memstore_percent_over_thresho) |
| 报警项 | 监控指标 | 阈值
| --- | --- | --- |
| ob_host_connection_percent_over_threshold | 100 * max(ob_active_session_num{@LABELS} / 262144) by (@GBLABELS) | 80 |
| ob_host_cpu_percent | 100 * (1 - sum(rate(node_cpu_seconds_total{mode="idle", @LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(node_cpu_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) | 100 |
| ob_cpu_percent_over_threshold | 100 * sum(ob_sysstat{stat_id="140006",@LABELS}) by (@GBLABELS) / sum(ob_sysstat{stat_id="140005",@LABELS}) by (@GBLABELS) | 90 |
| ob_host_disk_percent_over_threshold | 100 * (1 - avg(node_filesystem_avail_bytes{@LABELS}) by (@GBLABELS) / avg(node_filesystem_size_bytes{@LABELS}) by (@GBLABELS)) | 97 |
| ob_cluster_frozen_version_delta_over_threshold | max(ob_zone_stat{name="frozen_version",@LABELS}) by (@GBLABELS) - min(ob_zone_stat{name="last_merged_version",@LABELS}) by (@GBLABELS) | 1 |
| ob_host_net_recv_percent_over_threshold | 100 * max(sum(rate(node_network_receive_bytes_total{@LABELS}[@INTERVAL])) by (device,@GBLABELS) / sum(bandwidth{@LABELS})) by (@GBLABELS) | 80 |
| ob_host_net_send_percent_over_threshold | 100 * max(sum(rate(node_network_transmit_bytes_total{@LABELS}[@INTERVAL])) by (device,@GBLABELS) / sum(bandwidth{@LABELS})) by (@GBLABELS) | 80 |
| ob_cluster_exists_inactive_server | max(ob_server_num{status="inactive",@LABELS}) by (@GBLABELS) | 0 |
| ob_cluster_exists_index_fail_table | sum(ob_index_error_num{@LABELS}) by (@GBLABELS) | 0 |
| ob_host_load1_per_cpu_over_threshold | sum(node_load1{@LABELS}) by (@GBLABELS) / sum(cpu_count{@LABELS}) by (@GBLABELS) | 2 |
| ob_host_mem_percent_over_threshold | (1 - (avg(node_memory_MemFree_bytes{@LABELS}) by (@GBLABELS) + avg(node_memory_Cached_bytes{@LABELS}) by (@GBLABELS) + avg(node_memory_Buffers_bytes{@LABELS}) by (@GBLABELS)) / avg(node_memory_MemTotal_bytes{@LABELS}) by (@GBLABELS)) * 100 | 90 |
| ob_cluster_merge_timeout | max(ob_zone_stat{name="is_merge_timeout",@LABELS}) by (@GBLABELS) | 1 |
| ob_cluster_merge_error | max(ob_zone_stat{name="is_merge_error",@LABELS}) by (@GBLABELS) | 1 |
| ob_host_partition_count_over_threshold | sum(ob_partition_num{@LABELS}) by (@GBLABELS) | 30000 |
| ob_host_disk_readonly | max(node_filesystem_readonly{@LABELS}) by (@GBLABELS) | 1 |
| ob_server_sstable_percent_over_threshold | 100 * (sum(ob_disk_total_bytes{@LABELS}) by (@GBLABELS) - sum(ob_disk_free_bytes{@LABELS}) by (@GBLABELS)) / sum(ob_disk_total_bytes{@LABELS}) by (@GBLABELS) | 85 |
| tenant_active_memstore_percent_over_threshold | 100 * sum(ob_sysstat{stat_id="130000",@LABELS}) by (@GBLABELS) / sum(ob_sysstat{stat_id="130002",@LABELS}) by (@GBLABELS) | 110 |
| tenant_memstore_percent_over_threshold | 100 * sum(ob_sysstat{stat_id="130001",@LABELS}) by (@GBLABELS) / sum(ob_sysstat{stat_id="130004",@LABELS}) by (@GBLABELS) | 85 |
# 常用的监控指标
**exporter暴露的指标**
| **类别** | **指标名** | **label** | **含义** | **类型** |
| --- | --- | --- | --- | --- |
| **主机**
| node_xxx | node_exporter的label再增加ob_cluster_id,ob_cluster_name,obzone,svr_ip | 主机监控指标 | 参考node_exporter对应的指标类型 |
| **OB** | ob_active_session_num | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 活跃连接数 | gauge |
| | ob_cache_size_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name,cache_name | kvcache大小 | gauge |
| | ob_partition_num | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 分区数 | gauge |
| | ob_plan_cache_access_total | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 执行计划访问次数 | counter |
| | ob_plan_cache_hit_total | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 执行计划命中次数 | counter |
| | ob_plan_cache_memory_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | plancache大小 | gauge |
| | ob_table_num | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 表数量 | gauge |
| | ob_waitevent_wait_seconds_total | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 等待事件总等待时间 | counter |
| | ob_waitevent_wait_total | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 等待事件总等待次数 | counter |
| | ob_disk_free_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip | OB磁盘剩余大小 | gauge |
| | ob_disk_total_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip | OB磁盘总大小 | gauge |
| | ob_memstore_active_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 活跃memstore大小 | gauge |
| | ob_memstore_freeze_times | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | memstore冻结次数 | counter |
| | ob_memstore_freeze_trigger_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | memstore冻结阈值 | gauge |
| | ob_memstore_total_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | memstore总大小 | gauge |
| | ob_server_resource_cpu | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer可用cpu数 | gauge |
| | ob_server_resource_cpu_assigned | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer已分配cpu数 | gauge |
| | ob_server_resource_memory_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer可用内存大小 | gauge |
| | ob_server_resource_memory_assigned_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer已分配内存大小 | gauge |
| | ob_unit_num | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer unit数量 | gauge |
| | ob_sysstat | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name,stat_id | ob内部统计项 | 不同stat_id不相同,参考对应部分解释 |
**sysstat统计项**
| **stat_id** | **含义** | **类型** |
| --- | --- | --- |
| 10000 | 收到的rpc包数量 | counter |
| 10001 | 收到的rpc包大小 | counter |
| 10002 | 发送的rpc包数量 | counter |
| 10003 | 发送的rpc包大小 | counter |
| 10005 | rpc网络延迟 | counter |
| 10006 | rpc框架延迟 | counter |
| 20001 | 请求出队列次数 | counter |
| 20002 | 请求在队列中时间 | counter |
| 30000 | clog 同步时间 | counter |
| 30001 | clog 同步次数 | counter |
| 30002 | clog 提交次数 | counter |
| 30005 | 事务数 | counter |
| 30006 | 事务总时间 | counter |
| 40000 | select sql数 | counter |
| 40001 | select sql执行时间 | counter |
| 40002 | insert sql数 | counter |
| 40003 | insert sql执行时间 | counter |
| 40004 | replace sql数 | counter |
| 40005 | replace sql执行时间 | counter |
| 40006 | update sql数 | counter |
| 40007 | update sql执行时间 | counter |
| 40008 | delete sql数 | counter |
| 40009 | delete sql执行时间 | counter |
| 40010 | 本地sql执行次数 | counter |
| 40011 | 远程sql执行次数 | counter |
| 40012 | 分布式sql执行次数 | counter |
| 50000 | row cache命中次数 | counter |
| 50001 | row cache没有命中次数 | counter |
| 50008 | block cache命中次数 | counter |
| 50009 | block cache没有命中次数 | counter |
| 60000 | io 读次数 | counter |
| 60001 | io 读延时 | counter |
| 60002 | io 读字节数 | counter |
| 60003 | io 写次数 | counter |
| 60004 | io 写延时 | counter |
| 60005 | io 写字节数 | counter |
| 60019 | memstore读锁成功次数 | counter |
| 60020 | memstore读锁失败次数 | counter |
| 60021 | memstore写锁成功次数 | counter |
| 60022 | memstore写锁成功次数 | counter |
| 60023 | memstore等写锁时间 | counter |
| 60024 | memstore等读锁时间 | counter |
| 80040 | clog写次数 | counter |
| 80041 | clog写时间 | counter |
| 80057 | clog大小 | counter |
| 130000 | 活跃memstore大小 | gauge |
| 130001 | memstore总大小 | gauge |
| 130002 | 触发major freeze阈值 | gauge |
| 130004 | memstore大小限制 | gauge |
| 140002 | 最大可使用内存 | gauge |
| 140003 | 已使用内存 | gauge |
| 140005 | 最大可使用cpu | gauge |
| 140006 | 已使用cpu | gauge |
**​**
**常用指标的查询表达式**
实际查询的时候需要将变量替换成需要查询的具体信息
- @LABELS 替换为具体label的过滤条件
- @INTERVAL 替换为计算周期
- @GBLABELS 替换为聚合的label名称
| **指标** | **表达式** | **单位** |
| --- | --- | --- |
| 活跃 MEMStore 大小 | sum(ob_sysstat{stat_id="130000",@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| 当前活跃会话数 | sum(ob_active_session_num{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
| |
| 块缓存命中率 | 100 * 1 / (1 + sum(rate(ob_sysstat{stat_id="50009",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="50008",@LABELS}[@INTERVAL])) by (@GBLABELS)) | % |
| 块缓存大小 | sum(ob_cache_size_bytes{cache_name="user_block_cache",@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| 每秒提交的事务日志大小 | sum(rate(ob_sysstat{stat_id="80057",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| byte |
| 每次事务日志写盘平均耗时 | sum(rate(ob_sysstat{stat_id="80041",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="80040",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| CPU 使用率 | 100 * (1 - sum(rate(node_cpu_seconds_total{mode="idle", @LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(node_cpu_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) | % |
| 磁盘分区已使用容量 | sum(node_filesystem_size_bytes{@LABELS} - node_filesystem_avail_bytes{@LABELS}) by (@GBLABELS)
/ 1073741824 | GB |
| 每秒读次数 | avg(rate(node_disk_reads_completed_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| 次/s |
| 每次读取数据量 | avg(rate(node_disk_read_bytes_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| SSStore 每秒读次数 | sum(rate(ob_sysstat{stat_id="60000",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| SSStore 每次读取平均耗时 | sum(rate(ob_sysstat{stat_id="60001",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="60000",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| SSStore 每秒读取数据量 | sum(rate(ob_sysstat{stat_id="60002",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| byte |
| 每秒读取平均耗时 | 1000000 * (avg(rate(node_disk_read_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) / (avg(rate(node_disk_reads_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) | us |
| 平均每次 IO 读取耗时 | 1000000 * (avg(rate(node_disk_read_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) / (avg(rate(node_disk_reads_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) | us |
| 每秒写次数 | avg(rate(node_disk_writes_completed_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| 次/s |
| 每次写入数据量 | avg(rate(node_disk_written_bytes_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| SSStore 每秒写次数 | sum(rate(ob_sysstat{stat_id="60003",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| SSStore 每次写入平均耗时 | sum(rate(ob_sysstat{stat_id="60004",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="60003",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| SSStore 每秒写入数据量 | sum(rate(ob_sysstat{stat_id="60005",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| byte |
| 每秒写入平均耗时 | 1000000 * (avg(rate(node_disk_write_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) / (avg(rate(node_disk_writes_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) | us |
| 平均每次 IO 写入耗时 | 1000000 * (avg(rate(node_disk_write_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) / (avg(rate(node_disk_writes_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) | us |
| 过去1分钟系统平均负载 | avg(node_load1{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
| |
| 过去15分钟系统平均负载 | avg(node_load15{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
| |
| 过去5分钟系统平均负载 | avg(node_load5{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
| |
| 触发合并阈值 | sum(ob_sysstat{stat_id="130002",@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| 内核 Buffer Cache 大小 | avg(node_memory_Buffers_bytes{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ 1073741824 | GB |
| 可用物理内存大小 | avg(node_memory_MemFree_bytes{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ 1073741824 | GB |
| 使用物理内存大小 | (avg(node_memory_MemTotal_bytes{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
- avg(node_memory_MemFree_bytes{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
- avg(node_memory_Cached_bytes{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
- avg(node_memory_Buffers_bytes{@LABELS}) by (@GBLABELS)) / 1073741824 | GB |
| MEMStore的limit | sum(ob_sysstat{stat_id="130004",@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| MEMStore使用百分比 | 100 * sum(ob_sysstat{stat_id="130001",@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ sum(ob_sysstat{stat_id="130004",@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
| % |
| 写锁等待失败次数 | sum(rate(ob_sysstat{stat_id="60022",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 写锁等待成功次数 | sum(rate(ob_sysstat{stat_id="60021",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 写锁平均等待耗时 | sum(rate(ob_sysstat{stat_id="60023",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ (sum(rate(ob_sysstat{stat_id="60021",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="60022",@LABELS}[@INTERVAL])) by (@GBLABELS)) | us |
| 每秒接收数据量 | avg(rate(node_network_receive_bytes_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| 每秒发送数据量 | avg(rate(node_network_transmit_bytes_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| CPU使用率 | 100 * sum(ob_sysstat{stat_id="140006",@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ sum(ob_sysstat{stat_id="140005",@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
| % |
| 分区数量 | sum(ob_partition_num{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
| |
| 执行计划缓存命中率 | 100 * sum(rate(ob_plan_cache_hit_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_plan_cache_access_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| % |
| 执行计划缓存大小 | sum(ob_plan_cache_memory_bytes{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| 平均每秒 SQL 进等待队列的次数 | sum(rate(ob_sysstat{stat_id="20001",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| SQL 在等待队列中等待耗时 | sum(rate(ob_sysstat{stat_id="20002",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="20001",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| 行缓存命中率 | 100 * 1 / (1 + sum(rate(ob_sysstat{stat_id="50001",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="50000",@LABELS}[@INTERVAL])) by (@GBLABELS)) | % |
| 缓存大小 | sum(ob_cache_size_bytes{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| Rpc 收包吞吐量 | sum(rate(ob_sysstat{stat_id="10001",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| byte |
| Rpc 收包平均耗时 | (sum(rate(ob_sysstat{stat_id="10005",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="10006",@LABELS}[@INTERVAL])) by (@GBLABELS)) / sum(rate(ob_sysstat{stat_id="10000",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| Rpc 发包吞吐量 | sum(rate(ob_sysstat{stat_id="10003",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| byte |
| Rpc 发包平均耗时 | (sum(rate(ob_sysstat{stat_id="10005",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="10006",@LABELS}[@INTERVAL])) by (@GBLABELS)) / sum(rate(ob_sysstat{stat_id="10002",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| 每秒处理sql语句数 | sum(rate(ob_sysstat{stat_id="40002",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40004",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40006",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40008",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40000",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 服务端每条 SQL 语句平均处理耗时 | (sum(rate(ob_sysstat{stat_id="40003",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40005",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40007",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40009",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40001",@LABELS}[@INTERVAL])) by (@GBLABELS)) /(sum(rate(ob_sysstat{stat_id="40002",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40004",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40006",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40008",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
+ sum(rate(ob_sysstat{stat_id="40000",@LABELS}[@INTERVAL])) by (@GBLABELS)) | us |
| 每秒处理 Delete 语句数 | sum(rate(ob_sysstat{stat_id="40008",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 服务端每条 Delete 语句平均处理耗时 | sum(rate(ob_sysstat{stat_id="40009",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="40008",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| 每秒处理分布式执行计划数 | sum(rate(ob_sysstat{stat_id="40012",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 每秒处理 Insert 语句数 | sum(rate(ob_sysstat{stat_id="40002",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 服务端每条 Insert 语句平均处理耗时 | sum(rate(ob_sysstat{stat_id="40003",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="40002",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| 每秒处理本地执行数 | sum(rate(ob_sysstat{stat_id="40010",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 每秒处理远程执行计划数 | sum(rate(ob_sysstat{stat_id="40011",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 每秒处理 Replace 语句数 | sum(rate(ob_sysstat{stat_id="40004",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 服务端每条 Replace 语句平均处理耗时 | sum(rate(ob_sysstat{stat_id="40005",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="40004",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| 每秒处理 Select 语句数 | sum(rate(ob_sysstat{stat_id="40000",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 服务端每条 Select 语句平均处理耗时 | sum(rate(ob_sysstat{stat_id="40001",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="40000",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| 每秒处理 Update 语句数 | sum(rate(ob_sysstat{stat_id="40006",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 服务端每条 Update 语句平均处理耗时 | sum(rate(ob_sysstat{stat_id="40007",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="40006",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| 表数量 | max(ob_table_num{@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
| |
| MEMStore 总大小 | sum(ob_sysstat{stat_id="130001",@LABELS}) by ([@GBLABELS) ](/GBLABELS) )
/ 1048576 | MB |
| 每秒处理事务数 | sum(rate(ob_sysstat{stat_id="30005",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 服务端每个事务平均处理耗时 | sum(rate(ob_sysstat{stat_id="30006",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="30005",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| 每秒提交的事务日志数 | sum(rate(ob_sysstat{stat_id="30002",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 每次事务日志网络同步平均耗时 | sum(rate(ob_sysstat{stat_id="30000",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_sysstat{stat_id="30001",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| us |
| 每秒等待事件次数 | sum(rate(ob_waitevent_wait_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| times/s |
| 等待事件平均耗时 | sum(rate(ob_waitevent_wait_seconds_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
/ sum(rate(ob_waitevent_wait_total{@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS) )
| s |
# obagent数据处理流程
![Screenshot 2021-09-15 at 11.36.11.png](https://intranetproxy.alipay.com/skylark/lark/0/2021/png/28412/1631676986085-49f40134-9502-438b-bb32-5a3ee6591fbc.png#clientId=u89a16740-3189-4&from=ui&id=ucf4512dc&margin=%5Bobject%20Object%5D&name=Screenshot%202021-09-15%20at%2011.36.11.png&originHeight=868&originWidth=1638&originalType=binary&ratio=1&size=108483&status=done&style=none&taskId=uc6b3bf07-12ab-4e56-9568-bdab0001cc4)
obagent将插件组合,定义成流水线,流水线作为基本的调度单位,实现一个完整的数据采集,处理和上报流程,同时支持两种模式,推模式和拉模式,一个流水线包含一组input插件,并行进行数据采集,一组processor插件,串行进行数据处理,拉模式包括一个exporter插件,通过http服务的方式将数据暴露出来,推模式包括一个output插件,实现数据推送功能, 流水线中流转的数据定义为统一的Metric接口,为了扩展obagent的能力,可以自定义开发一些插件,插件的开发只需要实现插件的基本接口和对应类型插件的接口即可
# metric接口定义
```go
type Metric interface {
Clone() Metric
SetName(name string)
GetName() string
SetTime(time time.Time)
GetTime() time.Time
SetMetricType(metricType Type)
GetMetricType() Type
Fields() map[string]interface{}
Tags() map[string]string
}
```
# 插件基本接口定义
```go
//Initializer contains Init function
type Initializer interface {
Init(config map[string]interface{}) error
}
//Closer contains Close function
type Closer interface {
Close() error
}
//Describer show SampleConfig and Description
type Describer interface {
SampleConfig() string
Description() string
}
```
所有obagent的插件都必须实现以上的基本接口
- Init方法用来做插件的初始化工作
- Close方法在插件退出时调用,用来关闭一些资源
- SampleConfig方法返回插件的配置样例
- Description方法返回插件的描述信息
# 不同类型插件的接口定义
## Input插件
```go
type Input interface {
Collect() ([]metric.Metric, error)
}
```
intput 插件需要实现Collect方法, 来采集数据,返回一组metric和error
## Processor插件
```go
type Processor interface {
Process(metrics ...metric.Metric) ([]metric.Metric, error)
}
```
processor插件需要实现Process方法,用来做数据处理,输入参数是一组metric,输出一组metric和error
## Exporter插件
```go
type Exporter interface {
Export(metrics []metric.Metric) (*bytes.Buffer, error)
}
```
exporter插件需要实现Export方法,输入参数是一组metric,输出byte buffer 和 error, 作用是将metric做格式转换,只用在拉模式
## Output插件
```go
type Output interface {
Write(metrics []metric.Metric) error
}
```
output插件需要实现Write方法,输入参数是一组metric,输出error,作用是将metric数据推送到远端,只用在推模式
# 输出插件(Exporter)指标
输出插件(Exporter)暴露的指标见下表:
## 主机指标
| **指标名** | **Label** | **描述** | **类型** |
| --- | --- | --- | --- |
| node_cpu_seconds_total | cpu,mode,svr_ip | CPU 时间 | counter |
| node_disk_read_bytes_total | device,svr_ip | 磁盘读取字节数 | counter |
| node_disk_read_time_seconds_total | device,svr_ip | 磁盘读取消耗总时间 | counter |
| node_disk_reads_completed_total | device,svr_ip | 磁盘读取完成次数 | counter |
| node_disk_written_bytes_total | device,svr_ip | 磁盘写入字节数 | counter |
| node_disk_write_time_seconds_total | device,svr_ip | 磁盘写入消耗总时间 | counter |
| node_disk_writes_completed_total | device,svr_ip | 磁盘写入完成次数 | counter |
| node_filesystem_avail_bytes | device,fstype,mountpoint,svr_ip | 文件系统可用大小 | gauge |
| node_filesystem_readonly | device,fstype,mountpoint,svr_ip | 文件系统是否只读 | gauge |
| node_filesystem_size_bytes | device,fstype,mountpoint,svr_ip | 文件系统大小 | gauge |
| node_load1 | svr_ip | 1 分钟平均 load | gauge |
| node_load5 | svr_ip | 5 分钟平均 load | gauge |
| node_load15 | svr_ip | 15 分钟平均 load | gauge |
| node_memory_Buffers_bytes | svr_ip | 内存 buffer 大小 | gauge |
| node_memory_Cached_bytes | svr_ip | 内存 cache 大小 | gauge |
| node_memory_MemFree_bytes | svr_ip | 内存 free 大小 | gauge |
| node_memory_MemTotal_bytes | svr_ip | 内存总大小 | gauge |
| node_network_receive_bytes_total | device,svr_ip | 网络接受总字节数 | counter |
| node_network_transmit_bytes_total | device,svr_ip | 网络发送总字节数 | counter |
| node_ntp_offset_seconds | svr_ip | NTP 时钟偏移 | gauge |
## OceanBase 数据库指标
| **指标名** | **label** | **含义** | **类型** |
| --- | --- | --- | --- |
| ob_active_session_num | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 活跃连接数 | gauge |
| ob_cache_size_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name,cache_name | kvcache 大小 | gauge |
| ob_partition_num | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 分区数 | gauge |
| ob_plan_cache_access_total | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 执行计划访问次数 | counter |
| ob_plan_cache_hit_total | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 执行计划命中次数 | counter |
| ob_plan_cache_memory_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | plancache大小 | gauge |
| ob_table_num | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 表数量 | gauge |
| ob_waitevent_wait_seconds_total | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 等待事件总等待时间 | counter |
| ob_waitevent_wait_total | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 等待事件总等待次数 | counter |
| ob_disk_free_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip | OceanBase 磁盘剩余大小 | gauge |
| ob_disk_total_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip | OceanBase 磁盘总大小 | gauge |
| ob_memstore_active_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | 活跃 memstore 大小 | gauge |
| ob_memstore_freeze_times | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | memstore 冻结次数 | counter |
| ob_memstore_freeze_trigger_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | memstore 冻结阈值 | gauge |
| ob_memstore_total_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name | memstore 总大小 | gauge |
| ob_server_resource_cpu | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer可用cpu数 | gauge |
| ob_server_resource_cpu_assigned | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer 已分配 CPU 数 | gauge |
| ob_server_resource_memory_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer 可用内存大小 | gauge |
| ob_server_resource_memory_assigned_bytes | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer 已分配内存大小 | gauge |
| ob_unit_num | ob_cluster_id,ob_cluster_name,obzone,svr_ip | observer unit 数量 | gauge |
| ob_sysstat | ob_cluster_id,ob_cluster_name,obzone,svr_ip,tenant_name,stat_id | ob内部统计项 | 不同 stat_id 不相同,参考对应部分解释 |
# 系统统计项
系统统计项(sysstat)的信息见下表:
| **stat_id** | **描述** | **类型** |
| --- | --- | --- |
| 10000 | 收到的 RPC 包数量 | counter |
| 10001 | 收到的 RPC 包大小 | counter |
| 10002 | 发送的 RPC 包数量 | counter |
| 10003 | 发送的 RPC 包大小 | counter |
| 10005 | RPC 网络延迟 | counter |
| 10006 | RPC 框架延迟 | counter |
| 20001 | 请求出队列次数 | counter |
| 20002 | 请求在队列中时间 | counter |
| 30000 | clog 同步时间 | counter |
| 30001 | clog 同步次数 | counter |
| 30002 | clog 提交次数 | counter |
| 30005 | 事务数 | counter |
| 30006 | 事务总时间 | counter |
| 40000 | select SQL 数 | counter |
| 40001 | select SQL执行时间 | counter |
| 40002 | insert SQL 数 | counter |
| 40003 | insert SQL 执行时间 | counter |
| 40004 | replace SQL 数 | counter |
| 40005 | replace SQL 执行时间 | counter |
| 40006 | update SQL 数 | counter |
| 40007 | update SQL 执行时间 | counter |
| 40008 | delete SQL 数 | counter |
| 40009 | delete SQL 执行时间 | counter |
| 40010 | 本地 SQL 执行次数 | counter |
| 40011 | 远程 SQL 执行次数 | counter |
| 40012 | 分布式 SQL 执行次数 | counter |
| 50000 | row cache 命中次数 | counter |
| 50001 | row cache 没有命中次数 | counter |
| 50008 | block cache 命中次数 | counter |
| 50009 | block cache 没有命中次数 | counter |
| 60000 | IO 读次数 | counter |
| 60001 | IO 读延时 | counter |
| 60002 | IO 读字节数 | counter |
| 60003 | IO 写次数 | counter |
| 60004 | IO 写延时 | counter |
| 60005 | IO 写字节数 | counter |
| 60019 | memstore 读锁成功次数 | counter |
| 60020 | memstore 读锁失败次数 | counter |
| 60021 | memstore 写锁成功次数 | counter |
| 60022 | memstore 写锁成功次数 | counter |
| 60023 | memstore 等写锁时间 | counter |
| 60024 | memstore 等读锁时间 | counter |
| 80040 | clog写次数 | counter |
| 80041 | clog写时间 | counter |
| 80057 | clog大小 | counter |
| 130000 | 活跃 memstore 大小 | gauge |
| 130001 | memstore 总大小 | gauge |
| 130002 | 触发 major freeze 阈值 | gauge |
| 130004 | memstore 大小限制 | gauge |
| 140002 | 最大可使用内存 | gauge |
| 140003 | 已使用内存 | gauge |
| 140005 | 最大可使用 CPU | gauge |
| 140006 | 已使用 CPU | gauge |
# 安装和配置
obagent提供rpm包,可以使用rpm命令进行安装
// TODO: replace with real url
[obagent-0.1-1.alios7.x86_64.rpm]()
```bash
rpm -ivh obagent-0.1-1.alios7.x86_64.rpm
```
**目录结构**
安装之后会有一个binary文件,和一组配置文件,配置文件中又分为程序启动配置,模块配置模版和kv变量的配置, 另外为了方便使用还有一个prometheus的配置模版
```bash
# 目录结构示例
.
├── bin
│   └── monagent
├── conf
│   ├── config_properties
│   │   ├── monagent_basic_auth.yaml
│   │   └── monagent_pipeline.yaml
│   ├── module_config
│   │   ├── monagent_basic_auth.yaml
│   │   ├── monagent_config.yaml
│   │   ├── monitor_node_host.yaml
│   │   └── monitor_ob.yaml
│   ├── monagent.yaml
│   └── prometheus_config
│   ├── prometheus.yaml
│   └── rules
│   ├── host_rules.yaml
│   └── ob_rules.yaml
└── run
```
**monagent.yaml配置文件:**
```bash
## 日志相关配置
log:
level: debug
filename: log/monagent.log
maxsize: 30
maxage: 7
maxbackups: 10
localtime: true
compress: true
## 进程相关配置,address是默认的拉取metric和管理相关接口,adminAddress是pprof调试端口
server:
address: "0.0.0.0:8088"
adminAddress: "0.0.0.0:8089"
runDir: run
## 配置相关,加密方法支持aes和plain,如果是aes,会使用下面key文件中的key对需要加密的配置项进行加密
## modulePath中存放配置模版,propertiesPath存放kv变量配置
cryptoMethod: plain
cryptoPath: conf/.config_secret.key
modulePath: conf/module_config
propertiesPath: conf/config_properties
```
**配置模版**
```bash
## basic auth 相关配置,可以配置两个端口开启或者关闭,配置disabled后对应的变量 {disable_http_basic_auth} {disable_pprof_basic_auth}
monagent_basic_auth.yaml
## 配置模块相关的配置,一般不需要修改
monagent_config.yaml
## 主机监控流水线配置模版,一般不需要修改
monitor_node_host.yaml
## OB监控流水线配置模版,一般不需要修改
monitor_ob.yaml
```
**kv配置项**
```bash
## basic auth 相关的kv配置项
monagent_basic_auth.yaml
## 流水线相关的kv配置项
monagent_pipeline.yaml
```
**​**
**kv配置项说明:**
```yaml
# encrypted=true的配置项, 需要加密存储,目前支持aes加密方法,
# {}中的变量需要进行替换,替换成真实的值,如果monagent启动配置中设置了加密方法=aes, 需要配置加密之后的值
## basic auth
configVersion: "2021-08-20T07:52:28.5443+08:00"
configs:
- key: http.server.basic.auth.username
value: {http_basic_auth_user}
valueType: string
- key: http.server.basic.auth.password
value: {http_basic_auth_password}
valueType: string
encrypted: true
- key: http.admin.basic.auth.username
value: {pprof_basic_auth_user}
valueType: string
- key: http.admin.basic.auth.password
value: {pprof_basic_auth_password}
valueType: string
encrypted: true
## pipeline
configVersion: "2021-08-20T07:52:28.5443+08:00"
configs:
- key: monagent.ob.monitor.user
value: {monitor_user}
valueType: string
- key: monagent.ob.monitor.password
value: {monitor_password}
valueType: string
encrypted: true
- key: monagent.ob.sql.port
value: {sql_port}
valueType: int64
- key: monagent.ob.rpc.port
value: {rpc_port}
valueType: int64
- key: monagent.host.ip
value: {host_ip}
valueType: string
- key: monagent.ob.cluster.name
value: {cluster_name}
valueType: string
- key: monagent.ob.cluster.id
value: {cluster_id}
valueType: int64
- key: monagent.ob.zone.name
value: {zone_name}
valueType: string
- key: monagent.pipeline.ob.status
value: {ob_monitor_status}
valueType: string
- key: monagent.pipeline.node.status
value: {host_monitor_status}
valueType: string
```
**​**
**启动monagent:**
```bash
# 推荐使用supervisor来拉起进程
# 启动命令
cd /home/admin/obagent
nohup ./bin/monagent -c conf/monagent.yaml >> ./log/monagent_stdout.log 2>&1 &
# supervisor 配置样例
[program:monagent]
command=./bin/monagent -c conf/monagent.yaml
directory=/home/admin/obagent
autostart=true
autorestart=true
redirect_stderr=true
priority=10
stdout_logfile=log/monagent_stdout.log
```
**​**
**配置更新:**
obagent提供了配置更新的接口, 可以通过http服务的方式更新kv配置项,具体的调用方式
```bash
# 可以同时更新多个kv的配置项,写多组key和value的值即可
curl --user user:pass -H "Content-Type:application/json" -d '{"configs":[{"key":"monagent.pipeline.ob.status", "value":"active"}]}' -L 'http://ip:port/api/v1/module/config/update'
```
# prometheus采集配置
**prometheus配置样例**
```yaml
# obagent的rpm包中携带了一份prometheus的配置模版,可以根据实际情况做一些修改,
# 如果开启basic auth认证,需要配置{http_basic_auth_user} {http_basic_auth_password}
# {target} 替换成主机的ip + port
# rules 目录下有两个报警配置模版,分别是默认的主机和ob报警配置,如需自定义报警项,可以作为参考
global:
scrape_interval: 1s
evaluation_interval: 10s
rule_files:
- "rules/*rules.yaml"
scrape_configs:
- job_name: prometheus
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- 'localhost:9090'
- job_name: node
basic_auth:
username: {http_basic_auth_user}
password: {http_basic_auth_password}
metrics_path: /metrics/node/host
scheme: http
static_configs:
- targets:
- {target}
- job_name: ob_basic
basic_auth:
username: {http_basic_auth_user}
password: {http_basic_auth_password}
metrics_path: /metrics/ob/basic
scheme: http
static_configs:
- targets:
- {target}
- job_name: ob_extra
basic_auth:
username: {http_basic_auth_user}
password: {http_basic_auth_password}
metrics_path: /metrics/ob/extra
scheme: http
static_configs:
- targets:
- {target}
- job_name: agent
basic_auth:
username: {http_basic_auth_user}
password: {http_basic_auth_password}
metrics_path: /metrics/stat
scheme: http
static_configs:
- targets:
- {target}
```
**启动prometheus**
```yaml
# 需要首先下载prometheus
./prometheus --config.file=./prometheus.yaml
```
**​**
**在prometheus中查看exporter状态**
// TODO: replace with real url
![Screenshot 2021-08-12 at 20.29.35.png](https://intranetproxy.alipay.com/skylark/lark/0/2021/png/28412/1628771389150-58415f6b-f455-416a-8329-46822eb292dc.png#clientId=u9961547e-7621-4&from=ui&id=u8d70909f&margin=%5Bobject%20Object%5D&name=Screenshot%202021-08-12%20at%2020.29.35.png&originHeight=964&originWidth=1980&originalType=binary&ratio=1&size=198565&status=done&style=none&taskId=u8a79e4af-ee50-4d8b-bd92-3ae6f780cdd)
**在prometheus中计算监控指标**
![Screenshot 2021-08-12 at 20.32.53.png](https://intranetproxy.alipay.com/skylark/lark/0/2021/png/28412/1628771585474-6b14d363-22d5-4748-a032-299c5a08484c.png#clientId=u9961547e-7621-4&from=ui&id=u8297d38a&margin=%5Bobject%20Object%5D&name=Screenshot%202021-08-12%20at%2020.32.53.png&originHeight=842&originWidth=2558&originalType=binary&ratio=1&size=177625&status=done&style=none&taskId=uf184d0a6-af77-4418-84d2-2b3fb736d00)
**配置报警相关信息:**
1. **部署alertmanager**
```yaml
1. 下载alertmanager
2. 解压
3. 启动
./alertmanager --config.file=alertmanager.yaml
具体的配置信息可以参考 https://www.prometheus.io/docs/alerting/latest/configuration/
```
2. **配置prometheus**
```yaml
# prometheus的配置文件中增加报警相关的配置, 根据上面的配置文件,报警相关的配置文件放在rules目录下,命名满足*rule.yaml
以磁盘监控为例
groups:
- name: node-alert
rules:
- alert: disk-full
expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype=~"ext4|xfs"} * 100) / node_filesystem_size_bytes {mountpoint="/",fstype=~"ext4|xfs"}) > 80
for: 1m
labels:
serverity: page
annotations:
summary: "{{ $labels.instance }} disk full "
description: "{{ $labels.instance }} disk > {{ $value }} "
```
3. **查看报警信息**
![Screenshot 2021-09-05 at 22.16.57.png](https://intranetproxy.alipay.com/skylark/lark/0/2021/png/28412/1630851428176-e5dd994c-b6ad-4e56-8cf5-a93dda31d639.png#clientId=udbbe738a-dc94-4&from=ui&id=uca310bcd&margin=%5Bobject%20Object%5D&name=Screenshot%202021-09-05%20at%2022.16.57.png&originHeight=1326&originWidth=2880&originalType=binary&ratio=1&size=337214&status=done&style=none&taskId=u9ae5bd9d-3c54-4581-9ea8-dae5f49e898)
**​**
# 常用指标的查询表达式
本文介绍 OBAgent 常用指标的查询表达式。
> **注意**:在查询时,您必须将变量替换成待查询的具体信息。
需要替换的字段如下:
- @LABELS,替换为具体 Label 的过滤条件。
- @INTERVAL,替换为计算周期。
- @GBLABELS,替换为聚合的 Label 名称。
| **指标** | **表达式** | **单位** |
| --- | --- | --- |
| 活跃 MEMStore 大小 | `sum(ob_sysstat{stat_id="130000",@LABELS}) by (@GBLABELS) / 1048576` | MB |
| 当前活跃会话数 | `sum(ob_active_session_num{@LABELS}) by ([@GBLABELS)` | - |
| 块缓存命中率 | `100 * 1 / (1 + sum(rate(ob_sysstat{stat_id="50009",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS)) / sum(rate(ob_sysstat{stat_id="50008",@LABELS}[@INTERVAL])) by (@GBLABELS))` | % |
| 块缓存大小 | `sum(ob_cache_size_bytes{cache_name="user_block_cache",@LABELS}) by ([@GBLABELS) ](/GBLABELS)) / 1048576` | MB |
| 每秒提交的事务日志大小 | `sum(rate(ob_sysstat{stat_id="80057",@LABELS}[@INTERVAL])) by ([@GBLABELS) ](/GBLABELS))`| byte |
| 每次事务日志写盘平均耗时 | `sum(rate(ob_sysstat{stat_id="80041",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="80040",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| CPU 使用率 | `100 * (1 - sum(rate(node_cpu_seconds_total{mode="idle", @LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(node_cpu_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS))` | % |
| 磁盘分区已使用容量 | `sum(node_filesystem_size_bytes{@LABELS} - node_filesystem_avail_bytes{@LABELS}) by (@GBLABELS) / 1073741824` | GB |
| 每秒读次数 | `avg(rate(node_disk_reads_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 每次读取数据量 | `avg(rate(node_disk_read_bytes_total{@LABELS}[@INTERVAL])) by (@GBLABELS) / 1048576` | MB |
| SSStore 每秒读次数 | `sum(rate(ob_sysstat{stat_id="60000",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| SSStore 每次读取平均耗时 | `sum(rate(ob_sysstat{stat_id="60001",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="60000",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| SSStore 每秒读取数据量 | `sum(rate(ob_sysstat{stat_id="60002",@LABELS}[@INTERVAL])) by (@GBLABELS)` | byte |
| 每秒读取平均耗时 | `1000000 * (avg(rate(node_disk_read_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) / (avg(rate(node_disk_reads_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS))` | us |
| 平均每次 IO 读取耗时 | `1000000 * (avg(rate(node_disk_read_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) / (avg(rate(node_disk_reads_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS))` | us |
| 每秒写次数 | `avg(rate(node_disk_writes_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 每次写入数据量 | `avg(rate(node_disk_written_bytes_total{@LABELS}[@INTERVAL])) by (@GBLABELS) / 1048576` | MB |
| SSStore 每秒写次数 | `sum(rate(ob_sysstat{stat_id="60003",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| SSStore 每次写入平均耗时 | `sum(rate(ob_sysstat{stat_id="60004",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="60003",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| SSStore 每秒写入数据量 | `sum(rate(ob_sysstat{stat_id="60005",@LABELS}[@INTERVAL])) by (@GBLABELS)` | byte |
| 每秒写入平均耗时 | `1000000 * (avg(rate(node_disk_write_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) / (avg(rate(node_disk_writes_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS))` | us |
| 平均每次 IO 写入耗时 | `1000000 * (avg(rate(node_disk_write_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)) / (avg(rate(node_disk_writes_completed_total{@LABELS}[@INTERVAL])) by (@GBLABELS))` | us |
| 过去 1 分钟系统平均负载 |`avg(node_load1{@LABELS}) by (@GBLABELS)`| - |
| 过去 15 分钟系统平均负载 | `avg(node_load15{@LABELS}) by (@GBLABELS)` | - |
| 过去 5 分钟系统平均负载 | `avg(node_load5{@LABELS}) by (@GBLABELS)` | - |
| 触发合并阈值 | `sum(ob_sysstat{stat_id="130002",@LABELS}) by (@GBLABELS) / 1048576` | MB |
| 内核 Buffer Cache 大小 | `avg(node_memory_Buffers_bytes{@LABELS}) by (@GBLABELS) / 1073741824` | GB |
| 可用物理内存大小 | `avg(node_memory_MemFree_bytes{@LABELS}) by (@GBLABELS) / 1073741824` | GB |
| 使用物理内存大小 | `(avg(node_memory_MemTotal_bytes{@LABELS}) by (@GBLABELS)` </br> `avg(node_memory_MemFree_bytes{@LABELS}) by (@GBLABELS)` </br> `avg(node_memory_Cached_bytes{@LABELS}) by (@GBLABELS)` </br> `avg(node_memory_Buffers_bytes{@LABELS}) by (@GBLABELS)) / 1073741824` | GB |
| MEMStore 的上限 | `sum(ob_sysstat{stat_id="130004",@LABELS}) by (@GBLABELS) / 1048576` | MB |
| MEMStore 使用百分比 | `100 * sum(ob_sysstat{stat_id="130001",@LABELS}) by (@GBLABELS) / sum(ob_sysstat{stat_id="130004",@LABELS}) by (@GBLABELS)` | % |
| 写锁等待失败次数 | `sum(rate(ob_sysstat{stat_id="60022",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 写锁等待成功次数 | `sum(rate(ob_sysstat{stat_id="60021",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 写锁平均等待耗时 | `sum(rate(ob_sysstat{stat_id="60023",@LABELS}[@INTERVAL])) by (@GBLABELS) / (sum(rate(ob_sysstat{stat_id="60021",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="60022",@LABELS}[@INTERVAL])) by (@GBLABELS))` | us |
| 每秒接收数据量 | `avg(rate(node_network_receive_bytes_total{@LABELS}[@INTERVAL])) by (@GBLABELS) / 1048576` | MB |
| 每秒发送数据量 | `avg(rate(node_network_transmit_bytes_total{@LABELS}[@INTERVAL])) by (@GBLABELS) / 1048576` | MB |
| CPU 使用率 | `100 * sum(ob_sysstat{stat_id="140006",@LABELS}) by (@GBLABELS) / sum(ob_sysstat{stat_id="140005",@LABELS}) by (@GBLABELS)` | % |
| 分区数量 | `sum(ob_partition_num{@LABELS}) by (@GBLABELS)` | - |
| 执行计划缓存命中率 | `100 * sum(rate(ob_plan_cache_hit_total{@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_plan_cache_access_total{@LABELS}[@INTERVAL])) by (@GBLABELS)` | % |
| 执行计划缓存大小 | `sum(ob_plan_cache_memory_bytes{@LABELS}) by (@GBLABELS) / 1048576` | MB |
| 平均每秒 SQL 进等待队列的次数 | `sum(rate(ob_sysstat{stat_id="20001",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| SQL 在等待队列中等待耗时 | `sum(rate(ob_sysstat{stat_id="20002",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="20001",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 行缓存命中率 | `100 * 1 / (1 + sum(rate(ob_sysstat{stat_id="50001",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="50000",@LABELS}[@INTERVAL])) by (@GBLABELS))` | % |
| 缓存大小 | `sum(ob_cache_size_bytes{@LABELS}) by (@GBLABELS) / 1048576` | MB |
| RPC 收包吞吐量 | `sum(rate(ob_sysstat{stat_id="10001",@LABELS}[@INTERVAL])) by (@GBLABELS)` | byte |
| RPC 收包平均耗时 | `(sum(rate(ob_sysstat{stat_id="10005",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="10006",@LABELS}[@INTERVAL])) by (@GBLABELS)) / sum(rate(ob_sysstat{stat_id="10000",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| RPC 发包吞吐量 | `sum(rate(ob_sysstat{stat_id="10003",@LABELS}[@INTERVAL])) by (@GBLABELS)` | byte |
| RPC 发包平均耗时 | `(sum(rate(ob_sysstat{stat_id="10005",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="10006",@LABELS}[@INTERVAL])) by (@GBLABELS)) / sum(rate(ob_sysstat{stat_id="10002",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 每秒处理 SQL 语句数 | `sum(rate(ob_sysstat{stat_id="40002",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40004",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40006",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40008",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40000",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 服务端每条 SQL 语句平均处理耗时 | `(sum(rate(ob_sysstat{stat_id="40003",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40005",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40007",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40009",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40001",@LABELS}[@INTERVAL])) by (@GBLABELS)) /(sum(rate(ob_sysstat{stat_id="40002",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40004",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40006",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40008",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(ob_sysstat{stat_id="40000",@LABELS}[@INTERVAL])) by (@GBLABELS))` | us |
| 每秒处理 Delete 语句数 | `sum(rate(ob_sysstat{stat_id="40008",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 服务端每条 Delete 语句平均处理耗时 | `sum(rate(ob_sysstat{stat_id="40009",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="40008",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 每秒处理分布式执行计划数 | `sum(rate(ob_sysstat{stat_id="40012",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 每秒处理 Insert 语句数 | `sum(rate(ob_sysstat{stat_id="40002",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 服务端每条 Insert 语句平均处理耗时 | `sum(rate(ob_sysstat{stat_id="40003",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="40002",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 每秒处理本地执行数 | `sum(rate(ob_sysstat{stat_id="40010",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 每秒处理远程执行计划数 | `sum(rate(ob_sysstat{stat_id="40011",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 每秒处理 Replace 语句数 | `sum(rate(ob_sysstat{stat_id="40004",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 服务端每条 Replace 语句平均处理耗时 | `sum(rate(ob_sysstat{stat_id="40005",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="40004",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 每秒处理 Select 语句数 | `sum(rate(ob_sysstat{stat_id="40000",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 服务端每条 Select 语句平均处理耗时 | `sum(rate(ob_sysstat{stat_id="40001",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="40000",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 每秒处理 Update 语句数 | `sum(rate(ob_sysstat{stat_id="40006",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 服务端每条 Update 语句平均处理耗时 | `sum(rate(ob_sysstat{stat_id="40007",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="40006",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 表数量 | `max(ob_table_num{@LABELS}) by (@GBLABELS)` | - |
| MEMStore 总大小 | `sum(ob_sysstat{stat_id="130001",@LABELS}) by (@GBLABELS) / 1048576` | MB |
| 每秒处理事务数 | `sum(rate(ob_sysstat{stat_id="30005",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 服务端每个事务平均处理耗时 | `sum(rate(ob_sysstat{stat_id="30006",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="30005",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 每秒提交的事务日志数 | `sum(rate(ob_sysstat{stat_id="30002",@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 每次事务日志网络同步平均耗时 | `sum(rate(ob_sysstat{stat_id="30000",@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_sysstat{stat_id="30001",@LABELS}[@INTERVAL])) by (@GBLABELS)` | us |
| 每秒等待事件次数 | `sum(rate(ob_waitevent_wait_total{@LABELS}[@INTERVAL])) by (@GBLABELS)` | 次/s |
| 等待事件平均耗时 | `sum(rate(ob_waitevent_wait_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(ob_waitevent_wait_total{@LABELS}[@INTERVAL])) by (@GBLABELS)` | s |
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册