diff --git a/CHANGES.md b/CHANGES.md index 989dc381221e3142a118898f2be7d0ac835a390a..e0574e6b7e37661722b2a3c2b575c5ecfcd0fa89 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -101,6 +101,7 @@ Release Notes. #### Documentation * Polish documentation due to we have covered all tracing, logging, and metrics fields. * Adjust documentation about Zipkin receiver. +* Add backend-infrastructure-monitoring doc. All issues and pull requests are [here](https://github.com/apache/skywalking/milestone/76?closed=1) diff --git a/docs/en/setup/backend/backend-infrastructure-monitoring.md b/docs/en/setup/backend/backend-infrastructure-monitoring.md new file mode 100644 index 0000000000000000000000000000000000000000..ddbf9b7eefd971cc371cc021b5b46c8617c5eeb2 --- /dev/null +++ b/docs/en/setup/backend/backend-infrastructure-monitoring.md @@ -0,0 +1,119 @@ +# VMs monitoring +SkyWalking leverages Prometheus node-exporter for collecting metrics data from the VMs, and leverages OpenTelemetry Collector to transfer the metrics to +[OpenTelemetry receiver](backend-receivers.md#opentelemetry-receiver) and into the [Meter System](./../../concepts-and-designs/meter.md). +We defined the VM entity as a `Service` in OAP, use `vm::` as a prefix to identify. + +## Data flow +1. Prometheus node-exporter collects metrics data from the VMs. +2. OpenTelemetry Collector fetches metrics from node-exporter via Prometheus Receiver and pushes metrics to SkyWalking OAP Server via the OpenCensus GRPC Exporter. +3. The SkyWalking OAP Server parses the expression with [MAL](../../concepts-and-designs/mal.md) to filter/calculate/aggregate and store the results. + +## Setup +1. Setup [Prometheus node-exporter](https://prometheus.io/docs/guides/node-exporter/). +2. Setup [OpenTelemetry Collector ](https://opentelemetry.io/docs/collector/). This is an example for OpenTelemetry Collector configuration [otel-collector-config.yaml](../../../../test/e2e/e2e-test/docker/promOtelVM/otel-collector-config.yaml). +3. Config SkyWalking [OpenTelemetry receiver](backend-receivers.md#opentelemetry-receiver). + +## Supported Metrics + +| Monitoring Panel | Unit | Metric Name | Description | Data Source | +|-----|-----|-----|-----|-----| +| CPU Usage | % | cpu_total_percentage | The CPU cores total used percentage, if there are 2 cores the max usage is 200% | Prometheus node-exporter | +| Memory RAM Usage | MB | meter_vm_memory_used | The RAM total usage | Prometheus node-exporter | +| Memory Swap Usage | % | meter_vm_memory_swap_percentage | The swap memory used percentage | Prometheus node-exporter | +| CPU Average Used | % | meter_vm_cpu_average_used | The CPU cores used percentage in each mode | Prometheus node-exporter | +| CPU Load | | meter_vm_cpu_load1
meter_vm_cpu_load5
meter_vm_cpu_load15 | The CPU 1m / 5m / 15m average load | Prometheus node-exporter | +| Memory RAM | MB | meter_vm_memory_total
meter_vm_memory_available
meter_vm_memory_used | The RAM statistics, include Total / Available / Used | Prometheus node-exporter | +| Memory Swap | MB | meter_vm_memory_swap_free
meter_vm_memory_swap_total | The Swap Memory statistics, include Free / Total | Prometheus node-exporter | +| File System Mountpoint Usage | % | meter_vm_filesystem_percentage | The File System used percentage in each mount point | Prometheus node-exporter | +| Disk R/W | KB/s | meter_vm_disk_read,meter_vm_disk_written | The Disk read and written | Prometheus node-exporter | +| Network Bandwidth Usage | KB/s | meter_vm_network_receive
meter_vm_network_transmit | The Network receive and transmit | Prometheus node-exporter | +| Network Status | | meter_vm_tcp_curr_estab
meter_vm_tcp_tw
meter_vm_tcp_alloc
meter_vm_sockets_used
meter_vm_udp_inuse | The number of the TCP establish / TCP time wait / TCP allocated / Sockets inuse / UDP inuse | Prometheus node-exporter | +| Filefd Allocated | | meter_vm_filefd_allocated | The number of the File Descriptor allocated | Prometheus node-exporter | + +## Customizing +You can customize your own metrics/expression/dashboard panel. +The metrics definition and expression rules are in `/config/otel-oc-rules/vm.yaml`. +The dashboard panel confirmations are in `/config/ui-initialized-templates/vm.yml`. + +## Blog +A related blog can see: [SkyWalking 8.4 provides infrastructure monitoring](https://skywalking.apache.org/blog/2021-02-07-infrastructure-monitoring/) + +# K8s monitoring +SkyWalking leverages K8s kube-state-metrics and cAdvisor for collecting metrics data from the K8s, and leverages OpenTelemetry Collector to transfer the metrics to +[OpenTelemetry receiver](backend-receivers.md#opentelemetry-receiver) and into the [Meter System](./../../concepts-and-designs/meter.md). This feature requires authorizing the OAP Server to access K8s's `API Server`. +We defined the k8s-cluster as a `Service` in OAP, use `k8s-cluster::` as a prefix to identify. +Defined the k8s-node as an `Instance` in OAP, the name is k8s `node name`. +Defined the k8s-service as an `Endpoint` in OAP, the name is `$serviceName.$namespace`. + +## Data flow +1. K8s kube-state-metrics and cAdvisor collects metrics data from the K8s. +2. OpenTelemetry Collector fetches metrics from kube-state-metrics and cAdvisor via Prometheus Receiver and pushes metrics to SkyWalking OAP Server via the OpenCensus GRPC Exporter. +3. The SkyWalking OAP Server access to K8s's `API Server` gets meta info and parses the expression with [MAL](../../concepts-and-designs/mal.md) to filter/calculate/aggregate and store the results. + +## Setup +1. Setup [kube-state-metric](https://github.com/kubernetes/kube-state-metrics#kubernetes-deployment). +2. cAdvisor is integrated into `kubelet` by default. +3. Setup [OpenTelemetry Collector ](https://opentelemetry.io/docs/collector/getting-started/#kubernetes). Prometheus Receiver in OpenTelemetry Collector for K8s can reference [here](https://github.com/prometheus/prometheus/blob/main/documentation/examples/prometheus-kubernetes.yml). For a quick start, we provided a full example for OpenTelemetry Collector configuration [otel-collector-config.yaml](otel-collector-config.yaml). +4. Config SkyWalking [OpenTelemetry receiver](backend-receivers.md#opentelemetry-receiver). + +## Supported Metrics +From the different point of view to monitor the K8s, there are 3 kinds of metrics: [Cluster](#cluster) / [Node](#node) / [Service](#service) + +### CLuster +These metrics are related to the selected cluster(`Current Service in the dashboard`). + +| Monitoring Panel | Unit | Metric Name | Description | Data Source | +|-----|-----|-----|-----|-----| +| Node Total | | k8s_cluster_node_total | The number of the nodes | K8s kube-state-metrics| +| Namespace Total | | k8s_cluster_namespace_total | The number of the namespaces | K8s kube-state-metrics| +| Deployment Total | | k8s_cluster_deployment_total | The number of the deployments | K8s kube-state-metrics| +| Service Total | | k8s_cluster_service_total | The number of the services | K8s kube-state-metrics| +| Pod Total | | k8s_cluster_pod_total | The number of the pods | K8s kube-state-metrics| +| Container Total | | k8s_cluster_container_total | The number of the containers | K8s kube-state-metrics| +| CPU Resources | m | k8s_cluster_cpu_cores
k8s_cluster_cpu_cores_requests
k8s_cluster_cpu_cores_limits
k8s_cluster_cpu_cores_allocatable | The capacity and the Requests / Limits / Allocatable of the CPU | K8s kube-state-metrics| +| Memory Resources | GB | k8s_cluster_memory_total
k8s_cluster_memory_requests
k8s_cluster_memory_limits
k8s_cluster_memory_allocatable | The capacity and the Requests / Limits / Allocatable of the memory | K8s kube-state-metrics| +| Storage Resources | GB | k8s_cluster_storage_total
k8s_cluster_storage_allocatable | The capacity and allocatable of the storage | K8s kube-state-metrics| +| Node Status | | k8s_cluster_node_status | The current status of the nodes | K8s kube-state-metrics| +| Deployment Status | | k8s_cluster_deployment_status | The current status of the deployment | K8s kube-state-metrics| +| Deployment Spec Replicas | | k8s_cluster_deployment_spec_replicas | The number of desired pods for a deployment | K8s kube-state-metrics| +| Service Status | | k8s_cluster_service_pod_status | The services current status, depending on the related pods' status | K8s kube-state-metrics| +| Pod Status Not Running | | k8s_cluster_pod_status_not_running | The pods which the current phase is not running | K8s kube-state-metrics| +| Pod Status Waiting | | k8s_cluster_pod_status_waiting | The pods and containers which currently in the waiting status, and show the reason | K8s kube-state-metrics| +| Pod Status Terminated | | k8s_cluster_container_status_terminated | The pods and containers which currently in the terminated status, and show the reason | K8s kube-state-metrics| + +### Node +These metrics are related to the selected node (`Current Instance in the dashboard`). + +| Monitoring Panel | Unit | Metric Name | Description | Data Source | +|-----|-----|-----|-----|-----| +| Pod Total | | k8s_node_pod_total | The number of the pods which in this node | K8s kube-state-metrics | +| Node Status | | k8s_node_node_status | The current status of this node | K8s kube-state-metrics | +| CPU Resources | m | k8s_node_cpu_cores
k8s_node_cpu_cores_allocatable
k8s_node_cpu_cores_requests
k8s_node_cpu_cores_limits | The capacity and the Requests / Limits / Allocatable of the CPU | K8s kube-state-metrics | +| Memory Resources | GB | k8s_node_memory_total
k8s_node_memory_allocatable
k8s_node_memory_requests
k8s_node_memory_limits | The capacity and the Requests / Limits / Allocatable of the memory | K8s kube-state-metrics | +| Storage Resources | GB | k8s_node_storage_total
k8s_node_storage_allocatable | The capacity and allocatable of the storage | K8s kube-state-metrics | +| CPU Usage | m | k8s_node_cpu_usage | The CPU cores total usage, if there are 2 cores the max usage is 2000m | cAdvisor | +| Memory Usage | GB | k8s_node_memory_usage | The memory total usage | cAdvisor | +| Network I/O| KB/s | k8s_node_network_receive
k8s_node_network_transmit | The Network receive and transmit | cAdvisor | + +### Service +In these metrics, the pods are related to the selected service (`Current Endpoint in the dashboard`). + +| Monitoring Panel | Unit | Metric Name | Description | Data Source | +|-----|-----|-----|-----|-----| +| Service Pod Total | | k8s_service_pod_total | The number of the pods | K8s kube-state-metrics | +| Service Pod Status | | k8s_service_pod_status | The current status of pods | K8s kube-state-metrics | +| Service CPU Resources | m | k8s_service_cpu_cores_requests
k8s_service_cpu_cores_limits | The CPU resources Requests / Limits of this service | K8s kube-state-metrics | +| Service Memory Resources | MB | k8s_service_memory_requests
k8s_service_memory_limits | The memory resources Requests / Limits of this service | K8s kube-state-metrics | +| Pod CPU Usage | m | k8s_service_pod_cpu_usage | The CPU resources total usage of pods | cAdvisor | +| Pod Memory Usage | MB | k8s_service_pod_memory_usage | The memory resources total usage of pods | cAdvisor | +| Pod Waiting | | k8s_service_pod_status_waiting | The pods and containers which currently in the waiting status, and show the reason | K8s kube-state-metrics | +| Pod Terminated | | k8s_service_pod_status_terminated | The pods and containers which currently in the terminated status, and show the reason | K8s kube-state-metrics | +| Pod Restarts | | k8s_service_pod_status_restarts_total | The number of per container restarts that related to the pod | K8s kube-state-metrics | +| Pod Network Receive | KB/s | k8s_service_pod_network_receive | The Network receive of the pods | cAdvisor | +| Pod Network Transmit | KB/s | k8s_service_pod_network_transmit | The Network transmit of the pods | cAdvisor | +| Pod Storage Usage | MB | k8s_service_pod_fs_usage | The storage resources total usage of pods which related to this service | cAdvisor | + +## Customizing +You can customize your own metrics/expression/dashboard panel. +The metrics definition and expression rules are in `/config/otel-oc-rules/k8s-cluster.yaml,/config/otel-oc-rules/k8s-node.yaml, /config/otel-oc-rules/k8s-service.yaml`. +The dashboard panel confirmations are in `/config/ui-initialized-templates/k8s.yml`. \ No newline at end of file diff --git a/docs/en/setup/backend/otel-collector-config.yaml b/docs/en/setup/backend/otel-collector-config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..b27281f938e56c8b5bb2685b564b767a5ce9a74d --- /dev/null +++ b/docs/en/setup/backend/otel-collector-config.yaml @@ -0,0 +1,199 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: otel-collector-conf + labels: + app: opentelemetry + component: otel-collector-conf + namespace: monitoring +data: + otel-collector-config: | + receivers: + prometheus: + config: + global: + scrape_interval: 15s + evaluation_interval: 15s + scrape_configs: + - job_name: 'kubernetes-cadvisor' + scheme: https + metrics_path: /metrics/cadvisor + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + kubernetes_sd_configs: + - role: node + relabel_configs: + - action: labelmap + regex: __meta_kubernetes_node_label_(.+) + - source_labels: [] + target_label: cluster # relabel the cluster name + replacement: gke-cluster-1 + - source_labels: [instance] # relabel the node name + separator: ; + regex: (.+) + target_label: node + replacement: $$1 + action: replace + - job_name: kube-state-metrics + kubernetes_sd_configs: + - role: endpoints + relabel_configs: + - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name] + regex: kube-state-metrics + replacement: $$1 + action: keep + - action: labelmap + regex: __meta_kubernetes_service_label_(.+) + - source_labels: [] # relabel the cluster name + target_label: cluster + replacement: gke-cluster-1 + processors: + batch: + extensions: + health_check: {} + zpages: {} + exporters: + opencensus: + endpoint: "OAP:11800" # The OAP Server address + insecure: true + logging: + logLevel: debug + service: + extensions: [health_check, zpages] + pipelines: + metrics: + receivers: [prometheus] + processors: [batch] + exporters: [opencensus,logging] +--- +apiVersion: v1 +kind: Service +metadata: + name: otel-collector + labels: + app: opentelemetry + component: otel-collector + namespace: monitoring +spec: + ports: + - name: otlp # Default endpoint for OpenTelemetry receiver. + port: 55680 + protocol: TCP + targetPort: 55680 + - name: metrics # Default endpoint for querying metrics. + port: 8888 + selector: + component: otel-collector +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: otel-collector + labels: + app: opentelemetry + component: otel-collector + namespace: monitoring +spec: + selector: + matchLabels: + app: opentelemetry + component: otel-collector + minReadySeconds: 5 + progressDeadlineSeconds: 120 + replicas: 1 #TODO - adjust this to your own requirements + template: + metadata: + labels: + app: opentelemetry + component: otel-collector + spec: + containers: + - command: + - "/otelcol" + - "--config=/conf/otel-collector-config.yaml" + - "--log-level=DEBUG" +# Memory Ballast size should be max 1/3 to 1/2 of memory. + - "--mem-ballast-size-mib=683" + image: otel/opentelemetry-collector-dev:latest + name: otel-collector + resources: + limits: + cpu: 1 + memory: 2Gi + requests: + cpu: 200m + memory: 400Mi + ports: + - containerPort: 55679 # Default endpoint for ZPages. + - containerPort: 55680 # Default endpoint for OpenTelemetry receiver. + - containerPort: 8888 # Default endpoint for querying metrics. + volumeMounts: + - name: otel-collector-config-vol + mountPath: /conf + livenessProbe: + httpGet: + path: / + port: 13133 # Health Check extension default port. + readinessProbe: + httpGet: + path: / + port: 13133 # Health Check extension default port. + volumes: + - configMap: + name: otel-collector-conf + items: + - key: otel-collector-config + path: otel-collector-config.yaml + name: otel-collector-config-vol + +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: otel-collector +rules: +- apiGroups: [""] + resources: + - nodes + - nodes/proxy + - nodes/metrics + - services + - endpoints + - pods + verbs: ["get", "list", "watch"] +- apiGroups: + - extensions + resources: + - ingresses + verbs: ["get", "list", "watch"] +- nonResourceURLs: ["/metrics"] + verbs: ["get"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: otel-collector +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: otel-collector +subjects: +- kind: ServiceAccount + name: default + namespace: monitoring \ No newline at end of file diff --git a/docs/menu.yml b/docs/menu.yml index 13d1a25d2b144100372921e8951373f9e91a7745..4df194b70cca60281d2b55358c17e27e09e5dfcb 100644 --- a/docs/menu.yml +++ b/docs/menu.yml @@ -141,6 +141,8 @@ catalog: path: "/en/setup/backend/spring-sleuth-setup" - name: "Log Analyzer" path: "/en/setup/backend/log-analyzer" + - name: "Infrastructure Monitoring" + path: "/en/setup/backend/backend-infrastructure-monitoring" - name: "UI Setup" path: "/en/setup/backend/ui-setup" - name: "CLI Setup"