Add backend-infrastructure-monitoring doc (#6711)

d6571606 · wankai123 · GitHub · 58854af6 · d6571606 · d6571606
4 changed file
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -101,6 +101,7 @@ Release Notes.
 #### Documentation
 * Polish documentation due to we have covered all tracing, logging, and metrics fields.
 * Adjust documentation about Zipkin receiver.
+* Add backend-infrastructure-monitoring doc.

 All issues and pull requests are [here](https://github.com/apache/skywalking/milestone/76?closed=1)


--- a/docs/en/setup/backend/backend-infrastructure-monitoring.md
+++ b/docs/en/setup/backend/backend-infrastructure-monitoring.md
+# VMs monitoring 
+SkyWalking leverages Prometheus node-exporter for collecting metrics data from the VMs, and leverages OpenTelemetry Collector to transfer the metrics to
+[OpenTelemetry receiver](backend-receivers.md#opentelemetry-receiver) and into the [Meter System](./../../concepts-and-designs/meter.md).  
+We defined the VM entity as a `Service` in OAP, use `vm::` as a prefix to identify.  
+
+## Data flow
+1. Prometheus node-exporter collects metrics data from the VMs.
+2. OpenTelemetry Collector fetches metrics from node-exporter via Prometheus Receiver and pushes metrics to SkyWalking OAP Server via the OpenCensus GRPC Exporter.
+3. The SkyWalking OAP Server parses the expression with [MAL](../../concepts-and-designs/mal.md) to filter/calculate/aggregate and store the results. 
+
+## Setup 
+1. Setup [Prometheus node-exporter](https://prometheus.io/docs/guides/node-exporter/).
+2. Setup [OpenTelemetry Collector ](https://opentelemetry.io/docs/collector/). This is an example for OpenTelemetry Collector configuration [otel-collector-config.yaml](../../../../test/e2e/e2e-test/docker/promOtelVM/otel-collector-config.yaml).
+3. Config SkyWalking [OpenTelemetry receiver](backend-receivers.md#opentelemetry-receiver).
+   
+## Supported Metrics
+
+| Monitoring Panel | Unit | Metric Name | Description | Data Source |
+|-----|-----|-----|-----|-----|
+| CPU Usage | % | cpu_total_percentage | The CPU cores total used percentage, if there are 2 cores the max usage is 200% | Prometheus node-exporter |
+| Memory RAM Usage | MB | meter_vm_memory_used | The RAM total usage | Prometheus node-exporter |
+| Memory Swap Usage | % | meter_vm_memory_swap_percentage | The swap memory used percentage | Prometheus node-exporter |
+| CPU Average Used | % | meter_vm_cpu_average_used | The CPU cores used percentage in each mode | Prometheus node-exporter |
+| CPU Load |  | meter_vm_cpu_load1<br />meter_vm_cpu_load5<br />meter_vm_cpu_load15 | The CPU 1m / 5m / 15m average load | Prometheus node-exporter |
+| Memory RAM | MB | meter_vm_memory_total<br />meter_vm_memory_available<br />meter_vm_memory_used | The RAM statistics, include Total / Available / Used | Prometheus node-exporter |
+| Memory Swap | MB | meter_vm_memory_swap_free<br />meter_vm_memory_swap_total | The Swap Memory statistics, include Free / Total | Prometheus node-exporter |
+| File System Mountpoint Usage | % | meter_vm_filesystem_percentage | The File System used percentage in each mount point | Prometheus node-exporter |
+| Disk R/W | KB/s | meter_vm_disk_read,meter_vm_disk_written | The Disk read and written | Prometheus node-exporter |
+| Network Bandwidth Usage | KB/s | meter_vm_network_receive<br />meter_vm_network_transmit | The Network receive and transmit | Prometheus node-exporter |
+| Network Status |  | meter_vm_tcp_curr_estab<br />meter_vm_tcp_tw<br />meter_vm_tcp_alloc<br />meter_vm_sockets_used<br />meter_vm_udp_inuse | The number of the TCP establish / TCP time wait / TCP allocated / Sockets inuse / UDP inuse | Prometheus node-exporter |
+| Filefd Allocated |  | meter_vm_filefd_allocated | The number of the File Descriptor allocated | Prometheus node-exporter |
+
+## Customizing 
+You can customize your own metrics/expression/dashboard panel.   
+The metrics definition and expression rules are in `/config/otel-oc-rules/vm.yaml`.  
+The dashboard panel confirmations are in `/config/ui-initialized-templates/vm.yml`.
+
+## Blog
+A related blog can see: [SkyWalking 8.4 provides infrastructure monitoring](https://skywalking.apache.org/blog/2021-02-07-infrastructure-monitoring/)
+
+# K8s monitoring 
+SkyWalking leverages K8s kube-state-metrics and cAdvisor for collecting metrics data from the K8s, and leverages OpenTelemetry Collector to transfer the metrics to
+[OpenTelemetry receiver](backend-receivers.md#opentelemetry-receiver) and into the [Meter System](./../../concepts-and-designs/meter.md). This feature requires authorizing the OAP Server to access K8s's `API Server`.  
+We defined the k8s-cluster as a `Service` in OAP, use `k8s-cluster::` as a prefix to identify.  
+Defined the k8s-node as an `Instance` in OAP, the name is k8s `node name`.  
+Defined the k8s-service as an `Endpoint` in OAP, the name is `$serviceName.$namespace`.  
+
+## Data flow
+1. K8s kube-state-metrics and cAdvisor collects metrics data from the K8s.
+2. OpenTelemetry Collector fetches metrics from kube-state-metrics and cAdvisor via Prometheus Receiver and pushes metrics to SkyWalking OAP Server via the OpenCensus GRPC Exporter.
+3. The SkyWalking OAP Server access to K8s's `API Server` gets meta info and parses the expression with [MAL](../../concepts-and-designs/mal.md) to filter/calculate/aggregate and store the results. 
+
+## Setup 
+1. Setup [kube-state-metric](https://github.com/kubernetes/kube-state-metrics#kubernetes-deployment).
+2. cAdvisor is integrated into `kubelet` by default.
+3. Setup [OpenTelemetry Collector ](https://opentelemetry.io/docs/collector/getting-started/#kubernetes). Prometheus Receiver in OpenTelemetry Collector for K8s can reference [here](https://github.com/prometheus/prometheus/blob/main/documentation/examples/prometheus-kubernetes.yml). For a quick start, we provided a full example for OpenTelemetry Collector configuration [otel-collector-config.yaml](otel-collector-config.yaml).
+4. Config SkyWalking [OpenTelemetry receiver](backend-receivers.md#opentelemetry-receiver).
+
+## Supported Metrics
+From the different point of view to monitor the K8s, there are 3 kinds of metrics: [Cluster](#cluster) / [Node](#node) / [Service](#service) 
+
+### CLuster 
+These metrics are related to the selected cluster(`Current Service in the dashboard`).
+
+| Monitoring Panel | Unit | Metric Name | Description | Data Source |
+|-----|-----|-----|-----|-----|
+| Node Total |  | k8s_cluster_node_total | The number of the nodes | K8s kube-state-metrics|
+| Namespace Total |  | k8s_cluster_namespace_total | The number of the namespaces | K8s kube-state-metrics|
+| Deployment Total |  | k8s_cluster_deployment_total | The number of the deployments | K8s kube-state-metrics|
+| Service Total |  | k8s_cluster_service_total | The number of the services | K8s kube-state-metrics|
+| Pod Total |  | k8s_cluster_pod_total | The number of the pods | K8s kube-state-metrics|
+| Container Total |  | k8s_cluster_container_total | The number of the containers | K8s kube-state-metrics|
+| CPU Resources | m | k8s_cluster_cpu_cores<br />k8s_cluster_cpu_cores_requests<br />k8s_cluster_cpu_cores_limits<br />k8s_cluster_cpu_cores_allocatable | The capacity and the Requests / Limits / Allocatable of the CPU | K8s kube-state-metrics|
+| Memory Resources | GB | k8s_cluster_memory_total<br />k8s_cluster_memory_requests<br />k8s_cluster_memory_limits<br />k8s_cluster_memory_allocatable | The capacity and the Requests / Limits / Allocatable of the memory | K8s kube-state-metrics|
+| Storage Resources | GB | k8s_cluster_storage_total<br />k8s_cluster_storage_allocatable | The capacity and allocatable of the storage | K8s kube-state-metrics|
+| Node Status |  | k8s_cluster_node_status | The current status of the nodes | K8s kube-state-metrics|
+| Deployment Status |  | k8s_cluster_deployment_status | The current status of the deployment | K8s kube-state-metrics|
+| Deployment Spec Replicas |  | k8s_cluster_deployment_spec_replicas | The number of desired pods for a deployment | K8s kube-state-metrics|
+| Service Status |  | k8s_cluster_service_pod_status | The services current status, depending on the related pods' status | K8s kube-state-metrics|
+| Pod Status Not Running |  | k8s_cluster_pod_status_not_running | The pods which the current phase is not running | K8s kube-state-metrics|
+| Pod Status Waiting |  | k8s_cluster_pod_status_waiting | The pods and containers which currently in the waiting status, and show the reason | K8s kube-state-metrics|
+| Pod Status Terminated |  | k8s_cluster_container_status_terminated | The pods and containers which currently in the terminated status, and show the reason | K8s kube-state-metrics|
+
+### Node
+These metrics are related to the selected node (`Current Instance in the dashboard`).
+
+| Monitoring Panel | Unit | Metric Name | Description | Data Source |
+|-----|-----|-----|-----|-----|
+| Pod Total |  | k8s_node_pod_total | The number of the pods which in this node | K8s kube-state-metrics |
+| Node Status |  | k8s_node_node_status | The current status of this node | K8s kube-state-metrics |
+| CPU Resources | m | k8s_node_cpu_cores<br />k8s_node_cpu_cores_allocatable<br />k8s_node_cpu_cores_requests<br />k8s_node_cpu_cores_limits |  The capacity and the Requests / Limits / Allocatable of the CPU  | K8s kube-state-metrics |
+| Memory Resources | GB | k8s_node_memory_total<br />k8s_node_memory_allocatable<br />k8s_node_memory_requests<br />k8s_node_memory_limits | The capacity and the Requests / Limits / Allocatable of the memory | K8s kube-state-metrics |
+| Storage Resources | GB | k8s_node_storage_total<br />k8s_node_storage_allocatable | The capacity and allocatable of the storage | K8s kube-state-metrics |
+| CPU Usage | m | k8s_node_cpu_usage | The CPU cores total usage, if there are 2 cores the max usage is 2000m | cAdvisor |
+| Memory Usage | GB | k8s_node_memory_usage | The memory total usage | cAdvisor |
+| Network I/O| KB/s | k8s_node_network_receive<br />k8s_node_network_transmit | The Network receive and transmit | cAdvisor |
+
+### Service
+In these metrics, the pods are related to the selected service (`Current Endpoint in the dashboard`).
+
+| Monitoring Panel | Unit | Metric Name | Description | Data Source |
+|-----|-----|-----|-----|-----|
+| Service Pod Total |  | k8s_service_pod_total | The number of the pods | K8s kube-state-metrics |
+| Service Pod Status |  | k8s_service_pod_status | The current status of pods | K8s kube-state-metrics |
+| Service CPU Resources | m | k8s_service_cpu_cores_requests<br />k8s_service_cpu_cores_limits | The CPU resources Requests / Limits of this service | K8s kube-state-metrics |
+| Service Memory Resources | MB | k8s_service_memory_requests<br />k8s_service_memory_limits | The memory resources Requests / Limits of this service | K8s kube-state-metrics |
+| Pod CPU Usage | m | k8s_service_pod_cpu_usage | The CPU resources total usage of pods | cAdvisor |
+| Pod Memory Usage | MB | k8s_service_pod_memory_usage | The memory resources total usage of pods | cAdvisor |
+| Pod Waiting |  | k8s_service_pod_status_waiting | The pods and containers which currently in the waiting status, and show the reason | K8s kube-state-metrics |
+| Pod Terminated |  | k8s_service_pod_status_terminated | The pods and containers which currently in the terminated status, and show the reason | K8s kube-state-metrics |
+| Pod Restarts |  | k8s_service_pod_status_restarts_total | The number of per container restarts that related to the pod | K8s kube-state-metrics |
+| Pod Network Receive | KB/s | k8s_service_pod_network_receive | The Network receive of the pods | cAdvisor |
+| Pod Network Transmit | KB/s | k8s_service_pod_network_transmit | The Network transmit of the pods  | cAdvisor |
+| Pod Storage Usage | MB | k8s_service_pod_fs_usage | The storage resources total usage of pods which related to this service | cAdvisor |
+
+## Customizing 
+You can customize your own metrics/expression/dashboard panel.   
+The metrics definition and expression rules are in `/config/otel-oc-rules/k8s-cluster.yaml，/config/otel-oc-rules/k8s-node.yaml, /config/otel-oc-rules/k8s-service.yaml`.  
+The dashboard panel confirmations are in `/config/ui-initialized-templates/k8s.yml`.
\ No newline at end of file
--- a/docs/en/setup/backend/otel-collector-config.yaml
+++ b/docs/en/setup/backend/otel-collector-config.yaml
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: otel-collector-conf
+  labels:
+    app: opentelemetry
+    component: otel-collector-conf
+  namespace: monitoring
+data:
+  otel-collector-config: |
+    receivers:
+      prometheus:
+        config:
+          global:
+            scrape_interval: 15s
+            evaluation_interval: 15s
+          scrape_configs:
+            - job_name: 'kubernetes-cadvisor' 
+              scheme: https
+              metrics_path: /metrics/cadvisor
+              tls_config:
+                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
+              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+              kubernetes_sd_configs:
+              - role: node
+              relabel_configs:
+              - action: labelmap
+                regex: __meta_kubernetes_node_label_(.+)
+              - source_labels: []
+                target_label: cluster   # relabel the cluster name 
+                replacement: gke-cluster-1
+              - source_labels: [instance]   # relabel the node name 
+                separator: ;
+                regex: (.+)
+                target_label: node
+                replacement: $$1
+                action: replace
+            - job_name: kube-state-metrics
+              kubernetes_sd_configs:
+              - role: endpoints
+              relabel_configs:
+              - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
+                regex: kube-state-metrics
+                replacement: $$1
+                action: keep
+              - action: labelmap
+                regex: __meta_kubernetes_service_label_(.+)
+              - source_labels: []  # relabel the cluster name 
+                target_label: cluster
+                replacement: gke-cluster-1
+    processors:
+      batch:
+    extensions:
+      health_check: {}
+      zpages: {}
+    exporters:
+      opencensus:
+        endpoint: "OAP:11800" # The OAP Server address
+        insecure: true    
+      logging:
+        logLevel: debug
+    service:
+      extensions: [health_check, zpages]
+      pipelines:
+        metrics:
+          receivers: [prometheus]
+          processors: [batch]
+          exporters: [opencensus,logging]
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: otel-collector
+  labels:
+    app: opentelemetry
+    component: otel-collector
+  namespace: monitoring
+spec:
+  ports:
+  - name: otlp # Default endpoint for OpenTelemetry receiver.
+    port: 55680
+    protocol: TCP
+    targetPort: 55680
+  - name: metrics # Default endpoint for querying metrics.
+    port: 8888
+  selector:
+    component: otel-collector
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: otel-collector
+  labels:
+    app: opentelemetry
+    component: otel-collector
+  namespace: monitoring
+spec:
+  selector:
+    matchLabels:
+      app: opentelemetry
+      component: otel-collector
+  minReadySeconds: 5
+  progressDeadlineSeconds: 120
+  replicas: 1 #TODO - adjust this to your own requirements
+  template:
+    metadata:
+      labels:
+        app: opentelemetry
+        component: otel-collector
+    spec:
+      containers:
+      - command:
+          - "/otelcol"
+          - "--config=/conf/otel-collector-config.yaml"
+          - "--log-level=DEBUG"
+#           Memory Ballast size should be max 1/3 to 1/2 of memory.
+          - "--mem-ballast-size-mib=683"
+        image: otel/opentelemetry-collector-dev:latest
+        name: otel-collector
+        resources:
+          limits:
+            cpu: 1
+            memory: 2Gi
+          requests:
+            cpu: 200m
+            memory: 400Mi
+        ports:
+        - containerPort: 55679 # Default endpoint for ZPages.
+        - containerPort: 55680 # Default endpoint for OpenTelemetry receiver.
+        - containerPort: 8888  # Default endpoint for querying metrics.
+        volumeMounts:
+        - name: otel-collector-config-vol
+          mountPath: /conf
+        livenessProbe:
+          httpGet:
+            path: /
+            port: 13133 # Health Check extension default port.
+        readinessProbe:
+          httpGet:
+            path: /
+            port: 13133 # Health Check extension default port.
+      volumes:
+        - configMap:
+            name: otel-collector-conf
+            items:
+              - key: otel-collector-config
+                path: otel-collector-config.yaml
+          name: otel-collector-config-vol
+
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: otel-collector
+rules:
+- apiGroups: [""]
+  resources:
+  - nodes
+  - nodes/proxy
+  - nodes/metrics
+  - services
+  - endpoints
+  - pods
+  verbs: ["get", "list", "watch"]
+- apiGroups:
+  - extensions
+  resources:
+  - ingresses
+  verbs: ["get", "list", "watch"]
+- nonResourceURLs: ["/metrics"]
+  verbs: ["get"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: otel-collector
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: otel-collector
+subjects:
+- kind: ServiceAccount
+  name: default
+  namespace: monitoring
\ No newline at end of file
--- a/docs/menu.yml
+++ b/docs/menu.yml
@@ -141,6 +141,8 @@ catalog:
            path: "/en/setup/backend/spring-sleuth-setup"
          - name: "Log Analyzer"
            path: "/en/setup/backend/log-analyzer"
+          - name: "Infrastructure Monitoring"
+            path: "/en/setup/backend/backend-infrastructure-monitoring"
      - name: "UI Setup"
        path: "/en/setup/backend/ui-setup"
      - name: "CLI Setup"