prometheus.rst 11.1 KB
Newer Older
XuanDai's avatar
XuanDai 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314
.. _mgr-prometheus:

=================
Prometheus Module
=================

Provides a Prometheus exporter to pass on Ceph performance counters
from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
messages from all MgrClient processes (mons and OSDs, for instance)
with performance counter schema data and actual counter data, and keeps
a circular buffer of the last N samples.  This module creates an HTTP
endpoint (like all Prometheus exporters) and retrieves the latest sample
of every counter when polled (or "scraped" in Prometheus terminology).
The HTTP path and query parameters are ignored; all extant counters
for all reporting entities are returned in text exposition format.
(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)

Enabling prometheus output
==========================

The *prometheus* module is enabled with::

  ceph mgr module enable prometheus

Configuration
-------------

.. note::

    The Prometheus manager module needs to be restarted for configuration changes to be applied.

By default the module will accept HTTP requests on port ``9283`` on all IPv4
and IPv6 addresses on the host.  The port and listen address are both
configurable with ``ceph config-key set``, with keys
``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.  This port
is registered with Prometheus's `registry
<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.

::

    ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
    ceph config set mgr mgr/prometheus/server_port 9283

.. warning::

    The ``scrape_interval`` of this module should always be set to match
    Prometheus' scrape interval to work properly and not cause any issues.
    
The Prometheus manager module is, by default, configured with a scrape interval
of 15 seconds.  The scrape interval in the module is used for caching purposes
and to determine when a cache is stale.

It is not recommended to use a scrape interval below 10 seconds.  It is
recommended to use 15 seconds as scrape interval, though, in some cases it
might be useful to increase the scrape interval.

To set a different scrape interval in the Prometheus module, set
``scrape_interval`` to the desired value::

    ceph config set mgr mgr/prometheus/scrape_interval 20

On large clusters (>1000 OSDs), the time to fetch the metrics may become
significant.  Without the cache, the Prometheus manager module could,
especially in conjunction with multiple Prometheus instances, overload the
manager and lead to unresponsive or crashing Ceph manager instances.  Hence,
the cache is enabled by default and cannot be disabled.  This means that there
is a possibility that the cache becomes stale.  The cache is considered stale
when the time to fetch the metrics from Ceph exceeds the configured
``scrape_interval``.

If that is the case, **a warning will be logged** and the module will either

* respond with a 503 HTTP status code (service unavailable) or,
* it will return the content of the cache, even though it might be stale.

This behavior can be configured. By default, it will return a 503 HTTP status
code (service unavailable). You can set other options using the ``ceph config
set`` commands.

To tell the module to respond with possibly stale data, set it to ``return``::

    ceph config set mgr mgr/prometheus/stale_cache_strategy return

To tell the module to respond with "service unavailable", set it to ``fail``::

    ceph config set mgr mgr/prometheus/stale_cache_strategy fail

.. _prometheus-rbd-io-statistics:

RBD IO statistics
-----------------

The module can optionally collect RBD per-image IO statistics by enabling
dynamic OSD performance counters. The statistics are gathered for all images
in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
configuration parameter. The parameter is a comma or space separated list
of ``pool[/namespace]`` entries. If the namespace is not specified the
statistics are collected for all namespaces in the pool.

Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::

  ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"

The module makes the list of all available images scanning the specified
pools and namespaces and refreshes it periodically. The period is
configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
parameter (in sec) and is 300 sec (5 minutes) by default. The module will
force refresh earlier if it detects statistics from a previously unknown
RBD image.

Example to turn up the sync interval to 10 minutes::

  ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600

Statistic names and labels
==========================

The names of the stats are exactly as Ceph names them, with
illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
and ``ceph_`` prefixed to all names.


All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
that identifies the type and ID of the daemon they come from.  Some
statistics can come from different types of daemon, so when querying
e.g. an OSD's RocksDB stats, you would probably want to filter
on ceph_daemon starting with "osd" to avoid mixing in the monitor
rocksdb stats.


The *cluster* statistics (i.e. those global to the Ceph cluster)
have labels appropriate to what they report on.  For example,
metrics relating to pools have a ``pool_id`` label.


The long running averages that represent the histograms from core Ceph
are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.

Pool and OSD metadata series
----------------------------

Special series are output to enable displaying and querying on
certain metadata fields.

Pools have a ``ceph_pool_metadata`` field like this:

::

    ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0

OSDs have a ``ceph_osd_metadata`` field like this:

::

    ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0


Correlating drive statistics with node_exporter
-----------------------------------------------

The prometheus output from Ceph is designed to be used in conjunction
with the generic host monitoring from the Prometheus node_exporter.

To enable correlation of Ceph OSD statistics with node_exporter's
drive statistics, special series are output like this:

::

    ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}

To use this to get disk statistics by OSD ID, use either the ``and`` operator or
the ``*`` operator in your prometheus query. All metadata metrics (like ``
ceph_disk_occupation`` have the value 1 so they act neutral with ``*``. Using ``*``
allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
the resulting metric has additional labels from one side of the query.

See the
`prometheus documentation`__ for more information about constructing queries.

__ https://prometheus.io/docs/prometheus/latest/querying/basics

The goal is to run a query like

::

    rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Out of the box the above query will not return any metrics since the ``instance`` labels of
both metrics don't match. The ``instance`` label of ``ceph_disk_occupation``
will be the currently active MGR node.

 The following two section outline two approaches to remedy this.

Use label_replace
=================

The ``label_replace`` function (cp.
`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
can add a label to, or alter a label of, a metric within a query.

To correlate an OSD and its disks write rate, the following query can be used:

::

    label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.*):.*") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Configuring Prometheus server
=============================

honor_labels
------------

To enable Ceph to output properly-labeled data relating to any host,
use the ``honor_labels`` setting when adding the ceph-mgr endpoints
to your prometheus configuration.

This allows Ceph to export the proper ``instance`` label without prometheus
overwriting it. Without this setting, Prometheus applies an ``instance`` label
that includes the hostname and port of the endpoint that the series came from.
Because Ceph clusters have multiple manager daemons, this results in an
``instance`` label that changes spuriously when the active manager daemon
changes.

If this is undesirable a custom ``instance`` label can be set in the
Prometheus target configuration: you might wish to set it to the hostname
of your first mgr daemon, or something completely arbitrary like "ceph_cluster".

node_exporter hostname labels
-----------------------------

Set your ``instance`` labels to match what appears in Ceph's OSD metadata
in the ``instance`` field.  This is generally the short hostname of the node.

This is only necessary if you want to correlate Ceph stats with host stats,
but you may find it useful to do it in all cases in case you want to do
the correlation in the future.

Example configuration
---------------------

This example shows a single node configuration running ceph-mgr and
node_exporter on a server called ``senta04``. Note that this requires to add the
appropriate instance label to every ``node_exporter`` target individually.

This is just an example: there are other ways to configure prometheus
scrape targets and label rewrite rules.

prometheus.yml
~~~~~~~~~~~~~~

::

    global:
      scrape_interval:     15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'node'
        file_sd_configs:
          - files:
            - node_targets.yml
      - job_name: 'ceph'
        honor_labels: true
        file_sd_configs:
          - files:
            - ceph_targets.yml


ceph_targets.yml
~~~~~~~~~~~~~~~~


::

    [
        {
            "targets": [ "senta04.mydomain.com:9283" ],
            "labels": {}
        }
    ]


node_targets.yml
~~~~~~~~~~~~~~~~

::

    [
        {
            "targets": [ "senta04.mydomain.com:9100" ],
            "labels": {
                "instance": "senta04"
            }
        }
    ]


Notes
=====

Counters and gauges are exported; currently histograms and long-running 
averages are not.  It's possible that Ceph's 2-D histograms could be 
reduced to two separate 1-D histograms, and that long-running averages
could be exported as Prometheus' Summary type.

Timestamps, as with many Prometheus exporters, are established by
the server's scrape time (Prometheus expects that it is polling the
actual counter process synchronously).  It is possible to supply a
timestamp along with the stat report, but the Prometheus team strongly
advises against this.  This means that timestamps will be delayed by
an unpredictable amount; it's not clear if this will be problematic,
but it's worth knowing about.