Skip to content

Observability metrics of MongoDB Flex

In this article you can learn about the meaning of the observability metrics exported by MongoDB. All metrics are available in OpenMetrics format and can be scraped by time-series databases such as Prometheus, VictoriaMetrics, or Grafana Mimir.

Indicates whether the last scrape was able to reach the database daemon and the latter reported a healthy state. (type: gauge)

This is the most important metric for alerting. When the value is 0, all other metrics are unavailable.

Server uptime in milliseconds (type: counter)

Can be used to calculate uptime percentages. A reset to 0 indicates a server restart.

Number of connections to the server (type: gauge)

High current connection counts indicate misconfigured connection pools or connection leaks.

RAM used by the MongoDB process in megabytes (type: gauge)

Virtual memory used by the MongoDB process in megabytes (type: gauge)

Virtual memory is typically much larger than resident memory due to memory-mapped files.

Total operations by type since server startup (type: counter)

Use rate() to calculate operations per second. High rates with high latency indicate performance issues.

Network traffic in bytes (type: counter)

The following metrics represent system-level resources. All include the label node_type.

Time waiting for I/O operations (type: counter)

Calculate normalized CPU iowait: rate(hardware_system_cpu_io_wait_milliseconds[1m]) / 10 / hardware_platform_num_logical_cpus

Number of logical CPU cores (type: gauge)

hardware_system_memory_mem_available_kilobytes

Section titled “hardware_system_memory_mem_available_kilobytes”

Available memory in kilobytes (type: gauge)

More useful than mem_free as it includes reclaimable cache memory.

In-memory cache for disk files in kilobytes (type: gauge)

High values are normal and indicate efficient RAM usage.

hardware_disk_metrics_disk_space_free_bytes / hardware_disk_metrics_disk_space_used_bytes

Section titled “hardware_disk_metrics_disk_space_free_bytes / hardware_disk_metrics_disk_space_used_bytes”

Disk space in bytes (type: gauge)

hardware_disk_metrics_read_count / hardware_disk_metrics_write_count

Section titled “hardware_disk_metrics_read_count / hardware_disk_metrics_write_count”

I/O operations processed (type: counter)

Calculate IOPS: rate(hardware_disk_metrics_read_count[30s]) + rate(hardware_disk_metrics_write_count[30s])

hardware_disk_metrics_read_time_milliseconds / hardware_disk_metrics_write_time_milliseconds

Section titled “hardware_disk_metrics_read_time_milliseconds / hardware_disk_metrics_write_time_milliseconds”

Wait time for I/O requests in milliseconds (type: counter)

Calculate latency: rate(hardware_disk_metrics_read_time_milliseconds[5m]) / rate(hardware_disk_metrics_read_count[5m])

hardware_disk_metrics_weighted_time_io_milliseconds

Section titled “hardware_disk_metrics_weighted_time_io_milliseconds”

Weighted time doing I/Os - indicates disk queue depth (type: counter)

High values suggest storage system struggles to keep up with I/O demand.

We highly recommend monitoring the following metrics:

  • Disk IOPS: The Disk IOPS threshold depends on the current IOPS allocation provisioned for the cluster’s tier and storage capacity. It is the sum of hardware_disk_metrics_read_count and hardware_disk_metrics_write_count. Monitor whether disk IOPS approaches the maximum provisioned IOPS and determine whether the cluster can handle future workloads.
  • Normalized System CPU iowait: This metric indicates the percentage of time the CPU is idle, waiting for IO (input/output) operations to finish, scaled to a range of 0-100% by dividing it by the number of CPU cores. This metric helps identify potential disk bottlenecks. It’s possible that the system is reaching its aggregate disk throughput limits based on available capacity. In this scenario, you might notice IOPS not reaching their full capacity while concurrently observing Normalized System CPU iowait, indicating IO resource exhaustion.
  • Disk Queue Depth: The Disk Queue Depth metric represents the count of pending I/O operations in the disk queue. It offers visibility into the volume of pending read and write operations awaiting processing by the underlying storage system. A high Disk Queue Depth value can indicate that the storage system is struggling to keep up with the workload, potentially leading to performance issues. However, it’s important to note that what constitutes a “high” value depends on various factors such as your specific workload, hardware setup, and performance expectations. Generally, if the Disk Queue Depth consistently maintains a value exceeding 2-4 times the number of CPU cores on your server, it might suggest an underlying issue. This hints at an excess of pending I/O operations that the storage system might not be adeptly managing.
  • Disk Latency: In addition to monitoring the two metrics mentioned previously, we recommend creating an alert on Disk read latency on Data Partition and Disk write latency on Data Partition similar to the threshold you have defined for your operation execution time, which depends on the cluster configuration and your specific workload. Note that acceptable disk latency can significantly differ based on factors, such as: your application’s workload, the complexity of your queries, the read and write patterns and the overall performance expectations.

We offer a template of the most important metrics. You can download and import it in STACKIT Observability: metric_exporter.json.