Observability metrics of MongoDB Flex
In this article you can learn about the meaning of the observability metrics exported by MongoDB. All metrics are available in OpenMetrics format and can be scraped by time-series databases such as Prometheus, VictoriaMetrics, or Grafana Mimir.
Service Availability
Section titled “Service Availability”mongodb_up
Section titled “mongodb_up”Indicates whether the last scrape was able to reach the database daemon and the latter reported a healthy state. (type: gauge)
| Label | Description | Values |
|---|---|---|
node_type | Type of database node | 0 = Primary, 1 = Secondary, 2 = Arbiter |
This is the most important metric for alerting. When the value is 0, all other metrics are unavailable.
mongodb_uptimeMillis
Section titled “mongodb_uptimeMillis”Server uptime in milliseconds (type: counter)
| Label | Description |
|---|---|
rs_nm | Replica set name |
rs_state | Replica set member state |
node_type | Type of database node |
Can be used to calculate uptime percentages. A reset to 0 indicates a server restart.
Connections
Section titled “Connections”mongodb_connections
Section titled “mongodb_connections”Number of connections to the server (type: gauge)
| Label | Description | Possible Values |
|---|---|---|
conn_type | Type of connection | current, available |
node_type | Type of database node | 0, 1, 2 |
High current connection counts indicate misconfigured connection pools or connection leaks.
Memory
Section titled “Memory”mongodb_mem_resident
Section titled “mongodb_mem_resident”RAM used by the MongoDB process in megabytes (type: gauge)
| Label | Description |
|---|---|
node_type | Type of database node |
mongodb_mem_virtual
Section titled “mongodb_mem_virtual”Virtual memory used by the MongoDB process in megabytes (type: gauge)
| Label | Description |
|---|---|
node_type | Type of database node |
Virtual memory is typically much larger than resident memory due to memory-mapped files.
Operations
Section titled “Operations”mongodb_opcounters_total
Section titled “mongodb_opcounters_total”Total operations by type since server startup (type: counter)
| Label | Description | Possible Values |
|---|---|---|
type | Operation type | insert, query, update, delete, getmore, command |
node_type | Type of database node | 0, 1, 2 |
Use rate() to calculate operations per second. High rates with high latency indicate performance issues.
Network
Section titled “Network”mongodb_network_bytes_total
Section titled “mongodb_network_bytes_total”Network traffic in bytes (type: counter)
| Label | Description | Possible Values |
|---|---|---|
type | Traffic direction | in, out |
node_type | Type of database node | 0, 1, 2 |
Hardware Metrics
Section titled “Hardware Metrics”The following metrics represent system-level resources. All include the label node_type.
hardware_system_cpu_io_wait_milliseconds
Section titled “hardware_system_cpu_io_wait_milliseconds”Time waiting for I/O operations (type: counter)
Calculate normalized CPU iowait: rate(hardware_system_cpu_io_wait_milliseconds[1m]) / 10 / hardware_platform_num_logical_cpus
hardware_platform_num_logical_cpus
Section titled “hardware_platform_num_logical_cpus”Number of logical CPU cores (type: gauge)
Memory
Section titled “Memory”hardware_system_memory_mem_available_kilobytes
Section titled “hardware_system_memory_mem_available_kilobytes”Available memory in kilobytes (type: gauge)
More useful than mem_free as it includes reclaimable cache memory.
hardware_system_memory_cached_kilobytes
Section titled “hardware_system_memory_cached_kilobytes”In-memory cache for disk files in kilobytes (type: gauge)
High values are normal and indicate efficient RAM usage.
hardware_disk_metrics_disk_space_free_bytes / hardware_disk_metrics_disk_space_used_bytes
Section titled “hardware_disk_metrics_disk_space_free_bytes / hardware_disk_metrics_disk_space_used_bytes”Disk space in bytes (type: gauge)
| Label | Description |
|---|---|
disk_name | Block device name |
node_type | Type of database node |
hardware_disk_metrics_read_count / hardware_disk_metrics_write_count
Section titled “hardware_disk_metrics_read_count / hardware_disk_metrics_write_count”I/O operations processed (type: counter)
| Label | Description |
|---|---|
disk_name | Block device name |
node_type | Type of database node |
Calculate IOPS: rate(hardware_disk_metrics_read_count[30s]) + rate(hardware_disk_metrics_write_count[30s])
hardware_disk_metrics_read_time_milliseconds / hardware_disk_metrics_write_time_milliseconds
Section titled “hardware_disk_metrics_read_time_milliseconds / hardware_disk_metrics_write_time_milliseconds”Wait time for I/O requests in milliseconds (type: counter)
| Label | Description |
|---|---|
disk_name | Block device name |
node_type | Type of database node |
Calculate latency: rate(hardware_disk_metrics_read_time_milliseconds[5m]) / rate(hardware_disk_metrics_read_count[5m])
hardware_disk_metrics_weighted_time_io_milliseconds
Section titled “hardware_disk_metrics_weighted_time_io_milliseconds”Weighted time doing I/Os - indicates disk queue depth (type: counter)
| Label | Description |
|---|---|
disk_name | Block device name |
node_type | Type of database node |
High values suggest storage system struggles to keep up with I/O demand.
Best Practices
Section titled “Best Practices”We highly recommend monitoring the following metrics:
- Disk IOPS: The Disk IOPS threshold depends on the current IOPS allocation provisioned for the cluster’s tier and storage capacity. It is the sum of
hardware_disk_metrics_read_countandhardware_disk_metrics_write_count. Monitor whether disk IOPS approaches the maximum provisioned IOPS and determine whether the cluster can handle future workloads. - Normalized System CPU iowait: This metric indicates the percentage of time the CPU is idle, waiting for IO (input/output) operations to finish, scaled to a range of 0-100% by dividing it by the number of CPU cores. This metric helps identify potential disk bottlenecks. It’s possible that the system is reaching its aggregate disk throughput limits based on available capacity. In this scenario, you might notice IOPS not reaching their full capacity while concurrently observing Normalized System CPU iowait, indicating IO resource exhaustion.
- Disk Queue Depth: The Disk Queue Depth metric represents the count of pending I/O operations in the disk queue. It offers visibility into the volume of pending read and write operations awaiting processing by the underlying storage system. A high Disk Queue Depth value can indicate that the storage system is struggling to keep up with the workload, potentially leading to performance issues. However, it’s important to note that what constitutes a “high” value depends on various factors such as your specific workload, hardware setup, and performance expectations. Generally, if the Disk Queue Depth consistently maintains a value exceeding 2-4 times the number of CPU cores on your server, it might suggest an underlying issue. This hints at an excess of pending I/O operations that the storage system might not be adeptly managing.
- Disk Latency: In addition to monitoring the two metrics mentioned previously, we recommend creating an alert on Disk read latency on Data Partition and Disk write latency on Data Partition similar to the threshold you have defined for your operation execution time, which depends on the cluster configuration and your specific workload. Note that acceptable disk latency can significantly differ based on factors, such as: your application’s workload, the complexity of your queries, the read and write patterns and the overall performance expectations.
Resources
Section titled “Resources”We offer a template of the most important metrics. You can download and import it in STACKIT Observability: metric_exporter.json.