Skip to content

Observability metrics

This document describes the Prometheus metrics exported by PostgreSQL (via pg_exporter) for the Managed Database Service. All metrics include the label node_type="primary" since the service is configured with Patroni for high availability.

Database file size in bytes (Type: Gauge)

This metric shows the current size of each database on the server and is important for capacity planning and growth trends.

Number of locks by mode and database (Type: Gauge)

High lock counts can indicate contention and performance issues. Under normal conditions, most locks should be at 0.

Maximum number of concurrent connections to the database server (Type: Gauge)

The system uses up to 15 connections for internal essential processes such as backup, monitoring, etc. These connections are counted against the max_connections limit. When approaching this limit, connection pooling should be considered.

Number of connections grouped by various attributes (Type: Gauge)

idle in transaction for extended periods indicates application issues. Many idle connections show inefficient connection management.

Number of disk block accesses found already in the buffer cache (Type: Counter)

The cache hit ratio is calculated as blks_hit / (blks_hit + blks_read). The goal is a cache hit ratio of >99% for good performance.

Number of disk blocks that had to be read from disk (Type: Counter)

Low values relative to blks_hit are optimal. High values indicate insufficient shared buffers or cold cache.

Number of queries canceled due to conflicts with recovery (Type: Counter)

This metric is mainly relevant on standby/replica servers. A value of 0 is normal for primary nodes.

Total amount of data written to temporary files (in bytes) (Type: Counter)

High values indicate queries that require more RAM than work_mem allows. Frequent temp file usage should be addressed by increasing work_mem or optimizing queries.

Number of rows deleted by queries (Type: Counter)

Shows DELETE activity in the database. High delete rates require regular VACUUM operations.

Number of rows fetched by queries (Type: Counter)

Measures the number of actually returned rows. High values show intensive read activity.

Number of rows inserted by queries (Type: Counter)

Shows INSERT activity in the database.

Number of rows updated by queries (Type: Counter)

Updates create new row versions (MVCC), so VACUUM is important. High update rates can lead to table bloat.

Number of successfully committed transactions (Type: Counter)

Shows the number of successful transactions. High values indicate intensive database activity.

Indicates whether the last scrape successfully connected to the server (Type: Gauge)

This is the most important metric for alerting. When the value is 0, all other metrics are unavailable.

  • Monitor cache efficiency: The ratio blks_hit / (blks_hit + blks_read) should consistently be >99%.
  • Connection pooling: When approaching max_connections, a connection pool library/application (e.g., pgBouncer) should be deployed.
  • Temp file usage: Should remain at 0. If not, increase work_mem or optimize queries.
  • Lock monitoring: Regular locks can indicate design issues.
  • Idle in transaction: These connections should not persist longer than a few seconds.
  • All metrics are provided via pg_exporter and collected by Prometheus.
  • The service runs with Patroni for high availability, hence the label node_type="primary".
  • The listed metrics form the basis for monitoring and alerting. Additional metrics for replication, background writer, and WAL archiving may be available depending on the setup.