Observability metrics
This document describes the Prometheus metrics exported by PostgreSQL (via pg_exporter) for the Managed Database Service. All metrics include the label node_type="primary" since the service is configured with Patroni for high availability.
Database Size
Section titled “Database Size”pg_database_size_bytes
Section titled “pg_database_size_bytes”Database file size in bytes (Type: Gauge)
| Label | Description | Example Values |
|---|---|---|
datname | Name of the database | nextcloud, wordpress, crm |
node_type | Type of database node | primary |
This metric shows the current size of each database on the server and is important for capacity planning and growth trends.
pg_locks_count
Section titled “pg_locks_count”Number of locks by mode and database (Type: Gauge)
| Label | Description | Possible Values |
|---|---|---|
datname | Name of the database | nextcloud, wordpress, crm |
mode | Lock mode | See table below |
node_type | Type of database node | primary |
Lock Modes
Section titled “Lock Modes”| Lock Mode | Usage | Conflicts With |
|---|---|---|
AccessShareLock | SELECT queries | AccessExclusiveLock |
RowShareLock | SELECT FOR UPDATE/SHARE | Exclusive, ShareRowExclusive, AccessExclusive |
RowExclusiveLock | INSERT, UPDATE, DELETE | Share, ShareRowExclusive, Exclusive, AccessExclusive |
ShareUpdateExclusiveLock | VACUUM, CREATE INDEX CONCURRENTLY | ShareUpdateExclusive and higher |
ShareLock | CREATE INDEX | RowExclusive and higher |
ShareRowExclusiveLock | Rarely used | RowExclusive and higher |
ExclusiveLock | Blocks all concurrent access | RowExclusive and higher |
AccessExclusiveLock | ALTER TABLE, DROP TABLE, TRUNCATE | All modes |
High lock counts can indicate contention and performance issues. Under normal conditions, most locks should be at 0.
Connections
Section titled “Connections”pg_settings_max_connections
Section titled “pg_settings_max_connections”Maximum number of concurrent connections to the database server (Type: Gauge)
| Label | Description |
|---|---|
node_type | Type of database node |
The system uses up to 15 connections for internal essential processes such as backup, monitoring, etc. These connections are counted against the max_connections limit. When approaching this limit, connection pooling should be considered.
pg_stat_activity_count
Section titled “pg_stat_activity_count”Number of connections grouped by various attributes (Type: Gauge)
| Label | Description | Possible Values |
|---|---|---|
datname | Name of the database | nextcloud, wordpress, crm |
state | Connection state | See table below |
application_name | Name of client application | Patroni heartbeat, Patroni restapi, pg_exporter |
backend_type | Type of backend process | client backend |
usename | Database user | api, 2178329, christian.walther |
wait_event | Event being waited for | ClientRead |
wait_event_type | Type of wait event | Client |
node_type | Type of database node | primary |
Connection States
Section titled “Connection States”| State | Meaning |
|---|---|
active | Query is currently executing |
idle | Connection open but no query active |
idle in transaction | Connection in transaction but no current query |
idle in transaction (aborted) | Transaction failed, waiting for ROLLBACK |
fastpath function call | Client executing fast-path function |
disabled | Tracking disabled for this connection |
idle in transaction for extended periods indicates application issues. Many idle connections show inefficient connection management.
Database Statistics
Section titled “Database Statistics”pg_stat_database_blks_hit
Section titled “pg_stat_database_blks_hit”Number of disk block accesses found already in the buffer cache (Type: Counter)
| Label | Description |
|---|---|
datname | Name of the database |
node_type | Type of database node |
The cache hit ratio is calculated as blks_hit / (blks_hit + blks_read). The goal is a cache hit ratio of >99% for good performance.
pg_stat_database_blks_read
Section titled “pg_stat_database_blks_read”Number of disk blocks that had to be read from disk (Type: Counter)
| Label | Description |
|---|---|
datname | Name of the database |
node_type | Type of database node |
Low values relative to blks_hit are optimal. High values indicate insufficient shared buffers or cold cache.
pg_stat_database_conflicts
Section titled “pg_stat_database_conflicts”Number of queries canceled due to conflicts with recovery (Type: Counter)
| Label | Description |
|---|---|
datname | Name of the database |
node_type | Type of database node |
This metric is mainly relevant on standby/replica servers. A value of 0 is normal for primary nodes.
pg_stat_database_temp_bytes
Section titled “pg_stat_database_temp_bytes”Total amount of data written to temporary files (in bytes) (Type: Counter)
| Label | Description |
|---|---|
datname | Name of the database |
node_type | Type of database node |
High values indicate queries that require more RAM than work_mem allows. Frequent temp file usage should be addressed by increasing work_mem or optimizing queries.
pg_stat_database_tup_deleted
Section titled “pg_stat_database_tup_deleted”Number of rows deleted by queries (Type: Counter)
| Label | Description |
|---|---|
datname | Name of the database |
node_type | Type of database node |
Shows DELETE activity in the database. High delete rates require regular VACUUM operations.
pg_stat_database_tup_fetched
Section titled “pg_stat_database_tup_fetched”Number of rows fetched by queries (Type: Counter)
| Label | Description |
|---|---|
datname | Name of the database |
node_type | Type of database node |
Measures the number of actually returned rows. High values show intensive read activity.
pg_stat_database_tup_inserted
Section titled “pg_stat_database_tup_inserted”Number of rows inserted by queries (Type: Counter)
| Label | Description |
|---|---|
datname | Name of the database |
node_type | Type of database node |
Shows INSERT activity in the database.
pg_stat_database_tup_updated
Section titled “pg_stat_database_tup_updated”Number of rows updated by queries (Type: Counter)
| Label | Description |
|---|---|
datname | Name of the database |
node_type | Type of database node |
Updates create new row versions (MVCC), so VACUUM is important. High update rates can lead to table bloat.
pg_stat_database_xact_commit
Section titled “pg_stat_database_xact_commit”Number of successfully committed transactions (Type: Counter)
| Label | Description |
|---|---|
datname | Name of the database |
node_type | Type of database node |
Shows the number of successful transactions. High values indicate intensive database activity.
Service Availability
Section titled “Service Availability”Indicates whether the last scrape successfully connected to the server (Type: Gauge)
| Label | Description | Values |
|---|---|---|
node_type | Type of database node | 1 = connection successful, 0 = connection failed |
This is the most important metric for alerting. When the value is 0, all other metrics are unavailable.
Recommended Alerts
Section titled “Recommended Alerts”| Alert | Condition | Severity |
|---|---|---|
| PostgreSQL Down | pg_up == 0 | Critical |
| Connection Limit | pg_stat_activity_count / pg_settings_max_connections > 0.8 | Warning |
| Connection Limit Critical | pg_stat_activity_count / pg_settings_max_connections > 0.9 | Critical |
| Low Cache Hit Ratio | rate(pg_stat_database_blks_hit) / (rate(pg_stat_database_blks_hit) + rate(pg_stat_database_blks_read)) < 0.95 | Warning |
| Temp File Usage | rate(pg_stat_database_temp_bytes[5m]) > 0 | Warning |
| Too Many Locks | pg_locks_count > 100 | Warning |
| Long Running Transactions | pg_stat_activity_count{state="idle in transaction"} > 0 for > 5min | Warning |
Monitoring Best Practices
Section titled “Monitoring Best Practices”- Monitor cache efficiency: The ratio
blks_hit / (blks_hit + blks_read)should consistently be >99%. - Connection pooling: When approaching
max_connections, a connection pool library/application (e.g.,pgBouncer) should be deployed. - Temp file usage: Should remain at 0. If not, increase
work_memor optimize queries. - Lock monitoring: Regular locks can indicate design issues.
- Idle in transaction: These connections should not persist longer than a few seconds.
- All metrics are provided via
pg_exporterand collected by Prometheus. - The service runs with Patroni for high availability, hence the label
node_type="primary". - The listed metrics form the basis for monitoring and alerting. Additional metrics for replication, background writer, and WAL archiving may be available depending on the setup.