log_postmaster_stats

Started by Jakub Wartak27 days ago3 messageshackers
Jump to latest
#1Jakub Wartak
jakub.wartak@enterprisedb.com

Hi -hackers,

We seem to have certain observability about postmaster
(pg_stat_database.{sessions,parallel_workers_launched}), but we do not have
pre-exisiting way to asses how much postmaster was really busy back in the
past. Even checkpointer (log_checkpoints) or startup recovery code is reporting
better what they were doing. One can say we have log_connections, yet bigger
shops cannot afford to log_connections all the time to count what happened
some time ago (and that can cumbersome anyway).

The attached patch introduces log_postmaster_stats in the same way we do have
log_startup_progress_interval, e.g. when set to 10 (seconds), it will show this
during artificial connection storm (log produced every 10s):

LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 10.00 s
LOG: postmaster stats: avg 1834.30 conns/sec; 1833.60 disconns/sec;
0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.75 s,
elapsed: 9.96 s
LOG: postmaster stats: avg 1055.75 conns/sec; 1056.25 disconns/sec;
0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.27 s,
elapsed: 16.25 s
LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 13.82 s
LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 10.00 s

The interesting thing above is that the elapsed time is 6s (with the
setting at 10s), then one
can already tell there was a probem.

Known issues include connection storms, spotting low postmaster/fork()
efficency,
PQ workers causing startvation for new connections and so on. It is somehow
complementary to having those pg_stat_database counters mentioned at the
beggining. It is also complementary to the more recent log_connections with
=setup_durations, which logs timings, but not direct rate of forks()/second.

Another interesting thing above is that there can be discrepeancy
between user+system=~5s
against elapsed wall clock time=~10s above (it does not add up) and that's even
getrusage(RUSAGE_SELF and not RUSAGE_CHILDREN), but this comes apparently from
CPU scheduling at those kind of fork() rates. I was thinking about adding some
message like every now and then:
"WARNING: postmaster potentially overloaded, stats not gathered in time"
however lot of folks don't like those self diagnosis messages, so that's not in
v1 patch today.

I've thought it would be good idea to actually to enable it by default (@60s?),
but right now it is off to be aligned with others.

Any hints/reviews are welcome.

-J.

Attachments:

v1-0001-Add-log_postmaster_stats.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Add-log_postmaster_stats.patchDownload+151-4
#2Quan Zongliang
quanzongliang@yeah.net
In reply to: Jakub Wartak (#1)
Re: log_postmaster_stats

On 5/27/26 7:39 PM, Jakub Wartak wrote:

Hi -hackers,

We seem to have certain observability about postmaster
(pg_stat_database.{sessions,parallel_workers_launched}), but we do not have
pre-exisiting way to asses how much postmaster was really busy back in the
past. Even checkpointer (log_checkpoints) or startup recovery code is reporting
better what they were doing. One can say we have log_connections, yet bigger
shops cannot afford to log_connections all the time to count what happened
some time ago (and that can cumbersome anyway).

The attached patch introduces log_postmaster_stats in the same way we do have
log_startup_progress_interval, e.g. when set to 10 (seconds), it will show this
during artificial connection storm (log produced every 10s):

LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 10.00 s
LOG: postmaster stats: avg 1834.30 conns/sec; 1833.60 disconns/sec;
0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.75 s,
elapsed: 9.96 s
LOG: postmaster stats: avg 1055.75 conns/sec; 1056.25 disconns/sec;
0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.27 s,
elapsed: 16.25 s
LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 13.82 s
LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 10.00 s

The interesting thing above is that the elapsed time is 6s (with the
setting at 10s), then one
can already tell there was a probem.

When the database is idle for a long time. Will keep outputting

LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: ...
LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: ...
LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: ...

Could it be considered to reduce the output frequency when conn_delta,
disc_delta and pqw_delta are all zero?
Or until a new connection is established, then output the log for the
idle period at one time. Just like:

LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 1hours 10min 32s
LOG: postmaster stats: avg 0.30 conns/sec; 0.20 disconns/sec; 1.10
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 10.00 s

--
Quan Zongliang

#3Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Quan Zongliang (#2)
Re: log_postmaster_stats

On Wed, Jun 10, 2026 at 6:10 AM Quan Zongliang <quanzongliang@yeah.net> wrote:
[..]

When the database is idle for a long time. Will keep outputting

LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: ...
LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: ...
LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: ...

Could it be considered to reduce the output frequency when conn_delta,
disc_delta and pqw_delta are all zero?
Or until a new connection is established, then output the log for the
idle period at one time. Just like:

LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 1hours 10min 32s
LOG: postmaster stats: avg 0.30 conns/sec; 0.20 disconns/sec; 1.10
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 10.00 s

Hi Quan,

thanks for trying out the patch! To be honest, I don't have opinion on
this, we could do that if there's community agreement to do this (or
some committer wants it this way or that way), but right now e.g.
checkpoint_timeout (which serves similiar purpose) does not behave
like that and always logs stuff and I was been told we should do
things consistently.

As for reducing logs volume, I think one could just use threshold
of 1 or 10 minutes (but while assessing some production issues one
could bump it down to every 1s temporarily just to have data to have
much better understandind what was system/postmaster doing).

-J.