pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Started by Andres Freundover 6 years ago173 messageshackers

andres@anarazel.de

over 6 years ago

Hi,

Currently pg_stat_bgwriter.buffers_backend is pretty useless to gauge
whether backends are doing writes they shouldn't do. That's because it
counts things that are either unavoidably or unlikely doable by other
parts of the system (checkpointer, bgwriter).

In particular extending the file can not currently be done by any
another type of process, yet is counted. When using a buffer access
strategy it is also very likely that writes have to be done by the
'dirtying' backend itself, as the buffer will be reused soon after (when
not previously in s_b that is).

Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
autovacuum et al.

I think it'd make sense to at least split buffers_backend into
buffers_backend_extend,
buffers_backend_write,
buffers_backend_write_strat

but it could also be worthwhile to expand it into
buffers_backend_extend,
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
buffers_{backend,autovacuum}_write_stat

Possibly by internally, in contrast to SQL level, having just counter
arrays indexed by backend types.

It's also noteworthy that buffers_backend is accounted in an absurd
manner. One might think that writes are accounted from backend -> shared
memory or such. But instead it works like this:

1) backend flushes buffer in bufmgr.c, accounts for backend *write time*
2) mdwrite writes and registers a sync request, which forwards the sync request to checkpointer
3) ForwardSyncRequest(), when not called by bgwriter, increments CheckpointerShmem->num_backend_writes
4) checkpointer, whenever doing AbsorbSyncRequests(), moves
CheckpointerShmem->num_backend_writes to
BgWriterStats.m_buf_written_backend (local memory!)
5) Occasionally it calls pgstat_send_bgwriter(), which sends the data to
pgstat (which bgwriter also does)
6) Which then updates the shared memory used by the display functions

Worthwhile to note that backend buffer read/write *time* is accounted
differently. That's done via pgstat_send_tabstat().

I think there's very little excuse for the indirection via checkpointer,
besides architectually being weird, it actually requires that we
continue to wake up checkpointer over and over instead of optimizing how
and when we submit fsync requests.

As far as I can tell we're also simply not accounting at all for writes
done outside of shared buffers. All writes done directly through
smgrwrite()/extend() aren't accounted anywhere as far as I can tell.

I think we also count things as writes that aren't writes: mdtruncate()
is AFAICT counted as one backend write for each segment. Which seems
weird to me.

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

Greetings,

Andres Freund

Magnus Hagander

magnus@hagander.net

over 6 years ago

In reply to: Andres Freund (#1)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

Currently pg_stat_bgwriter.buffers_backend is pretty useless to gauge
whether backends are doing writes they shouldn't do. That's because it
counts things that are either unavoidably or unlikely doable by other
parts of the system (checkpointer, bgwriter).
In particular extending the file can not currently be done by any
another type of process, yet is counted. When using a buffer access
strategy it is also very likely that writes have to be done by the
'dirtying' backend itself, as the buffer will be reused soon after (when
not previously in s_b that is).

Yeah. That's quite annoying.

Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
autovacuum et al.

I think it'd make sense to at least split buffers_backend into
buffers_backend_extend,
buffers_backend_write,
buffers_backend_write_strat

but it could also be worthwhile to expand it into
buffers_backend_extend,
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
buffers_{backend,autovacuum}_write_stat

Given that these are individual global counters, I don't really see
any reason not to expand it to the bigger set of counters. It's easy
enough to add them up together later if needed.

Possibly by internally, in contrast to SQL level, having just counter
arrays indexed by backend types.

It's also noteworthy that buffers_backend is accounted in an absurd
manner. One might think that writes are accounted from backend -> shared
memory or such. But instead it works like this:

1) backend flushes buffer in bufmgr.c, accounts for backend *write time*
2) mdwrite writes and registers a sync request, which forwards the sync request to checkpointer
3) ForwardSyncRequest(), when not called by bgwriter, increments CheckpointerShmem->num_backend_writes
4) checkpointer, whenever doing AbsorbSyncRequests(), moves
CheckpointerShmem->num_backend_writes to
BgWriterStats.m_buf_written_backend (local memory!)
5) Occasionally it calls pgstat_send_bgwriter(), which sends the data to
pgstat (which bgwriter also does)
6) Which then updates the shared memory used by the display functions

Worthwhile to note that backend buffer read/write *time* is accounted
differently. That's done via pgstat_send_tabstat().

I think there's very little excuse for the indirection via checkpointer,
besides architectually being weird, it actually requires that we
continue to wake up checkpointer over and over instead of optimizing how
and when we submit fsync requests.

As far as I can tell we're also simply not accounting at all for writes
done outside of shared buffers. All writes done directly through
smgrwrite()/extend() aren't accounted anywhere as far as I can tell.

I think we also count things as writes that aren't writes: mdtruncate()
is AFAICT counted as one backend write for each segment. Which seems
weird to me.

It's at least slightly weird :) Might it be worth counting truncate
events separately?

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Magnus Hagander (#2)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
autovacuum et al.

I think it'd make sense to at least split buffers_backend into
buffers_backend_extend,
buffers_backend_write,
buffers_backend_write_strat

but it could also be worthwhile to expand it into
buffers_backend_extend,
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
buffers_{backend,autovacuum}_write_stat

Given that these are individual global counters, I don't really see
any reason not to expand it to the bigger set of counters. It's easy
enough to add them up together later if needed.

Are you agreeing to
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
or are you suggesting further ones?

I think we also count things as writes that aren't writes: mdtruncate()
is AFAICT counted as one backend write for each segment. Which seems
weird to me.

It's at least slightly weird :) Might it be worth counting truncate
events separately?

Is that really something interesting? Feels like it'd have to be done at
a higher level to be useful. E.g. the truncate done by TRUNCATE (when in
same xact as creation) and VACUUM are quite different. I think it'd be
better to just not include it.

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers). There's a few details like how
exactly to implement resetting the counters, but ...

Thanks,

Andres Freund

Magnus Hagander

magnus@hagander.net

over 6 years ago

In reply to: Andres Freund (#3)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
autovacuum et al.

I think it'd make sense to at least split buffers_backend into
buffers_backend_extend,
buffers_backend_write,
buffers_backend_write_strat

but it could also be worthwhile to expand it into
buffers_backend_extend,
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
buffers_{backend,autovacuum}_write_stat

Given that these are individual global counters, I don't really see
any reason not to expand it to the bigger set of counters. It's easy
enough to add them up together later if needed.

Are you agreeing to
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
or are you suggesting further ones?

The former.

I think we also count things as writes that aren't writes: mdtruncate()
is AFAICT counted as one backend write for each segment. Which seems
weird to me.

It's at least slightly weird :) Might it be worth counting truncate
events separately?

Is that really something interesting? Feels like it'd have to be done at
a higher level to be useful. E.g. the truncate done by TRUNCATE (when in
same xact as creation) and VACUUM are quite different. I think it'd be
better to just not include it.

Yeah, you're probably right. it certainly makes very little sense
where it is now.

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers). There's a few details like how
exactly to implement resetting the counters, but ...

Right. Each process gets to do their own write, but still in shared
memory. But do you need to lock them when reading them (for the
summary)? That's the part where I figured you could just read and
summarize them, and accept the possible loss.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Magnus Hagander (#4)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:

On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers). There's a few details like how
exactly to implement resetting the counters, but ...

Right. Each process gets to do their own write, but still in shared
memory. But do you need to lock them when reading them (for the
summary)? That's the part where I figured you could just read and
summarize them, and accept the possible loss.

Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
integers can be read / written without a danger of torn values, and I
don't think we need perfect cross counter accuracy. To deal with the few
platforms without 64bit "single copy atomicity", we can just use
pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
fall back to using locked operations for those platforms. So I don't
think there's actually a danger of loss.

Obviously we could also use atomic ops to increment the value, but I'd
rather not add all those atomic operations, even if it's on uncontended
cachelines. It'd allow us to reset the backend values more easily by
just swapping in a 0, which we can't do if the backend increments
non-atomically. But I think we could instead just have one global "bias"
value to implement resets (by subtracting that from the summarized
value, and storing the current sum when resetting). Or use the new
global barrier to trigger a reset. Or something similar.

Greetings,

Andres Freund

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Andres Freund (#5)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hello.

At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

I feel the same on the specific issues brought in upthread.

On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:

On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers). There's a few details like how
exactly to implement resetting the counters, but ...

Right. Each process gets to do their own write, but still in shared
memory. But do you need to lock them when reading them (for the
summary)? That's the part where I figured you could just read and
summarize them, and accept the possible loss.

Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
integers can be read / written without a danger of torn values, and I
don't think we need perfect cross counter accuracy. To deal with the few
platforms without 64bit "single copy atomicity", we can just use
pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
fall back to using locked operations for those platforms. So I don't
think there's actually a danger of loss.

Obviously we could also use atomic ops to increment the value, but I'd
rather not add all those atomic operations, even if it's on uncontended
cachelines. It'd allow us to reset the backend values more easily by
just swapping in a 0, which we can't do if the backend increments
non-atomically. But I think we could instead just have one global "bias"
value to implement resets (by subtracting that from the summarized
value, and storing the current sum when resetting). Or use the new
global barrier to trigger a reset. Or something similar.

Fixed or global stats are suitable for the startar of shared-memory
stats collector. In the case of buffers_*_write, the global stats
entry for each process needs just 8 bytes plus matbe extra 8 bytes for
the bias value. I'm not sure how many counters like this there are,
but is such size of footprint acceptatble? (Each backend already uses
the same amount of local memory for pgstat use, though.)

Anyway I will do something like that as a trial, maybe by adding a
member in PgBackendStatus and one global-shared for the bial value.

int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ PgBackendStatsCounters counters;
} PgBackendStatus;

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: