PoC: history of recent vacuum/checkpoint runs (using new hooks)

Started by Tomas Vondraabout 1 year ago21 messages

tomas.vondra@2ndquadrant.com

about 1 year ago

Hi,

Our runtime stats system is great, but it only keeps a snapshot of
cumulative stats / snapshot. And while that works for user workloads, it
may not be quite sufficient when analyzing maintenance operations like
vacuum/checkpoint, etc.

For those operations it's useful to have information about individual
runs, not just the cumulative counters (or even deltas between regular
snapshots). There's also the issue that we only keep a very limited
subset of available information - just look at the info included in
VACUUM VERBOSE or with log_checkpoints=on, and how little of that is
available in pg_stats_. For vacuum we have the vacuum/analyze counts,
and timestamp of the last operation, but that's it. VACUUM VERBOSE
provides way more information, but we can only guess based on the
regular pgstat counters.

Yes, we can get this info written to server log using log_checkpoints
and log_autovacuum_min_duration (AFAIK there's no way to ensure logging
for manual VACUUM). But processing this information is not particularly
convenient, as it requires parsing the log, the message format is
suitable more for humans, etc. And it'd be very convenient to be able to
query this information, just like the other pgstat catalogs.

I wonder if we might/should do two things, to improve this:

1) Introduce hooks that allow doing some custom stuff with info about
those actions, after logging it. The attached 0001 and 0002 patches do
this for vacuum and checkpoint.

2) Allow keeping information about events. The 0003 patch does that in
an extension, leveraging the new hooks, but it'd certainly possible to
do in core too.

I realize our current pgstat collector is keeping per-object stats, not
per-event. We might add this to per-object stats (e.g. each table would
have stats about vacuum runs), but that wouldn't work for checkpoints.
There's also the question of memory consumption - I'm sure we don't want
to keep infinite history of vacuum runs, for example.

So the extension simply maintains a fixed-size circular queue, i.e. when
it gets full it starts evicting oldest entries. 1MB is enough for
storing ~4k VACUUM runs - I'm sure it can be made more compact.

I don't think there's a better way to do this. I've considered if this
might be done using emit_log_hook, but (a) that only helps when we end
up logging the event (and I'd like to do this always), and (b) it'd
require parsing the server log. So it's not much better than just doing
that, I think ...

Opinions?

--
Tomas Vondra

wenhui qiu

qiuwenhuifx@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#1)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

Hi Tomas
     This is a great feature.
+ /*
+ * Define (or redefine) custom GUC variables.
+ */
+ DefineCustomIntVariable("stats_history.size",
+ "Sets the amount of memory available for past events.",
+ NULL,
+ &statsHistorySizeMB,
+ 1,
+ 1,
+ 128,
+ PGC_POSTMASTER,
+ GUC_UNIT_MB,
+ NULL,
+ NULL,
+ NULL);
+
RAM is in terabytes now, the statsHistorySize is 128MB ,I think can
increase to store more history record ?

Thanks

On Sun, Dec 22, 2024 at 4:28 AM Tomas Vondra <tomas@vondra.me> wrote:

Show quoted text

Hi,

Our runtime stats system is great, but it only keeps a snapshot of
cumulative stats / snapshot. And while that works for user workloads, it
may not be quite sufficient when analyzing maintenance operations like
vacuum/checkpoint, etc.

For those operations it's useful to have information about individual
runs, not just the cumulative counters (or even deltas between regular
snapshots). There's also the issue that we only keep a very limited
subset of available information - just look at the info included in
VACUUM VERBOSE or with log_checkpoints=on, and how little of that is
available in pg_stats_. For vacuum we have the vacuum/analyze counts,
and timestamp of the last operation, but that's it. VACUUM VERBOSE
provides way more information, but we can only guess based on the
regular pgstat counters.

Yes, we can get this info written to server log using log_checkpoints
and log_autovacuum_min_duration (AFAIK there's no way to ensure logging
for manual VACUUM). But processing this information is not particularly
convenient, as it requires parsing the log, the message format is
suitable more for humans, etc. And it'd be very convenient to be able to
query this information, just like the other pgstat catalogs.

I wonder if we might/should do two things, to improve this:

1) Introduce hooks that allow doing some custom stuff with info about
those actions, after logging it. The attached 0001 and 0002 patches do
this for vacuum and checkpoint.

2) Allow keeping information about events. The 0003 patch does that in
an extension, leveraging the new hooks, but it'd certainly possible to
do in core too.

I realize our current pgstat collector is keeping per-object stats, not
per-event. We might add this to per-object stats (e.g. each table would
have stats about vacuum runs), but that wouldn't work for checkpoints.
There's also the question of memory consumption - I'm sure we don't want
to keep infinite history of vacuum runs, for example.

So the extension simply maintains a fixed-size circular queue, i.e. when
it gets full it starts evicting oldest entries. 1MB is enough for
storing ~4k VACUUM runs - I'm sure it can be made more compact.

I don't think there's a better way to do this. I've considered if this
might be done using emit_log_hook, but (a) that only helps when we end
up logging the event (and I'd like to do this always), and (b) it'd
require parsing the server log. So it's not much better than just doing
that, I think ...

Opinions?

--
Tomas Vondra

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: wenhui qiu (#2)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

Hi,

On 12/23/24 07:35, wenhui qiu wrote:

Hi Tomas 
     This is a great feature.  
+ /*
+ * Define (or redefine) custom GUC variables.
+ */
+ DefineCustomIntVariable("stats_history.size",
+ "Sets the amount of memory available for past events.",
+ NULL,
+ &statsHistorySizeMB,
+ 1,
+ 1,
+ 128,
+ PGC_POSTMASTER,
+ GUC_UNIT_MB,
+ NULL,
+ NULL,
+ NULL);
+
RAM is in terabytes now, the statsHistorySize is 128MB ,I think can
increase to store more history record ?

Maybe, the 128MB is an arbitrary (and conservative) limit - it's enough
for ~500k events, which seems good enough for most systems. Of course,
on systems with many relations might need more space, not sure.

I was thinking about specifying the space in more natural terms, either
as amount of time ("keep 1 day of history") or number of entries ("10k
entries"). That would probably mean the memory can't be allocated as
fixed size.

But maybe it'd be possible to just write the entries to a file. We don't
need random access to past entries (unlike e.g. pg_stat_statements), and
people won't query that very often either.

regards

--
Tomas Vondra

Robert Treat

xzilla@users.sourceforge.net

about 1 year ago

In reply to: Tomas Vondra (#3)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On Wed, Dec 25, 2024 at 12:26 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

On 12/23/24 07:35, wenhui qiu wrote:
Hi Tomas
This is a great feature.
+ /*
+ * Define (or redefine) custom GUC variables.
+ */
+ DefineCustomIntVariable("stats_history.size",
+ "Sets the amount of memory available for past events.",
+ NULL,
+ &statsHistorySizeMB,
+ 1,
+ 1,
+ 128,
+ PGC_POSTMASTER,
+ GUC_UNIT_MB,
+ NULL,
+ NULL,
+ NULL);
+
RAM is in terabytes now, the statsHistorySize is 128MB ,I think can
increase to store more history record ?
Maybe, the 128MB is an arbitrary (and conservative) limit - it's enough
for ~500k events, which seems good enough for most systems. Of course,
on systems with many relations might need more space, not sure.

I was thinking about specifying the space in more natural terms, either
as amount of time ("keep 1 day of history") or number of entries ("10k
entries"). That would probably mean the memory can't be allocated as
fixed size.

Based on the above, a rough calculation is that this is enough for
holding 1 year of hourly vacuum runs for 50 tables, or a year of daily
vacuums for 1000 tables. Most folks will fall somewhere in that range
(and won't really need a year's history) but that seems like plenty
for a default.

But maybe it'd be possible to just write the entries to a file. We don't
need random access to past entries (unlike e.g. pg_stat_statements), and
people won't query that very often either.

Yeah, workloads will vary, but it doesn't seem like they would more
than query workloads do.

Robert Treat
https://xzilla.net

wenhui qiu

qiuwenhuifx@gmail.com

about 1 year ago

In reply to: Robert Treat (#4)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

As far as I know, more than 10,000 tables of instances are often
encountered,
So I insist that the maximum can be appropriately increased to 256MB，Can be
more adaptable to many table situations

On Thu, 26 Dec 2024 at 23:23, Robert Treat <rob@xzilla.net> wrote:

Show quoted text

On Wed, Dec 25, 2024 at 12:26 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi,

On 12/23/24 07:35, wenhui qiu wrote:
Hi Tomas
This is a great feature.
+ /*
+ * Define (or redefine) custom GUC variables.
+ */
+ DefineCustomIntVariable("stats_history.size",
+ "Sets the amount of memory available for past events.",
+ NULL,
+ &statsHistorySizeMB,
+ 1,
+ 1,
+ 128,
+ PGC_POSTMASTER,
+ GUC_UNIT_MB,
+ NULL,
+ NULL,
+ NULL);
+
RAM is in terabytes now, the statsHistorySize is 128MB ,I think can
increase to store more history record ?
Maybe, the 128MB is an arbitrary (and conservative) limit - it's enough
for ~500k events, which seems good enough for most systems. Of course,
on systems with many relations might need more space, not sure.

I was thinking about specifying the space in more natural terms, either
as amount of time ("keep 1 day of history") or number of entries ("10k
entries"). That would probably mean the memory can't be allocated as
fixed size.
Based on the above, a rough calculation is that this is enough for
holding 1 year of hourly vacuum runs for 50 tables, or a year of daily
vacuums for 1000 tables. Most folks will fall somewhere in that range
(and won't really need a year's history) but that seems like plenty
for a default.

But maybe it'd be possible to just write the entries to a file. We don't
need random access to past entries (unlike e.g. pg_stat_statements), and
people won't query that very often either.

Yeah, workloads will vary, but it doesn't seem like they would more
than query workloads do.

Robert Treat
https://xzilla.net

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: wenhui qiu (#5)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On 12/26/24 17:00, wenhui qiu wrote:

Hi

As far as I know, more than 10,000 tables of instances are often
encountered,
So I insist that the maximum can be appropriately increased to 256MB，
Can be more adaptable to many table situations

If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not
make a fundamental difference ...

Anyway, the 128MB value is rather arbitrary. I don't mind increasing the
limit, or possibly removing it entirely (and accepting anything the
system can handle).

regards

--
Tomas Vondra

wenhui qiu

qiuwenhuifx@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#6)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

Hi Tomas

If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not
make a fundamental difference ...

agree

Anyway, the 128MB value is rather arbitrary. I don't mind increasing the
limit, or possibly removing it entirely (and accepting anything the
system can handle).

yes, I mean by this is that the maximum value is not friendly to large
instances if the setting is small ，In the real production instance , many
sub-tables with the same table structure are often encountered

On Fri, Dec 27, 2024 at 1:58 AM Tomas Vondra <tomas@vondra.me> wrote:

Show quoted text

On 12/26/24 17:00, wenhui qiu wrote:

Hi

As far as I know, more than 10,000 tables of instances are often
encountered,
So I insist that the maximum can be appropriately increased to 256MB，
Can be more adaptable to many table situations

If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not
make a fundamental difference ...

Anyway, the 128MB value is rather arbitrary. I don't mind increasing the
limit, or possibly removing it entirely (and accepting anything the
system can handle).

regards

--
Tomas Vondra

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Tomas Vondra (#6)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On Thu, Dec 26, 2024 at 06:58:11PM +0100, Tomas Vondra wrote:

If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not
make a fundamental difference ...

Anyway, the 128MB value is rather arbitrary. I don't mind increasing the
limit, or possibly removing it entirely (and accepting anything the
system can handle).

+    DefineCustomIntVariable("stats_history.size",
+                            "Sets the amount of memory available for past events.",

How about some time-based retention? Data size can be hard to think
about for the end user, while there would be a set of users that would
want to retain data for the past week, month, etc? If both size and
time upper-bound are define, then entries that match one condition or
the other are removed.

+        checkpoint_log_hook(
+          CheckpointStats.ckpt_start_t,            /* start_time */
+          CheckpointStats.ckpt_end_t,                /* end_time */
+          (flags & CHECKPOINT_IS_SHUTDOWN),        /* is_shutdown */
+          (flags & CHECKPOINT_END_OF_RECOVERY),    /* is_end_of_recovery */
+          (flags & CHECKPOINT_IMMEDIATE),            /* is_immediate */
+          (flags & CHECKPOINT_FORCE),                /* is_force */
+          (flags & CHECKPOINT_WAIT),                /* is_wait */
+          (flags & CHECKPOINT_CAUSE_XLOG),        /* is_wal */
+          (flags & CHECKPOINT_CAUSE_TIME),        /* is_time */
+          (flags & CHECKPOINT_FLUSH_ALL),            /* is_flush_all */
+          CheckpointStats.ckpt_bufs_written,        /* buffers_written */
+          CheckpointStats.ckpt_slru_written,        /* slru_written */
+          CheckpointStats.ckpt_segs_added,        /* segs_added */
+          CheckpointStats.ckpt_segs_removed,        /* segs_removed */
+          CheckpointStats.ckpt_segs_recycled,        /* segs_recycled */

That's a lot of arguments. CheckpointStatsData and the various
CHECKPOINT_* flags are exposed, why not just send these values to the
hook?

For v1-0001 as well, I'd suggest some grouping with existing
structures, or expose these structures so as they can be reused for
out-of-core code via the proposed hook. More arguments lead to more
mistakes that could be easily avoided.
--
Michael

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Michael Paquier (#8)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On 12/27/24 05:00, Michael Paquier wrote:

On Thu, Dec 26, 2024 at 06:58:11PM +0100, Tomas Vondra wrote:

If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not
make a fundamental difference ...

Anyway, the 128MB value is rather arbitrary. I don't mind increasing the
limit, or possibly removing it entirely (and accepting anything the
system can handle).
+    DefineCustomIntVariable("stats_history.size",
+                            "Sets the amount of memory available for past events.",
How about some time-based retention? Data size can be hard to think
about for the end user, while there would be a set of users that would
want to retain data for the past week, month, etc? If both size and
time upper-bound are define, then entries that match one condition or
the other are removed.

Right. In my response [1]/messages/by-id/8df7cee1-31aa-4db3-bbb7-83157ca139da@vondra.me I suggested replacing the simple memory limit
with a time-based limit, but I haven't done anything about that yet.

And the more I think about it the more I'm convinced we don't need to
keep the data about past runs in memory, a file should be enough (except
maybe for a small buffer). That would mean we don't need to worry about
dynamic shared memory, etc. I initially rejected this because it seemed
like a regression to how pgstat worked initially (sharing data through
files), but I don't think that's actually true - this data is different
(almost append-only), etc.

The one case when we may need co read the data is in response to DROP of
a table, when we need to discard entries for that object. Or we could
handle that by recording OIDs of dropped objects ... not sure how
complex this would need to be.

We'd still want to enforce some limits, of course - a time-based limit
of the minimum amount of time to cover, and maximum amount of disk space
to use (more as a safety).

FWIW there's one "issue" with enforcing the time-based limit - we can
only do that for the "end time", because that's when we see the entry.
If you configure the limit to keep "1 day" history, and then a vacuum
completes after running for 2 days, we want to keep it, so that anyone
can actually see it.

[1]: /messages/by-id/8df7cee1-31aa-4db3-bbb7-83157ca139da@vondra.me
/messages/by-id/8df7cee1-31aa-4db3-bbb7-83157ca139da@vondra.me

+        checkpoint_log_hook(
+          CheckpointStats.ckpt_start_t,            /* start_time */
+          CheckpointStats.ckpt_end_t,                /* end_time */
+          (flags & CHECKPOINT_IS_SHUTDOWN),        /* is_shutdown */
+          (flags & CHECKPOINT_END_OF_RECOVERY),    /* is_end_of_recovery */
+          (flags & CHECKPOINT_IMMEDIATE),            /* is_immediate */
+          (flags & CHECKPOINT_FORCE),                /* is_force */
+          (flags & CHECKPOINT_WAIT),                /* is_wait */
+          (flags & CHECKPOINT_CAUSE_XLOG),        /* is_wal */
+          (flags & CHECKPOINT_CAUSE_TIME),        /* is_time */
+          (flags & CHECKPOINT_FLUSH_ALL),            /* is_flush_all */
+          CheckpointStats.ckpt_bufs_written,        /* buffers_written */
+          CheckpointStats.ckpt_slru_written,        /* slru_written */
+          CheckpointStats.ckpt_segs_added,        /* segs_added */
+          CheckpointStats.ckpt_segs_removed,        /* segs_removed */
+          CheckpointStats.ckpt_segs_recycled,        /* segs_recycled */

That's a lot of arguments. CheckpointStatsData and the various
CHECKPOINT_* flags are exposed, why not just send these values to the
hook?

Yes, I admit the number of parameters seemed a bit annoying to me too,
and maybe we could reduce that somewhat. Certainly for checkpoints,
where we already have a reasonable CheckpointStats struct and flags,
wrapping most of the fields.

With vacuum it's a bit more complicated, for two reasons: (a) the
counters are simply in LVRelState, mixed with all kinds of other info,
and it seems "not great" to pass it to a "log" hook, and (b) there are
some calculated values, so I guess the hook would need to do that
calculation on it's own, and it'd be easy to diverge over time.

For (a) we could introduce some "stats" struct to keep these counters
for vacuum (the nearby parallel vacuum patch actually does something
like that, I think). Not sure if (b) is actually a problem in practice,
but I guess we could add those fields to the new "stats" struct too.

regards

--
Tomas Vondra

#10

Robert Treat

xzilla@users.sourceforge.net

about 1 year ago

In reply to: Tomas Vondra (#9)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On Fri, Dec 27, 2024 at 8:25 PM Tomas Vondra <tomas@vondra.me> wrote:

On 12/27/24 05:00, Michael Paquier wrote:
On Thu, Dec 26, 2024 at 06:58:11PM +0100, Tomas Vondra wrote:

If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not
make a fundamental difference ...

Anyway, the 128MB value is rather arbitrary. I don't mind increasing the
limit, or possibly removing it entirely (and accepting anything the
system can handle).
+    DefineCustomIntVariable("stats_history.size",
+                            "Sets the amount of memory available for past events.",
How about some time-based retention? Data size can be hard to think
about for the end user, while there would be a set of users that would
want to retain data for the past week, month, etc? If both size and
time upper-bound are define, then entries that match one condition or
the other are removed.
Right. In my response [1] I suggested replacing the simple memory limit
with a time-based limit, but I haven't done anything about that yet.

And the more I think about it the more I'm convinced we don't need to
keep the data about past runs in memory, a file should be enough (except
maybe for a small buffer). That would mean we don't need to worry about
dynamic shared memory, etc. I initially rejected this because it seemed
like a regression to how pgstat worked initially (sharing data through
files), but I don't think that's actually true - this data is different
(almost append-only), etc.

The one case when we may need co read the data is in response to DROP of
a table, when we need to discard entries for that object. Or we could
handle that by recording OIDs of dropped objects ... not sure how
complex this would need to be.

We'd still want to enforce some limits, of course - a time-based limit
of the minimum amount of time to cover, and maximum amount of disk space
to use (more as a safety).

FWIW there's one "issue" with enforcing the time-based limit - we can
only do that for the "end time", because that's when we see the entry.
If you configure the limit to keep "1 day" history, and then a vacuum
completes after running for 2 days, we want to keep it, so that anyone
can actually see it.

I can't say I recall all the reasoning involved in making
pg_stat_statements just be based on a fixed number of entries, but the
ability to come up with corner cases was certainly a factor. For
example, imagine the scenario where you set a max at 30 days, but you
have some tables only being vacuumed every few months. Ideally you
probably want the last entry no matter what, and honestly probably the
last 2 (in case you are troubleshooting something, having the last run
and something to compare against is ideal). In any case, it can get
complicated pretty quickly.

[1]
/messages/by-id/8df7cee1-31aa-4db3-bbb7-83157ca139da@vondra.me
+        checkpoint_log_hook(
+          CheckpointStats.ckpt_start_t,            /* start_time */
+          CheckpointStats.ckpt_end_t,                /* end_time */
+          (flags & CHECKPOINT_IS_SHUTDOWN),        /* is_shutdown */
+          (flags & CHECKPOINT_END_OF_RECOVERY),    /* is_end_of_recovery */
+          (flags & CHECKPOINT_IMMEDIATE),            /* is_immediate */
+          (flags & CHECKPOINT_FORCE),                /* is_force */
+          (flags & CHECKPOINT_WAIT),                /* is_wait */
+          (flags & CHECKPOINT_CAUSE_XLOG),        /* is_wal */
+          (flags & CHECKPOINT_CAUSE_TIME),        /* is_time */
+          (flags & CHECKPOINT_FLUSH_ALL),            /* is_flush_all */
+          CheckpointStats.ckpt_bufs_written,        /* buffers_written */
+          CheckpointStats.ckpt_slru_written,        /* slru_written */
+          CheckpointStats.ckpt_segs_added,        /* segs_added */
+          CheckpointStats.ckpt_segs_removed,        /* segs_removed */
+          CheckpointStats.ckpt_segs_recycled,        /* segs_recycled */
That's a lot of arguments. CheckpointStatsData and the various
CHECKPOINT_* flags are exposed, why not just send these values to the
hook?

For v1-0001 as well, I'd suggest some grouping with existing
structures, or expose these structures so as they can be reused for
out-of-core code via the proposed hook. More arguments lead to more
mistakes that could be easily avoided.
Yes, I admit the number of parameters seemed a bit annoying to me too,
and maybe we could reduce that somewhat. Certainly for checkpoints,
where we already have a reasonable CheckpointStats struct and flags,
wrapping most of the fields.

With vacuum it's a bit more complicated, for two reasons: (a) the
counters are simply in LVRelState, mixed with all kinds of other info,
and it seems "not great" to pass it to a "log" hook, and (b) there are
some calculated values, so I guess the hook would need to do that
calculation on it's own, and it'd be easy to diverge over time.

For (a) we could introduce some "stats" struct to keep these counters
for vacuum (the nearby parallel vacuum patch actually does something
like that, I think). Not sure if (b) is actually a problem in practice,
but I guess we could add those fields to the new "stats" struct too.

At the risk of increasing scope, since you already are working on
checkpoints along with vacuums, I'm curious if there was a reason not
to do analyze stats retention as well? It seems pretty correlated in
the same area/problems as vacuum history.

Robert Treat
https://xzilla.net

#11

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Robert Treat (#10)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On 12/29/24 16:39, Robert Treat wrote:

On Fri, Dec 27, 2024 at 8:25 PM Tomas Vondra <tomas@vondra.me> wrote:
On 12/27/24 05:00, Michael Paquier wrote:
On Thu, Dec 26, 2024 at 06:58:11PM +0100, Tomas Vondra wrote:

If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not
make a fundamental difference ...

Anyway, the 128MB value is rather arbitrary. I don't mind increasing the
limit, or possibly removing it entirely (and accepting anything the
system can handle).
+    DefineCustomIntVariable("stats_history.size",
+                            "Sets the amount of memory available for past events.",
How about some time-based retention? Data size can be hard to think
about for the end user, while there would be a set of users that would
want to retain data for the past week, month, etc? If both size and
time upper-bound are define, then entries that match one condition or
the other are removed.
Right. In my response [1] I suggested replacing the simple memory limit
with a time-based limit, but I haven't done anything about that yet.

And the more I think about it the more I'm convinced we don't need to
keep the data about past runs in memory, a file should be enough (except
maybe for a small buffer). That would mean we don't need to worry about
dynamic shared memory, etc. I initially rejected this because it seemed
like a regression to how pgstat worked initially (sharing data through
files), but I don't think that's actually true - this data is different
(almost append-only), etc.

The one case when we may need co read the data is in response to DROP of
a table, when we need to discard entries for that object. Or we could
handle that by recording OIDs of dropped objects ... not sure how
complex this would need to be.

We'd still want to enforce some limits, of course - a time-based limit
of the minimum amount of time to cover, and maximum amount of disk space
to use (more as a safety).

FWIW there's one "issue" with enforcing the time-based limit - we can
only do that for the "end time", because that's when we see the entry.
If you configure the limit to keep "1 day" history, and then a vacuum
completes after running for 2 days, we want to keep it, so that anyone
can actually see it.
I can't say I recall all the reasoning involved in making
pg_stat_statements just be based on a fixed number of entries, but the
ability to come up with corner cases was certainly a factor. For
example, imagine the scenario where you set a max at 30 days, but you
have some tables only being vacuumed every few months. Ideally you
probably want the last entry no matter what, and honestly probably the
last 2 (in case you are troubleshooting something, having the last run
and something to compare against is ideal). In any case, it can get
complicated pretty quickly.

I really don't want to get into such complicated retention policies,
because there's no "right" solution, and it adds complexity both to
implementation (e.g. we'd need to keep per-relation counters for all the
various events), and processing (How would you know if there were no
other vacuum runs on a relation or if those other events were removed?)

I think the best solution to use a trivial retention policy (e.g. keep
everything for X days), and if you want to keep a longer history, you
need to read and archive that regularly (and the retention period
ensures you don't miss any events).

After all, that's what we assume for most other runtime stats anyway -
it's not very useful to know the current values, if you can't calculate
deltas. So you have to sample the stats.

There's also the trouble that this is not crash safe, so I'd hesitate to
rely on this very much if it can disappear.

...

At the risk of increasing scope, since you already are working on
checkpoints along with vacuums, I'm curious if there was a reason not
to do analyze stats retention as well? It seems pretty correlated in
the same area/problems as vacuum history.

I agree tracking ANALYZE would be useful. I ignored it mostly to keep
the scope as limited as possible, so two separate events were enough.
Adding another hook to do_analyze_rel() would be fairly trivial, but
I'll leave that for later.

regards

--
Tomas Vondra

#12

Jim Nasby

Jim.Nasby@BlueTreble.com

about 1 year ago

In reply to: Tomas Vondra (#3)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On Dec 25, 2024, at 11:25 AM, Tomas Vondra <tomas@vondra.me> wrote:

But maybe it'd be possible to just write the entries to a file. We don't
need random access to past entries (unlike e.g. pg_stat_statements), and
people won't query that very often either.

Assuming this doesn’t add significant complexity I think file-based would be the best way to go. It allows for more meaningful retention policies (ie, “1 month”) that don’t equate to storage size. More importantly, it removes the need to try and scale these limits based on hardware. While 128MB isn’t a huge amount on modern hardware, it’s also not nothing (especially if it can’t be swapped). This is especially true in an environment with a lot of small database instances (ie, at work our default instance has 8GB of memory, and it’d be nice to reduce that even more).

Speaking of retention, it would be nice if this feature allowed users to DELETE from the view that presented the data. That would allow for any kind of custom config that someone could dream up.

#13

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Tomas Vondra (#9)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On Sat, Dec 28, 2024 at 02:25:16AM +0100, Tomas Vondra wrote:

And the more I think about it the more I'm convinced we don't need to
keep the data about past runs in memory, a file should be enough (except
maybe for a small buffer). That would mean we don't need to worry about
dynamic shared memory, etc. I initially rejected this because it seemed
like a regression to how pgstat worked initially (sharing data through
files), but I don't think that's actually true - this data is different
(almost append-only), etc.

Right, I was looking a bit at 0003 that introduces the extension. I
am wondering if we are not repeating the errors of pgss by using a
different file, and if we should not just use pgstats and its single
file instead to store this data through an extension. You are right
that as an append-only pattern using the dshash of pgstat does not fit
well into this picture. How about the second type of stats kinds:
the fixed-numbered stats kind? These allocate a fixed amount of
shared memory, meaning that you could allocate N entries of history
and just manage a queue of them, then do a memcpy() of the whole set
if adding new history at the head of the queue, or just append new
ones at the tail of the queue in shmem, memcpy() once the queue is
full. The extension gets simpler:
- No need to manage a new file, flush of the stats is controlled by
pgstats itself.
- The extension could register a fixed-numbered custom stats kind.
- Push the stats with the new hooks.
- Perhaps less useful, but it is possible to control the timing where
the data is pushed.
- Add SQL wrappers on top to fetch the data from pgstat.
--
Michael

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Michael Paquier (#13)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On 12/31/24 02:06, Michael Paquier wrote:

On Sat, Dec 28, 2024 at 02:25:16AM +0100, Tomas Vondra wrote:

And the more I think about it the more I'm convinced we don't need to
keep the data about past runs in memory, a file should be enough (except
maybe for a small buffer). That would mean we don't need to worry about
dynamic shared memory, etc. I initially rejected this because it seemed
like a regression to how pgstat worked initially (sharing data through
files), but I don't think that's actually true - this data is different
(almost append-only), etc.

Right, I was looking a bit at 0003 that introduces the extension. I
am wondering if we are not repeating the errors of pgss by using a
different file, and if we should not just use pgstats and its single
file instead to store this data through an extension. You are right
that as an append-only pattern using the dshash of pgstat does not fit
well into this picture. How about the second type of stats kinds:
the fixed-numbered stats kind? These allocate a fixed amount of
shared memory, meaning that you could allocate N entries of history
and just manage a queue of them, then do a memcpy() of the whole set
if adding new history at the head of the queue, or just append new
ones at the tail of the queue in shmem, memcpy() once the queue is
full. The extension gets simpler:
- No need to manage a new file, flush of the stats is controlled by
pgstats itself.
- The extension could register a fixed-numbered custom stats kind.

I'm not against leveraging some of the existing pstat infrastructure,
but as I explained earlier I don't like the "fixed amount of shmem"
approach. It's either wasteful (on machines with few vacuum runs) or
difficult to configure to keep enough history.

FWIW I'm not sure what you mean by "pgss errors" - I'm not saying it's
flawless, but OTOH reading/writing a flat file with entries is not
particularly hard. (It'd need to be a bit more complex to evict stuff
for dropped objects, but not by much).

- Push the stats with the new hooks.
- Perhaps less useful, but it is possible to control the timing where
the data is pushed.

Not sure what you mean by "pushed". Pushed where, or how would that matter?

- Add SQL wrappers on top to fetch the data from pgstat.

OK, although I think this is not particularly complicated part of the
code. It could even be simplified further.

regards

--
Tomas Vondra

#15

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jim Nasby (#12)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On 12/30/24 22:40, Jim Nasby wrote:

On Dec 25, 2024, at 11:25 AM, Tomas Vondra <tomas@vondra.me> wrote:

But maybe it'd be possible to just write the entries to a file. We don't
need random access to past entries (unlike e.g. pg_stat_statements), and
people won't query that very often either.

Assuming this doesn’t add significant complexity I think file-based
would be the best way to go. It allows for more meaningful retention
policies (ie, “1 month”) that don’t equate to storage size. More
importantly, it removes the need to try and scale these limits based on
hardware. While 128MB isn’t a huge amount on modern hardware, it’s also
not nothing (especially if it can’t be swapped). This is especially true
in an environment with a lot of small database instances (ie, at work
our default instance has 8GB of memory, and it’d be nice to reduce that
even more).

True, although I'd point out the 128MB was meant to be a maximum for a
GUC, not the default. The default in the patch was 1MB, and with a lots
of instances on the same machine, those can't be very active so this
would probably be enough.

Anyway, with the simple file-based solution this would go away.

Speaking of retention, it would be nice if this feature allowed users to
DELETE from the view that presented the data. That would allow for any
kind of custom config that someone could dream up.

I really don't intend / want to do that. That significantly increases
the complexity, turning an append-only file into something that's
essentially almost a regular table. Yes, even with the append-only file
there's need to be a way to deal with dropped objects, but supporting
arbitrary deletes seems much more demanding.

There should be a way to reset the stats, of course. But only a fairly
coarse one.

regards

--
Tomas Vondra

#16

Jim Nasby

Jim.Nasby@BlueTreble.com

about 1 year ago

In reply to: Tomas Vondra (#15)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On Dec 31, 2024, at 9:20 AM, Tomas Vondra <tomas@vondra.me> wrote:

Speaking of retention, it would be nice if this feature allowed users to
DELETE from the view that presented the data. That would allow for any
kind of custom config that someone could dream up.

I really don't intend / want to do that. That significantly increases
the complexity, turning an append-only file into something that's
essentially almost a regular table. Yes, even with the append-only file
there's need to be a way to deal with dropped objects, but supporting
arbitrary deletes seems much more demanding.

There should be a way to reset the stats, of course. But only a fairly
coarse one.

My thought was that you’d just accumulate the pkey of everything being deleted until end of transaction and then use that to rewrite the table. On the surface that doesn’t seem too terrible, especially depending on how dropped objects as well as aging out old data is handled.

Either way I just wanted to put the idea out there as something I think would be useful, especially since it might influence parts of the simple design. I certainly agree it’s not a mandatory feature.

#17

Cédric Villemain

cedric.villemain@data-bene.io

about 1 year ago

In reply to: Tomas Vondra (#14)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On 31/12/2024 16:06, Tomas Vondra wrote:

On 12/31/24 02:06, Michael Paquier wrote:

On Sat, Dec 28, 2024 at 02:25:16AM +0100, Tomas Vondra wrote:

And the more I think about it the more I'm convinced we don't need to
keep the data about past runs in memory, a file should be enough (except
maybe for a small buffer). That would mean we don't need to worry about
dynamic shared memory, etc. I initially rejected this because it seemed
like a regression to how pgstat worked initially (sharing data through
files), but I don't think that's actually true - this data is different
(almost append-only), etc.

Right, I was looking a bit at 0003 that introduces the extension. I
am wondering if we are not repeating the errors of pgss by using a
different file, and if we should not just use pgstats and its single
file instead to store this data through an extension. You are right
that as an append-only pattern using the dshash of pgstat does not fit
well into this picture. How about the second type of stats kinds:
the fixed-numbered stats kind? These allocate a fixed amount of
shared memory, meaning that you could allocate N entries of history
and just manage a queue of them, then do a memcpy() of the whole set
if adding new history at the head of the queue, or just append new
ones at the tail of the queue in shmem, memcpy() once the queue is
full. The extension gets simpler:
- No need to manage a new file, flush of the stats is controlled by
pgstats itself.
- The extension could register a fixed-numbered custom stats kind.

I'm not against leveraging some of the existing pstat infrastructure,
but as I explained earlier I don't like the "fixed amount of shmem"
approach. It's either wasteful (on machines with few vacuum runs) or
difficult to configure to keep enough history.

A bit late on this topic, I wrote StatsMgr which run stats snapshots
based on interval, it covers only fixed stats at the moment though, it's
ring buffer too, not perfect but does the job. It has many similarities
and I though it'll be interesting to share what I was looking at to
improve further.

I think it'll be interesting to have 2 hooks in PgStat_KindInfo
functions call back:

* "flush_fixed_cb()": a hook inside to be able to execute other action
while flushing, like generating aggregates ("counter summary")

* a new member to do the snapshot directly (or whatever action relevant
when the stats are in an interesting state), maybe snapshot_fixed_cb(),
with a hook inside.

This way, with 2 hooks we can cover both usages, for all stats (fixed
here, but probably the same hooks for variable stats), and have actions
on events instead of only on time interval.

I don't have strict opinion about files management but having a facility
from PostgreSQL to store *metrics* (as opposed to "stats" which are
required for PostgreSQL) will be very convenient. Maybe like SLRU or
similar (the thing used to keep commits timestamp) ? I didn't checked
all already available options in this area.

[1]: https://codeberg.org/Data-Bene/StatsMgr

---
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D

#18

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#3)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

Hi,

On Wed, Dec 25, 2024 at 06:25:50PM +0100, Tomas Vondra wrote:

Hi,

On 12/23/24 07:35, wenhui qiu wrote:
Hi Tomasï¿½
ï¿½ ï¿½ ï¿½This is a great feature. ï¿½
+ /*
+ * Define (or redefine) custom GUC variables.
+ */
+ DefineCustomIntVariable("stats_history.size",
+ "Sets the amount of memory available for past events.",
+ NULL,
+ &statsHistorySizeMB,
+ 1,
+ 1,
+ 128,
+ PGC_POSTMASTER,
+ GUC_UNIT_MB,
+ NULL,
+ NULL,
+ NULL);
+
RAM is in terabytes now, the statsHistorySize is 128MB ,I think can
increase to storeï¿½more history record ?
Maybe, the 128MB is an arbitrary (and conservative) limit - it's enough
for ~500k events, which seems good enough for most systems. Of course,
on systems with many relations might need more space, not sure.

I was thinking about specifying the space in more natural terms, either
as amount of time ("keep 1 day of history") or number of entries ("10k
entries"). That would probably mean the memory can't be allocated as
fixed size.

But maybe it'd be possible to just write the entries to a file. We don't
need random access to past entries (unlike e.g. pg_stat_statements), and
people won't query that very often either.

Thanks for working on this!

Another idea regarding the storage of those metrics: I think that one would
want to see "precise" data for recent metrics but would probably be fine with some
level of aggregation for historical ones. Something like being able to retrieve
"1 day of raw data" and say one year of data aggregated by day (average, maximum,
minimum , standard deviation and maybe some percentiles) could be fine too.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#19

Robert Treat

xzilla@users.sourceforge.net

about 1 year ago

In reply to: Bertrand Drouvot (#18)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On Tue, Jan 7, 2025 at 10:44 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Wed, Dec 25, 2024 at 06:25:50PM +0100, Tomas Vondra wrote:
Hi,

On 12/23/24 07:35, wenhui qiu wrote:
Hi Tomas
This is a great feature.
+ /*
+ * Define (or redefine) custom GUC variables.
+ */
+ DefineCustomIntVariable("stats_history.size",
+ "Sets the amount of memory available for past events.",
+ NULL,
+ &statsHistorySizeMB,
+ 1,
+ 1,
+ 128,
+ PGC_POSTMASTER,
+ GUC_UNIT_MB,
+ NULL,
+ NULL,
+ NULL);
+
RAM is in terabytes now, the statsHistorySize is 128MB ,I think can
increase to store more history record ?
Maybe, the 128MB is an arbitrary (and conservative) limit - it's enough
for ~500k events, which seems good enough for most systems. Of course,
on systems with many relations might need more space, not sure.

I was thinking about specifying the space in more natural terms, either
as amount of time ("keep 1 day of history") or number of entries ("10k
entries"). That would probably mean the memory can't be allocated as
fixed size.

But maybe it'd be possible to just write the entries to a file. We don't
need random access to past entries (unlike e.g. pg_stat_statements), and
people won't query that very often either.
Thanks for working on this!

Another idea regarding the storage of those metrics: I think that one would
want to see "precise" data for recent metrics but would probably be fine with some
level of aggregation for historical ones. Something like being able to retrieve
"1 day of raw data" and say one year of data aggregated by day (average, maximum,
minimum , standard deviation and maybe some percentiles) could be fine too.

While I'm sure some people are ok with it, I would say that most of
the observability/metrics community has moved away from aggregated
data storage towards raw time series data in tools like prometheus,
tsdb, and timescale in order to avoid the problems that misleading /
lossy / low-resolution data can create.

Robert Treat
https://xzilla.net

#20

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Robert Treat (#19)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

On 1/7/25 21:42, Robert Treat wrote:

On Tue, Jan 7, 2025 at 10:44 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

...

Another idea regarding the storage of those metrics: I think that one would
want to see "precise" data for recent metrics but would probably be fine with some
level of aggregation for historical ones. Something like being able to retrieve
"1 day of raw data" and say one year of data aggregated by day (average, maximum,
minimum , standard deviation and maybe some percentiles) could be fine too.

While I'm sure some people are ok with it, I would say that most of
the observability/metrics community has moved away from aggregated
data storage towards raw time series data in tools like prometheus,
tsdb, and timescale in order to avoid the problems that misleading /
lossy / low-resolution data can create.

That's how I see it too. My primary goal is to provide the raw data,
even if it covers only a limited amount of time, so that it can be
either queried directly, or ingested regularly into something like
prometheus.

I can imagine a more complicated system, aggregating the data after into
a lower resolution (e.g. per day). But that's not a complete solution,
because e.g. what if there are many relations that happen to be related
only once per day?

regards

--
Tomas Vondra

#21

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#20)

PoC: history of recent vacuum/checkpoint runs (using new hooks)

Attachments: