Flush pgstats file during checkpoints
Hi all,
On HEAD, xlog.c has the following comment, which has been on my own
TODO list for a couple of weeks now:
* TODO: With a bit of extra work we could just start with a pgstat file
* associated with the checkpoint redo location we're starting from.
Please find a patch series to implement that, giving the possibility
to keep statistics after a crash rather than discard them. I have
been looking at the code for a while, before settling down on:
- Forcing the flush of the pgstats file to happen during non-shutdown
checkpoint and restart points, after updating the control file's redo
LSN and the critical sections in the area.
- Leaving the current before_shmem_exit() callback around, as a matter
of delaying the flush of the stats for as long as possible in a
shutdown sequence. This also makes the single-user mode shutdown
simpler.
- Adding the redo LSN to the pgstats file, with a bump of
PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN. This change
is independently useful on its own when loading stats after a clean
startup, as well.
- The crash recovery case is simplified, as there is no more need for
the "discard" code path.
- Not using a logic where I would need to stick a LSN into the stats
file name, implying that we would need a potential lookup at the
contents of pg_stat/ to clean up past files at crash recovery. These
should not be costly, but I'd rather not add more of these.
7ff23c6d277d, that has removed the last call of CreateCheckPoint()
from the startup process, is older than 5891c7a8ed8f, still it seems
to me that pgstats relies on some areas of the code that don't make
sense on HEAD (see locking mentioned at the top of the write routine
for example). The logic gets straight-forward to think about as
restart points and checkpoints always run from the checkpointer,
implying that pgstat_write_statsfile() is already called only from the
postmaster in single-user mode or the checkpointer itself, at
shutdown.
Attached is a patch set, with the one being the actual feature, with
some stuff prior to that:
- 0001 adds the redo LSN to the pgstats file flushed.
- 0002 adds an assertion in pgstat_write_statsfile(), to check from
where it is called.
- 0003 with more debugging.
- 0004 is the meat of the thread.
I am adding that to the next CF. Thoughts and comments are welcome.
Thanks,
--
Michael
Attachments:
0001-Add-redo-LSN-to-pgstats-file.patchtext/x-diff; charset=us-asciiDownload+23-11
0002-Add-assertion-in-pgstat_write_statsfile.patchtext/x-diff; charset=us-asciiDownload+3-1
0003-Add-some-DEBUG2-information-about-the-redo-LSN-of-th.patchtext/x-diff; charset=us-asciiDownload+4-3
0004-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload+54-62
On 18/06/2024 9:01 am, Michael Paquier wrote:
Hi all,
On HEAD, xlog.c has the following comment, which has been on my own
TODO list for a couple of weeks now:* TODO: With a bit of extra work we could just start with a pgstat file
* associated with the checkpoint redo location we're starting from.Please find a patch series to implement that, giving the possibility
to keep statistics after a crash rather than discard them. I have
been looking at the code for a while, before settling down on:
- Forcing the flush of the pgstats file to happen during non-shutdown
checkpoint and restart points, after updating the control file's redo
LSN and the critical sections in the area.
- Leaving the current before_shmem_exit() callback around, as a matter
of delaying the flush of the stats for as long as possible in a
shutdown sequence. This also makes the single-user mode shutdown
simpler.
- Adding the redo LSN to the pgstats file, with a bump of
PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN. This change
is independently useful on its own when loading stats after a clean
startup, as well.
- The crash recovery case is simplified, as there is no more need for
the "discard" code path.
- Not using a logic where I would need to stick a LSN into the stats
file name, implying that we would need a potential lookup at the
contents of pg_stat/ to clean up past files at crash recovery. These
should not be costly, but I'd rather not add more of these.7ff23c6d277d, that has removed the last call of CreateCheckPoint()
from the startup process, is older than 5891c7a8ed8f, still it seems
to me that pgstats relies on some areas of the code that don't make
sense on HEAD (see locking mentioned at the top of the write routine
for example). The logic gets straight-forward to think about as
restart points and checkpoints always run from the checkpointer,
implying that pgstat_write_statsfile() is already called only from the
postmaster in single-user mode or the checkpointer itself, at
shutdown.Attached is a patch set, with the one being the actual feature, with
some stuff prior to that:
- 0001 adds the redo LSN to the pgstats file flushed.
- 0002 adds an assertion in pgstat_write_statsfile(), to check from
where it is called.
- 0003 with more debugging.
- 0004 is the meat of the thread.I am adding that to the next CF. Thoughts and comments are welcome.
Thanks,
--
Michael
Hi Michael.
I am working mostly on the same problem - persisting pgstat state in
Neon (because of separation of storage and compute it has no local files).
I have two questions concerning this PR and the whole strategy for
saving pgstat state between sessions.
1. Right now pgstat.stat is discarded after abnormal Postgres
termination. And in your PR we are storing LSN in pgstat.staty to check
that it matches checkpoint redo LSN. I wonder if having outdated version
of pgstat.stat is worse than not having it at all? Comment in xlog.c
says: "When starting with crash recovery, reset pgstat data - it might
not be valid." But why it may be invalid? We are writing it first to
temp file and then rename. May be fsync() should be added here and
durable_rename() should be used instead of rename(). But it seems to be
better than loosing statistics. And should not add some significant
overhead (because it happens only at shutdown). In your case we are
checking LSN of pgstat.stat file. But once again - why it is better to
discard file than load version from previous checkpoint?
2. Second question is also related with 1). So we saved pgstat.stat on
checkpoint, then did some updates and then Postgres crashed. As far as I
understand with your patch we will successfully restore pgstats.stat
file. But it is not actually up-to-date: it doesn't reflect information
about recent updates. If it was so critical to keep pgstat.stat
up-to-date, then why do we allow to restore state on most recent checkpoint?
Thanks,
Konstantin
On 6/28/24 09:32, Konstantin Knizhnik wrote:
On 18/06/2024 9:01 am, Michael Paquier wrote:
Hi all,
On HEAD, xlog.c has the following comment, which has been on my own
TODO list for a couple of weeks now:* TODO: With a bit of extra work we could just start with a
pgstat file
* associated with the checkpoint redo location we're starting from.Please find a patch series to implement that, giving the possibility
to keep statistics after a crash rather than discard them. I have
been looking at the code for a while, before settling down on:
- Forcing the flush of the pgstats file to happen during non-shutdown
checkpoint and restart points, after updating the control file's redo
LSN and the critical sections in the area.
- Leaving the current before_shmem_exit() callback around, as a matter
of delaying the flush of the stats for as long as possible in a
shutdown sequence. This also makes the single-user mode shutdown
simpler.
- Adding the redo LSN to the pgstats file, with a bump of
PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN. This change
is independently useful on its own when loading stats after a clean
startup, as well.
- The crash recovery case is simplified, as there is no more need for
the "discard" code path.
- Not using a logic where I would need to stick a LSN into the stats
file name, implying that we would need a potential lookup at the
contents of pg_stat/ to clean up past files at crash recovery. These
should not be costly, but I'd rather not add more of these.7ff23c6d277d, that has removed the last call of CreateCheckPoint()
from the startup process, is older than 5891c7a8ed8f, still it seems
to me that pgstats relies on some areas of the code that don't make
sense on HEAD (see locking mentioned at the top of the write routine
for example). The logic gets straight-forward to think about as
restart points and checkpoints always run from the checkpointer,
implying that pgstat_write_statsfile() is already called only from the
postmaster in single-user mode or the checkpointer itself, at
shutdown.Attached is a patch set, with the one being the actual feature, with
some stuff prior to that:
- 0001 adds the redo LSN to the pgstats file flushed.
- 0002 adds an assertion in pgstat_write_statsfile(), to check from
where it is called.
- 0003 with more debugging.
- 0004 is the meat of the thread.I am adding that to the next CF. Thoughts and comments are welcome.
Thanks,
--
MichaelHi Michael.
I am working mostly on the same problem - persisting pgstat state in
Neon (because of separation of storage and compute it has no local files).
I have two questions concerning this PR and the whole strategy for
saving pgstat state between sessions.1. Right now pgstat.stat is discarded after abnormal Postgres
termination. And in your PR we are storing LSN in pgstat.staty to check
that it matches checkpoint redo LSN. I wonder if having outdated version
of pgstat.stat is worse than not having it at all? Comment in xlog.c
says: "When starting with crash recovery, reset pgstat data - it might
not be valid." But why it may be invalid? We are writing it first to
temp file and then rename. May be fsync() should be added here and
durable_rename() should be used instead of rename(). But it seems to be
better than loosing statistics. And should not add some significant
overhead (because it happens only at shutdown). In your case we are
checking LSN of pgstat.stat file. But once again - why it is better to
discard file than load version from previous checkpoint?
I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.
2. Second question is also related with 1). So we saved pgstat.stat on
checkpoint, then did some updates and then Postgres crashed. As far as I
understand with your patch we will successfully restore pgstats.stat
file. But it is not actually up-to-date: it doesn't reflect information
about recent updates. If it was so critical to keep pgstat.stat
up-to-date, then why do we allow to restore state on most recent
checkpoint?
Yeah, I was wondering about the same thing - can't this mean we fail to
start autovacuum? Let's say we delete a significant fraction of a huge
table, but then we crash before the next checkpoint. With this patch we
restore the last stats snapshot, which can easily have
n_dead_tup | 0
n_mod_since_analyze | 0
for the table. And thus we fail to notice the table needs autovacuum.
AFAIK we run autovacuum on all tables with missing stats (or am I
wrong?). That's what's happening on replicas after switchover/failover
too, right?
It'd not be such an issue if we updated stats during recovery, but I
think think we're doing that. Perhaps we should, which might also help
on replicas - no idea if it's feasible, though.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:
I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.
Yeah, that's bugging me as well. I don't really get why we would not
want durability at shutdown for this data. So how about switching the
end of pgstat_write_statsfile() to use durable_rename()? Sounds like
an independent change, worth on its own.
Yeah, I was wondering about the same thing - can't this mean we fail to
start autovacuum? Let's say we delete a significant fraction of a huge
table, but then we crash before the next checkpoint. With this patch we
restore the last stats snapshot, which can easily haven_dead_tup | 0
n_mod_since_analyze | 0for the table. And thus we fail to notice the table needs autovacuum.
AFAIK we run autovacuum on all tables with missing stats (or am I
wrong?). That's what's happening on replicas after switchover/failover
too, right?
That's the opposite, please look at relation_needs_vacanalyze(). If a
table does not have any stats found in pgstats, like on HEAD after a
crash, then autoanalyze is skipped and autovacuum happens only for the
anti-wrap case.
For the database case, rebuild_database_list() uses
pgstat_fetch_stat_dbentry() three times, discards entirely databases
that have no stats. Again, there should be a stats entry post a
crash upon a reconnection.
So there's an argument in saying that the situation does not get worse
here and that we actually may improve odds by keeping a trace of the
previous stats, no? In most cases, there would be a stats entry
post-crash as an INSERT or a scan would have re-created it, but the
stats would reflect the state registered since the last crash recovery
(even on HEAD, a bulk delete bumping the dead tuple counter would not
be detected post-crash). The case of spiky workloads may impact the
decision-making, of course, but at least we'd allow autovacuum to take
some decision over giving up entirely based on some previous state of
the stats. That sounds like a win for me with steady workloads, less
for bulby workloads..
It'd not be such an issue if we updated stats during recovery, but I
think think we're doing that. Perhaps we should, which might also help
on replicas - no idea if it's feasible, though.
Stats on replicas are considered an independent thing AFAIU (scans are
counted for example in read-only queries). If we were to do that we
may want to split stats handling between nodes in standby state and
crash recovery. Not sure if that's worth the complication. First,
the stats exist at node-level.
--
Michael
On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote:
On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:
I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.Yeah, that's bugging me as well. I don't really get why we would not
want durability at shutdown for this data. So how about switching the
end of pgstat_write_statsfile() to use durable_rename()? Sounds like
an independent change, worth on its own.
Please find attached a rebased patch set with the durability point
addressed in 0001. There were also some conflicts.
Note that I have applied the previous 0002 adding an assert in
pgstat_write_statsfile() as 734c057a8935, as I've managed to break
again this assumption while hacking more on this area..
--
Michael
Attachments:
v2-0001-Make-write-of-pgstats-file-durable.patchtext/x-diff; charset=us-asciiDownload+2-6
v2-0002-Add-redo-LSN-to-pgstats-file.patchtext/x-diff; charset=us-asciiDownload+23-11
v2-0003-Add-some-DEBUG2-information-about-the-redo-LSN-of.patchtext/x-diff; charset=us-asciiDownload+4-3
v2-0004-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload+54-62
Hi,
On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote:
On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:
I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.Yeah, that's bugging me as well. I don't really get why we would not
want durability at shutdown for this data. So how about switching the
end of pgstat_write_statsfile() to use durable_rename()? Sounds like
an independent change, worth on its own.Yeah, I was wondering about the same thing - can't this mean we fail to
start autovacuum? Let's say we delete a significant fraction of a huge
table, but then we crash before the next checkpoint. With this patch we
restore the last stats snapshot, which can easily haven_dead_tup | 0
n_mod_since_analyze | 0for the table. And thus we fail to notice the table needs autovacuum.
AFAIK we run autovacuum on all tables with missing stats (or am I
wrong?). That's what's happening on replicas after switchover/failover
too, right?That's the opposite, please look at relation_needs_vacanalyze(). If a
table does not have any stats found in pgstats, like on HEAD after a
crash, then autoanalyze is skipped and autovacuum happens only for the
anti-wrap case.For the database case, rebuild_database_list() uses
pgstat_fetch_stat_dbentry() three times, discards entirely databases
that have no stats. Again, there should be a stats entry post a
crash upon a reconnection.So there's an argument in saying that the situation does not get worse
here and that we actually may improve odds by keeping a trace of the
previous stats, no?
I agree, we still could get autoanalyze/autovacuum skipped due to obsolete/stales
stats being restored from the last checkpoint but that's better than having them
always skipped (current HEAD).
In most cases, there would be a stats entry
post-crash as an INSERT or a scan would have re-created it,
Yeap.
but the
stats would reflect the state registered since the last crash recovery
(even on HEAD, a bulk delete bumping the dead tuple counter would not
be detected post-crash).
Right.
The case of spiky workloads may impact the
decision-making, of course, but at least we'd allow autovacuum to take
some decision over giving up entirely based on some previous state of
the stats. That sounds like a win for me with steady workloads, less
for bulby workloads..
I agree and it is not worst (though not ideal) that currently on HEAD.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Fri, Jul 12, 2024 at 03:42:21PM +0900, Michael Paquier wrote:
On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote:
On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:
I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.Yeah, that's bugging me as well. I don't really get why we would not
want durability at shutdown for this data. So how about switching the
end of pgstat_write_statsfile() to use durable_rename()? Sounds like
an independent change, worth on its own.Please find attached a rebased patch set with the durability point
addressed in 0001. There were also some conflicts.
Thanks!
Looking at 0001:
+ /* error logged already */
Maybe mention it's already logged by durable_rename() (like it's done in
InstallXLogFileSegment(), BaseBackup() for example).
Except this nit, 0001 LGTM.
Need to spend more time and thoughts on 0002+.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Fri, Jul 12, 2024 at 12:10:26PM +0000, Bertrand Drouvot wrote:
Need to spend more time and thoughts on 0002+.
I think there is a corner case, say:
1. shutdown checkpoint at LSN1
2. startup->reads the stat file (contains LSN1)->all good->read stat file and
remove it
3. crash (no checkpoint occured between 2. and 3.)
4. startup (last checkpoint is still LSN1)->no stat file (as removed in 2.)
In that case we start with empty stats.
Instead of removing the stat file, should we keep it around until the first call
to pgstat_write_statsfile()?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Fri, Jul 12, 2024 at 01:01:19PM +0000, Bertrand Drouvot wrote:
Instead of removing the stat file, should we keep it around until the first call
to pgstat_write_statsfile()?
Oops. You are right, I have somewhat missed the unlink() once we are
done reading the stats file with a correct redo location.
--
Michael
On Fri, Jul 12, 2024 at 12:10:26PM +0000, Bertrand Drouvot wrote:
Looking at 0001:
+ /* error logged already */
Maybe mention it's already logged by durable_rename() (like it's done in
InstallXLogFileSegment(), BaseBackup() for example).Except this nit, 0001 LGTM.
Tweaked the comment, and applied 0001 for durable_rename(). Thanks
for the review.
--
Michael
On Tue, Jul 16, 2024 at 10:37:39AM +0900, Michael Paquier wrote:
On Fri, Jul 12, 2024 at 01:01:19PM +0000, Bertrand Drouvot wrote:
Instead of removing the stat file, should we keep it around until the first call
to pgstat_write_statsfile()?Oops. You are right, I have somewhat missed the unlink() once we are
done reading the stats file with a correct redo location.
The durable_rename() part has been applied. Please find attached a
rebase of the patch set with all the other comments addressed.
--
Michael
Hi,
On Wed, Jul 17, 2024 at 12:52:12PM +0900, Michael Paquier wrote:
On Tue, Jul 16, 2024 at 10:37:39AM +0900, Michael Paquier wrote:
On Fri, Jul 12, 2024 at 01:01:19PM +0000, Bertrand Drouvot wrote:
Instead of removing the stat file, should we keep it around until the first call
to pgstat_write_statsfile()?Oops. You are right, I have somewhat missed the unlink() once we are
done reading the stats file with a correct redo location.The durable_rename() part has been applied. Please find attached a
rebase of the patch set with all the other comments addressed.
Thanks!
Looking at 0001:
1 ===
- pgstat_write_statsfile();
+ pgstat_write_statsfile(GetRedoRecPtr());
Not related with your patch but this comment in the GetRedoRecPtr() function:
* grabbed a WAL insertion lock to read the authoritative value in
* Insert->RedoRecPtr
sounds weird. Should'nt that be s/Insert/XLogCtl/?
2 ===
+ /* Write the redo LSN, used to cross check the file loaded */
Nit: s/loaded/read/?
3 ===
+ /*
+ * Read the redo LSN stored in the file.
+ */
+ if (!read_chunk_s(fpin, &file_redo) ||
+ file_redo != redo)
+ goto error;
I wonder if it would make sense to have dedicated error messages for
"file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
ease to diagnose as to why the stat file is discarded.
Looking at 0002:
LGTM
Looking at 0003:
4 ===
@@ -5638,10 +5634,7 @@ StartupXLOG(void)
* TODO: With a bit of extra work we could just start with a pgstat file
* associated with the checkpoint redo location we're starting from.
*/
- if (didCrash)
- pgstat_discard_stats();
- else
- pgstat_restore_stats(checkPoint.redo);
+ pgstat_restore_stats(checkPoint.redo)
remove the TODO comment?
5 ===
+ * process) if the stats file has a redo LSN that matches with the .
unfinished sentence?
6 ===
- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
*/
void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
{
Would that make sense to add an Assert in pgstat_flush_stats()? (checking what
the above comment states).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Jul 22, 2024 at 07:01:41AM +0000, Bertrand Drouvot wrote:
1 ===
Not related with your patch but this comment in the GetRedoRecPtr() function:* grabbed a WAL insertion lock to read the authoritative value in
* Insert->RedoRecPtrsounds weird. Should'nt that be s/Insert/XLogCtl/?
No, the comment is right. We are retrieving a copy of
Insert->RedoRecPtr here.
2 ===
+ /* Write the redo LSN, used to cross check the file loaded */
Nit: s/loaded/read/?
WFM.
3 ===
+ /* + * Read the redo LSN stored in the file. + */ + if (!read_chunk_s(fpin, &file_redo) || + file_redo != redo) + goto error;I wonder if it would make sense to have dedicated error messages for
"file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
ease to diagnose as to why the stat file is discarded.
Yep. This has been itching me quite a bit, and that's a bit more than
just the format ID or the redo LSN: it relates to all the read_chunk()
callers. I've taken a shot at this with patch 0001, implemented on
top of the rest. Adjusted as well the redo LSN read to have more
error context, now in 0002.
Looking at 0003:
4 ===
@@ -5638,10 +5634,7 @@ StartupXLOG(void) * TODO: With a bit of extra work we could just start with a pgstat file * associated with the checkpoint redo location we're starting from. */ - if (didCrash) - pgstat_discard_stats(); - else - pgstat_restore_stats(checkPoint.redo); + pgstat_restore_stats(checkPoint.redo)remove the TODO comment?
Pretty sure I've removed that more than one time already, and that
this is a rebase accident. Thanks for noticing.
5 ===
+ * process) if the stats file has a redo LSN that matches with the .
unfinished sentence?
This is missing a reference to the control file.
6 ===
- * Should only be called by the startup process or in single user mode. + * This is called by the checkpointer or in single-user mode. */ void -pgstat_discard_stats(void) +pgstat_flush_stats(XLogRecPtr redo) {Would that make sense to add an Assert in pgstat_flush_stats()? (checking what
the above comment states).
There is one in pgstat_write_statsfile(), not sure there is a point in
duplicating the assertion in both.
Attaching a new v4 series, with all these comments addressed.
--
Michael
Attachments:
v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patchtext/x-diff; charset=us-asciiDownload+0-270
v4-0002-Add-more-debugging-information-when-reading-stats.patchtext/x-diff; charset=us-asciiDownload+61-5
v4-0003-Add-redo-LSN-to-pgstats-file.patchtext/x-diff; charset=us-asciiDownload+32-11
v4-0004-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload+59-72
Hi,
On Tue, Jul 23, 2024 at 12:52:11PM +0900, Michael Paquier wrote:
On Mon, Jul 22, 2024 at 07:01:41AM +0000, Bertrand Drouvot wrote:
3 ===
+ /* + * Read the redo LSN stored in the file. + */ + if (!read_chunk_s(fpin, &file_redo) || + file_redo != redo) + goto error;I wonder if it would make sense to have dedicated error messages for
"file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
ease to diagnose as to why the stat file is discarded.Yep. This has been itching me quite a bit, and that's a bit more than
just the format ID or the redo LSN: it relates to all the read_chunk()
callers. I've taken a shot at this with patch 0001, implemented on
top of the rest.
Thanks! 0001 attached is v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch
so I guess you did not attached the right one.
Attaching a new v4 series, with all these comments addressed.
Thanks!
Looking at 0002:
1 ===
if (!read_chunk(fpin, ptr, info->shared_data_len))
+ {
+ elog(WARNING, "could not read data of stats kind %d for entry of type %c",
+ kind, t);
Nit: what about to include the "info->shared_data_len" value in the WARNING?
2 ===
if (!read_chunk_s(fpin, &name))
+ {
+ elog(WARNING, "could not read name of stats kind %d for entry of type %c",
+ kind, t);
goto error;
+ }
if (!pgstat_is_kind_valid(kind))
+ {
+ elog(WARNING, "invalid stats kind %d for entry of type %c",
+ kind, t);
goto error;
+ }
Shouldn't we swap those 2 tests so that we check that the kind is valid right
after this one?
if (!read_chunk_s(fpin, &kind))
+ {
+ elog(WARNING, "could not read stats kind for entry of type %c", t);
goto error;
+ }
Looking at 0003: LGTM
Looking at 0004: LGTM
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Jul 29, 2024 at 04:46:17AM +0000, Bertrand Drouvot wrote:
Thanks! 0001 attached is v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch
so I guess you did not attached the right one.
I did attach the right set of patches, please ignore 0001 entirely:
the patch series is made of three patches, beginning with 0002 :)
Looking at 0002:
if (!read_chunk(fpin, ptr, info->shared_data_len)) + { + elog(WARNING, "could not read data of stats kind %d for entry of type %c", + kind, t);Nit: what about to include the "info->shared_data_len" value in the WARNING?
Good idea, so added.
if (!read_chunk_s(fpin, &name)) + { + elog(WARNING, "could not read name of stats kind %d for entry of type %c", + kind, t); goto error; + } if (!pgstat_is_kind_valid(kind)) + { + elog(WARNING, "invalid stats kind %d for entry of type %c", + kind, t); goto error; + }Shouldn't we swap those 2 tests so that we check that the kind is valid right
after this one?
Hmm. We could, but this order is not buggy either. So I've let it
as-is for now, just adding the WARNINGs.
By the way, I have noticed an extra path where a WARNING would not be
logged while playing with corrupted pgstats files: when the entry type
itself is incorrect. I have added an extra elog() in this case, and
applied 0001. Err.. 0002, sorry ;)
Looking at 0003: LGTM
Looking at 0004: LGTM
Thanks. Attached are the two remaining patches, for now. I'm
planning to get back to these in a few days.
--
Michael
Hi,
On Tue, Jul 30, 2024 at 03:25:31PM +0900, Michael Paquier wrote:
On Mon, Jul 29, 2024 at 04:46:17AM +0000, Bertrand Drouvot wrote:
Thanks! 0001 attached is v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch
so I guess you did not attached the right one.I did attach the right set of patches, please ignore 0001 entirely:
the patch series is made of three patches, beginning with 0002 :)
Yeah, saw that ;-)
Looking at 0002:
if (!read_chunk(fpin, ptr, info->shared_data_len)) + { + elog(WARNING, "could not read data of stats kind %d for entry of type %c", + kind, t);Nit: what about to include the "info->shared_data_len" value in the WARNING?
Good idea, so added.
Thanks!
Looking at 0003: LGTM
Looking at 0004: LGTMThanks. Attached are the two remaining patches, for now. I'm
planning to get back to these in a few days.
Did a quick check and still LGTM.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Jul 30, 2024 at 08:53:48AM +0000, Bertrand Drouvot wrote:
Did a quick check and still LGTM.
Applied 0003 for now to add the redo LSN to the pgstats file, adding
the redo LSN to the two DEBUG2 entries when reading and writing while
on it, that I forgot. (It was not 01:57 where I am now.)
Attached is the last one.
--
Michael
Attachments:
v6-0001-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload+57-71
On Fri, Aug 02, 2024 at 02:11:34AM +0900, Michael Paquier wrote:
Applied 0003 for now to add the redo LSN to the pgstats file, adding
the redo LSN to the two DEBUG2 entries when reading and writing while
on it, that I forgot. (It was not 01:57 where I am now.)Attached is the last one.
The CF bot has been complaining in injection_points as an effect of
the stats remaining after a crash, so rebased to adapt to that.
--
Michael
Attachments:
v7-0001-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload+60-74
Hi,
On Mon, Aug 26, 2024 at 01:56:40PM +0900, Michael Paquier wrote:
On Fri, Aug 02, 2024 at 02:11:34AM +0900, Michael Paquier wrote:
Applied 0003 for now to add the redo LSN to the pgstats file, adding
the redo LSN to the two DEBUG2 entries when reading and writing while
on it, that I forgot. (It was not 01:57 where I am now.)Attached is the last one.
The CF bot has been complaining in injection_points as an effect of
the stats remaining after a crash, so rebased to adapt to that.
Thanks!
Checking the V7 diffs as compared to V4:
1. In pgstat_write_statsfile():
- elog(DEBUG2, "writing stats file \"%s\"", statfile);
+ elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
+ LSN_FORMAT_ARGS(redo));
2. and the ones in injection_points/t/001_stats.pl:
+# On crash the stats are still there.
$node->stop('immediate');
$node->start;
$numcalls = $node->safe_psql('postgres',
"SELECT injection_points_stats_numcalls('stats-notice');");
-is($numcalls, '', 'number of stats after crash');
+is($numcalls, '3', 'number of stats after crash');
$fixedstats = $node->safe_psql('postgres',
"SELECT * FROM injection_points_stats_fixed();");
-is($fixedstats, '0|0|0|0|0', 'fixed stats after crash');
+is($fixedstats, '1|0|2|1|1', 'fixed stats after crash');
They both LGTM.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
It'd not be such an issue if we updated stats during recovery, but I
think think we're doing that. Perhaps we should, which might also help
on replicas - no idea if it's feasible, though.Stats on replicas are considered an independent thing AFAIU (scans are
counted for example in read-only queries). If we were to do that we
may want to split stats handling between nodes in standby state and
crash recovery. Not sure if that's worth the complication. First,
the stats exist at node-level.
Hmm, I'm a bit disappointed this doesn't address replication. It makes
sense that scans are counted separately on a standby, but it would be
nice if stats like last_vacuum were propagated from primary to standbys.
I guess that can be handled separately later.
Reviewing v7-0001-Flush-pgstats-file-during-checkpoints.patch:
There are various race conditions where a stats entry can be leaked in
the pgstats file. I.e. relation is dropped, but its stats entry is
retained in the stats file after crash. In the worst case, suck leaked
entries can accumulate until the stats file is manually removed, which
resets all stats again. Perhaps that's acceptable - it's still possible
leak the actual relation file for a new relation on crash, after all,
which is much worse (I'm glad Horiguchi-san is working on that [1]/messages/by-id/20240901.010925.656452225144636594.horikyota.ntt@gmail.com).
For example:
1. BEGIN; CREATE TABLE foo (); ANALYZE foo;
2. CHECKPOINT;
3. pg_ctl restart -m immediate
This is the same scenario where we leak the relfile, but now you can
have it with e.g. function statistics too.
Until 5891c7a8ed, there was a mechanism to garbage collect such orphaned
entries (pgstat_vacuum()). How bad would it be to re-introduce that? Or
can we make it more watertight so that there are no leaks?
If you do this:
pg_ctl start -D data
pg_ctl stop -D data -m immediate
pg_ctl start -D data
pg_ctl stop -D data -m immediate
You get this in the log:
2024-09-02 16:28:37.874 EEST [1397281] WARNING: found incorrect redo
LSN 0/160A3C8 (expected 0/160A440)
I think it's failing to flush the stats file at the end of recovery
checkpoint.
[1]: /messages/by-id/20240901.010925.656452225144636594.horikyota.ntt@gmail.com
/messages/by-id/20240901.010925.656452225144636594.horikyota.ntt@gmail.com
--
Heikki Linnakangas
Neon (https://neon.tech)