Changing the state of data checksums in a running cluster

tomas.vondra@2ndquadrant.com

almost 2 years ago

In reply to: Daniel Gustafsson (#1)

Re: Changing the state of data checksums in a running cluster

Hi Daniel,

Thanks for rebasing the patch and submitting it again!

On 7/3/24 08:41, Daniel Gustafsson wrote:

After some off-list discussion about the desirability of this feature, where
several hackers opined that it's something that we should have, I've decided to
rebase this patch and submit it one more time. There are several (long)
threads covering the history of this patch [0][1], related work stemming from
this [2] as well as earlier attempts and discussions [3][4]. Below I try to
respond to a summary of points raised in those threads.

The mechanics of the patch hasn't changed since the last posted version, it has
mainly been polished slightly. A high-level overview of the processing is:
It's using a launcher/worker model where the launcher will spawn a worker per
database which will traverse all pages and dirty them in order to calculate and
set the checksum on them. During this inprogress state all backends calculated
and write checksums but don't verify them on read. Once all pages have been
checksummed the state of the cluster will switch over to "on" synchronized
across all backends with a procsignalbarrier. At this point checksums are
verified and processing is equal to checksums having been enabled initdb. When
a user disables checksums the cluster enters a state where all backends still
write checksums until all backends have acknowledged that they have stopped
verifying checksums (again using a procsignalbarrier). At this point the
cluster switches to "off" and checksums are neither written nor verified. In
case the cluster is restarted, voluntarily or via a crash, processing will have
to be restarted (more on that further down).

The user facing controls for this are two SQL level functions, for enabling and
disabling. The existing data_checksums GUC remains but is expanded with more
possible states (with on/off retained).

Complaints against earlier versions
===================================
Seasoned hackers might remember that this patch has been on -hackers before.
There has been a lot of review, and AFAICT all specific comments have been
addressed. There are however a few larger more generic complaints:

* Restartability - the initial version of the patch did not support stateful
restarts, a shutdown performed (or crash) before checksums were enabled would
result in a need to start over from the beginning. This was deemed the safe
orchestration method. The lack of this feature was seen as serious drawback,
so it was added. Subsequent review instead found the patch to be too
complicated with a too large featureset. I thihk there is merit to both of
these arguments: being able to restart is a great feature; and being able to
reason about the correctness of a smaller patch is also great. As of this
submission I have removed the ability to restart to keep the scope of the patch
small (which is where the previous version was, which received no review after
the removal). The way I prefer to frame this is to first add scaffolding and
infrastructure (this patch) and leave refinements and add-on features
(restartability, but also others like parallel workers, optimizing rare cases,
etc) for follow-up patches.

I 100% support this approach.

Sure, I'd like to have a restartable tool, but clearly that didn't go
particularly well, and we still have nothing to enable checksums online.
That doesn't seem to benefit anyone - to me it seems reasonable to get
the non-restartable tool in, and then maybe later someone can improve
this to make it restartable. Thanks to the earlier work we know it's
doable, even if it was too complex.

This way it's at least possible to enable checksums online with some
additional care (e.g. to make sure no one restarts the cluster etc.).
I'd bet for vast majority of systems this will work just fine. Huge
systems with some occasional / forced restarts may not be able to make
this work - but then again, that's no worse than now.

* Complexity - it was brought up that this is a very complex patch for a niche
feature, and there is a lot of truth to that. It is inherently complex to
change a pg_control level state of a running cluster. There might be ways to
make the current patch less complex, while not sacrificing stability, and if so
that would be great. A lot of of the complexity came from being able to
restart processing, and that's not removed for this version, but it's clearly
not close to a one-line-diff even without it.

I'd push back on this a little bit - the patch looks like this:

50 files changed, 2691 insertions(+), 48 deletions(-)

and if we ignore the docs / perl tests, then the two parts that stand
out are

src/backend/access/transam/xlog.c | 455 +++++-
src/backend/postmaster/datachecksumsworker.c | 1353 +++++++++++++++++

I don't think the worker code is exceptionally complex. Yes, it's not
trivial, but a lot of the 1353 inserts is comments (which is good) or
generic infrastructure to start the worker etc.

Other complaints were addressed, in part by the invention of procsignalbarriers
which makes this synchronization possible. In re-reading the threads I might
have missed something which is still left open, and if so I do apologize for
that.

Open TODO items:
================
* Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
order to be able to run the tests in a timely manner on it. This is overly
aggressive and dialling it back while still being able to run fast tests is a
TODO. Not sure what the best option is there.

Why not to add a parameter to pg_enable_data_checksums(), specifying
whether to do immediate checkpoint or wait for the next one? AFAIK
that's what we do in pg_backup_start, for example.

* Monitoring - an insightful off-list reviewer asked how the current progress
of the operation is monitored. So far I've been using pg_stat_activity but I
don't disagree that it's not a very sharp tool for this. Maybe we need a
specific function or view or something? There clearly needs to be a way for a
user to query state and progress of a transition.

Yeah, I think a view like pg_stat_progress_checksums would work.

* Throttling - right now the patch uses the vacuum access strategy, with the
same cost options as vacuum, in order to implement throttling. This is in part
due to the patch starting out modelled around autovacuum as a worker, but it
may not be the right match for throttling checksums.

IMHO it's reasonable to reuse the vacuum throttling. Even if it's not
perfect, it does not seem great to invent something new and end up with
two different ways to throttle stuff.

* Naming - the in-between states when data checksums are enabled or disabled
are called inprogress-on and inprogress-off. The reason for this is simply
that early on there were only three states: inprogress, on and off, and the
process of disabling wasn't labeled with a state. When this transition state
was added it seemed like a good idea to tack the end-goal onto the transition.
These state names make the code easily greppable but might not be the most
obvious choices for anything user facing. Is "Enabling" and "Disabling" better
terms to use (across the board or just user facing) or should we stick to the
current?

I think the naming is fine. In the worst case we can rename that later,
seems more like a detail.

There are ways in which this processing can be optimized to achieve better
performance, but in order to keep goalposts in sight and patchsize down they
are left as future work.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Bruce Momjian

bruce@momjian.us

almost 2 years ago

In reply to: Tomas Vondra (#2)

Re: Changing the state of data checksums in a running cluster

On Wed, Jul 3, 2024 at 01:20:10PM +0200, Tomas Vondra wrote:

* Restartability - the initial version of the patch did not support stateful
restarts, a shutdown performed (or crash) before checksums were enabled would
result in a need to start over from the beginning. This was deemed the safe
orchestration method. The lack of this feature was seen as serious drawback,
so it was added. Subsequent review instead found the patch to be too
complicated with a too large featureset. I thihk there is merit to both of
these arguments: being able to restart is a great feature; and being able to
reason about the correctness of a smaller patch is also great. As of this
submission I have removed the ability to restart to keep the scope of the patch
small (which is where the previous version was, which received no review after
the removal). The way I prefer to frame this is to first add scaffolding and
infrastructure (this patch) and leave refinements and add-on features
(restartability, but also others like parallel workers, optimizing rare cases,
etc) for follow-up patches.

I 100% support this approach.

Yes, I was very disappointed when restartability sunk the patch, and I
saw this as another case where saying "yes" to every feature improvement
can lead to failure.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

daniel@yesql.se

over 1 year ago

In reply to: Tomas Vondra (#2)

Re: Changing the state of data checksums in a running cluster

On 3 Jul 2024, at 13:20, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Thanks for rebasing the patch and submitting it again!

Thanks for review, sorry for being so slow to pick this up again.

The attached version is a rebase with some level of cleanup and polish all
around, and most importantly it adresses the two points raised below.

* Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
order to be able to run the tests in a timely manner on it. This is overly
aggressive and dialling it back while still being able to run fast tests is a
TODO. Not sure what the best option is there.

Why not to add a parameter to pg_enable_data_checksums(), specifying
whether to do immediate checkpoint or wait for the next one? AFAIK
that's what we do in pg_backup_start, for example.

That's a good idea, pg_enable_data_checksums now accepts a third parameter
"fast" (defaults to false) which will enable immediate checkpoints when true.

* Monitoring - an insightful off-list reviewer asked how the current progress
of the operation is monitored. So far I've been using pg_stat_activity but I
don't disagree that it's not a very sharp tool for this. Maybe we need a
specific function or view or something? There clearly needs to be a way for a
user to query state and progress of a transition.

Yeah, I think a view like pg_stat_progress_checksums would work.

Added in the attached version. It probably needs some polish (the docs for
sure do) but it's at least a start.

--
Daniel Gustafsson

Michael Banck

michael.banck@credativ.de

over 1 year ago

In reply to: Daniel Gustafsson (#4)

Re: Changing the state of data checksums in a running cluster

Hi,

On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:

Yeah, I think a view like pg_stat_progress_checksums would work.

Added in the attached version. It probably needs some polish (the docs for
sure do) but it's at least a start.

Just a nitpick, but we call it data_checksums about everywhere, but the
new view is called pg_stat_progress_datachecksums - I think
pg_stat_progress_data_checksums would look better even if it gets quite
long.

Michael

daniel@yesql.se

over 1 year ago

In reply to: Michael Banck (#5)

Re: Changing the state of data checksums in a running cluster

On 1 Oct 2024, at 00:43, Michael Banck <mbanck@gmx.net> wrote:

Hi,

On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:

Yeah, I think a view like pg_stat_progress_checksums would work.

Added in the attached version. It probably needs some polish (the docs for
sure do) but it's at least a start.

Just a nitpick, but we call it data_checksums about everywhere, but the
new view is called pg_stat_progress_datachecksums - I think
pg_stat_progress_data_checksums would look better even if it gets quite
long.

That's a fair point, I'll make sure to switch for the next version of the
patch.

--
Daniel Gustafsson

daniel@yesql.se

over 1 year ago

In reply to: Daniel Gustafsson (#6)

Re: Changing the state of data checksums in a running cluster

On 1 Oct 2024, at 20:55, Daniel Gustafsson <daniel@yesql.se> wrote:

On 1 Oct 2024, at 00:43, Michael Banck <mbanck@gmx.net> wrote:

Hi,

On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:

Yeah, I think a view like pg_stat_progress_checksums would work.

Added in the attached version. It probably needs some polish (the docs for
sure do) but it's at least a start.

Just a nitpick, but we call it data_checksums about everywhere, but the
new view is called pg_stat_progress_datachecksums - I think
pg_stat_progress_data_checksums would look better even if it gets quite
long.

That's a fair point, I'll make sure to switch for the next version of the
patch.

A rebased v3 attached with that change.

--
Daniel Gustafsson

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Daniel Gustafsson (#7)

Re: Changing the state of data checksums in a running cluster

Hi,

I did a quick review today. First a couple minor comments:

1) monitoring.sgml

typos: number of database -> number of databases
calcuated -> calculated

2) unnecessary newline in heapam.c (no other changes)

3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
shadowing earlier variable

4) unnecessary comment change in bufmgr.c (no other changes)

5) unnecessary include and newline in bulk_write.c (no other changes)

6) unnecessary newline in pgstat.c (no other changes)

7) controldata.c - maybe this

if (oldctrl->data_checksum_version == 2)

should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
constant? For "off" we use "0" which seems somewhat acceptable, but for
other values it's less obvious what the meaning is.

8) xlog_internal.h - xl_checksum_state should be added to typedefs

9) system_functions.sql - Isn't it weird that this only creates the new
pg_enable_data_checksums function, but not pg_disable_data_checksums? It
also means it doesn't revoke EXECUTE from public on it, which I guess it
probably should? Or why should this be different for the two functions?

Also the error message seems to differ:

test=> select pg_enable_data_checksums();
ERROR: permission denied for function pg_enable_data_checksums
test=> select pg_disable_data_checksums();
ERROR: must be superuser

Probably harmless, but seems a bit strange.

But there also seems to be a more serious problem with recovery. I did a
simple script that does a loop of

* start a cluster
* initialize a small pgbench database (scale 1 - 100)
* run "long" pgbench
* call pg_enable_data_checksums(), wait for it to complete
* stop the cluster with "-m immediate"
* start the cluster

And this unfortunately hits this assert:

bool
AbsorbChecksumsOnBarrier(void)
{
Assert(LocalDataChecksumVersion ==
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
return true;
}

Based on our short discussion about this, the controlfile gets updated
right after pg_enable_data_checksums() completes. The immediate stop
however forces a recovery since the last checkpoint, which means we see
the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
exit recovery, try to start checkpointer and it trips over this, because
the control file already has the "on" value :-(

I'm not sure what's the best way to fix this. Would it be possible to
remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
noop in that case? Or not set the barrier when exiting recovery. I'm not
sure the relaxed assert would remain meaningful, though. What would it
check on standbys, for example?

Maybe a better way would be to wait for a checkpoint before updating the
controlfile, similar to what we do at the beginning? Possibly even with
the same "fast=true/false" logic. That would prevent us from seeing the
XLOG_CHECKSUMS wal record with the updated flag. It would extend the
"window" where a crash would mean we have to redo the checksums, but I
don't think that matters much. For small databases who cares, and for
large databases it should not be a meaningful difference (setting the
checksums already ran over multiple checkpoints, so one checkpoint is
not a big difference).

regards

--
Tomas Vondra

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

over 1 year ago

In reply to: Tomas Vondra (#8)

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> writes:

3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
shadowing earlier variable

All the ListCell variables can be eliminated by using the foreach_ptr
and foreach_oid macros instead of plain foreach.

- ilmari

#10

daniel@yesql.se

over 1 year ago

In reply to: Tomas Vondra (#8)

Re: Changing the state of data checksums in a running cluster

On 7 Oct 2024, at 16:46, Tomas Vondra <tomas@vondra.me> wrote:

I did a quick review today. First a couple minor comments:

Thanks for looking! 1-6 are all fixed.

7) controldata.c - maybe this

if (oldctrl->data_checksum_version == 2)

should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
constant? For "off" we use "0" which seems somewhat acceptable, but for
other values it's less obvious what the meaning is.

It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
we should move (or mirror) the checksum versions to storage/checksum_impl.h to
make them available to frontend and backend tools?

8) xlog_internal.h - xl_checksum_state should be added to typedefs

Fixed.

9) system_functions.sql - Isn't it weird that this only creates the new
pg_enable_data_checksums function, but not pg_disable_data_checksums?

We don't need any DEFAULT values for pg_disable_data_checksums so it doesn't
need to be created there.

It
also means it doesn't revoke EXECUTE from public on it, which I guess it
probably should? Or why should this be different for the two functions?

That should however be done, so fixed.

But there also seems to be a more serious problem with recovery. I did a
simple script that does a loop of

* start a cluster
* initialize a small pgbench database (scale 1 - 100)
* run "long" pgbench
* call pg_enable_data_checksums(), wait for it to complete
* stop the cluster with "-m immediate"
* start the cluster

And this unfortunately hits this assert:

bool
AbsorbChecksumsOnBarrier(void)
{
Assert(LocalDataChecksumVersion ==
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
return true;
}

Based on our short discussion about this, the controlfile gets updated
right after pg_enable_data_checksums() completes. The immediate stop
however forces a recovery since the last checkpoint, which means we see
the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
exit recovery, try to start checkpointer and it trips over this, because
the control file already has the "on" value :-(

I'm not sure what's the best way to fix this. Would it be possible to
remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
noop in that case? Or not set the barrier when exiting recovery. I'm not
sure the relaxed assert would remain meaningful, though. What would it
check on standbys, for example?

Maybe a better way would be to wait for a checkpoint before updating the
controlfile, similar to what we do at the beginning? Possibly even with
the same "fast=true/false" logic. That would prevent us from seeing the
XLOG_CHECKSUMS wal record with the updated flag. It would extend the
"window" where a crash would mean we have to redo the checksums, but I
don't think that matters much. For small databases who cares, and for
large databases it should not be a meaningful difference (setting the
checksums already ran over multiple checkpoints, so one checkpoint is
not a big difference).

The more I think about it the more I think that updating the control file is
the wrong thing to do for this patch, it should only change the state in memory
and let the checkpoints update the controlfile. The attached fixes that and I
can no longer reproduce the assertion failure you hit.

The attached version also contains updates to the documentation, the aux proc
counter and other smaller bits of polish.

I did remove parts of the progress reporting for now since it can't be used
from the dynamic backgroundworker it seems. I need to regroup and figure out a
better way there, but I wanted to address your above find sooner rather than
wait for that.

--
Daniel Gustafsson

#11

daniel@yesql.se

over 1 year ago

In reply to: Dagfinn Ilmari Mannsåker (#9)

Re: Changing the state of data checksums in a running cluster

On 7 Oct 2024, at 20:42, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote:

Tomas Vondra <tomas@vondra.me> writes:

3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
shadowing earlier variable

All the ListCell variables can be eliminated by using the foreach_ptr
and foreach_oid macros instead of plain foreach.

Fair point, done in the v4 attached upthread.

--
Daniel Gustafsson

#12

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Daniel Gustafsson (#10)

Re: Changing the state of data checksums in a running cluster

On 10/8/24 22:38, Daniel Gustafsson wrote:

7) controldata.c - maybe this

if (oldctrl->data_checksum_version == 2)

should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
constant? For "off" we use "0" which seems somewhat acceptable, but for
other values it's less obvious what the meaning is.

It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
we should move (or mirror) the checksum versions to storage/checksum_impl.h to
make them available to frontend and backend tools?

+1 to have checksum_impl.h

But there also seems to be a more serious problem with recovery. I did a
simple script that does a loop of

* start a cluster
* initialize a small pgbench database (scale 1 - 100)
* run "long" pgbench
* call pg_enable_data_checksums(), wait for it to complete
* stop the cluster with "-m immediate"
* start the cluster

And this unfortunately hits this assert:

bool
AbsorbChecksumsOnBarrier(void)
{
Assert(LocalDataChecksumVersion ==
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
return true;
}

Based on our short discussion about this, the controlfile gets updated
right after pg_enable_data_checksums() completes. The immediate stop
however forces a recovery since the last checkpoint, which means we see
the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
exit recovery, try to start checkpointer and it trips over this, because
the control file already has the "on" value :-(

I'm not sure what's the best way to fix this. Would it be possible to
remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
noop in that case? Or not set the barrier when exiting recovery. I'm not
sure the relaxed assert would remain meaningful, though. What would it
check on standbys, for example?

Maybe a better way would be to wait for a checkpoint before updating the
controlfile, similar to what we do at the beginning? Possibly even with
the same "fast=true/false" logic. That would prevent us from seeing the
XLOG_CHECKSUMS wal record with the updated flag. It would extend the
"window" where a crash would mean we have to redo the checksums, but I
don't think that matters much. For small databases who cares, and for
large databases it should not be a meaningful difference (setting the
checksums already ran over multiple checkpoints, so one checkpoint is
not a big difference).

The more I think about it the more I think that updating the control file is
the wrong thing to do for this patch, it should only change the state in memory
and let the checkpoints update the controlfile. The attached fixes that and I
can no longer reproduce the assertion failure you hit.

I think leaving the update of controlfile to checkpointer is correct,
and probably the only way to make this correct (without race
conditions). We need to do that automatically with the checkpoint (which
updates the redo LSN, guaranteeing we won't see the XLOG_CHECKSUMS
record again).

I ran the tests with this new patch, and I haven't reproduced the
crashes. I'll let it run a bit longer, and improve it to test some more
stuff, but it looks good.

regards

--
Tomas Vondra

#13

daniel@yesql.se

over 1 year ago

In reply to: Tomas Vondra (#12)

Re: Changing the state of data checksums in a running cluster

On 9 Oct 2024, at 12:41, Tomas Vondra <tomas@vondra.me> wrote:

should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
constant? For "off" we use "0" which seems somewhat acceptable, but for
other values it's less obvious what the meaning is.

It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
we should move (or mirror) the checksum versions to storage/checksum_impl.h to
make them available to frontend and backend tools?

+1 to have checksum_impl.h

I tried various different ways of breaking out the checksum version into
another header file but all of them ended up messier than the current state due
to how various tools include the checksum code. In the end I opted for doing
the bufpage include to keep it simple. This patch is big enough as it is
without additional refactoring of checksum (header) code, that can be done
separately from this.

I ran the tests with this new patch, and I haven't reproduced the
crashes. I'll let it run a bit longer, and improve it to test some more
stuff, but it looks good.

Thanks for testing, I am too unable to reproduce that error.

The attached v5 has the above include fix as well as pgindent and pgperltidy
runs and some tweaking to the commit message to make it concise. It's also
rebased to handle a recent conflict in the makefiles.

--
Daniel Gustafsson

#14

daniel@yesql.se

over 1 year ago

In reply to: Daniel Gustafsson (#13)

Re: Changing the state of data checksums in a running cluster

Attached is a rebased v6 fixing the tests to handle that checksums are now on
by default, no other changes are made as no outstanding review comments or
identified bugs exist.

Does anyone object to going ahead with this?

--
Daniel Gustafsson

#15

daniel@yesql.se

over 1 year ago

In reply to: Daniel Gustafsson (#14)

Re: Changing the state of data checksums in a running cluster

On 4 Nov 2024, at 12:27, Daniel Gustafsson <daniel@yesql.se> wrote:

Attached is a rebased v6 fixing the tests to handle that checksums are now on
by default, no other changes are made as no outstanding review comments or
identified bugs exist.

Does anyone object to going ahead with this?

And a new rebase to cope with recent changes,

--
Daniel Gustafsson

#16

daniel@yesql.se

over 1 year ago

In reply to: Daniel Gustafsson (#15)

Re: Changing the state of data checksums in a running cluster

On 5 Nov 2024, at 13:51, Daniel Gustafsson <daniel@yesql.se> wrote:

On 4 Nov 2024, at 12:27, Daniel Gustafsson <daniel@yesql.se> wrote:

Attached is a rebased v6 fixing the tests to handle that checksums are now on
by default, no other changes are made as no outstanding review comments or
identified bugs exist.

Does anyone object to going ahead with this?

And a new rebase to cope with recent changes,

..and one more since I forgot to git add the new expected output testfile.

--
Daniel Gustafsson

#17

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Daniel Gustafsson (#16)

Re: Changing the state of data checksums in a running cluster

Hi,

Unfortunately it seems we're not out of the woods yet :-(

I started doing some more testing on the v8 patch. My plan was to do
some stress testing with physical replication, random restarts and stuff
like that. But I ran into issues before that.

Attached is a reproducer script, that does this:

1) initializes an instance with a small (scale 10) pgbench database

2) runs a pgbench in the background, and flips checksums

3) restarts the database with fast or immediate mode

4) watches for checksums state until it reaches expected value

5) restarts the instance

Of course, the restart interrupts the checksum enable, with this message
in the log:

WARNING: data checksums are being enabled, but no worker is running
1731024482.102 2024-11-08 01:08:02.102 CET [267066] [startup:]
[672d5660.4133a:7] [2024-11-08 01:08:00 CET] [/0] HINT: If checksums
were being enabled during shutdown then processing must be manually
restarted.

That's expected, of course. So I did

SELECT pg_enable_data_checksums()

and "datachecksumsworker launcher" appeared in pg_stat_activity, but
nothing else was happening. It also says:

Waiting for worker in database template0 (pid 258442)

But there are no workers with that PID. Not in the OS, not in the view,
not in the server log. Seems a bit weird. Maybe it already completed,
but then why is there a launcher waiting for it?

Ultimately I tried running CHECKPOINT, And that apparently did the
trick, and the instance restarted. But then on start it hits an assert that:

(LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)

But this only happens in the final stop is -m immediate. If I change it
to "-m fast" it works.

I haven't looked into the details, but I guess it's related to the issue
with controlfile update we dealt with about a month ago.

Attached is the test.sh file (make sure to tweak the paths), and an
example of the backtraces. I've seen various processes hitting that.

Two more comments:

* It's a bit surprising that pg_disable_data_checksums() flips the state
right away, while pg_enable_data_checksums() waits for a checkpoint. I
guess it's correct, but maybe the docs should mention this difference?

* The docs currently say:

<para>
If the cluster is stopped while in <literal>inprogress-on</literal> mode,
for any reason, then this process must be restarted manually. To do this,
re-execute the function <function>pg_enable_data_checksums()</function>
once the cluster has been restarted. The background worker will attempt
to resume the work from where it was interrupted.
</para>

I believe that's incorrect/misleading. There's no attempt to resume work
from where it was interrupted.

regards

--
Tomas Vondra

#18

[1]: /messages/by-id/CA226DE1-DC9A-4675-A83C-32270C473F0B@yesql.se
/messages/by-id/CA226DE1-DC9A-4675-A83C-32270C473F0B@yesql.se

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Tomas Vondra (#17)

Re: Changing the state of data checksums in a running cluster

Hi,

I spent a bit more time doing some testing on the last version of the
patch from [1]/messages/by-id/CA226DE1-DC9A-4675-A83C-32270C473F0B@yesql.se. And I ran into this assert in PostmasterStateMachine()
when stopping the cluster:

/* All types should be included in targetMask or remainMask */
Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);

At first I was puzzled as this happens on every shutdown, but that's
because these checks were introduced by a78af0427015 a week ago. So it's
more a matter of rebasing.

However, I also noticed the progress monitoring does not really work. I
get stuff like this:

    + psql -x test -c 'select * from pg_stat_progress_data_checksums'
    -[ RECORD 1 ]---------------------+---------
    pid                               | 56811
    datid                             | 0
    datname                           |
    phase                             | enabling
    databases_total                   | 4
    relations_total                   |
    databases_processed               | 0
    relations_processed               | 0
    databases_current                 | 16384
    relation_current                  | 0
    relation_current_blocks           | 0
    relation_current_blocks_processed | 0

But I've never seen any of the "processed" fields to be non-zero (and
relations is even NULL), and the same thing applies to relation_. Also
what is the datid/datname about? It's empty, not mentioned in sgml docs,
and we already have databases_current ...

The message [2]/messages/by-id/DD25705F-E75F-4DCA-B49A-5578F4F55D94@yesql.se from 10/08 says:

I did remove parts of the progress reporting for now since it can't be
used from the dynamic backgroundworker it seems. I need to regroup
and figure out a better way there, but I wanted to address your above
find sooner rather than wait for that.

And I guess that would explain why some of the fields are not updated.
But then the later patch versions seem to imply there are no outstanding
issues / missing stuff.

regards

[2]: /messages/by-id/DD25705F-E75F-4DCA-B49A-5578F4F55D94@yesql.se
/messages/by-id/DD25705F-E75F-4DCA-B49A-5578F4F55D94@yesql.se

--
Tomas Vondra

#19

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Tomas Vondra (#18)

Re: Changing the state of data checksums in a running cluster

On Tue, Nov 26, 2024 at 11:07:12PM +0100, Tomas Vondra wrote:

I spent a bit more time doing some testing on the last version of the
patch from [1]. And I ran into this assert in PostmasterStateMachine()
when stopping the cluster:

/* All types should be included in targetMask or remainMask */
Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);

At first I was puzzled as this happens on every shutdown, but that's
because these checks were introduced by a78af0427015 a week ago. So it's
more a matter of rebasing.

Looks like the CI is not really happy about this point.. (Please make
sure to refresh the patch status after a review.)
--
Michael

#20

[1]: /messages/by-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e@vondra.me
/messages/by-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e@vondra.me

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Michael Paquier (#19)

Re: Changing the state of data checksums in a running cluster

Here's a rebased version of the patch series, addressing the issues I've
pointed out in the last round of reviews. I've kept the changes in
separate patches for clarity, but it should be squashed into a single
patch in the end.

1) v20250309-0001-Online-enabling-and-disabling-of-data-chec.patch
------------------------------------------------------------------

Original patch, rebased, resolving merge conflicts.

2) v20250309-0002-simple-post-rebase-fixes.patch
------------------------------------------------

A couple minor fixes, addressing test failures due to stuff committed
since the previous patch version. Mostly mechanical, the main change is
I don't think the pgstat_bestart() call is necessary. Or is it?

3) v20250309-0003-sync-the-data_checksums-GUC-with-the-local.patch
------------------------------------------------------------------

This is the main change, fixing failures in 002_actions.pl - the short
version is that test does "-C data_checksums", but AFAICS that does not
quite work because it does not call show_data_checksums() that early,
and instead just grabs the variable backing the GUC. Which may be out of
sync, so this patch fixes that by updating them both.

That fixes the issue, but it's it a bit strange we now have three places
tracking the state of data checksums? We have data_checksum_version in
the control file, and then data_checksums and LocalDataChecksumVersion
in the backends.

Would it be possible to "unify" the latter two? That would also mean we
don't have the duplicate constants like PG_DATA_CHECKSUM_VERSION and
DATA_CHECKSUM_VERSION. Or why do we need that?

4) v20250309-0004-make-progress-reporting-work.patch
----------------------------------------------------

The progress reporting got "mostly disabled" in an earlier version, due
to issues with the bgworkers. AFAICS the issue is the "status" row can
be updated only by a single process, which does not quite work with the
launcher + per-db workers architecture.

I've considered a couple different approaches:

a) updating the status only from the launcher

This is mostly what CREATE INDEX does with parallel builds, and there
it's mostly sufficient. But for checksums it'd mean we only have the
number of databases to process/done, and that seems unsatisfactory,
considering large clusters often have only a single large database. So
not good enough, IMHO.

b) allowing workers to update the status row, created by the launcher

I guess this would be better, we'd know the relations/blocks counts. And
I haven't tried coding this, but there would need to be some locking so
that the workers don't overwrite increments from other workers, etc.

But I don't think it'd work nicely with parallel per-db workers (which
we don't do now, but we might).

c) having one status entry per worker

I ended up doing this, it didn't require any changes to the progress
infrastructure, and it will work naturally even with multiple workers.
There will always be one row for launcher status (which only has the
number of databases total/done), and then one row per worker, with
database-level info (datid, datname, #relations, #blocks).

I removed the "DONE" phase, because that's right before the launcher
exists, and I don't think we have that for similar cases. And I added
"waiting on checkpoint" state, because that's often a long wait when the
launcher seems to do nothing, so it seems useful to communicate the
reason for that wait.

5) v20250309-0005-update-docs.patch
-----------------------------------

Minor tweaks to docs, to reflect the changes to the progress reporting
changes, and also some corrections (no resume after restart, ...).

So far this passed all my tests - both chekc-world and stress testing
(no failures / assert crashes, ...). One thing that puzzles me is I
wasn't able to reproduce the failures reported in [1]/messages/by-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e@vondra.me - not even with
just the rebase + minimal fixes (0001 + 0002). My best theory is this
is somehow machine-specific, and my laptop is too slow or something.
I'll try with the machine I used before once it gets available.

regards

--
Tomas Vondra

#21

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#20)

#22

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#21)

#23

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#22)

#24

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#23)

#25

daniel@yesql.se

about 1 year ago

In reply to: Tomas Vondra (#24)

#26

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Daniel Gustafsson (#25)

#27

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#26)

#28

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#27)

#29

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

about 1 year ago

In reply to: Tomas Vondra (#26)

#30

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Dagfinn Ilmari Mannsåker (#29)

#31

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

about 1 year ago

In reply to: Tomas Vondra (#30)

#32

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#28)

#33

daniel@yesql.se

about 1 year ago

In reply to: Tomas Vondra (#32)

#34

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Daniel Gustafsson (#33)

#35

daniel@yesql.se

about 1 year ago

In reply to: Tomas Vondra (#34)

#36

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Daniel Gustafsson (#35)

#37

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#36)

#38

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#37)

#39

daniel@yesql.se

about 1 year ago

In reply to: Tomas Vondra (#38)

#40

daniel@yesql.se

about 1 year ago

In reply to: Daniel Gustafsson (#39)

#41

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Daniel Gustafsson (#40)

#42

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#41)

#43

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#42)

#44

Alexander Korotkov

aekorotkov@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#43)

#45

Bernd Helmle

mailings@oopsware.de

10 months ago

In reply to: Tomas Vondra (#41)

#46

daniel@yesql.se

10 months ago

In reply to: Bernd Helmle (#45)

#47

daniel@yesql.se

8 months ago

In reply to: Bernd Helmle (#45)

#48

Bruce Momjian

bruce@momjian.us

8 months ago

In reply to: Daniel Gustafsson (#47)

#49

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Daniel Gustafsson (#47)

#50

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Tomas Vondra (#49)

#51

daniel@yesql.se

8 months ago

In reply to: Tomas Vondra (#49)

#52

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Daniel Gustafsson (#51)

#53

daniel@yesql.se

8 months ago

In reply to: Tomas Vondra (#52)

#54

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Daniel Gustafsson (#53)

#55

daniel@yesql.se

8 months ago

In reply to: Tomas Vondra (#54)

#56

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Daniel Gustafsson (#55)

#57

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Tomas Vondra (#56)

#58

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Tomas Vondra (#56)

#59

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Tomas Vondra (#57)

#60

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Tomas Vondra (#59)

#61

tomas.vondra@2ndquadrant.com

8 months ago

In reply to: Tomas Vondra (#60)

#62

daniel@yesql.se

7 months ago

In reply to: Tomas Vondra (#61)

#63

daniel@yesql.se

6 months ago

In reply to: Daniel Gustafsson (#62)

#64

tomas.vondra@2ndquadrant.com

6 months ago

In reply to: Daniel Gustafsson (#63)

#65

tomas.vondra@2ndquadrant.com

6 months ago

In reply to: Tomas Vondra (#64)

#66

tomas.vondra@2ndquadrant.com

5 months ago

In reply to: Tomas Vondra (#65)

#67

Andreas Karlsson

andreas.karlsson@percona.com

5 months ago

In reply to: Tomas Vondra (#65)

#68

tomas.vondra@2ndquadrant.com

5 months ago

In reply to: Andreas Karlsson (#67)

#69

Andreas Karlsson

andreas.karlsson@percona.com

5 months ago

In reply to: Tomas Vondra (#68)

#70

tomas.vondra@2ndquadrant.com

5 months ago

In reply to: Andreas Karlsson (#69)

#71

Bruce Momjian

bruce@momjian.us

5 months ago

In reply to: Tomas Vondra (#70)

#72

andres@anarazel.de

5 months ago

In reply to: Andreas Karlsson (#69)

#73

daniel@yesql.se

5 months ago

In reply to: Tomas Vondra (#64)

#74

andres@anarazel.de

4 months ago

In reply to: Daniel Gustafsson (#73)

#75

daniel@yesql.se

3 months ago

In reply to: Andres Freund (#74)

#76

daniel@yesql.se

3 months ago

In reply to: Daniel Gustafsson (#75)

#77

tomas.vondra@2ndquadrant.com

3 months ago

In reply to: Daniel Gustafsson (#76)

#78

tomas.vondra@2ndquadrant.com

3 months ago

In reply to: Tomas Vondra (#77)

#79

Michael Banck

michael.banck@credativ.de

3 months ago

In reply to: Tomas Vondra (#77)

#80

Michael Banck

michael.banck@credativ.de

3 months ago

In reply to: Tomas Vondra (#77)

#81

tomas.vondra@2ndquadrant.com

2 months ago

#82

daniel@yesql.se

about 1 month ago

In reply to: Tomas Vondra (#78)

#83

daniel@yesql.se

about 1 month ago

In reply to: Daniel Gustafsson (#82)

#84

tomas.vondra@2ndquadrant.com

about 1 month ago

In reply to: Daniel Gustafsson (#82)

#85

tomas.vondra@2ndquadrant.com

about 1 month ago

In reply to: Daniel Gustafsson (#83)

#86

tomas.vondra@2ndquadrant.com

about 1 month ago

In reply to: Tomas Vondra (#84)

#87

daniel@yesql.se

about 1 month ago

In reply to: Tomas Vondra (#84)

#88

heikki.linnakangas@enterprisedb.com

about 1 month ago

In reply to: Daniel Gustafsson (#87)

#89

daniel@yesql.se

about 1 month ago

In reply to: Heikki Linnakangas (#88)

#90

heikki.linnakangas@enterprisedb.com

about 1 month ago

In reply to: Daniel Gustafsson (#89)

#91

daniel@yesql.se

about 1 month ago

In reply to: Heikki Linnakangas (#90)

#92

daniel@yesql.se

about 1 month ago

In reply to: Daniel Gustafsson (#91)

#93

daniel@yesql.se

29 days ago

In reply to: Daniel Gustafsson (#92)

#94

tomas.vondra@2ndquadrant.com

29 days ago

In reply to: Daniel Gustafsson (#93)

#95

heikki.linnakangas@enterprisedb.com

29 days ago

In reply to: Daniel Gustafsson (#93)

#96

daniel@yesql.se

28 days ago

In reply to: Heikki Linnakangas (#95)

#97

daniel@yesql.se

26 days ago

In reply to: Daniel Gustafsson (#96)

#98

daniel@yesql.se

25 days ago

In reply to: Daniel Gustafsson (#96)

#99

daniel@yesql.se

24 days ago

In reply to: Daniel Gustafsson (#98)

#100

heikki.linnakangas@enterprisedb.com

23 days ago

In reply to: Daniel Gustafsson (#99)

#101

daniel@yesql.se

23 days ago

In reply to: Heikki Linnakangas (#100)

#102

daniel@yesql.se

22 days ago

In reply to: Daniel Gustafsson (#101)

#103

heikki.linnakangas@enterprisedb.com

22 days ago

In reply to: Daniel Gustafsson (#102)

#104

daniel@yesql.se

22 days ago

In reply to: Heikki Linnakangas (#103)

#105

daniel@yesql.se

22 days ago

In reply to: Daniel Gustafsson (#104)

#106

daniel@yesql.se

22 days ago

In reply to: Daniel Gustafsson (#105)

#107

daniel@yesql.se

22 days ago

In reply to: Daniel Gustafsson (#106)

#108

daniel@yesql.se

22 days ago

In reply to: Daniel Gustafsson (#107)

#109

daniel@yesql.se

22 days ago

In reply to: Daniel Gustafsson (#108)

#110

daniel@yesql.se

21 days ago

In reply to: Daniel Gustafsson (#109)

#111

andres@anarazel.de

21 days ago

In reply to: Daniel Gustafsson (#110)

#112

daniel@yesql.se

20 days ago

In reply to: Andres Freund (#111)

#113

daniel@yesql.se

20 days ago

In reply to: Andres Freund (#111)

#114

Alexander Lakhin

exclusion@gmail.com

19 days ago

In reply to: Daniel Gustafsson (#107)

#115

daniel@yesql.se

19 days ago

In reply to: Alexander Lakhin (#114)

#116

SATYANARAYANA NARLAPURAM

satyanarlapuram@gmail.com

18 days ago

In reply to: Daniel Gustafsson (#102)

#117

daniel@yesql.se

17 days ago

In reply to: SATYANARAYANA NARLAPURAM (#116)

#118