pgsql: Allow on-line enabling and disabling of data checksums

Started by Magnus Haganderabout 8 years ago5 messagescomitters
Jump to latest
#1Magnus Hagander
magnus@hagander.net

Allow on-line enabling and disabling of data checksums

This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn it
off).

Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages have
been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.

This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.

Authors: Magnus Hagander and Daniel Gustafsson
Review: Tomas Vondra, Michael Banck, Heikki Linnakangas, Andrey Borodin

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/1fde38beaa0c3e66c340efc7cc0dc272d6254bb0

Modified Files
--------------
doc/src/sgml/func.sgml | 65 ++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/initdb.sgml | 6 +-
doc/src/sgml/ref/pg_verify_checksums.sgml | 112 +++
doc/src/sgml/reference.sgml | 1 +
doc/src/sgml/wal.sgml | 81 ++
src/backend/access/rmgrdesc/xlogdesc.c | 16 +
src/backend/access/transam/xlog.c | 124 +++-
src/backend/access/transam/xlogfuncs.c | 59 ++
src/backend/catalog/system_views.sql | 5 +
src/backend/postmaster/Makefile | 5 +-
src/backend/postmaster/bgworker.c | 7 +
src/backend/postmaster/checksumhelper.c | 855 ++++++++++++++++++++++
src/backend/postmaster/pgstat.c | 5 +
src/backend/replication/basebackup.c | 2 +-
src/backend/replication/logical/decode.c | 1 +
src/backend/storage/ipc/ipci.c | 2 +
src/backend/storage/page/README | 3 +-
src/backend/storage/page/bufpage.c | 6 +-
src/backend/utils/misc/guc.c | 37 +-
src/bin/Makefile | 1 +
src/bin/pg_upgrade/controldata.c | 9 +
src/bin/pg_upgrade/pg_upgrade.h | 2 +-
src/bin/pg_verify_checksums/.gitignore | 1 +
src/bin/pg_verify_checksums/Makefile | 36 +
src/bin/pg_verify_checksums/pg_verify_checksums.c | 315 ++++++++
src/include/access/xlog.h | 10 +-
src/include/access/xlog_internal.h | 7 +
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_control.h | 1 +
src/include/catalog/pg_proc.h | 5 +
src/include/pgstat.h | 4 +-
src/include/postmaster/checksumhelper.h | 31 +
src/include/storage/bufpage.h | 1 +
src/include/storage/checksum.h | 7 +
src/test/Makefile | 3 +-
src/test/checksum/.gitignore | 2 +
src/test/checksum/Makefile | 24 +
src/test/checksum/README | 22 +
src/test/checksum/t/001_standby_checksum.pl | 101 +++
src/test/isolation/expected/checksum_cancel.out | 27 +
src/test/isolation/expected/checksum_enable.out | 27 +
src/test/isolation/isolation_schedule | 4 +
src/test/isolation/specs/checksum_cancel.spec | 47 ++
src/test/isolation/specs/checksum_enable.spec | 70 ++
45 files changed, 2118 insertions(+), 34 deletions(-)

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#1)
Re: pgsql: Allow on-line enabling and disabling of data checksums

On Fri, Apr 6, 2018 at 5:35 AM, Magnus Hagander <magnus@hagander.net> wrote:

Allow on-line enabling and disabling of data checksums

This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn it
off).

Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages have
been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.

This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.

This has broken the buildfarm's cross-version upgrade testing (yes, we
do it for same-version upgrade as well as previous version upgrade).

For now I have fixed crake by adding code to disable checksums in the
saved cluster. That at least will send crake green. Not sure if it's
the fix we want, though. Maybe we should test if checksums are enabled
on the upgraded cluster and if so enable them on the new cluster via
initdb. When we decide on the best fix I will put out a new release.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#2)
Re: pgsql: Allow on-line enabling and disabling of data checksums

On Fri, Apr 6, 2018 at 2:03 AM, Andrew Dunstan <
andrew.dunstan@2ndquadrant.com> wrote:

On Fri, Apr 6, 2018 at 5:35 AM, Magnus Hagander <magnus@hagander.net>
wrote:

Allow on-line enabling and disabling of data checksums

This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn it
off).

Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages have
been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.

This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.

This has broken the buildfarm's cross-version upgrade testing (yes, we
do it for same-version upgrade as well as previous version upgrade).

For now I have fixed crake by adding code to disable checksums in the
saved cluster. That at least will send crake green. Not sure if it's
the fix we want, though. Maybe we should test if checksums are enabled
on the upgraded cluster and if so enable them on the new cluster via
initdb. When we decide on the best fix I will put out a new release.

I'm unsure of why it actually leaves the cluster with checksums on. Which
steps leaves it with checksums on? The last step of the checksum specific
tests actually turns them *off* again. At which point in the series does it
actually get the cluster to upgrade?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#3)
Re: pgsql: Allow on-line enabling and disabling of data checksums

On Fri, Apr 6, 2018 at 7:07 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Fri, Apr 6, 2018 at 2:03 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

On Fri, Apr 6, 2018 at 5:35 AM, Magnus Hagander <magnus@hagander.net>
wrote:

Allow on-line enabling and disabling of data checksums

This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn it
off).

Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages have
been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.

This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.

This has broken the buildfarm's cross-version upgrade testing (yes, we
do it for same-version upgrade as well as previous version upgrade).

For now I have fixed crake by adding code to disable checksums in the
saved cluster. That at least will send crake green. Not sure if it's
the fix we want, though. Maybe we should test if checksums are enabled
on the upgraded cluster and if so enable them on the new cluster via
initdb. When we decide on the best fix I will put out a new release.

I'm unsure of why it actually leaves the cluster with checksums on. Which
steps leaves it with checksums on? The last step of the checksum specific
tests actually turns them *off* again. At which point in the series does it
actually get the cluster to upgrade?

At the time the "old" datadir is copied to be upgraded by this module,
the following test sets have been run against it on crake:

'InstallCheck-C',
'RedisFDW-installcheck-C'
'FileTextArrayFDW-installcheck-C'
'IsolationCheck',
'PLCheck-C',
'ContribCheck-C',
'TestModulesCheck-C',

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#4)
Re: pgsql: Allow on-line enabling and disabling of data checksums

On Fri, Apr 6, 2018 at 12:41 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.
com> wrote:

On Fri, Apr 6, 2018 at 7:07 PM, Magnus Hagander <magnus@hagander.net>
wrote:

On Fri, Apr 6, 2018 at 2:03 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

On Fri, Apr 6, 2018 at 5:35 AM, Magnus Hagander <magnus@hagander.net>
wrote:

Allow on-line enabling and disabling of data checksums

This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn

it

off).

Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages

have

been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.

This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.

This has broken the buildfarm's cross-version upgrade testing (yes, we
do it for same-version upgrade as well as previous version upgrade).

For now I have fixed crake by adding code to disable checksums in the
saved cluster. That at least will send crake green. Not sure if it's
the fix we want, though. Maybe we should test if checksums are enabled
on the upgraded cluster and if so enable them on the new cluster via
initdb. When we decide on the best fix I will put out a new release.

I'm unsure of why it actually leaves the cluster with checksums on. Which
steps leaves it with checksums on? The last step of the checksum specific
tests actually turns them *off* again. At which point in the series does

it

actually get the cluster to upgrade?

At the time the "old" datadir is copied to be upgraded by this module,
the following test sets have been run against it on crake:

'InstallCheck-C',
'RedisFDW-installcheck-C'
'FileTextArrayFDW-installcheck
-C'
'IsolationCheck',
'PLCheck-C',
'ContribCheck-C',
'TestModulesCheck-C',

TestModulesCheck-C runs "make check" in src/test, right?

Can I actually see the output from that somehow? The buildfarm link seems
to only show TestModulesInstallCheck-C. And that one doesn't seem to run
the checksum checks at all. From the logs I can't even figure out where
they run at all, except that the *isolation checker* runs them -- that
seems wrong.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;