Enabling checksums on a streaming replica

Started by Brad Nicholsonalmost 7 years ago6 messagesgeneral

bradn@ca.ibm.com

almost 7 years ago

I'm wondering about the validity of using the pg_checksums utility to
enable checksum's on a streaming replication standby, and then promoting
that standby as a way to enable checksums on existing clusters.

I've tested the process out, and it "works" (by works I mean doesn't blow
up or log any errors). But this seems far enough outside of supported
territory that I'm curious what others think.

The process is:

- Primary with checksums off
- create a streaming replica off that primary
- stop the secondary
- enable checksums on the secondary with the pg_checksums utility
- start the replica
- promote the replica

I've thrown load at it while the it was replicating from checksums off ->
checksums on, promoted it, and verified the checksums after with
pg_checksums.

Any thoughts on this approach?

Thanks,
Brad

Grigory Smolkin

g.smolkin@postgrespro.ru

almost 7 years ago

In reply to: Brad Nicholson (#1)

Re: Enabling checksums on a streaming replica

Hello!

Very interesting idea, but what about full page writes, that comes from
master?
Can they be a problem?

On 6/26/19 6:32 PM, Brad Nicholson wrote:

I'm wondering about the validity of using the pg_checksums utility to
enable checksum's on a streaming replication standby, and then
promoting that standby as a way to enable checksums on existing clusters.

I've tested the process out, and it "works" (by works I mean doesn't
blow up or log any errors). But this seems far enough outside of
supported territory that I'm curious what others think.

The process is:

- Primary with checksums off
- create a streaming replica off that primary
- stop the secondary
- enable checksums on the secondary with the pg_checksums utility
- start the replica
- promote the replica

I've thrown load at it while the it was replicating from checksums off
-> checksums on, promoted it, and verified the checksums after with
pg_checksums.

Any thoughts on this approach?

Thanks,
Brad

--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Brad Nicholson

bradn@ca.ibm.com

almost 7 years ago

In reply to: Grigory Smolkin (#2)

Re: Re: Enabling checksums on a streaming replica

Grigory Smolkin <g.smolkin@postgrespro.ru> wrote on 06/26/2019 12:37:10 PM:

From: Grigory Smolkin <g.smolkin@postgrespro.ru>
To: pgsql-general@lists.postgresql.org
Date: 06/26/2019 12:37 PM
Subject: [EXTERNAL] Re: Enabling checksums on a streaming replica

Hello!
Very interesting idea, but what about full page writes, that comes
from master?
Can they be a problem?

In testing, it doesn't appear to matter. I've ensured that I've generated
some full page writes (confirmed via pg_waldump), and those apply fine.

The one thing I'm not sure of, when verifying checksums via pg_checksums I
see blocks being skipped. I'm not sure what or why it is skipping blocks,
and if that indicates a problem or not.

pg_checksums -c /data/10_repl/

Checksum scan completed
Files scanned: 1530
Blocks scanned: 100782
Blocks skipped: 3
Bad checksums: 0
Data checksum version: 1

Brad

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Brad Nicholson (#3)

Re: Re: Enabling checksums on a streaming replica

On Wed, Jun 26, 2019 at 02:08:19PM -0400, Brad Nicholson wrote:

In testing, it doesn't appear to matter. I've ensured that I've generated
some full page writes (confirmed via pg_waldump), and those apply
fine.

Full pages writes are first written from shared buffers to WAL, where
their checksums does not actually apply. When the WAL records are
read an applied, a full page image is recovered in shared buffers.
The checksum of the page would get updated once the shared buffer page
used is evicted and written back to disk.

The one thing I'm not sure of, when verifying checksums via pg_checksums I
see blocks being skipped. I'm not sure what or why it is skipping blocks,
and if that indicates a problem or not.

Some files can get skipped entirely, but all the blocks of a file
are basically checked. Well, except for new pages of course which
have no checksums to look at yet.
--
Michael

Brad Nicholson

bradn@ca.ibm.com

almost 7 years ago

In reply to: Michael Paquier (#4)

Re: Re: Re: Enabling checksums on a streaming replica

Michael Paquier <michael@paquier.xyz> wrote on 06/27/2019 11:17:10 PM:

From: Michael Paquier <michael@paquier.xyz>
To: Brad Nicholson <bradn@ca.ibm.com>
Cc: Grigory Smolkin <g.smolkin@postgrespro.ru>, pgsql-
general@lists.postgresql.org
Date: 06/27/2019 11:17 PM
Subject: [EXTERNAL] Re: Re: Enabling checksums on a streaming replica

On Wed, Jun 26, 2019 at 02:08:19PM -0400, Brad Nicholson wrote:

In testing, it doesn't appear to matter. I've ensured that I've

generated

some full page writes (confirmed via pg_waldump), and those apply
fine.

Full pages writes are first written from shared buffers to WAL, where
their checksums does not actually apply. When the WAL records are
read an applied, a full page image is recovered in shared buffers.
The checksum of the page would get updated once the shared buffer page
used is evicted and written back to disk.

So if all the checksums are being recalculated on the replica, this
approach should be relatively safe, should it not?

Assuming pg_checksums is doing the right thing (and it looks to me like it
should be).

Brad.

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Brad Nicholson (#5)

Re: Re: Re: Enabling checksums on a streaming replica

On Fri, Jun 28, 2019 at 07:43:30AM -0400, Brad Nicholson wrote:

So if all the checksums are being recalculated on the replica, this
approach should be relatively safe, should it not?

Yep.

Assuming pg_checksums is doing the right thing (and it looks to me like it
should be).

The constraint that a cluster needs to be cleanly shut down to be able
to enable checksums with pg_checksums is the actual deal here. After
that of course comes the WAL retention on the primary or in the WAL
archives that a standby would need again to catch up while it was
offline.
--
Michael