Add checksums without --initdb

Started by David Christensenover 10 years ago7 messages
#1David Christensen
david@endpoint.com

So on #postgresql, I was musing about methods of getting checksums enabled/disabled without requiring a separate initdb step and minimizing the downtime required to get such functionality enabled.

What about adapting pg_basebackup to add the following options:

-k|--checksums - build the replica with checksums enabled.
-K|—no-checksums - build the replica with checksums disabled.

The way this would work would be to have pg_basebackup's ReceiveAndUnpackTarFile() calculate and/or remove the checksums from each heap page as it is streamed and update the pg_control file to reflect the new checksums setting. After this checksum-enabled replica is created, then it could stream/process WAL and get caught up, then the user fails over to their brand-spanking-new checksum-enabled database. Obviously this would be a bit slower to calculate each page’s checksum than it would be just to write the data out from the tar stream, but it seems to me like this is a single point where the whole database would need to be processed page-by-page as it is.

Possible concerns here are whether checksums are included in WAL full_page_writes or if they are independently calculated; if the latter I think we’d be fine. If checksums are all handled at the layer below WAL than any streamed/processed changes should be fine to get us to the point where we could come up as a master.

We’d also need to be careful to add checksums to only heap files, but that would be able to be handled via the filename prefixes (base|global) (I’m not sure if the relation forks are in standard Page format, but if not we could exclude those as well). Obviously this bakes quite a bit of cluster structural awareness into pg_basebackup and may tie it more strongly to a specific major version, but it seems to me like the tradeoffs would be worth it if you wanted to have that option and the code paths could exist to keep the existing behavior if so.

Andres suggested a separate tool that would basically rewrite the existing data directory heap files in place, which I can also see a use case for, but I also think there’s some benefit to be found in having it happen while the replica is being streamed/built.

Ideas/thoughts/reasons this wouldn’t work?

David
--
David Christensen
PostgreSQL Team Manager
End Point Corporation
david@endpoint.com
785-727-1171

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Josh Berkus
josh@agliodbs.com
In reply to: David Christensen (#1)
Re: Add checksums without --initdb

On 07/02/2015 12:39 PM, David Christensen wrote:

So on #postgresql, I was musing about methods of getting checksums enabled/disabled without requiring a separate initdb step and minimizing the downtime required to get such functionality enabled.

Funny, I was thinking just yesterday about how nobody is using checksums
because of the dump/reload requirement. And that OS packagers aren't
packaging PostgreSQL with checksums in the initial initdb because of the
upgrade barrier.

I'm not so sure about your solution, but it seems like we need *something*.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Heikki Linnakangas
hlinnaka@iki.fi
In reply to: David Christensen (#1)
Re: Add checksums without --initdb

On 07/02/2015 10:39 PM, David Christensen wrote:

Possible concerns here are whether checksums are included in WAL
full_page_writes or if they are independently calculated; if the
latter I think we’d be fine. If checksums are all handled at the
layer below WAL than any streamed/processed changes should be fine to
get us to the point where we could come up as a master.

It's not full_page_writes that's the problem, but the server would not
WAL-log hint bit updates, unless you also have wal_log_hints enabled.
But that would be simple to just check - wal_log_hints can be enabled
with a server restart so that's not too onerous.

Andres suggested a separate tool that would basically rewrite the
existing data directory heap files in place, which I can also see a
use case for, but I also think there’s some benefit to be found in
having it happen while the replica is being streamed/built.

Ideas/thoughts/reasons this wouldn’t work?

You probably could make this work, but it seems like a pretty
complicated way to enable checksums. There's also interesting
corner-cases with replication; is it possible to connect a streaming
replica that's been restored from the checksums-enabled backup to a
checksums-disabled master. The enable-in-place approach seems a lot more
straightforward to me. In a nutshell:

Add a "enabling-checksums" mode to the server where it calculates
checksums for anything it writes, but doesn't check or complain about
incorrect checksums on reads. Put the server into that mode, and then
have a background process that reads through all data in the cluster,
calculates the checksum for every page, and writes all the data back.
Once that's completed, checksums can be fully enabled.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#3)
Re: Add checksums without --initdb

On 2015-07-02 22:53:40 +0300, Heikki Linnakangas wrote:

On 07/02/2015 10:39 PM, David Christensen wrote:

Possible concerns here are whether checksums are included in WAL
full_page_writes or if they are independently calculated; if the
latter I think we’d be fine. If checksums are all handled at the
layer below WAL than any streamed/processed changes should be fine to
get us to the point where we could come up as a master.

It's not full_page_writes that's the problem, but the server would not
WAL-log hint bit updates, unless you also have wal_log_hints enabled. But
that would be simple to just check - wal_log_hints can be enabled with a
server restart so that's not too onerous.

Hm. Since hint bits writes in that case aren't transported to the
standby I don't see the problem right now? The effect of a hint bit
write on the primary won't have an effect on the standby. And on
standbys themselves we "solve" that problem by not dirtying the buffer
on hint bit writes...

Andres suggested a separate tool that would basically rewrite the
existing data directory heap files in place, which I can also see a
use case for, but I also think there’s some benefit to be found in
having it happen while the replica is being streamed/built.

Ideas/thoughts/reasons this wouldn’t work?

We don't have, afaik, an easy way to know what kind of page format
individual relations in the base backup are going to have. I think
you'll need access for the catalog for that.

Now, we don't have many non-standard pages that aren't recognizeable by
their fork. So you could possibly somehow get away with it. But I don't
think that'll be very robust.

Add a "enabling-checksums" mode to the server where it calculates checksums
for anything it writes, but doesn't check or complain about incorrect
checksums on reads. Put the server into that mode, and then have a
background process that reads through all data in the cluster, calculates
the checksum for every page, and writes all the data back. Once that's
completed, checksums can be fully enabled.

You'd need, afaics, a bgworker that connects to every database to read
pg_class, to figure out what type of page a relfilenode has. And this
probably should call back into the relevant AM or such.

Generally having an infrastructure to do this would be very, very good:
It'll allow us to clean up old data in the background. We could much
more realistically reclaim infomask fields etc.

Sounds like it rather conceivably could be integrated into
autovacuum. The launcher recognizes that a database isn't yet in the new
format/checksum state, launches a worker, etc.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#4)
Re: Add checksums without --initdb

On 07/02/2015 11:28 PM, Andres Freund wrote:

On 2015-07-02 22:53:40 +0300, Heikki Linnakangas wrote:

Add a "enabling-checksums" mode to the server where it calculates checksums
for anything it writes, but doesn't check or complain about incorrect
checksums on reads. Put the server into that mode, and then have a
background process that reads through all data in the cluster, calculates
the checksum for every page, and writes all the data back. Once that's
completed, checksums can be fully enabled.

You'd need, afaics, a bgworker that connects to every database to read
pg_class, to figure out what type of page a relfilenode has. And this
probably should call back into the relevant AM or such.

Nah, we already assume that every relation data file follows the
standard page format, at least enough to have the checksum field at the
right location. See FlushBuffer() - it just unconditionally calculates
the checksum before writing out the page. (I'm not totally happy about
that, but that ship has sailed)
- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#5)
Re: Add checksums without --initdb

On 2015-07-02 23:43:17 +0300, Heikki Linnakangas wrote:

You'd need, afaics, a bgworker that connects to every database to read
pg_class, to figure out what type of page a relfilenode has. And this
probably should call back into the relevant AM or such.

Nah, we already assume that every relation data file follows the standard
page format, at least enough to have the checksum field at the right
location. See FlushBuffer() - it just unconditionally calculates the
checksum before writing out the page. (I'm not totally happy about that, but
that ship has sailed)

Ugh, I'd forgotten about that.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7David Christensen
david@endpoint.com
In reply to: Heikki Linnakangas (#5)
Re: Add checksums without --initdb

On Jul 2, 2015, at 3:43 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/02/2015 11:28 PM, Andres Freund wrote:

On 2015-07-02 22:53:40 +0300, Heikki Linnakangas wrote:

Add a "enabling-checksums" mode to the server where it calculates checksums
for anything it writes, but doesn't check or complain about incorrect
checksums on reads. Put the server into that mode, and then have a
background process that reads through all data in the cluster, calculates
the checksum for every page, and writes all the data back. Once that's
completed, checksums can be fully enabled.

You'd need, afaics, a bgworker that connects to every database to read
pg_class, to figure out what type of page a relfilenode has. And this
probably should call back into the relevant AM or such.

Nah, we already assume that every relation data file follows the standard page format, at least enough to have the checksum field at the right location. See FlushBuffer() - it just unconditionally calculates the checksum before writing out the page. (I'm not totally happy about that, but that ship has sailed)
- Heikki

So thinking some more about the necessary design to support enabling checksums post-initdb, what about the following?:

Introduce a new field in pg_control, data_checksum_state -> (0 - disabled, 1 - enabling in process, 2 - enabled). This could be set via (say) a pg_upgrade flag when creating a new cluster with --enable-checksums or a standalone program to adjust that field in pg_control. Checksum enforcing behavior will be dependent on that setting; 0 is non-enforcing read or write, 1 is enforcing checksums on buffer write but ignoring on read, and 2 is the normal enforcing read/write mode. Disabling checksums could be done with this tool as well, and would trivially just cause it to ignore the checksums (or alternately set to 0 on page write, depending on if we think it matters).

Add new catalog fields pg_database.dathaschecksum, pg_class.relhaschecksum; initially set to 0, or 1 if checksums were enabled at initdb time.

Augment autovacuum to check if we are currently enabling checksums based on the value in pg_control; if so, loop over any database with !pg_database.dathaschecksum.

For any relation in said database, check for relations with !pg_class.relhaschecksum; if found, read/dirty/write (however) each block to force the checksum written out for each page. As each relation is completely verified checksummed, update relhaschecksum = t. When no relations remain, set pg_database.dathaschecksum = t. (There may need to be some additional considerations for storing the checksum state of global relations or any other thing that uses the standard page format that live outside a specific database; i.e., all shared catalogs, quite possibly some things I haven't considered yet.)

If the data_checksum_state is "enabling" and there are no database needing to be enabled, then we can set data_checksum_state to "enabled"; everything then works as expected for the normal enforcing state.

External programs needing to be adjusted:
- pg_reset_xlog -- add the persistence of the data_checksum_state
- pg_controldata -- add the display of the data_checksum_state
- pg_upgrade -- add an --enable-checksums flag to transition a new cluster with the data pages. May need some adjustments for the data_checksum_state field

Possible new tool:
- pg_enablechecksums -- basic tool to set the data_checksum_state flag of pg_control

Other thoughts
Do we need periodic CRC scanning background worker just to check buffers periodically?
- if so, does this cause any interference with frozen relations?

What additional changes would be required or what wrinkles would we have to work out?

David
--
David Christensen
PostgreSQL Team Manager
End Point Corporation
david@endpoint.com
785-727-1171

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers