[DESIGN] Incremental checksums
pgsql-hackers,
So I’ve put some time into a design for the incremental checksum feature and wanted to get some feedback from the group:
* Incremental Checksums
PostgreSQL users should have a way up upgrading their cluster to use data checksums without having to do a costly pg_dump/pg_restore; in particular, checksums should be able to be enabled/disabled at will, with the database enforcing the logic of whether the pages considered for a given database are valid.
Considered approaches for this are having additional flags to pg_upgrade to set up the new cluster to use checksums where they did not before (or optionally turning these off). This approach is a nice tool to have, but in order to be able to support this process in a manner which has the database online while the database is going throught the initial checksum process.
In order to support the idea of incremental checksums, this design adds the following things:
** pg_control:
Keep "data_checksum_version", but have it indicate *only* the algorithm version for checksums. i.e., it's no longer used for the data_checksum enabled/disabled state.
Add "data_checksum_state", an enum with multiple states: "disabled", "enabling", "enforcing" (perhaps "revalidating" too; something to indicate that we are reprocessing a database that purports to have been completely checksummed already)
An explanation of the states as well as the behavior of the checksums for each.
- disabled => not in a checksum cycle; no read validation, no checksums written. This is the current behavior for Postgres *without* checksums.
- enabling => in a checksum cycle; no read validation, write checksums. Any page that gets written to disk will be a valid checksum. This is required when transitioning a cluster which has never had checksums, as the page reads would normally fail since they are uninitialized.
- enforcing => not in a checksum cycle; read validation, write checksums. This is the current behavior of Postgres *with* checksums.
(caveat: I'm not certain the following state is needed (and the current version of this patch doesn't have it)):
- revalidating => in a checksum cycle; read validation, write checksums. The difference between this and "enabling" is that we care if page reads fail, since by definition they should have been validly checksummed, as we should verify this.
Add "data_checksum_cycle", a counter that gets incremented with every checksum cycle change. This is used as a flag to verify when new checksum actions take place, for instance if we wanted to upgrade/change the checksum algorithm, or if we just want to support periodic checksum validation.
This variable will be compared against new values in the system tables to keep track of which relations still need to be checksummed in the cluster.
** pg_database:
Add a field "datlastchecksum" which will be the last checksum cycle which has completed for all relations in that database.
** pg_class:
Add a field "rellastchecksum" which stores the last successful checksum cycle for each relation.
** The checksum bgworker:
Something needs to proactively checksum any relations which are needing to be validated, and this something is known as the checksum bgworker. Checksum bgworker will operate similar to autovacuum daemons, and in fact in this initial pass, we'll hook into the autovac launcher due to similarities in catalog reading functionality as well as balancing out with other maintenance activity.
If autovacuum does not need to do any vacuuming work, it will check if the cluster has requested a checksum cycle by checking if the state is "enabling" (or "revalidate"). If so, it will look for any database which needs checksums update. It checks the current value of the data_checksum_cycle counter and looks for any databases with "datlastchecksum < data_checksum_cycle".
When all database have "datlastchecksum" == data_checksum_cycle, we initiate checksumming of any global cluster heap files. When the global cluster tables heap files have been checksummed, then we consider the checksum cycle complete, change pg_control's "data_checksum_state" to "enforcing" and consider things fully up-to-date.
If it finds a database needing work, it iterates through that database's relations looking for "rellastchecksum < data_checksum_cycle". If it finds none (i.e., every record has rellastchecksum == data_checksum_cycle) then it marks the containing database as up-to-date by updating "datlastchecksum = data_checksum_cycle".
For any relation that it finds in the database which is not checksummed, it starts an actual worker to handle the checksum process for this table. Since the state of the cluster is already either "enforcing" or "revalidating", any block writes will get checksums added automatically, so the only thing the bgworker needs to do is load each block in the relation and explicitly mark as dirty (unless that's not required for FlushBuffer() to do its thing). After every block in the relation is visited this way and checksummed, its pg_class record will have "rellastchecksum" updated.
** Function API:
Interface to the functionality will be via the following Utility functions:
- pg_enable_checksums(void) => turn checksums on for a cluster. Will error if the state is anything but "disabled". If this is the first time this cluster has run this, this will initialize ControlFile->data_checksum_version to the preferred built-in algorithm (since there's only one currently, we just set it to 1). This increments the ControlFile->data_checksum_cycle variable, then sets the state to "enabling", which means that the next time the bgworker checks if there is anything to do it will see that state, scan all the databases' "datlastchecksum" fields, and start kicking off the bgworker processes to handle the checksumming of the actual relation files.
- pg_disable_checksums(void) => turn checksums off for a cluster. Sets the state to "disabled", which means bg_worker will not do anything.
- pg_request_checksum_cycle(void) => if checksums are "enabled", increment the data_checksum_cycle counter and set the state to "enabling". (Alterantely, if we use the "revalidate" state here we could ensure that existing checksums are validated on read to alert us of any blocks with problems. This could also be made to be "smart" i.e., interrupt existing running checksum cycle to kick off another one (not sure of the use case), effectively call pg_enable_checksums() if the cluster has not been explictly enabled before, etc; depends on how pedantic we want to be.
** Design notes/implications:
When the system is in one of the modes which write checksums (currently everything but "disabled") any new relations/databases will have their "rellastchecksum"/"datlastchecksum" counters prepopulated with the current value of "data_checksum_counter", as we know that any space used for these relations will be checksummed, and hence valid. By pre-setting this, we remove the need for the checksum bgworker to explicitly visit these new relations and force checksums which will already be valid.
With checksums on, we know any full-heap-modifying operations will be properly checksummed, we may be able to pre-set rellastchecksum for other operations such as ALTER TABLEs which trigger a full rewrite *without* having to explicitly have the checksum bgworker run on this. I suspect there are a number of other places which may lend themselves to optimization like this to avoid having to process relations explicitly. (Say, if we somehow were able to force a checksum operation on any full SeqScan and update the state after the fact, we'd avoid paying this penalty another time.)
** pg_upgrade:
Milestone 2 in this patch is adding support for pg_upgrade.
With this additional complexity, we need to consider pg_upgrade, both now and in future versions. For one thing, we need to transfer settings from pg_control, plus make sure that pg_upgrade accepts deviances in any of the data_checksum-related settings in pg_control.
4 scenarios to consider if/what to allow:
*** non-checksummed -> non-checksummed
exactly as it stands now
*** checksummed -> non-checksummed
pretty trivial; since the system tables will be non-checksummed, just equivalent to resetting the checksum_cycle and pg_control fields; user data files will be copied or linked into place with the checksums, but since it is disbled they will be ignored.
*** non-checksummed -> checksummed
For the major version this patch makes it into, this will likely be the primary use case; add an --enable-checksums option to `pg_upgrade` to initially set the new cluster state to the checksums_enabling state, pre-init the system databases with the correct state and checksum cycle flag.
*** checksummed -> checksummed
The potentially tricky case (but likely to be more common going forward as incremental checksums are supported).
Since we may have had checksum cycles in process in the old cluster or otherwise had the checksum counter we need to do the following:
- need to propagate data_checksum_state, data_checksum_cycle, and data_checksum_version. If we wanted to support a different CRC algorithm, we could pre-set the data_checksum_version to a different version here, increment data_checksum_cycle, and set data_checksum_state to either "enabling" or "revalidating", depending on the original state from the old cluster. (i.e., if we were in the middle of an initial checksum cycle (state == "enabling").
- new cluster's system tables may need to have the "rellastchecksum" and "datlastchecksum" settings saved from the previous system, if that's easy, to avoid a fresh checksum run if there is no need.
** Handling checksums on a standby:
How to handle checksums on a standby is a bit trickier since checksums are inherently a local cluster state and not WAL logged but we are storing state in the system tables for each database.
In order to manage this discrepency, we WAL log a few additional pieces of information; specifically:
- new events to capture/propogate any of the pg_control fields, such as: checksum version data, checksum cycle increases, enabling/disabling actions
- checksum background worker block ranges.
Some notes on the block ranges: This would effectively be a series of records containing (datid, relid, start block, end block) for explicit checksum ranges, generated by the checksum bgworker as it checksums individual relations. This could be broken up into a series of blocks so rather than having the granularity be by relation we could have these records get generated periodicaly (say in groups of 10K blocks or whatever, number to be determined) to allow standby checksum recalculation to be incremental so as not to delay replay unnecessarily as checksums are being created.
Since the block range WAL records will be replayed before any of the pg_class/pg_database catalog records are replay, we'll be guaranteed to have the checksums calculated on the standby by the time it appears valid due to system state.
We may also be able to use the WAL records to speed up the processing of existing heap files if they are interrupted for some reason, this remains to be seen.
** Testing changes:
We need to add separate initdb checksum regression test which are outside of the normal pg_regress framework.
** Roadmap:
- Milestone 1 (master support) [0/7]
- [ ] pg_control updates for new data_checksum_cycle, data_checksum_state
- [ ] pg_class changes
- [ ] pg_database changes
- [ ] function API
- [ ] autovac launcher modifications
- [ ] checksum bgworker
- [ ] doc updates
- Milestone 2 (pg_upgrade support) [0/4]
- [ ] no checksum -> no checksum
- [ ] checksum -> no checksum
- [ ] no checksum -> checksum
- [ ] checksum -> checksum
- Milestone 3 (standby support) [0/4]
- [ ] WAL log checksum cycles
- [ ] WAL log enabling/disabling checksums
- [ ] WAL log checksum block ranges
- [ ] Add standby WAL replay
I look forward to any feedback; thanks!
David
--
David Christensen
PostgreSQL Team Manager
End Point Corporation
david@endpoint.com
785-727-1171
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7/13/15 3:26 PM, David Christensen wrote:
* Incremental Checksums
PostgreSQL users should have a way up upgrading their cluster to use data checksums without having to do a costly pg_dump/pg_restore; in particular, checksums should be able to be enabled/disabled at will, with the database enforcing the logic of whether the pages considered for a given database are valid.
Considered approaches for this are having additional flags to pg_upgrade to set up the new cluster to use checksums where they did not before (or optionally turning these off). This approach is a nice tool to have, but in order to be able to support this process in a manner which has the database online while the database is going throught the initial checksum process.
It would be really nice if this could be extended to handle different
page formats as well, something that keeps rearing it's head. Perhaps
that could be done with the cycle idea you've described.
Another possibility is some kind of a page-level indicator of what
binary format is in use on a given page. For checksums maybe a single
bit would suffice (indicating that you should verify the page checksum).
Another use case is using this to finally ditch all the old VACUUM FULL
code in HeapTupleSatisfies*().
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Jul 13, 2015, at 3:50 PM, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote:
On 7/13/15 3:26 PM, David Christensen wrote:
* Incremental Checksums
PostgreSQL users should have a way up upgrading their cluster to use data checksums without having to do a costly pg_dump/pg_restore; in particular, checksums should be able to be enabled/disabled at will, with the database enforcing the logic of whether the pages considered for a given database are valid.
Considered approaches for this are having additional flags to pg_upgrade to set up the new cluster to use checksums where they did not before (or optionally turning these off). This approach is a nice tool to have, but in order to be able to support this process in a manner which has the database online while the database is going throught the initial checksum process.
It would be really nice if this could be extended to handle different page formats as well, something that keeps rearing it's head. Perhaps that could be done with the cycle idea you've described.
I had had this thought too, but the main issues I saw were that new page formats were not guaranteed to take up the same space/storage, so there was an inherent limitation on the ability to restructure things out *arbitrarily*; that being said, there may be a use-case for the types of modifications that this approach *would* be able to handle.
Another possibility is some kind of a page-level indicator of what binary format is in use on a given page. For checksums maybe a single bit would suffice (indicating that you should verify the page checksum). Another use case is using this to finally ditch all the old VACUUM FULL code in HeapTupleSatisfies*().
There’s already a page version field, no? I assume that would be sufficient for the page format indicator. I don’t know about using a flag for verifying the checksum, as that is already modifying the page which is to be checksummed anyway, which we want to avoid having to rewrite a bunch of pages unnecessarily, no? And you’d presumably need to clear that state again which would be an additional write. This was the issue that the checksum cycle was meant to handle, since we store this information in the system catalogs and the types of modifications here would be idempotent.
David
--
David Christensen
PostgreSQL Team Manager
End Point Corporation
david@endpoint.com
785-727-1171
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-13 15:50:44 -0500, Jim Nasby wrote:
Another possibility is some kind of a page-level indicator of what binary
format is in use on a given page. For checksums maybe a single bit would
suffice (indicating that you should verify the page checksum). Another use
case is using this to finally ditch all the old VACUUM FULL code in
HeapTupleSatisfies*().
That's a bad idea, because that bit then'd not be protected by the
checksum.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7/13/15 4:02 PM, David Christensen wrote:
On Jul 13, 2015, at 3:50 PM, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote:
On 7/13/15 3:26 PM, David Christensen wrote:
* Incremental Checksums
PostgreSQL users should have a way up upgrading their cluster to use data checksums without having to do a costly pg_dump/pg_restore; in particular, checksums should be able to be enabled/disabled at will, with the database enforcing the logic of whether the pages considered for a given database are valid.
Considered approaches for this are having additional flags to pg_upgrade to set up the new cluster to use checksums where they did not before (or optionally turning these off). This approach is a nice tool to have, but in order to be able to support this process in a manner which has the database online while the database is going throught the initial checksum process.
It would be really nice if this could be extended to handle different page formats as well, something that keeps rearing it's head. Perhaps that could be done with the cycle idea you've described.
I had had this thought too, but the main issues I saw were that new page formats were not guaranteed to take up the same space/storage, so there was an inherent limitation on the ability to restructure things out *arbitrarily*; that being said, there may be a use-case for the types of modifications that this approach *would* be able to handle.
After some discussion on IRC, I there's 2 main points to consider.
First, we're currently unhappy with how relfrozenxid works, and this
proposal follows the same pattern of having essentially a counter field
in pg_class. Perhaps this is OK because things like checksum really
shouldn't change that often. (My inclination is that fields in pg_class
are OK for now.)
Second, there are 4 use cases here that are very similar. We should
*consider* them now, while designing this. That doesn't mean the first
patch needs to support anything other than checksums.
1) Page layout changes
2) Page run-time changes (currently only checksums)
3) Tuple layout changes (ie: HEAP_MOVED_IN)
4) Tuple run-time changes (ie: DROP COLUMN)
1 is currently handled in pg_upgrade by forcing a page-by-by-page copy
during upgrade. Doing this online would require the same kind of
conversion plugin pg_upgrade uses. If we want to support conversions
that need extra free space on a page we'd also need support for that.
2 is similar to 1, except this can change via GUC or similar. Checksums
are an example of this, as is creating extra free space on a page to
support an upgrade.
3 & 4 are tuple-level equivalents to 1 & 2.
I think the bigger challenge to these things is how to track the status
of a conversion (as opposed to the conversion function itself).
- Do we want each of these to have a separate counter in pg_class?
(rellastchecksum, reloldestpageversion, etc)
- Should that info be combined?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 14, 2015 at 1:56 AM, David Christensen <david@endpoint.com>
wrote:
For any relation that it finds in the database which is not checksummed,
it starts an actual worker to handle the checksum process for this table.
Since the state of the cluster is already either "enforcing" or
"revalidating", any block writes will get checksums added automatically, so
the only thing the bgworker needs to do is load each block in the relation
and explicitly mark as dirty (unless that's not required for FlushBuffer()
to do its thing). After every block in the relation is visited this way
and checksummed, its pg_class record will have "rellastchecksum" updated.
If during scan of a relation, after doing checksum for half of the
blocks in relation, system crashes, then in the above scheme a
restart would need to again read all the blocks even though some
of the blocks are already checksummed in previous cycle, this is
okay if it happens for few small or medium size relations, but assume
it happens when multiple large size relations are at same state
(half blocks are checksummed) when the crash occurs, then it could
lead to much more IO than required.
** Function API:
Interface to the functionality will be via the following Utility
functions:
- pg_enable_checksums(void) => turn checksums on for a cluster. Will
error if the state is anything but "disabled". If this is the first time
this cluster has run this, this will initialize
ControlFile->data_checksum_version to the preferred built-in algorithm
(since there's only one currently, we just set it to 1). This increments
the ControlFile->data_checksum_cycle variable, then sets the state to
"enabling", which means that the next time the bgworker checks if there is
anything to do it will see that state, scan all the databases'
"datlastchecksum" fields, and start kicking off the bgworker processes to
handle the checksumming of the actual relation files.
- pg_disable_checksums(void) => turn checksums off for a cluster. Sets
the state to "disabled", which means bg_worker will not do anything.
- pg_request_checksum_cycle(void) => if checksums are "enabled",
increment the data_checksum_cycle counter and set the state to "enabling".
If the cluster is already enabled for checksums, then what is
the need for any other action?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 2015-07-15 12:48:40 +0530, Amit Kapila wrote:
If during scan of a relation, after doing checksum for half of the
blocks in relation, system crashes, then in the above scheme a
restart would need to again read all the blocks even though some
of the blocks are already checksummed in previous cycle, this is
okay if it happens for few small or medium size relations, but assume
it happens when multiple large size relations are at same state
(half blocks are checksummed) when the crash occurs, then it could
lead to much more IO than required.
I don't think this is worth worrying about. If you crash frequently
enough for this to be a problem you should fix that. Adding complexity
for such an uncommon case spreads the cost to many more people.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Jul 15, 2015, at 3:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
- pg_disable_checksums(void) => turn checksums off for a cluster. Sets the state to "disabled", which means bg_worker will not do anything.
- pg_request_checksum_cycle(void) => if checksums are "enabled", increment the data_checksum_cycle counter and set the state to "enabling".
If the cluster is already enabled for checksums, then what is
the need for any other action?
You are assuming this is a one-way action. Some people may decide that checksums end up taking too much overhead or similar, we should support disabling of this feature; with this proposed patch the disable action is fairly trivial to handle.
Requesting an explicit checksum cycle would be desirable in the case where you want to proactively verify there is no cluster corruption to be found.
David
--
David Christensen
PostgreSQL Team Manager
End Point Corporation
david@endpoint.com
785-727-1171
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 15, 2015 at 9:13 PM, David Christensen <david@endpoint.com>
wrote:
On Jul 15, 2015, at 3:18 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
- pg_disable_checksums(void) => turn checksums off for a cluster.
Sets the state to "disabled", which means bg_worker will not do anything.
- pg_request_checksum_cycle(void) => if checksums are "enabled",
increment the data_checksum_cycle counter and set the state to "enabling".
If the cluster is already enabled for checksums, then what is
the need for any other action?You are assuming this is a one-way action.
No, I was confused by the state (enabling) this function will set.
Requesting an explicit checksum cycle would be desirable in the case
where you want to proactively verify there is no cluster corruption to be
found.
Sure, but I think that is different from setting the state to enabling.
In your proposal above, in enabling state cluster needs to write
checksums, where for such a feature you only need read validation.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
- pg_disable_checksums(void) => turn checksums off for a cluster. Sets the state to "disabled", which means bg_worker will not do anything.
- pg_request_checksum_cycle(void) => if checksums are "enabled", increment the data_checksum_cycle counter and set the state to "enabling".
If the cluster is already enabled for checksums, then what is
the need for any other action?You are assuming this is a one-way action.
No, I was confused by the state (enabling) this function will set.
Okay.
Requesting an explicit checksum cycle would be desirable in the case where you want to proactively verify there is no cluster corruption to be found.
Sure, but I think that is different from setting the state to enabling.
In your proposal above, in enabling state cluster needs to write
checksums, where for such a feature you only need read validation.
With “revalidating” since your database is still actively making changes, you need to validate writes too (think new tables, etc). “Enabling” needs reads unvalidated because you’re starting from an unknown state (i.e., not checksummed already).
Thanks,
David
--
David Christensen
PostgreSQL Team Manager
End Point Corporation
david@endpoint.com
785-727-1171
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers