[Patch] Checksums for SLRU files
Hello, I`d like to show my implementation of SLRU file protection with
checksums.
It only has effect if checksums on database are enabled.
Detection of a checksum failure during a read normally causes PostgreSQL
to report an error. Setting ignore_slru_checksum_failure to on causes
the system to ignore the failure (but still report a warning), and
continue processing. It is similary like ignore_checksum_failure but for
some slru files. The default setting is true, and it can only be changed
by a database admin. For changes, use ALTER SYSTEM and reload
configuration:
ALTER SYSTEM SET ignore_slru_checksum_failure = off;
SELECT pg_reload_conf();
Impementation:
1) Checksum writes in last 2 bytes of every page
2) Checksum calculates and write down in page happens when page writes
on disk (same as relation does it)
3) Checking checksum happens same as relation does it, on
Read\PhysicalRead when GUC ignore_slru_checksum_failure = false
{default = true}.
I would like to hear your thoughts over my patch.
--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
slruchecksums.patchtext/x-diff; name=slruchecksums.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index aeda826..8501dd2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8394,6 +8394,29 @@ LOG: CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
</listitem>
</varlistentry>
+ <varlistentry id="guc-ignore-slru-checksum-failure" xreflabel="ignore_slru_checksum_failure">
+ <term><varname>ignore_slru_checksum_failure</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>ignore_slru_checksum_failure</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Only has effect if <xref linkend="app-initdb-data-checksums"> are enabled.
+ </para>
+ <para>
+ Detection of a checksum failure during a read normally causes
+ <productname>PostgreSQL</> to report an error. Setting <varname>ignore_slru_checksum_failure</>
+ to on causes the system to ignore the failure (but still report a warning), and
+ continue processing. Similary like <varname>ignore_checksum_failure</> but for
+ not relation files. The default setting is <literal>true</>, and it can only be
+ changed by a superuser. For changes, use ALTER SYSTEM and reload configuration:
+ ALTER SYSTEM SET ignore_slru_checksum_failure = off;
+ SELECT pg_reload_conf();
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-zero-damaged-pages" xreflabel="zero_damaged_pages">
<term><varname>zero_damaged_pages</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index a3e2b12..87ed09c 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -58,7 +58,7 @@
/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT 2
#define CLOG_XACTS_PER_BYTE 4
-#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+#define CLOG_XACTS_PER_PAGE ((BLCKSZ - CHKSUMSZ) * CLOG_XACTS_PER_BYTE)
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)
#define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 7142ece..35c317e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -106,7 +106,7 @@
*/
/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+#define MULTIXACT_OFFSETS_PER_PAGE ((BLCKSZ - CHKSUMSZ) / sizeof(MultiXactOffset))
#define MultiXactIdToOffsetPage(xid) \
((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
@@ -2039,7 +2039,6 @@ TrimMultiXact(void)
{
int slotno;
MultiXactOffset *offptr;
-
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9dd7719..b7a154c 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -58,7 +58,18 @@
#include "storage/fd.h"
#include "storage/shmem.h"
#include "miscadmin.h"
+#include "utils/guc.h"
+#include "storage/checksum.h"
+#include "utils/memutils.h"
+
+/*
+ * GUC variable
+ * Set from backend:
+ * alter system set ignore_slru_checksum_failure = on/off;
+ * select pg_reload_conf();
+ */
+bool ignore_slru_checksum_failure = true;
#define SlruFileName(ctl, path, seg) \
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -376,6 +387,8 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid)
{
SlruShared shared = ctl->shared;
+ int checksum;
+ static char *pageCopy = NULL;
/* Outer loop handles restart if we must wait for someone else's I/O */
for (;;)
@@ -426,6 +439,22 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
+ if (pageCopy == NULL)
+ pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
+
+ memcpy(pageCopy, shared->page_buffer[slotno], BLCKSZ);
+
+ checksum = pg_getchecksum_slru_page(pageCopy);
+
+ if ( DataChecksumsEnabled() && (checksum != pg_checksum_slru_page(pageCopy) ))
+ {
+ elog(LOG, "CHECKSUM: Page Is not Verified.");
+ if (!ignore_slru_checksum_failure)
+ {
+ elog(ERROR, "CHECKSUM: ERROR ignore_slru_checksum_failure turned off.");
+ }
+ }
+
/* Set the LSNs for this newly read-in page to zero */
SimpleLruZeroLSNs(ctl, slotno);
@@ -539,6 +568,14 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
/* Release control lock while doing I/O */
LWLockRelease(shared->ControlLock);
+ /*
+ * Calculate checksum of the page and write in the last 2 bytes of the page
+ *
+ * NOTE: it`s all done in checksum_impl.h : pg_checksum_slru_page()
+ * you give there page and get page with updated checksum
+ */
+ pg_checksum_slru_page(shared->page_buffer[slotno]);
+
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ae22185..8d62e7e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -116,6 +116,7 @@ extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
+extern bool ignore_slru_checksum_failure;
extern bool synchronize_seqscans;
#ifdef TRACE_SYNCSCAN
@@ -1006,6 +1007,20 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"ignore_slru_checksum_failure", PGC_SUSET, DEVELOPER_OPTIONS,
+ gettext_noop("Continues processing after SLRU checksum failure."),
+ gettext_noop("Detection of a SLRU checksum failure normally causes PostgreSQL to "
+ "report an error. Setting ignore_slru_checksum_failure to true"
+ "causes the system to ignore the failure (but still report a warning), and"
+ "continue processing. This behavior could cause crashes or other serious"
+ "problems. Only has an effect if checksums are enabled."),
+ GUC_NOT_IN_SAMPLE
+ },
+ &ignore_slru_checksum_failure,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"zero_damaged_pages", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Continues processing past damaged page headers."),
gettext_noop("Detection of a damaged page header normally causes PostgreSQL to "
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0f2b8bd..57f7a1e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -292,6 +292,9 @@ extern void assign_checkpoint_completion_target(double newval, void *extra);
* Routines to start, stop, and get status of a base backup.
*/
+/* Size of checksum in bytes default 2 bytes (uint16) */
+#define CHKSUMSZ 2
+
/*
* Session-level status of base backups
*
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index b85f714..67535b0 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -21,4 +21,8 @@
*/
extern uint16 pg_checksum_page(char *page, BlockNumber blkno);
+extern uint16 pg_checksum_slru_page(char *page);
+
+extern uint16 pg_getchecksum_slru_page(char *page);
+
#endif /* CHECKSUM_H */
diff --git a/src/include/storage/checksum_impl.h b/src/include/storage/checksum_impl.h
index bffd061..8e340db 100644
--- a/src/include/storage/checksum_impl.h
+++ b/src/include/storage/checksum_impl.h
@@ -106,6 +106,8 @@
#define N_SUMS 32
/* prime multiplier of FNV-1a hash */
#define FNV_PRIME 16777619
+/* Size of checksum in bytes default 2 bytes (uint16) */
+#define CHKSUMSZ 2
/*
* Base offsets to initialize each of the parallel FNV hashes into a
@@ -205,3 +207,51 @@ pg_checksum_page(char *page, BlockNumber blkno)
*/
return (checksum % 65535) + 1;
}
+
+/*
+ * Compute the checksum for a Postgres SLRU page. The page must be aligned on a
+ * 4-byte boundary.
+ *
+ * The checksum save itself to the last 2 bytes (CHKSUMSZ = 2 bytes) of the page
+ */
+uint16
+pg_checksum_slru_page(char *page)
+{
+ uint16 checksum;
+ uint8 bytes[2];
+
+ /* Set last 2 bytes to 0 */
+ page[BLCKSZ - 1] = 0;
+ page[BLCKSZ - 2] = 0;
+
+ checksum = (pg_checksum_block(page, BLCKSZ) % 65535) + 1;
+
+ bytes[0] = (checksum & 0X00FF); /* Least significant bit */
+ bytes[1] = (checksum & 0XFF00) >> 8; /* Most significant bit */
+
+
+ /* Set last 2 bytes to calculated checksum */
+ page[BLCKSZ - 1] = bytes[0];
+ page[BLCKSZ - 2] = bytes[1];
+
+ return checksum;
+}
+
+
+/*
+ * Get the checksum for a Postgres SLRU page. The page must be aligned on a
+ * 4-byte boundary.
+ *
+ * The checksum located at last 2 bytes (CHKSUMSZ = 2 bytes) of the page
+ */
+uint16
+pg_getchecksum_slru_page(char *page)
+{
+ uint16 checksum = 0X00000000;
+
+ checksum = checksum | page[BLCKSZ - 2];
+ checksum = checksum << 8 ;
+ checksum = checksum | (page[BLCKSZ - 1] & 0X00FF);
+
+ return checksum;
+}
Hi, Ivan!
31 дек. 2017 г., в 22:30, Ivan Kartyshov <i.kartyshov@postgrespro.ru> написал(а):
Hello, I`d like to show my implementation of SLRU file protection with checksums.
.....
I would like to hear your thoughts over my patch.
As far as I can see, the patch solves problem of hardware corruption in SLRU.
This seems a valid concern. I've tried to understand your patch and few questions arose which I could not answer myself.
1. Why do you propose different GUC besides ignore_checksum_failure? What is scenario of it's use which is not covered by general GUC switch?
2. What is performance penalty of having this checksums?
Besides this, some things seems suspicious to me:
1. This comment seems excessive. I'd leave just first one first line.
+/*
+ * GUC variable
+ * Set from backend:
+ * alter system set ignore_slru_checksum_failure = on/off;
+ * select pg_reload_conf();
+ */
2. Functions pg_checksum_slru_page() and pg_getchecksum_slru_page() operate with two bytes instead of uint16. This seems strange.
3. This line
checksum = (pg_checksum_block(page, BLCKSZ) % 65535) + 1;
Need to share comment with previous function (pg_checksum_page()). +1 was a tough thing for me to understand before looking around and reading those comments.
4. I could not understand purpose of this expression
page[BLCKSZ - 1] & 0X00FF
Happy New Year :) Keep up good work.
Best regards, Andrey Borodin.
On Mon, Jan 1, 2018 at 9:19 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
31 дек. 2017 г., в 22:30, Ivan Kartyshov <i.kartyshov@postgrespro.ru>
написал(а):
Hello, I`d like to show my implementation of SLRU file protection with
checksums.
.....
I would like to hear your thoughts over my patch.As far as I can see, the patch solves problem of hardware corruption in
SLRU.
This seems a valid concern. I've tried to understand your patch and few
questions arose which I could not answer myself.1. Why do you propose different GUC besides ignore_checksum_failure? What
is scenario of it's use which is not covered by general GUC switch?
2. What is performance penalty of having this checksums?Besides this, some things seems suspicious to me: 1. This comment seems excessive. I'd leave just first one first line. +/* + * GUC variable + * Set from backend: + * alter system set ignore_slru_checksum_failure = on/off; + * select pg_reload_conf(); + */ 2. Functions pg_checksum_slru_page() and pg_getchecksum_slru_page() operate with two bytes instead of uint16. This seems strange. 3. This line checksum = (pg_checksum_block(page, BLCKSZ) % 65535) + 1; Need to share comment with previous function (pg_checksum_page()). +1 was a tough thing for me to understand before looking around and reading those comments. 4. I could not understand purpose of this expression page[BLCKSZ - 1] & 0X00FF
Andrey, thank you for review.
Ivan, thank you for submitting this patch. I also have some notes on it
from the first glance.
1. It seems that you need to define some macro for (BLCKSZ - CHKSUMSZ) in
order to evade typing it multiple times.
2. You also didn't modify all the SLRU macros depending on useful size of
page. You missed at least COMMIT_TS_XACTS_PER_PAGE,
MULTIXACT_MEMBERGROUPS_PER_PAGE and SUBTRANS_XACTS_PER_PAGE.
3. pg_upgrade isn't considered. This patch should provide upgrading SLRUs
to adopt changed useful size of page. That seems to be hardest patch of
this patch to be written.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Jan 3, 2018 at 11:21 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
On Mon, Jan 1, 2018 at 9:19 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
31 дек. 2017 г., в 22:30, Ivan Kartyshov <i.kartyshov@postgrespro.ru>
написал(а):Hello, I`d like to show my implementation of SLRU file protection with
checksums.
.....
I would like to hear your thoughts over my patch.As far as I can see, the patch solves problem of hardware corruption in
SLRU.
This seems a valid concern.
+1
It doesn't make sense to have a checksum feature that protects
relation files, but doesn't protect these super important meta-data
files that affect our interpretation of the relation files.
[...]
3. pg_upgrade isn't considered. This patch should provide upgrading SLRUs
to adopt changed useful size of page. That seems to be hardest patch of
this patch to be written.
+1
I think we'd want pg_upgrade tests showing an example of each SLRU
growing past one segment, and then being upgraded, and then being
accessed in various different pages and segment files, so that we can
see that we're able to shift the data to the right place successfully.
For example I think I'd want to see that a single aborted transaction
surrounded by many committed transactions shows up in the right place
after an upgrade.
--
Thomas Munro
http://www.enterprisedb.com
On 2018-02-02 11:37:34 +1300, Thomas Munro wrote:
3. pg_upgrade isn't considered. This patch should provide upgrading SLRUs
to adopt changed useful size of page. That seems to be hardest patch of
this patch to be written.+1
I think we'd want pg_upgrade tests showing an example of each SLRU
growing past one segment, and then being upgraded, and then being
accessed in various different pages and segment files, so that we can
see that we're able to shift the data to the right place successfully.
For example I think I'd want to see that a single aborted transaction
surrounded by many committed transactions shows up in the right place
after an upgrade.
This patch is in the 2018-03 CF, but I don't see any progress since the
last comments. As it has been Waiting on author since the last CF, I
think we should mark this as returned with feedback.
Greetings,
Andres Freund
On Thu, Mar 1, 2018 at 8:25 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-02-02 11:37:34 +1300, Thomas Munro wrote:
3. pg_upgrade isn't considered. This patch should provide upgrading SLRUs
to adopt changed useful size of page. That seems to be hardest patch of
this patch to be written.+1
I think we'd want pg_upgrade tests showing an example of each SLRU
growing past one segment, and then being upgraded, and then being
accessed in various different pages and segment files, so that we can
see that we're able to shift the data to the right place successfully.
For example I think I'd want to see that a single aborted transaction
surrounded by many committed transactions shows up in the right place
after an upgrade.This patch is in the 2018-03 CF, but I don't see any progress since the
last comments. As it has been Waiting on author since the last CF, I
think we should mark this as returned with feedback.
+1.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2018-03-02 11:49:05 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 8:25 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-02-02 11:37:34 +1300, Thomas Munro wrote:
3. pg_upgrade isn't considered. This patch should provide upgrading SLRUs
to adopt changed useful size of page. That seems to be hardest patch of
this patch to be written.+1
I think we'd want pg_upgrade tests showing an example of each SLRU
growing past one segment, and then being upgraded, and then being
accessed in various different pages and segment files, so that we can
see that we're able to shift the data to the right place successfully.
For example I think I'd want to see that a single aborted transaction
surrounded by many committed transactions shows up in the right place
after an upgrade.This patch is in the 2018-03 CF, but I don't see any progress since the
last comments. As it has been Waiting on author since the last CF, I
think we should mark this as returned with feedback.+1.
Done.
Hi, Ivan!
I've found that there are few more places with SLRU items per page, where you have to update usable page size. Please find the diff attached.
I agree that there is a little chance to get this commitable quickly, but still, the feature worth working on, from my point of view.
Best regards, Andrey Borodin.
Attachments:
moar_slru.diffapplication/octet-stream; name=moar_slru.diff; x-unix-mode=0644Download
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 7b7bf2b2bf..1c19ed28ec 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -64,7 +64,7 @@ typedef struct CommitTimestampEntry
sizeof(RepOriginId))
#define COMMIT_TS_XACTS_PER_PAGE \
- (BLCKSZ / SizeOfCommitTimestampEntry)
+ (BLCKSZ - CHKSUMSZ / SizeOfCommitTimestampEntry)
#define TransactionIdToCTsPage(xid) \
((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 35a3d36a7a..f5b0a7782b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -138,7 +138,7 @@
/* size in bytes of a complete group */
#define MULTIXACT_MEMBERGROUP_SIZE \
(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE ((BLCKSZ - CHKSUMSZ) / MULTIXACT_MEMBERGROUP_SIZE)
#define MULTIXACT_MEMBERS_PER_PAGE \
(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index f640661130..80420cb7a4 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -49,7 +49,7 @@
*/
/* We need four bytes per xact */
-#define SUBTRANS_XACTS_PER_PAGE (BLCKSZ / sizeof(TransactionId))
+#define SUBTRANS_XACTS_PER_PAGE ((BLCKSZ - CHKSUMSZ) / sizeof(TransactionId))
#define TransactionIdToPage(xid) ((xid) / (TransactionId) SUBTRANS_XACTS_PER_PAGE)
#define TransactionIdToEntry(xid) ((xid) % (TransactionId) SUBTRANS_XACTS_PER_PAGE)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index f7de742a56..8f672635e3 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -150,7 +150,7 @@
* than that, so changes in that data structure won't affect user-visible
* restrictions.
*/
-#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128 - CHKSUMSZ)
/*
* Struct representing an entry in the global notify queue
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 251a359bff..de8b74820e 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -315,7 +315,7 @@ static SlruCtlData OldSerXidSlruCtlData;
#define OLDSERXID_PAGESIZE BLCKSZ
#define OLDSERXID_ENTRYSIZE sizeof(SerCommitSeqNo)
-#define OLDSERXID_ENTRIESPERPAGE (OLDSERXID_PAGESIZE / OLDSERXID_ENTRYSIZE)
+#define OLDSERXID_ENTRIESPERPAGE ((OLDSERXID_PAGESIZE - CHKSUMSZ) / OLDSERXID_ENTRYSIZE)
/*
* Set maximum pages based on the lesser of the number needed to track all
Hi, Ivan!
5 марта 2018 г., в 20:58, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
I've found that there are few more places with SLRU items per page
I was looking into this patch mainly because I was reviewing other checksums patch in different thread. But the purpose of this patch seems viable for me.
After looking into the patch I'd like to propose some editorialization:
0. Removed GUC: ignore_checksum_failure looks good enough
1. Removed copy on read from disk. Also I do not see a reason to copy page before write
2. multis are upgraded by WAL reset, CLOG is actually upgraded
3. Updated all places where SLRU block size was used
Best regards, Andrey Borodin.
Attachments:
0001-SLRU-checksums-patch.patchapplication/octet-stream; name=0001-SLRU-checksums-patch.patch; x-unix-mode=0644Download
From 023db6aaeee8cfdbe1d89bfd6ae7c13dd3a60465 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Mon, 1 Jan 2018 20:55:14 +0500
Subject: [PATCH] SLRU checksums patch
---
src/backend/access/transam/clog.c | 2 +-
src/backend/access/transam/commit_ts.c | 2 +-
src/backend/access/transam/multixact.c | 4 +-
src/backend/access/transam/slru.c | 26 +++++
src/backend/access/transam/subtrans.c | 2 +-
src/backend/commands/async.c | 2 +-
src/backend/storage/lmgr/predicate.c | 2 +-
src/bin/pg_upgrade/pg_upgrade.c | 188 +++++++++++++++++++++++++++++++--
src/bin/pg_upgrade/pg_upgrade.h | 4 +
src/include/access/slru.h | 1 +
src/include/catalog/catversion.h | 2 +-
src/include/storage/checksum.h | 11 +-
src/include/storage/checksum_impl.h | 46 ++++++++
13 files changed, 273 insertions(+), 19 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index bbf9ce1a3a..3d9dba0414 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -58,7 +58,7 @@
/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT 2
#define CLOG_XACTS_PER_BYTE 4
-#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+#define CLOG_XACTS_PER_PAGE ((BLCKSZ - CHKSUMSZ) * CLOG_XACTS_PER_BYTE)
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)
#define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 7b7bf2b2bf..621f6cf482 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -64,7 +64,7 @@ typedef struct CommitTimestampEntry
sizeof(RepOriginId))
#define COMMIT_TS_XACTS_PER_PAGE \
- (BLCKSZ / SizeOfCommitTimestampEntry)
+ ((BLCKSZ - CHKSUMSZ) / SizeOfCommitTimestampEntry)
#define TransactionIdToCTsPage(xid) \
((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ba01e94328..f04c23c649 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -106,7 +106,7 @@
*/
/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+#define MULTIXACT_OFFSETS_PER_PAGE ((BLCKSZ - CHKSUMSZ) / sizeof(MultiXactOffset))
#define MultiXactIdToOffsetPage(xid) \
((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
@@ -138,7 +138,7 @@
/* size in bytes of a complete group */
#define MULTIXACT_MEMBERGROUP_SIZE \
(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE ((BLCKSZ - CHKSUMSZ) / MULTIXACT_MEMBERGROUP_SIZE)
#define MULTIXACT_MEMBERS_PER_PAGE \
(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 94b6e6612a..651afd51e9 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -55,9 +55,15 @@
#include "access/transam.h"
#include "access/xlog.h"
#include "pgstat.h"
+#include "storage/checksum.h"
#include "storage/fd.h"
#include "storage/shmem.h"
#include "miscadmin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+/* GUC variable */
+extern bool ignore_checksum_failure;
#define SlruFileName(ctl, path, seg) \
@@ -376,6 +382,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid)
{
SlruShared shared = ctl->shared;
+ int16 checksum;
/* Outer loop handles restart if we must wait for someone else's I/O */
for (;;)
@@ -426,6 +433,19 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
+ if (DataChecksumsEnabled() && ok)
+ {
+ checksum = pg_getchecksum_slru_page(shared->page_buffer[slotno]);
+ if (checksum != pg_checksum_slru_page(shared->page_buffer[slotno]))
+ {
+ elog(LOG, "CHECKSUM: Page Is not Verified.");
+ if (!ignore_checksum_failure)
+ {
+ elog(ERROR, "CHECKSUM: ERROR ignore_checksum_failure turned off.");
+ }
+ }
+ }
+
/* Set the LSNs for this newly read-in page to zero */
SimpleLruZeroLSNs(ctl, slotno);
@@ -539,6 +559,12 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
/* Release control lock while doing I/O */
LWLockRelease(shared->ControlLock);
+ /*
+ * Update checksum on the page. We do not need to copy the page since page
+ * contents cannot be modified under the lock.
+ */
+ pg_setchecksum_slru_page(shared->page_buffer[slotno]);
+
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index f640661130..80420cb7a4 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -49,7 +49,7 @@
*/
/* We need four bytes per xact */
-#define SUBTRANS_XACTS_PER_PAGE (BLCKSZ / sizeof(TransactionId))
+#define SUBTRANS_XACTS_PER_PAGE ((BLCKSZ - CHKSUMSZ) / sizeof(TransactionId))
#define TransactionIdToPage(xid) ((xid) / (TransactionId) SUBTRANS_XACTS_PER_PAGE)
#define TransactionIdToEntry(xid) ((xid) % (TransactionId) SUBTRANS_XACTS_PER_PAGE)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index f7de742a56..8f672635e3 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -150,7 +150,7 @@
* than that, so changes in that data structure won't affect user-visible
* restrictions.
*/
-#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128 - CHKSUMSZ)
/*
* Struct representing an entry in the global notify queue
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 251a359bff..de8b74820e 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -315,7 +315,7 @@ static SlruCtlData OldSerXidSlruCtlData;
#define OLDSERXID_PAGESIZE BLCKSZ
#define OLDSERXID_ENTRYSIZE sizeof(SerCommitSeqNo)
-#define OLDSERXID_ENTRIESPERPAGE (OLDSERXID_PAGESIZE / OLDSERXID_ENTRYSIZE)
+#define OLDSERXID_ENTRIESPERPAGE ((OLDSERXID_PAGESIZE - CHKSUMSZ) / OLDSERXID_ENTRYSIZE)
/*
* Set maximum pages based on the lesser of the number needed to track all
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index c10103f0bf..5751285b3f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -404,17 +404,183 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
check_ok();
}
+#include "storage/checksum.h"
+
+#include <dirent.h>
+#include <math.h>
+#include <fcntl.h>
+
+#define SLRU_PAGES_PER_SEGMENT 32
+#define SLRU_SEGMENT_SIZE (BLCKSZ * SLRU_PAGES_PER_SEGMENT)
+
+#define CLOG_BYTES_PER_PAGE_NEW (BLCKSZ - CHKSUMSZ)
+#define CLOG_BYTES_PER_SEGMENT_NEW (BLCKSZ - CHKSUMSZ) * SLRU_PAGES_PER_SEGMENT
+
+static void write_xact_data_to_file(char *file_name, uint32 local_start, char *data, uint32 length)
+{
+ int dest_fd;
+ int local_end = local_start + length;
+ char *buffer = pg_malloc(SLRU_SEGMENT_SIZE);
+
+ Assert(length <= CLOG_BYTES_PER_SEGMENT_NEW);
+
+ if ((dest_fd = open(file_name, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+ S_IRUSR | S_IWUSR)) < 0)
+ pg_fatal("could not create file \"%s\": %s\n", file_name, strerror(errno));
+
+ if (ftruncate(dest_fd, SLRU_SEGMENT_SIZE) < 0)
+ pg_fatal("could not set size of file \"%s\": %s\n", file_name, strerror(errno));
+
+ while (local_start < local_end)
+ {
+ int nbytes;
+ int page = local_start / CLOG_BYTES_PER_PAGE_NEW;
+ int page_start = local_start - (page * CLOG_BYTES_PER_PAGE_NEW);
+ int delta = Min((page + 1) * CLOG_BYTES_PER_PAGE_NEW, local_end) - local_start;
+
+ if (lseek(dest_fd, page * BLCKSZ, SEEK_SET) < 0)
+ pg_fatal("could not set position in file \"%s\": %s\n", file_name, strerror(errno));
+
+ nbytes = read(dest_fd, buffer, BLCKSZ);
+
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %s\n", file_name, strerror(errno));
+
+ memmove(buffer + page_start, data, delta);
+
+ pg_setchecksum_slru_page(buffer);
+
+ if (lseek(dest_fd, page * BLCKSZ, SEEK_SET) < 0)
+ pg_fatal("could not set position in file \"%s\": %s\n", file_name, strerror(errno));
+
+ if (write(dest_fd, buffer, BLCKSZ) != BLCKSZ)
+ {
+ pg_fatal("could not write file \"%s\": %s\n", file_name, strerror(errno));
+ }
+
+ local_start += delta;
+ data += delta;
+ }
+
+ pg_free(buffer);
+ close(dest_fd);
+}
+
+static void
+distribute_xact_data(char *buffer, int nbytes, int oldsegno, const char *new_subdir)
+{
+ uint64 start = oldsegno * ((uint64) SLRU_SEGMENT_SIZE);
+ uint64 end = start + nbytes;
+
+ while (start < end)
+ {
+ int new_segno = start / (CLOG_BYTES_PER_SEGMENT_NEW);
+ uint64 local_start = start - new_segno * CLOG_BYTES_PER_SEGMENT_NEW;
+ uint64 local_end = Min(end, ((uint64)new_segno + 1) * CLOG_BYTES_PER_SEGMENT_NEW);
+ int64 length = local_end - start;
+ char new_file[MAXPGPATH];
+
+ Assert(length > 0);
+ Assert(length == (uint32)length);
+ Assert(local_start == (uint32)local_start);
+
+ snprintf(new_file, sizeof(new_file), "%s/%s/%04X", new_cluster.pgdata, new_subdir, new_segno);
+
+ write_xact_data_to_file(new_file, (uint32)local_start, buffer, (uint32)length);
+
+ start +=length;
+ buffer +=length;
+ }
+}
+
+static void
+upgrade_one_xact_file(const char *old_file, int segno, const char *new_subdir)
+{
+ char *buffer = pg_malloc(SLRU_SEGMENT_SIZE);
+ int src_fd;
+ ssize_t nbytes;
+
+ if ((src_fd = open(old_file, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %s\n", old_file, strerror(errno));
+
+ nbytes = read(src_fd, buffer, SLRU_SEGMENT_SIZE);
+
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %s\n", old_file, strerror(errno));
+
+ distribute_xact_data(buffer, nbytes, segno, new_subdir);
+
+ pg_free(buffer);
+ close(src_fd);
+}
+
+static void
+upgrade_xact_files(const char *old_subdir, const char *new_subdir)
+{
+ char old_path[MAXPGPATH];
+ char old_file[MAXPGPATH];
+
+ DIR *cldir;
+ struct dirent *clde;
+ int segno;
+
+ remove_new_subdir(new_subdir, false);
+
+ snprintf(old_path, sizeof(old_path), "%s/%s", old_cluster.pgdata, old_subdir);
+
+ prep_status("Upgrading old %s to new cluster", old_subdir);
+
+
+ if ((cldir = opendir(old_path)) == NULL)
+ {
+ pg_fatal("could not open dir \"%s\": %s\n", old_path, strerror(errno));
+ }
+
+ while (errno = 0, (clde = readdir(cldir)) != NULL)
+ {
+ size_t len;
+
+ len = strlen(clde->d_name);
+
+ if ((len == 4 || len == 5 || len == 6) &&
+ strspn(clde->d_name, "0123456789ABCDEF") == len)
+ {
+ segno = (int) strtol(clde->d_name, NULL, 16);
+ snprintf(old_file, sizeof(old_file), "%s/%s", old_path, clde->d_name);
+
+ upgrade_one_xact_file(old_file, segno, new_subdir);
+ }
+ }
+
+ if (errno)
+ {
+ pg_fatal("could not read dir \"%s\": %s\n", old_path, strerror(errno));
+ }
+ check_ok();
+}
+
static void
copy_xact_xlog_xid(void)
{
- /*
- * Copy old commit logs to new data dir. pg_clog has been renamed to
- * pg_xact in post-10 clusters.
- */
- copy_subdir_files(GET_MAJOR_VERSION(old_cluster.major_version) < 1000 ?
- "pg_clog" : "pg_xact",
- GET_MAJOR_VERSION(new_cluster.major_version) < 1000 ?
- "pg_clog" : "pg_xact");
+ bool slru_changed = (new_cluster.controldata.cat_ver >= SLRU_FORMAT_CHANGE_CAT_VER &&
+ old_cluster.controldata.cat_ver < SLRU_FORMAT_CHANGE_CAT_VER);
+ char *xact_old_subdir = GET_MAJOR_VERSION(old_cluster.major_version) < 1000 ?
+ "pg_clog" : "pg_xact";
+ char *xact_new_subdir = GET_MAJOR_VERSION(new_cluster.major_version) < 1000 ?
+ "pg_clog" : "pg_xact";
+
+ if (slru_changed)
+ {
+ upgrade_xact_files(xact_old_subdir, xact_new_subdir);
+ }
+ else
+ {
+ /*
+ * Copy old commit logs to new data dir. pg_clog has been renamed to
+ * pg_xact in post-10 clusters.
+ */
+ copy_subdir_files(xact_old_subdir, xact_new_subdir);
+ }
/* set the next transaction id and epoch of the new cluster */
prep_status("Setting next transaction ID and epoch for new cluster");
@@ -442,7 +608,8 @@ copy_xact_xlog_xid(void)
* server doesn't attempt to read multis older than the cutoff value.
*/
if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
- new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+ new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
+ !slru_changed)
{
copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
copy_subdir_files("pg_multixact/members", "pg_multixact/members");
@@ -462,7 +629,8 @@ copy_xact_xlog_xid(void)
new_cluster.pgdata);
check_ok();
}
- else if (new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+ else if (new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER ||
+ slru_changed)
{
/*
* Remove offsets/0000 file created by initdb that no longer matches
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index a21dd48c42..2c9350d0fb 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -131,6 +131,10 @@ extern char *output_files[];
*/
#define JSONB_FORMAT_CHANGE_CAT_VER 201409291
+/*
+ * change in SLRU format to add checksums
+ */
+#define SLRU_FORMAT_CHANGE_CAT_VER 201803181
/*
* Each relation is represented by a relinfo structure.
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 20114c4d44..e7b9662764 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -14,6 +14,7 @@
#define SLRU_H
#include "access/xlogdefs.h"
+#include "storage/checksum.h"
#include "storage/lwlock.h"
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 3934582efc..d6b15761d8 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201712251
+#define CATALOG_VERSION_NO 201803181
#endif
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index b85f714712..f4bd19b242 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -1,7 +1,7 @@
/*-------------------------------------------------------------------------
*
* checksum.h
- * Checksum implementation for data pages.
+ * Checksum implementation for data pages and SLRU pages.
*
* Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -21,4 +21,13 @@
*/
extern uint16 pg_checksum_page(char *page, BlockNumber blkno);
+extern uint16 pg_checksum_slru_page(char *page);
+
+extern uint16 pg_getchecksum_slru_page(char *page);
+
+extern void pg_setchecksum_slru_page(char *page);
+
+/* Size of checksum in bytes default 2 bytes (uint16) */
+#define CHKSUMSZ 2
+
#endif /* CHECKSUM_H */
diff --git a/src/include/storage/checksum_impl.h b/src/include/storage/checksum_impl.h
index bffd061de8..edc9d7e1dc 100644
--- a/src/include/storage/checksum_impl.h
+++ b/src/include/storage/checksum_impl.h
@@ -101,6 +101,7 @@
*/
#include "storage/bufpage.h"
+#include "storage/checksum.h"
/* number of checksums to calculate in parallel */
#define N_SUMS 32
@@ -205,3 +206,48 @@ pg_checksum_page(char *page, BlockNumber blkno)
*/
return (checksum % 65535) + 1;
}
+
+
+#define SLRU_CHECKSUM_UINT16_OFFSET (BLCKSZ / sizeof(uint16) - 1)
+/*
+ * Compute the checksum for a Postgres SLRU page. The page must be aligned on a
+ * 4-byte boundary.
+ *
+ * The checksum save itself to the last 2 bytes (CHKSUMSZ = 2 bytes) of the page
+ */
+uint16
+pg_checksum_slru_page(char *page)
+{
+ uint16 *page_casted = (uint16*) page;
+ uint16 save_checksum;
+ uint32 checksum;
+
+ save_checksum = page_casted[SLRU_CHECKSUM_UINT16_OFFSET];
+ page_casted[SLRU_CHECKSUM_UINT16_OFFSET] = 0;
+
+ checksum = (pg_checksum_block(page, BLCKSZ) % 65535) + 1;
+
+ page_casted[SLRU_CHECKSUM_UINT16_OFFSET] = save_checksum;
+
+ return checksum;
+}
+
+/*
+ * Get the checksum for a Postgres SLRU page.
+ */
+uint16
+pg_getchecksum_slru_page(char *page)
+{
+ uint16 *page_casted = (uint16*) page;
+ return page_casted[SLRU_CHECKSUM_UINT16_OFFSET];
+}
+
+/*
+ * Compute and install the checksum for a Postgres SLRU page.
+ */
+void
+pg_setchecksum_slru_page(char *page)
+{
+ uint16 *page_casted = (uint16*) page;
+ page_casted[SLRU_CHECKSUM_UINT16_OFFSET] = pg_checksum_slru_page(page);
+}
--
2.14.3 (Apple Git-98)
Hi, Ivan!
19 марта 2018 г., в 15:32, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
I was looking into this patch mainly because I was reviewing other checksums patch in different thread. But the purpose of this patch seems viable for me.
After looking into the patch I'd like to propose some editorialization:
I propose adding this patch to commitfest, if you do not object. We will have to add support for online SLRU checksum computation for this pages in spite of "Online enabling of checksums" patch(chances are that it will be committed earlier), so there is some work to do anyway. But I think it is good to keep this patch on the radars of potential reviewers.
Best regards, Andrey Borodin.
Hi, Tomas!
I think we'd want pg_upgrade tests showing an example of each SLRU
growing past one segment, and then being upgraded, and then being
accessed in various different pages and segment files, so that we can
see that we're able to shift the data to the right place successfully.
For example I think I'd want to see that a single aborted transaction
surrounded by many committed transactions shows up in the right place
after an upgrade.
Can you elaborate a bit on how to implement this test. I've searched for some automated pg_upgrade tests but didn't found one.
Should it be one-time test script or something "make check-world"-able?
Best regards, Andrey Borodin.
On Wed, Jul 18, 2018 at 5:41 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
I think we'd want pg_upgrade tests showing an example of each SLRU
growing past one segment, and then being upgraded, and then being
accessed in various different pages and segment files, so that we can
see that we're able to shift the data to the right place successfully.
For example I think I'd want to see that a single aborted transaction
surrounded by many committed transactions shows up in the right place
after an upgrade.Can you elaborate a bit on how to implement this test. I've searched for some automated pg_upgrade tests but didn't found one.
Should it be one-time test script or something "make check-world"-able?
Hi Andrey,
Like this (also reached by check-world):
$ cd src/bin/pg_upgrade
$ make check
It looks like the interesting bits are in test.sh.
--
Thomas Munro
http://www.enterprisedb.com
On Wed, Jul 18, 2018 at 5:54 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Wed, Jul 18, 2018 at 5:41 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
I think we'd want pg_upgrade tests showing an example of each SLRU
growing past one segment, and then being upgraded, and then being
accessed in various different pages and segment files, so that we can
see that we're able to shift the data to the right place successfully.
For example I think I'd want to see that a single aborted transaction
surrounded by many committed transactions shows up in the right place
after an upgrade.Can you elaborate a bit on how to implement this test. I've searched for some automated pg_upgrade tests but didn't found one.
Should it be one-time test script or something "make check-world"-able?
Hmm. This proposal doesn't seem to deal with torn writes. If someone
modifies an 8KB SLRU page and it is partially written out (say,
because your disk has 4KB sectors, and the power cuts out after only
one sector is modified), then during recovery we'll try to read that
block back in and the checksum will be wrong.
The way PostgreSQL usually deals with this problem (and higher level
problems caused by torn writes) is by putting a full page image into
the WAL the first time each page is modified after each checkpoint.
(There are other approaches used by other databases, such as MySQL's
write-two-copies strategy with a barrier in between, since they can't
both be torn, and the problem goes away if your filesystem somehow
magically provides atomic 8KB blocks so you can turn full page writes
off.)
To reuse the existing machinery, in theory I think you'd call
XLogRegisterBuffer() in every place that modifies an SLRU page, and
PageSetLSN() after inserting the WAL. The problem is that these pages
are not in the regular buffer pool and don't have an LSN in the
standard place, so that won't work. I heard about a project to put
SLRUs into the regular buffer pool, but I don't know the status.
Without that I think you might need to invent equivalent machinery
that can register SLRU buffers with xloginsert.c.
To avoid writing full page images for every SLRU page, you'd probably
want to use something like REGBUF_WILL_INIT to skip FPW for pages
you're zero-initialising (eg in ZeroCLOGPage()). I haven't studied
the synchronisation problems lurking there.
--
Thomas Munro
http://www.enterprisedb.com
Hi!
1 авг. 2018 г., в 13:49, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
Hmm. This proposal doesn't seem to deal with torn writes.
That's true, but it's a bit orthogonal to problem solved with checksums.
Checksums provide way to avoid reading bad page, torn pages - is about preventing writing bad writes.
But adding LSNs, and whole regular PageHeader is quite easy in this patch. Do you think we should really go that way?
Putting SLRUs into usual shared buffers and WAL-logging looks like a good idea, but a lot of work.
Best regards, Andrey Borodin.
On Wed, Aug 01, 2018 at 02:06:44PM +0300, Andrey Borodin wrote:
But adding LSNs, and whole regular PageHeader is quite easy in this
patch. Do you think we should really go that way?Putting SLRUs into usual shared buffers and WAL-logging looks like a good idea, but a lot of work.
Back at PgCon in Ottawa this year, I pitched the idea to moving
components off of SLRU and on to the buffer cache. That idea would
involve adding regular page headers to SLRU blocks, and transitioning
various components like multixact, commit timestamp, clog, subtrans and
others, to being first class residents in the buffer cache. The goal is
to provide performance benefits along with resiliency.
I believe Heikki also mentioned wanting support for stamping LSNs to
ensure durability in clog. I agree with the idea of incorporating the
PageHeader at the beginning of the SLRU page to maintain checksums and
we can iteratively add support for tracking LSNs if necessary.
--
Shawn Debnath
Hi,
On 2018-08-01 10:58:03 -0700, Shawn Debnath wrote:
On Wed, Aug 01, 2018 at 02:06:44PM +0300, Andrey Borodin wrote:
But adding LSNs, and whole regular PageHeader is quite easy in this
patch. Do you think we should really go that way?Putting SLRUs into usual shared buffers and WAL-logging looks like a good idea, but a lot of work.
Back at PgCon in Ottawa this year, I pitched the idea to moving
components off of SLRU and on to the buffer cache.
I believe you also planned to work on that, do I remember that
correctly, or is that just wishful thinking?
That idea would
involve adding regular page headers to SLRU blocks, and transitioning
various components like multixact, commit timestamp, clog, subtrans and
others, to being first class residents in the buffer cache. The goal is
to provide performance benefits along with resiliency.I believe Heikki also mentioned wanting support for stamping LSNs to
ensure durability in clog. I agree with the idea of incorporating the
PageHeader at the beginning of the SLRU page to maintain checksums and
we can iteratively add support for tracking LSNs if necessary.
I'm entirely in agreement with that goal, for the reasons stated above.
Greetings,
Andres Freund
On Wed, Aug 01, 2018 at 11:00:46AM -0700, Andres Freund wrote:
I believe you also planned to work on that, do I remember that
correctly, or is that just wishful thinking?
Yep - I am actively working on this at the moment, planning on having a
proposal out to this list in a week or two.
--
Shawn Debnath
On 2018-08-01 11:14:18 -0700, Shawn Debnath wrote:
On Wed, Aug 01, 2018 at 11:00:46AM -0700, Andres Freund wrote:
I believe you also planned to work on that, do I remember that
correctly, or is that just wishful thinking?Yep - I am actively working on this at the moment, planning on having a
proposal out to this list in a week or two.
Cool!
On Thu, Aug 2, 2018 at 6:17 AM, Andres Freund <andres@anarazel.de> wrote:
On 2018-08-01 11:14:18 -0700, Shawn Debnath wrote:
On Wed, Aug 01, 2018 at 11:00:46AM -0700, Andres Freund wrote:
I believe you also planned to work on that, do I remember that
correctly, or is that just wishful thinking?Yep - I am actively working on this at the moment, planning on having a
proposal out to this list in a week or two.Cool!
+1
--
Thomas Munro
http://www.enterprisedb.com
On Wed, Aug 1, 2018 at 11:06 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
1 авг. 2018 г., в 13:49, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
Hmm. This proposal doesn't seem to deal with torn writes.That's true, but it's a bit orthogonal to problem solved with checksums.
Checksums provide way to avoid reading bad page, torn pages - is about preventing writing bad writes.
It's a problem if you look at it like this: Without your patch, my
database can recover after power loss. With your patch, a torn SLRU
page can cause recovery to fail. Then my only option is to set
ignore_checksum_failure=on so that my cluster can start up. Without
significant effort I can't tell if the checksum verification failed
because data was arbitrarily corrupted (the reason for this feature to
exist), or because of a torn page (*expected behaviour* on
interruption of a storage system with atomic write size < BLCKSZ).
This may also apply also to online filesystem-level backups.
PostgreSQL only requires atomic writes of 512 bytes (see
PG_CONTROL_MAX_SAFE_SIZE), the traditional sector size for disks made
approximately 1980-2010, though as far as I know spinning disks made
this decade use 4KB sectors, and for SSDs there is more variation. I
suppose the theory for torn SLRU page safety today is that the
existing SLRU users all have fully independent values that don't cross
sector boundaries, so torn writes can't corrupt them.
--
Thomas Munro
http://www.enterprisedb.com
On 2018-Aug-02, Thomas Munro wrote:
PostgreSQL only requires atomic writes of 512 bytes (see
PG_CONTROL_MAX_SAFE_SIZE), the traditional sector size for disks made
approximately 1980-2010, though as far as I know spinning disks made
this decade use 4KB sectors, and for SSDs there is more variation. I
suppose the theory for torn SLRU page safety today is that the
existing SLRU users all have fully independent values that don't cross
sector boundaries, so torn writes can't corrupt them.
Hmm, I wonder if this is true for multixact/members. I think it's not
true for either 4kB sectors nor for 512 byte sectors.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-08-01 21:20:22 -0400, Alvaro Herrera wrote:
On 2018-Aug-02, Thomas Munro wrote:
PostgreSQL only requires atomic writes of 512 bytes (see
PG_CONTROL_MAX_SAFE_SIZE), the traditional sector size for disks made
approximately 1980-2010, though as far as I know spinning disks made
this decade use 4KB sectors, and for SSDs there is more variation. I
suppose the theory for torn SLRU page safety today is that the
existing SLRU users all have fully independent values that don't cross
sector boundaries, so torn writes can't corrupt them.Hmm, I wonder if this is true for multixact/members. I think it's not
true for either 4kB sectors nor for 512 byte sectors.
Hm, why not? Individual entries are 4bytes large and aligned, no? And as
we're solely appending (logicly if not physicly), that should be ok?
Greetings,
Andres Freund
On Thu, Aug 2, 2018 at 1:20 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2018-Aug-02, Thomas Munro wrote:
PostgreSQL only requires atomic writes of 512 bytes (see
PG_CONTROL_MAX_SAFE_SIZE), the traditional sector size for disks made
approximately 1980-2010, though as far as I know spinning disks made
this decade use 4KB sectors, and for SSDs there is more variation. I
suppose the theory for torn SLRU page safety today is that the
existing SLRU users all have fully independent values that don't cross
sector boundaries, so torn writes can't corrupt them.Hmm, I wonder if this is true for multixact/members. I think it's not
true for either 4kB sectors nor for 512 byte sectors.
Hmm, right, the set of members can span sectors. Let me try that
again. You can cross sector boundaries, but only if you don't require
any kind of multi-sector consistency during replay.
I think the important property for correct operation without FPWs is
that you can't read data from the page itself in order to redo writes
to the page. That rules out whole-page checksum verification, and
probably requires "physical" addressing. By physical addressing I
mean for example that the WAL record that writes member data must know
exactly where to put it on the page without, for example, consulting
the page header or item pointers to data that can move data around
("logical" intra-page addressing). We make the page consistent
incrementally, because each WAL record that writes new members into a
page is concerned with a specific physical part of the page identified
by offset and doesn't care about the rest, and no one should ever try
to read any part of it that hasn't already been made consistent. This
seems OK.
Another way to say it is that FPWs are physical logging of whole pages
(they say how to set every single bit), and WAL for multixacts is a
bit like physical logging of smaller regions of the page. Physical
logging doesn't suffer from torn pages, as long as readers are also
looking stuff up by physical addresses and never trying to read areas
of the page that haven't been written to yet. If you want page-level
checksums, though, the incremental approach won't work.
--
Thomas Munro
http://www.enterprisedb.com