Page Checksums

Started by David Fetterabout 14 years ago80 messages
#1David Fetter
david@fetter.org
1 attachment(s)

Folks,

What:

Please find attached a patch for 9.2-to-be which implements page
checksums. It changes the page format, so it's an initdb-forcing
change.

How:
In order to ensure that the checksum actually matches the hint
bits, this makes a copy of the page, calculates the checksum, then
sends the checksum and copy to the kernel, which handles sending
it the rest of the way to persistent storage.

Why:
My employer, VMware, thinks it's a good thing, and has dedicated
engineering resources to it. Lots of people's data is already in
cosmic ray territory, and many others' data will be soon. And
it's a TODO :)

If this introduces new failure modes, please detail, and preferably
demonstrate, just what those new modes are. As far as we've been able
to determine so far, it could expose on-disk corruption that wasn't
exposed before, but we see this as dealing with a previously
un-dealt-with failure rather than causing one.

Questions, comments and bug fixes are, of course, welcome.

Let the flames begin!

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachments:

checksums_20111217_01.patchtext/plain; charset=us-asciiDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..a5c20b3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1685,6 +1685,7 @@ SET ENABLE_SEQSCAN TO OFF;
         data corruption, after a system failure. The risks are similar to turning off
         <varname>fsync</varname>, though smaller, and it should be turned off
         only based on the same circumstances recommended for that parameter.
+        This parameter must be on when <varname>page_cksum</varname> is on.
        </para>
 
        <para>
@@ -1701,6 +1702,20 @@ SET ENABLE_SEQSCAN TO OFF;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-page-cksum" xreflabel="page_cksum">
+      <term><varname>page_cksum</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>page_cksum</varname>configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         When this parameter is on, the
+         <productname>PostgreSQL</productname> server writes and
+         checks checksums for each page written to persistent storage.
+        </para>
+       </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)</term>
       <indexterm>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 963189d..0fa5f68 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -314,6 +314,9 @@ extern char *optarg;
 extern int	optind,
 			opterr;
 
+extern int page_checksum;
+extern bool fullPageWrites;
+
 #ifdef HAVE_INT_OPTRESET
 extern int	optreset;			/* might not be declared by system headers */
 #endif
@@ -766,6 +769,29 @@ PostmasterMain(int argc, char *argv[])
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\" or \"hot_standby\"")));
 
 	/*
+	 * The idea here is that there will be checksum matches if there
+	 * are partial writes to pages during hardware crashes.  The user
+	 * should have full_page_writes enabled if page_checksum is
+	 * enabled so that these pages are automatically fixed, otherwise
+	 * PostgreSQL may get checksum errors after crashes on pages that
+	 * are in fact partially written and hence corrupted.  With
+	 * full_page_writes enabled, PostgreSQL will replace each page
+	 * without ever looking at the partially-written page and seeing
+	 * an incorrect checksum.  Hence, checksums will detect only real
+	 * disk corruptions, i.e. places where the disk reported a
+	 * successful write but the data was still corrupted at some
+	 * point.
+	 *
+	 * Alternatively, we may want to leave this check out.  This would
+	 * be for sophisticated users who have some other guarantee
+	 * (hardware and/or software) against ever producing a partial
+	 * write during crashes.
+	 */
+	if (page_checksum && !fullPageWrites)
+		ereport(ERROR,
+				(errmsg("full_page_writes must be enabled if page_checksum is enabled.")));
+
+	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
 	 */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4f607cd..1756d62 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -17,6 +17,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/catalog.h"
 #include "commands/tablespace.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -79,6 +80,12 @@ static const int NSmgr = lengthof(smgrsw);
  */
 static HTAB *SMgrRelationHash = NULL;
 
+/* Page checksumming. */
+static uint64 tempbuf[BLCKSZ/sizeof(uint64)];
+extern bool page_checksum;
+
+#define INVALID_CKSUM 0x1b0af034
+
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
@@ -381,6 +388,59 @@ smgrdounlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 }
 
 /*
+ * The initial value when computing the checksum for a data page.
+ */
+static inline uint64
+ChecksumInit(SMgrRelation reln, ForkNumber f, BlockNumber b)
+{
+	return b + f;
+}
+
+/*
+ * Compute a checksum of a buffer (with length len), using initial
+ * value cksum.  We use a relatively simple checksum calculation to
+ * avoid overhead, but could replace with some kind of CRC
+ * calculation.
+ */
+static inline uint32
+ComputeChecksum(uint64 *buffer, uint32 len, uint64 cksum)
+{
+	int i;
+
+	for (i = 0; i < len/sizeof(uint64); i += 4) {
+		cksum += (cksum << 5) + *buffer;
+		cksum += (cksum << 5) + *(buffer+1);
+		cksum += (cksum << 5) + *(buffer+2);
+		cksum += (cksum << 5) + *(buffer+3);
+		buffer += 4;
+	}
+	cksum = (cksum & 0xFFFFFFFF) + (cksum >> 32);
+	return cksum;
+}
+
+/*
+ * Copy buffer to dst and compute the checksum during the copy (so
+ * that the checksum is correct for the final contents fo dst).
+ */
+static inline uint32
+CopyAndComputeChecksum(uint64 *dst, volatile uint64 *buffer,
+					   uint32 len, uint64 cksum)
+{
+	int i;
+
+	for (i = 0; i < len/sizeof(uint64); i += 4) {
+		cksum += (cksum << 5) + (*dst = *buffer);
+		cksum += (cksum << 5) + (*(dst+1) = *(buffer+1));
+		cksum += (cksum << 5) + (*(dst+2) = *(buffer+2));
+		cksum += (cksum << 5) + (*(dst+3) = *(buffer+3));
+		dst += 4;
+		buffer += 4;
+	}
+	cksum = (cksum & 0xFFFFFFFF) + (cksum >> 32);
+	return cksum;
+}
+
+/*
  *	smgrextend() -- Add a new block to a file.
  *
  *		The semantics are nearly the same as smgrwrite(): write at the
@@ -393,8 +453,25 @@ void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   char *buffer, bool skipFsync)
 {
+	PageHeader p;
+	Assert(PageGetPageLayoutVersion(((PageHeader)buffer)) == PG_PAGE_LAYOUT_VERSION ||
+			PageIsNew(buffer));
+	if (page_checksum) {
+		p = (PageHeader)tempbuf;
+		((PageHeader)buffer)->cksum = 0;
+		/*
+		 * We copy and compute the checksum, and then write out the
+		 * data from the copy to avoid any problem with hint bits
+		 * changing after we compute the checksum.
+		 */
+		p->cksum = CopyAndComputeChecksum(tempbuf, (uint64 *)buffer, BLCKSZ,
+										  ChecksumInit(reln, forknum, blocknum));
+	} else {
+		p = (PageHeader)buffer;
+		p->cksum = INVALID_CKSUM;
+	}
 	(*(smgrsw[reln->smgr_which].smgr_extend)) (reln, forknum, blocknum,
-											   buffer, skipFsync);
+											   (char *)p, skipFsync);
 }
 
 /*
@@ -418,7 +495,29 @@ void
 smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 char *buffer)
 {
+	PageHeader p = (PageHeader) buffer;
 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer);
+	Assert(PageIsNew(p) || PageGetPageLayoutVersion(p) == PG_PAGE_LAYOUT_VERSION);
+	if (page_checksum && p->cksum != INVALID_CKSUM) {
+		const uint32 diskCksum = p->cksum;
+		uint32 cksum;
+
+		p->cksum = 0;
+		cksum = ComputeChecksum((uint64 *)buffer, BLCKSZ,
+								ChecksumInit(reln, forknum, blocknum));
+		
+		if (cksum != diskCksum) {
+			ereport(PANIC, (0, errmsg("checksum mismatch: disk has %#x, should be %#x\n"
+									  "filename %s, BlockNum %u, block specifier %d/%d/%d/%d/%u",
+									  diskCksum, (uint32)cksum,
+									  relpath(reln->smgr_rnode, forknum),
+									  blocknum,
+									  reln->smgr_rnode.node.spcNode,
+									  reln->smgr_rnode.node.dbNode,
+									  reln->smgr_rnode.node.relNode,
+									  forknum, blocknum)));
+		}
+	}
 }
 
 /*
@@ -440,8 +539,25 @@ void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		  char *buffer, bool skipFsync)
 {
+	PageHeader p;
+
+	if (page_checksum) {
+		p = (PageHeader)tempbuf;
+		((PageHeader)buffer)->cksum = 0;
+		/*
+		 * We copy and compute the checksum, then write out the data
+		 * from the copy so that we avoid any problem with hint bits
+		 * changing after we compute the checksum.
+		 */
+		p->cksum = CopyAndComputeChecksum(tempbuf, (uint64 *)buffer, BLCKSZ,
+										  ChecksumInit(reln, forknum, blocknum));
+	} else {
+		p = (PageHeader)buffer;
+		p->cksum = INVALID_CKSUM;
+	}
+	Assert(PageGetPageLayoutVersion(p) == PG_PAGE_LAYOUT_VERSION);
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+											  (char *)p, skipFsync);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index da7b6d4..332b960 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -419,6 +419,7 @@ bool		default_with_oids = false;
 bool		SQL_inheritance = true;
 
 bool		Password_encryption = true;
+bool		page_checksum = true;
 
 int			log_min_error_statement = ERROR;
 int			log_min_messages = WARNING;
@@ -1438,6 +1439,14 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"page_checksum", PGC_POSTMASTER, CUSTOM_OPTIONS,
+			gettext_noop("Enable disk page checksums."),
+			NULL,
+		},
+		&page_checksum, true, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..6a107b9 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -167,6 +167,8 @@
 					#   fsync_writethrough
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
+#page_cksum = on            # checksum disk pages
+
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
 					# (change requires restart)
 #wal_writer_delay = 200ms		# 1-10000 milliseconds
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 14e177d..05ae537 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201112071
+#define CATALOG_VERSION_NO	201112141
 
 #endif
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 42d6b10..847f157 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -132,6 +132,7 @@ typedef struct PageHeaderData
 	LocationIndex pd_special;	/* offset to start of special space */
 	uint16		pd_pagesize_version;
 	TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
+	uint32		cksum;			/* page checksum */
 	ItemIdData	pd_linp[1];		/* beginning of line pointer array */
 } PageHeaderData;
 
@@ -165,8 +166,9 @@ typedef PageHeaderData *PageHeader;
  * Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
  *		added the pd_flags field (by stealing some bits from pd_tli),
  *		as well as adding the pd_prune_xid field (which enlarges the header).
+ *	Release 9.2 uses 5; we added checksums to heap, index and fsm files.
  */
-#define PG_PAGE_LAYOUT_VERSION		4
+#define PG_PAGE_LAYOUT_VERSION		5
 
 
 /* ----------------------------------------------------------------
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: David Fetter (#1)
Re: Page Checksums

On 17.12.2011 23:33, David Fetter wrote:

What:

Please find attached a patch for 9.2-to-be which implements page
checksums. It changes the page format, so it's an initdb-forcing
change.

How:
In order to ensure that the checksum actually matches the hint
bits, this makes a copy of the page, calculates the checksum, then
sends the checksum and copy to the kernel, which handles sending
it the rest of the way to persistent storage.
...
If this introduces new failure modes, please detail, and preferably
demonstrate, just what those new modes are.

Hint bits, torn pages -> failed CRC. See earlier discussion:

http://archives.postgresql.org/pgsql-hackers/2009-11/msg01975.php

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3David Fetter
david@fetter.org
In reply to: Heikki Linnakangas (#2)
Re: Page Checksums

On Sun, Dec 18, 2011 at 10:14:38AM +0200, Heikki Linnakangas wrote:

On 17.12.2011 23:33, David Fetter wrote:

What:

Please find attached a patch for 9.2-to-be which implements page
checksums. It changes the page format, so it's an initdb-forcing
change.

How:
In order to ensure that the checksum actually matches the hint
bits, this makes a copy of the page, calculates the checksum, then
sends the checksum and copy to the kernel, which handles sending
it the rest of the way to persistent storage.
...
If this introduces new failure modes, please detail, and preferably
demonstrate, just what those new modes are.

Hint bits, torn pages -> failed CRC. See earlier discussion:

http://archives.postgresql.org/pgsql-hackers/2009-11/msg01975.php

The patch requires that full page writes be on in order to obviate
this problem by never reading a torn page. Instead, copy of the page
has already hit storage before the torn write occurs.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: David Fetter (#3)
Re: Page Checksums

On 18.12.2011 10:54, David Fetter wrote:

On Sun, Dec 18, 2011 at 10:14:38AM +0200, Heikki Linnakangas wrote:

On 17.12.2011 23:33, David Fetter wrote:

If this introduces new failure modes, please detail, and preferably
demonstrate, just what those new modes are.

Hint bits, torn pages -> failed CRC. See earlier discussion:

http://archives.postgresql.org/pgsql-hackers/2009-11/msg01975.php

The patch requires that full page writes be on in order to obviate
this problem by never reading a torn page.

Doesn't help. Hint bit updates are not WAL-logged.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5David Fetter
david@fetter.org
In reply to: Heikki Linnakangas (#4)
Re: Page Checksums

On Sun, Dec 18, 2011 at 12:19:32PM +0200, Heikki Linnakangas wrote:

On 18.12.2011 10:54, David Fetter wrote:

On Sun, Dec 18, 2011 at 10:14:38AM +0200, Heikki Linnakangas wrote:

On 17.12.2011 23:33, David Fetter wrote:

If this introduces new failure modes, please detail, and preferably
demonstrate, just what those new modes are.

Hint bits, torn pages -> failed CRC. See earlier discussion:

http://archives.postgresql.org/pgsql-hackers/2009-11/msg01975.php

The patch requires that full page writes be on in order to obviate
this problem by never reading a torn page.

Doesn't help. Hint bit updates are not WAL-logged.

What new failure modes are you envisioning for this case? Any way to
simulate them, even if it's by injecting faults into the source code?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: David Fetter (#5)
Re: Page Checksums

On 18.12.2011 20:44, David Fetter wrote:

On Sun, Dec 18, 2011 at 12:19:32PM +0200, Heikki Linnakangas wrote:

On 18.12.2011 10:54, David Fetter wrote:

On Sun, Dec 18, 2011 at 10:14:38AM +0200, Heikki Linnakangas wrote:

On 17.12.2011 23:33, David Fetter wrote:

If this introduces new failure modes, please detail, and preferably
demonstrate, just what those new modes are.

Hint bits, torn pages -> failed CRC. See earlier discussion:

http://archives.postgresql.org/pgsql-hackers/2009-11/msg01975.php

The patch requires that full page writes be on in order to obviate
this problem by never reading a torn page.

Doesn't help. Hint bit updates are not WAL-logged.

What new failure modes are you envisioning for this case?

Umm, the one explained in the email I linked to... Let me try once more.
For the sake of keeping the example short, imagine that the PostgreSQL
block size is 8 bytes, and the OS block size is 4 bytes. The CRC is 1
byte, and is stored on the first byte of each page.

In the beginning, a page is in the buffer cache, and it looks like this:

AA 12 34 56 78 9A BC DE

AA is the checksum. Now a hint bit on the last byte is set, so that the
page in the shared buffer cache looks like this:

AA 12 34 56 78 9A BC DF

Now PostgreSQL wants to evict the page from the buffer cache, so it
recalculates the CRC. The page in the buffer cache now looks like this:

BB 12 34 56 78 9A BC DF

Now, PostgreSQL writes the page to the OS cache, with the write() system
call. It sits in the OS cache for a few seconds, and then the OS decides
to flush the first 4 bytes, ie. the first OS block, to disk. On disk,
you now have this:

BB 12 34 56 78 9A BC DE

If the server now crashes, before the OS has flushed the second half of
the PostgreSQL page to disk, you have a classic torn page. The updated
CRC made it to disk, but the hint bit did not. The CRC on disk is not
valid, for the rest of the contents of that page on disk.

Without CRCs, that's not a problem because the data is valid whether or
not the hint bit makes it to the disk. It's just a hint, after all. But
when you have a CRC on the page, the CRC is only valid if both the CRC
update *and* the hint bit update makes it to disk, or neither.

So you've just turned an innocent torn page, which PostgreSQL tolerates
just fine, into a block with bad CRC.

Any way to
simulate them, even if it's by injecting faults into the source code?

Hmm, it's hard to persuade the OS to suffer a torn page on purpose. What
you could do is split the write() call in mdwrite() into two. First
write the 1st half of the page, then the second. Then you can put a
breakpoint in between the writes, and kill the system before the 2nd
half is written.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Peter Eisentraut
peter_e@gmx.net
In reply to: Heikki Linnakangas (#6)
Re: Page Checksums

On sön, 2011-12-18 at 21:34 +0200, Heikki Linnakangas wrote:

On 18.12.2011 20:44, David Fetter wrote:

Any way to
simulate them, even if it's by injecting faults into the source code?

Hmm, it's hard to persuade the OS to suffer a torn page on purpose. What
you could do is split the write() call in mdwrite() into two. First
write the 1st half of the page, then the second. Then you can put a
breakpoint in between the writes, and kill the system before the 2nd
half is written.

Perhaps the Library-level Fault Injector (http://lfi.sf.net) could be
used to set up a test for this. (Not that I think you need one, but if
David wants to see it happen himself ...)

#8Jesper Krogh
jesper@krogh.cc
In reply to: Heikki Linnakangas (#4)
Re: Page Checksums

On 2011-12-18 11:19, Heikki Linnakangas wrote:

The patch requires that full page writes be on in order to obviate
this problem by never reading a torn page.

Doesn't help. Hint bit updates are not WAL-logged.

I dont know if it would be seen as a "half baked feature".. or similar,
and I dont know if the hint bit problem is solvable at all, but I could
easily imagine checksumming just "skipping" the hit bit entirely.

It would still provide checksumming for the majority of the "data" sitting
underneath the system, and would still be extremely usefull in my
eyes.

Jesper
--
Jesper

#9Greg Stark
stark@mit.edu
In reply to: Jesper Krogh (#8)
Re: Page Checksums

On Sun, Dec 18, 2011 at 7:51 PM, Jesper Krogh <jesper@krogh.cc> wrote:

I dont know if it would be seen as a "half baked feature".. or similar,
and I dont know if the hint bit problem is solvable at all, but I could
easily imagine checksumming just "skipping" the hit bit entirely.

That was one approach discussed. The problem is that the hint bits are
currently in each heap tuple header which means the checksum code
would have to know a fair bit about the structure of the page format.
Also the closer people looked the more hint bits kept turning up
because the coding pattern had been copied to other places (the page
header has one, and index pointers have a hint bit indicating that the
target tuple is deleted, etc). And to make matters worse skipping
individual bits in varying places quickly becomes a big consumer of
cpu time since it means injecting logic into each iteration of the
checksum loop to mask out the bits.

So the general feeling was that we should move all the hint bits to a
dedicated part of the buffer so that they could all be skipped in a
simple way that doesn't depend on understanding the whole structure of
the page. That's not conceptually hard, it's just a fair amount of
work. I think that's where it was left off.

There is another way to look at this problem. Perhaps it's worth
having a checksum *even if* there are ways for the checksum to be
spuriously wrong. Obviously having an invalid checksum can't be a
fatal error then but it might still be useful information. Rright now
people don't really know if their system can experience torn pages or
not and having some way of detecting them could be useful. And if you
have other unexplained symptoms then having checksum errors might be
enough evidence that the investigation should start with the hardware
and get the sysadmin looking at hardware logs and running memtest
sooner.

--
greg

#10Josh Berkus
josh@agliodbs.com
In reply to: Greg Stark (#9)
Re: Page Checksums

On 12/18/11 5:55 PM, Greg Stark wrote:

There is another way to look at this problem. Perhaps it's worth
having a checksum *even if* there are ways for the checksum to be
spuriously wrong. Obviously having an invalid checksum can't be a
fatal error then but it might still be useful information. Rright now
people don't really know if their system can experience torn pages or
not and having some way of detecting them could be useful. And if you
have other unexplained symptoms then having checksum errors might be
enough evidence that the investigation should start with the hardware
and get the sysadmin looking at hardware logs and running memtest
sooner.

Frankly, if I had torn pages, even if it was just hint bits missing, I
would want that to be logged. That's expected if you crash, but if you
start seeing bad CRC warnings when you haven't had a crash? That means
you have a HW problem.

As long as the CRC checks are by default warnings, then I don't see a
problem with this; it's certainly better than what we have now.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#11Aidan Van Dyk
aidan@highrise.ca
In reply to: Josh Berkus (#10)
Re: Page Checksums

On Sun, Dec 18, 2011 at 11:21 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 12/18/11 5:55 PM, Greg Stark wrote:

There is another way to look at this problem. Perhaps it's worth
having a checksum *even if* there are ways for the checksum to be
spuriously wrong. Obviously having an invalid checksum can't be a
fatal error then but it might still be useful information. Rright now
people don't really know if their system can experience torn pages or
not and having some way of detecting them could be useful. And if you
have other unexplained symptoms then having checksum errors might be
enough evidence that the investigation should start with the hardware
and get the sysadmin looking at hardware logs and running memtest
sooner.

Frankly, if I had torn pages, even if it was just hint bits missing, I
would want that to be logged.  That's expected if you crash, but if you
start seeing bad CRC warnings when you haven't had a crash?  That means
you have a HW problem.

As long as the CRC checks are by default warnings, then I don't see a
problem with this; it's certainly better than what we have now.

But the scary part is you don't know how long *ago* the crash was.
Because a hint-bit-only change w/ a torn-page is a "non event" in
PostgreSQL *DESIGN*, on crash recovery, it doesn't do anything to try
and "scrub" every page in the database.

So you could have a crash, then a recovery, and a couple clean
shutdown-restart combinations before you happen to read the "needed"
page that was torn in the crash $X [ days | weeks | months ] ago.
It's specifically because PostgreSQL was *DESIGNED* to make torn pages
a non-event (because WAL/FPW fixes anything that's dangerous), that
the whole CRC issue is so complicated...

I'll through out a few random thoughts (some repeated) that people who
really want the CRC can fight over:

1) Find a way to not bother writing out hint-bit-only-dirty pages....
I know people like Kevin keep recommending a vacuum freeze after a
big load to avoid later problems anyways and I think that's probably
common in big OLAP shops, and OLTP people are likely to have real
changes on the page anyways. Does anybody want to try and measure
what type of performance trade-offs we'ld really have on a variety of
"normal" (ya, I know, what's normal) workloads? If the page has a
real change, it's got a WAL FPW, so we avoid the problem....

2) If the writer/checksummer knows it's a hint-bit-only-dirty page,
can it stuff a "cookie" checksum in it and not bother verifying?
Looses a bit of the CRC guarentee, especially around "crashes" which
is when we expect a torn page, but avoids the whole "scary! scary!
Your database is corrupt!" false-positives in the situation PostgreSQL
was specifically desinged to make not scary.

#) Anybody investigated putting the CRC in a relation fork, but not
right in the data block? If the CRC contains a timestamp, and is WAL
logged before the write, at least on reading a block with a wrong
checksum, if a warning is emitted, the timestamp could be looked at by
whoever is reading the warning and know tht the block was written
shortly before the crash $X $PERIODS ago....

The whole "CRC is only a warning" because we "expect to get them if we
ever crashed" means that the time when we most want them, we have to
assume they are bogus... And to make matters worse, we don't even
know when the perioud of "they may be bugus" ends, unless we have a
way to methodically force PG through ever buffer in the database after
the crash... And then that makes them very hard to consider
useful...

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#10)
Re: Page Checksums

On Mon, Dec 19, 2011 at 4:21 AM, Josh Berkus <josh@agliodbs.com> wrote:

On 12/18/11 5:55 PM, Greg Stark wrote:

There is another way to look at this problem. Perhaps it's worth
having a checksum *even if* there are ways for the checksum to be
spuriously wrong. Obviously having an invalid checksum can't be a
fatal error then but it might still be useful information. Rright now
people don't really know if their system can experience torn pages or
not and having some way of detecting them could be useful. And if you
have other unexplained symptoms then having checksum errors might be
enough evidence that the investigation should start with the hardware
and get the sysadmin looking at hardware logs and running memtest
sooner.

Frankly, if I had torn pages, even if it was just hint bits missing, I
would want that to be logged.  That's expected if you crash, but if you
start seeing bad CRC warnings when you haven't had a crash?  That means
you have a HW problem.

As long as the CRC checks are by default warnings, then I don't see a
problem with this; it's certainly better than what we have now.

It is an important problem, and also a big one, hence why it still exists.

Throwing WARNINGs for normal events would not help anybody; thousands
of false positives would just make Postgres appear to be less robust
than it really is. That would be a credibility disaster. VMWare
already have their own distro, so if they like this patch they can use
it.

The only sensible way to handle this is to change the page format as
discussed. IMHO the only sensible way that can happen is if we also
support an online upgrade feature. I will take on the online upgrade
feature if others work on the page format issues, but none of this is
possible for 9.2, ISTM.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#13Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#12)
Re: Page Checksums

On Monday, December 19, 2011 12:10:11 PM Simon Riggs wrote:

The only sensible way to handle this is to change the page format as
discussed. IMHO the only sensible way that can happen is if we also
support an online upgrade feature. I will take on the online upgrade
feature if others work on the page format issues, but none of this is
possible for 9.2, ISTM.

Totally with you that its not 9.2 material. But I think if somebody actually
wants to implement that that person would need to start discussing and
implementing rather soon if it should be ready for 9.3. Just because its not
geared towards the next release doesn't mean it OT.

Andres

#14Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#12)
Re: Page Checksums

On Mon, Dec 19, 2011 at 6:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Throwing WARNINGs for normal events would not help anybody; thousands
of false positives would just make Postgres appear to be less robust
than it really is. That would be a credibility disaster. VMWare
already have their own distro, so if they like this patch they can use
it.

Agreed on all counts.

It seems to me that it would be possible to plug this hole by keeping
track of which pages in shared_buffers have had unlogged changes to
them since the last FPI. When you go to evict such a page, you write
some kind of WAL record for it - either an FPI, or maybe a partial
page image containing just the parts that might have been changed
(like all the tuple headers, or whatever). This would be expensive,
of course.

The only sensible way to handle this is to change the page format as
discussed. IMHO the only sensible way that can happen is if we also
support an online upgrade feature. I will take on the online upgrade
feature if others work on the page format issues, but none of this is
possible for 9.2, ISTM.

I'm not sure that I understand the dividing line you are drawing here.
However, with respect to the implementation of this particular
feature, it would be nice if we could arrange things so that space
cost of the feature need only be paid by people who are using it. I
think it would be regrettable if everyone had to give up 4 bytes per
page because some people want checksums. Maybe I'll feel differently
if it turns out that the overhead of turning on checksumming is
modest, but that's not what I'm expecting.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Stephen Frost
sfrost@snowman.net
In reply to: Aidan Van Dyk (#11)
Re: Page Checksums

* Aidan Van Dyk (aidan@highrise.ca) wrote:

But the scary part is you don't know how long *ago* the crash was.
Because a hint-bit-only change w/ a torn-page is a "non event" in
PostgreSQL *DESIGN*, on crash recovery, it doesn't do anything to try
and "scrub" every page in the database.

Fair enough, but, could we distinguish these two cases? In other words,
would it be possible to detect if a page was torn due to a 'traditional'
crash and not complain in that case, but complain if there's a CRC
failure and it *doesn't* look like a torn page?

Perhaps that's a stretch, but if we can figure out that a page is torn
already, then perhaps it's not so far fetched..

Thanks,

Stephen
(who is no expert on WAL/torn pages/etc)

#16Stephen Frost
sfrost@snowman.net
In reply to: Aidan Van Dyk (#11)
Re: Page Checksums

* Aidan Van Dyk (aidan@highrise.ca) wrote:

#) Anybody investigated putting the CRC in a relation fork, but not
right in the data block? If the CRC contains a timestamp, and is WAL
logged before the write, at least on reading a block with a wrong
checksum, if a warning is emitted, the timestamp could be looked at by
whoever is reading the warning and know tht the block was written
shortly before the crash $X $PERIODS ago....

I do like the idea of putting the CRC info in a relation fork, if it can
be made to work decently, as we might be able to then support it on a
per-relation basis, and maybe even avoid the on-disk format change..

Of course, I'm sure there's all kinds of problems with that approach,
but it might be worth some thinking about.

Thanks,

Stephen

#17Alvaro Herrera
alvherre@commandprompt.com
In reply to: Stephen Frost (#16)
Re: Page Checksums

Excerpts from Stephen Frost's message of lun dic 19 11:18:21 -0300 2011:

* Aidan Van Dyk (aidan@highrise.ca) wrote:

#) Anybody investigated putting the CRC in a relation fork, but not
right in the data block? If the CRC contains a timestamp, and is WAL
logged before the write, at least on reading a block with a wrong
checksum, if a warning is emitted, the timestamp could be looked at by
whoever is reading the warning and know tht the block was written
shortly before the crash $X $PERIODS ago....

I do like the idea of putting the CRC info in a relation fork, if it can
be made to work decently, as we might be able to then support it on a
per-relation basis, and maybe even avoid the on-disk format change..

Of course, I'm sure there's all kinds of problems with that approach,
but it might be worth some thinking about.

I think the main objection to that idea was that if you lose a single
page of CRCs you have hundreds of data pages which no longer have good
CRCs.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#18Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#15)
Re: Page Checksums

On Mon, Dec 19, 2011 at 9:14 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Aidan Van Dyk (aidan@highrise.ca) wrote:

But the scary part is you don't know how long *ago* the crash was.
Because a hint-bit-only change w/ a torn-page is a "non event" in
PostgreSQL *DESIGN*, on crash recovery, it doesn't do anything to try
and "scrub" every page in the database.

Fair enough, but, could we distinguish these two cases?  In other words,
would it be possible to detect if a page was torn due to a 'traditional'
crash and not complain in that case, but complain if there's a CRC
failure and it *doesn't* look like a torn page?

No.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19David Fetter
david@fetter.org
In reply to: Robert Haas (#18)
Re: Page Checksums

On Mon, Dec 19, 2011 at 09:34:51AM -0500, Robert Haas wrote:

On Mon, Dec 19, 2011 at 9:14 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Aidan Van Dyk (aidan@highrise.ca) wrote:

But the scary part is you don't know how long *ago* the crash was.
Because a hint-bit-only change w/ a torn-page is a "non event" in
PostgreSQL *DESIGN*, on crash recovery, it doesn't do anything to try
and "scrub" every page in the database.

Fair enough, but, could we distinguish these two cases?  In other words,
would it be possible to detect if a page was torn due to a 'traditional'
crash and not complain in that case, but complain if there's a CRC
failure and it *doesn't* look like a torn page?

No.

Would you be so kind as to elucidate this a bit?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#20Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#17)
Re: Page Checksums

On Monday, December 19, 2011 03:33:22 PM Alvaro Herrera wrote:

Excerpts from Stephen Frost's message of lun dic 19 11:18:21 -0300 2011:

* Aidan Van Dyk (aidan@highrise.ca) wrote:

#) Anybody investigated putting the CRC in a relation fork, but not
right in the data block? If the CRC contains a timestamp, and is WAL
logged before the write, at least on reading a block with a wrong
checksum, if a warning is emitted, the timestamp could be looked at by
whoever is reading the warning and know tht the block was written
shortly before the crash $X $PERIODS ago....

I do like the idea of putting the CRC info in a relation fork, if it can
be made to work decently, as we might be able to then support it on a
per-relation basis, and maybe even avoid the on-disk format change..

Of course, I'm sure there's all kinds of problems with that approach,
but it might be worth some thinking about.

I think the main objection to that idea was that if you lose a single
page of CRCs you have hundreds of data pages which no longer have good
CRCs.

Which I find a pretty non-argument because there is lots of SPOF data in a
cluster (WAL, control record) anyway...
If recent data starts to fail you have to restore from backup anyway.

Andres

#21Stephen Frost
sfrost@snowman.net
In reply to: David Fetter (#19)
Re: Page Checksums

* David Fetter (david@fetter.org) wrote:

On Mon, Dec 19, 2011 at 09:34:51AM -0500, Robert Haas wrote:

On Mon, Dec 19, 2011 at 9:14 AM, Stephen Frost <sfrost@snowman.net> wrote:

Fair enough, but, could we distinguish these two cases?  In other words,
would it be possible to detect if a page was torn due to a 'traditional'
crash and not complain in that case, but complain if there's a CRC
failure and it *doesn't* look like a torn page?

No.

Would you be so kind as to elucidate this a bit?

I'm guessing, based on some discussion on IRC, that it's because we
don't really 'detect' torn pages today, when it's due to a hint-bit-only
update. With all the trouble due to hint-bit updates, and if they're
written out or not, makes me wish we could just avoid doing hint-bit
only updates to disk somehow.. Or log them when we do them. Both of
those have their own drawbacks, of course.

Thanks,

Stephen

#22Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#20)
Re: Page Checksums

* Andres Freund (andres@anarazel.de) wrote:

On Monday, December 19, 2011 03:33:22 PM Alvaro Herrera wrote:

I do like the idea of putting the CRC info in a relation fork, if it can
be made to work decently, as we might be able to then support it on a
per-relation basis, and maybe even avoid the on-disk format change..

I think the main objection to that idea was that if you lose a single
page of CRCs you have hundreds of data pages which no longer have good
CRCs.

Which I find a pretty non-argument because there is lots of SPOF data in a
cluster (WAL, control record) anyway...
If recent data starts to fail you have to restore from backup anyway.

I agree with Andres on this one.. Also, if we use CRC on the pages in
the CRC, hopefully we'd be able to detect when a bad block impacted the
CRC fork and differentiate that from a whole slew of bad blocks in the
heap..

There might be an issue there with handling locking and having to go
through the page-level lock on the CRC, which locks a lot more pages in
the heap and therefore reduces scalability.. Don't we have a similar
issue with the visibility map though?

Thanks,

Stephen

#23Greg Smith
greg@2ndQuadrant.com
In reply to: Robert Haas (#14)
Re: Page Checksums

On 12/19/2011 07:50 AM, Robert Haas wrote:

On Mon, Dec 19, 2011 at 6:10 AM, Simon Riggs<simon@2ndquadrant.com> wrote:

The only sensible way to handle this is to change the page format as
discussed. IMHO the only sensible way that can happen is if we also
support an online upgrade feature. I will take on the online upgrade
feature if others work on the page format issues, but none of this is
possible for 9.2, ISTM.

I'm not sure that I understand the dividing line you are drawing here.

There are three likely steps to reaching checksums:

1) Build a checksum mechanism into the database. This is the
straighforward part that multiple people have now done.

2) Rework hint bits to make the torn page problem go away. Checksums go
elsewhere? More WAL logging to eliminate the bad situations? Eliminate
some types of hint bit writes? It seems every alternative has
trade-offs that will require serious performance testing to really validate.

3) Finally tackle in-place upgrades that include a page format change.
One basic mechanism was already outlined: a page converter that knows
how to handle two page formats, some metadata to track which pages have
been converted, a daemon to do background conversions. Simon has some
new ideas here too ("online upgrade" involves two clusters kept in sync
on different versions, slightly different concept than the current
"in-place upgrade"). My recollection is that the in-place page upgrade
work was pushed out of the critical path before due to lack of immediate
need. It wasn't necessary until a) a working catalog upgrade tool was
validated and b) a bite-size feature change to test it on appeared. We
have (a) now in pg_upgrade, and CRCs could be (b)--if the hint bit
issues are sorted first.

What Simon was saying is that he's got some interest in (3), but wants
no part of (2).

I don't know how much time each of these will take. I would expect that
(2) and (3) have similar scopes though--many days, possibly a few
months, of work--which means they both dwarf (1). The part that's been
done is the visible tip of a mostly underwater iceburg.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#24Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#23)
Re: Page Checksums

Greg Smith <greg@2ndQuadrant.com> wrote:

2) Rework hint bits to make the torn page problem go away.
Checksums go elsewhere? More WAL logging to eliminate the bad
situations? Eliminate some types of hint bit writes? It seems
every alternative has trade-offs that will require serious
performance testing to really validate.

I'm wondering whether we're not making a mountain out of a mole-hill
here. In real life, on one single crash, how many torn pages with
hint-bit-only updates do we expect on average? What's the maximum
possible? In the event of a crash recovery, can we force all tables
to be seen as needing autovacuum? Would there be a way to limit
this to some subset which *might* have torn pages somehow?

It seems to me that on a typical production system you would
probably have zero or one such page per OS crash, with zero being
far more likely than one. If we can get that one fixed (if it
exists) before enough time has elapsed for everyone to forget the OS
crash, the idea that we would be scaring the users and negatively
affecting the perception of reliability seems far-fetched. The fact
that they can *have* page checksums in PostgreSQL should do a lot to
*enhance* the PostgreSQL reputation for reliability in some circles,
especially those getting pounded with FUD from competing products.
If a site has so many OS or hardware failures that they lose track
-- well, they really should be alarmed.

Of course, the fact that you may hit such a torn page in a situation
where all data is good means that it shouldn't be more than a
warning.

This seems as though it eliminates most of the work people have been
suggesting as necessary, and makes the submitted patch fairly close
to what we want.

-Kevin

#25Robert Haas
robertmhaas@gmail.com
In reply to: David Fetter (#19)
Re: Page Checksums

On Mon, Dec 19, 2011 at 12:07 PM, David Fetter <david@fetter.org> wrote:

On Mon, Dec 19, 2011 at 09:34:51AM -0500, Robert Haas wrote:

On Mon, Dec 19, 2011 at 9:14 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Aidan Van Dyk (aidan@highrise.ca) wrote:

But the scary part is you don't know how long *ago* the crash was.
Because a hint-bit-only change w/ a torn-page is a "non event" in
PostgreSQL *DESIGN*, on crash recovery, it doesn't do anything to try
and "scrub" every page in the database.

Fair enough, but, could we distinguish these two cases?  In other words,
would it be possible to detect if a page was torn due to a 'traditional'
crash and not complain in that case, but complain if there's a CRC
failure and it *doesn't* look like a torn page?

No.

Would you be so kind as to elucidate this a bit?

Well, basically, Stephen's proposal was pure hand-waving. :-)

I don't know of any magic trick that would allow us to know whether a
CRC failure "looks like a torn page". The only information we're
going to get is the knowledge of whether the CRC matches or not. If
it doesn't, it's fundamentally impossible for us to know why. We know
the page contents are not as expected - that's it!

It's been proposed before that we could examine the page, consider all
the unset hint bits that could be set, and try all combinations of
setting and clearing them to see whether any of them produce a valid
CRC. But, as Tom has pointed out previously, that has a really quite
large chance of making a page that's *actually* been corrupted look
OK. If you have 30 or so unset hint bits, odds are very good that
some combination will produce the 32-CRC you're expecting.

To put this another way, we currently WAL-log just about everything.
We get away with NOT WAL-logging some things when we don't care about
whether they make it to disk. Hint bits, killed index tuple pointers,
etc. cause no harm if they don't get written out, even if some other
portion of the same page does get written out. But as soon as you CRC
the whole page, now absolutely every single bit on that page becomes
critical data which CANNOT be lost. IOW, it now requires the same
sort of protection that we already need for our other critical updates
- i.e. WAL logging. Or you could introduce some completely new
mechanism that serves the same purpose, like MySQL's double-write
buffer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#24)
Re: Page Checksums

On Mon, Dec 19, 2011 at 2:16 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

It seems to me that on a typical production system you would
probably have zero or one such page per OS crash, with zero being
far more likely than one.  If we can get that one fixed (if it
exists) before enough time has elapsed for everyone to forget the OS
crash, the idea that we would be scaring the users and negatively
affecting the perception of reliability seems far-fetched.

The problem is that you can't "fix" them. If you come to a page with
a bad CRC, you only have two choices: take it seriously, or don't. If
you take it seriously, then you're complaining about something that
may be completely benign. If you don't take it seriously, then you're
ignoring something that may be a sign of data corruption.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#26)
Re: Page Checksums

Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 19, 2011 at 2:16 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

It seems to me that on a typical production system you would
probably have zero or one such page per OS crash, with zero being
far more likely than one. If we can get that one fixed (if it
exists) before enough time has elapsed for everyone to forget the
OS crash, the idea that we would be scaring the users and
negatively affecting the perception of reliability seems
far-fetched.

The problem is that you can't "fix" them. If you come to a page
with a bad CRC, you only have two choices: take it seriously, or
don't. If you take it seriously, then you're complaining about
something that may be completely benign. If you don't take it
seriously, then you're ignoring something that may be a sign of
data corruption.

I was thinking that we would warn when such was found, set hint bits
as needed, and rewrite with the new CRC. In the unlikely event that
it was a torn hint-bit-only page update, it would be a warning about
something which is a benign side-effect of the OS or hardware crash.
The argument was that it could happen months later, and people
might not remember the crash. My response to that is: don't let it
wait that long. By forcing a vacuum of all possibly-affected tables
(or all tables if the there's no way to rule any of them out), you
keep it within recent memory.

Of course, it would also make sense to document that such an error
after an OS or hardware crash might be benign or may indicate data
corruption or data loss, and give advice on what to do. There is
obviously no way for PostgreSQL to automagically "fix" real
corruption flagged by a CRC failure, under any circumstances.
There's also *always" a possibility that CRC error is a false
positive -- if only the bytes in the CRC were damaged. We're
talking quantitative changes here, not qualitative.

I'm arguing that the extreme measures suggested to achieve the
slight quantitative improvements are likely to cause more problems
than they solve. A better use of resources to improve the false
positive numbers would be to be more aggressive about setting hint
bits -- perhaps when a page is written with any tuples with
transaction IDs before the global xmin, the hint bits should be set
and the CRC calculated before write, for example. (But that would
be a different patch.)

-Kevin

#28Greg Smith
greg@2ndQuadrant.com
In reply to: Kevin Grittner (#27)
Re: Page Checksums

On 12/19/2011 02:44 PM, Kevin Grittner wrote:

I was thinking that we would warn when such was found, set hint bits
as needed, and rewrite with the new CRC. In the unlikely event that
it was a torn hint-bit-only page update, it would be a warning about
something which is a benign side-effect of the OS or hardware crash.
The argument was that it could happen months later, and people
might not remember the crash. My response to that is: don't let it
wait that long. By forcing a vacuum of all possibly-affected tables
(or all tables if the there's no way to rule any of them out), you
keep it within recent memory.

Cleanup that requires a potentially unbounded in size VACUUM to sort out
doesn't sound like a great path to wander down. Ultimately any CRC
implementation is going to want a "scrubbing" feature like those found
in RAID arrays eventually, one that wanders through all database pages
looking for literal bitrot. And pushing in priority requests for things
to check to the top of its queue may end up being a useful feature
there. But if you need all that infrastructure just to get the feature
launched, that's a bit hard to stomach.

Also, as someone who follows Murphy's Law as my chosen religion, I would
expect this situation could be exactly how flaky hardware would first
manifest itself: server crash and a bad CRC on the last thing written
out. And in that case, the last thing you want to do is assume things
are fine, then kick off a VACUUM that might overwrite more good data
with bad. The sort of bizarre, "that should never happen" cases are the
ones I'd most like to see more protection against, rather than excusing
them and going on anyway.

There's also *always" a possibility that CRC error is a false
positive -- if only the bytes in the CRC were damaged. We're
talking quantitative changes here, not qualitative.

The main way I expect to validate this sort of thing is with an as yet
unwritten function to grab information about a data block from a standby
server for this purpose, something like this:

Master: Computed CRC A, Stored CRC B; error raised because A!=B
Standby: Computed CRC C, Stored CRC D

If C==D && A==C, the corruption is probably overwritten bits of the CRC B.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#25)
Re: Page Checksums

On 19.12.2011 21:27, Robert Haas wrote:

To put this another way, we currently WAL-log just about everything.
We get away with NOT WAL-logging some things when we don't care about
whether they make it to disk. Hint bits, killed index tuple pointers,
etc. cause no harm if they don't get written out, even if some other
portion of the same page does get written out. But as soon as you CRC
the whole page, now absolutely every single bit on that page becomes
critical data which CANNOT be lost. IOW, it now requires the same
sort of protection that we already need for our other critical updates
- i.e. WAL logging. Or you could introduce some completely new
mechanism that serves the same purpose, like MySQL's double-write
buffer.

Double-writes would be a useful option also to reduce the size of WAL
that needs to be shipped in replication.

Or you could just use a filesystem that does CRCs...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#30Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#28)
Re: Page Checksums

Greg Smith <greg@2ndQuadrant.com> wrote:

But if you need all that infrastructure just to get the feature
launched, that's a bit hard to stomach.

Triggering a vacuum or some hypothetical "scrubbing" feature?

Also, as someone who follows Murphy's Law as my chosen religion,

If you don't think I pay attention to Murphy's Law, I should recap
our backup procedures -- which involves three separate forms of
backup, each to multiple servers in different buildings, real-time,
plus idle-time comparison of the databases of origin to all replicas
with reporting of any discrepancies. And off-line "snapshot"
backups on disk at a records center controlled by a different
department. That's in addition to RAID redundancy and hardware
health and performance monitoring. Some people think I border on
the paranoid on this issue.

I would expect this situation could be exactly how flaky hardware
would first manifest itself: server crash and a bad CRC on the
last thing written out. And in that case, the last thing you want
to do is assume things are fine, then kick off a VACUUM that might
overwrite more good data with bad.

Are you arguing that autovacuum should be disabled after crash
recovery? I guess if you are arguing that a database VACUUM might
destroy recoverable data when hardware starts to fail, I can't
argue. And certainly there are way too many people who don't ensure
that they have a good backup before firing up PostgreSQL after a
failure, so I can see not making autovacuum more aggressive than
usual, and perhaps even disabling it until there is some sort of
confirmation (I have no idea how) that a backup has been made. That
said, a database VACUUM would be one of my first steps after
ensuring that I had a copy of the data directory tree, personally.
I guess I could even live with that as recommended procedure rather
than something triggered through autovacuum and not feel that the
rest of my posts on this are too far off track.

The main way I expect to validate this sort of thing is with an as
yet unwritten function to grab information about a data block from
a standby server for this purpose, something like this:

Master: Computed CRC A, Stored CRC B; error raised because A!=B
Standby: Computed CRC C, Stored CRC D

If C==D && A==C, the corruption is probably overwritten bits of
the CRC B.

Are you arguing we need *that* infrastructure to get the feature
launched?

-Kevin

#31Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#27)
Re: Page Checksums

On Mon, Dec 19, 2011 at 2:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

I was thinking that we would warn when such was found, set hint bits
as needed, and rewrite with the new CRC.  In the unlikely event that
it was a torn hint-bit-only page update, it would be a warning about
something which is a benign side-effect of the OS or hardware crash.

But that's terrible. Surely you don't want to tell people:

WARNING: Your database is corrupted, or maybe not. But don't worry,
I modified the data block so that you won't get this warning again.

OK, I guess I'm not sure that you don't want to tell people that. But
*I* don't!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32Christopher Browne
cbbrowne@gmail.com
In reply to: Robert Haas (#31)
Re: Page Checksums

On Tue, Dec 20, 2011 at 8:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 19, 2011 at 2:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

I was thinking that we would warn when such was found, set hint bits
as needed, and rewrite with the new CRC.  In the unlikely event that
it was a torn hint-bit-only page update, it would be a warning about
something which is a benign side-effect of the OS or hardware crash.

But that's terrible.  Surely you don't want to tell people:

WARNING:  Your database is corrupted, or maybe not.  But don't worry,
I modified the data block so that you won't get this warning again.

OK, I guess I'm not sure that you don't want to tell people that.  But
*I* don't!

This seems to be a frequent problem with this whole "doing CRCs on pages" thing.

It's not evident which problems will be "real" ones. And in such
cases, is the answer to turf the database and recover from backup,
because of a single busted page? For a big database, I'm not sure
that's less scary than the possibility of one page having a
corruption.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#33Alvaro Herrera
alvherre@commandprompt.com
In reply to: Christopher Browne (#32)
Re: Page Checksums

Excerpts from Christopher Browne's message of mar dic 20 14:12:56 -0300 2011:

It's not evident which problems will be "real" ones. And in such
cases, is the answer to turf the database and recover from backup,
because of a single busted page? For a big database, I'm not sure
that's less scary than the possibility of one page having a
corruption.

I don't think the problem is having one page of corruption. The problem
is *not knowing* that random pages are corrupted, and living in the fear
that they might be.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#34Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#31)
Re: Page Checksums

Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 19, 2011 at 2:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

I was thinking that we would warn when such was found, set hint
bits as needed, and rewrite with the new CRC. In the unlikely
event that it was a torn hint-bit-only page update, it would be a
warning about something which is a benign side-effect of the OS
or hardware crash.

But that's terrible. Surely you don't want to tell people:

WARNING: Your database is corrupted, or maybe not. But don't
worry, I modified the data block so that you won't get this
warning again.

OK, I guess I'm not sure that you don't want to tell people that.
But *I* don't!

Well, I would certainly change that to comply with standard message
style guidelines. ;-)

But the alternatives I've heard so far bother me more. It sounds
like the most-often suggested alternative is:

ERROR (or stronger?): page checksum failed in relation 999 page 9
DETAIL: This may not actually affect the validity of any tuples,
since it could be a flipped bit in the checksum itself or dead
space, but we're shutting you down just in case.
HINT: You won't be able to read anything on this page, even if it
appears to be well-formed, without stopping your database and using
some arcane tool you've never heard of before to examine and
hand-modify the page. Any query which accesses this table may fail
in the same way.

The warning level message will be followed by something more severe
if the page or a needed tuple is mangled in a way that it would not
be used. I guess the biggest risk here is that there is real damage
to data which doesn't generate a stronger response, and the users
are ignoring warning messages. I'm not sure what to do about that,
but the above error doesn't seem like the right solution.

Assuming we do something about the "torn page on hint-bit only
write" issue, by moving the hint bits to somewhere else or logging
their writes, what would you suggest is the right thing to do when a
page is read with a checksum which doesn't match page contents?

-Kevin

#35Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Alvaro Herrera (#33)
Re: Page Checksums

Alvaro Herrera <alvherre@commandprompt.com> wrote:

Excerpts from Christopher Browne's message of mar dic 20 14:12:56
-0300 2011:

It's not evident which problems will be "real" ones. And in such
cases, is the answer to turf the database and recover from
backup, because of a single busted page? For a big database, I'm
not sure that's less scary than the possibility of one page
having a corruption.

I don't think the problem is having one page of corruption. The
problem is *not knowing* that random pages are corrupted, and
living in the fear that they might be.

What would you want the server to do when a page with a mismatching
checksum is read?

-Kevin

#36Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#35)
Re: Page Checksums

On Tuesday, December 20, 2011 06:38:44 PM Kevin Grittner wrote:

Alvaro Herrera <alvherre@commandprompt.com> wrote:

Excerpts from Christopher Browne's message of mar dic 20 14:12:56

-0300 2011:

It's not evident which problems will be "real" ones. And in such
cases, is the answer to turf the database and recover from
backup, because of a single busted page? For a big database, I'm
not sure that's less scary than the possibility of one page
having a corruption.

I don't think the problem is having one page of corruption. The
problem is *not knowing* that random pages are corrupted, and
living in the fear that they might be.

What would you want the server to do when a page with a mismatching
checksum is read?

Follow the behaviour of zero_damaged_pages.

Andres

#37Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#12)
Re: Page Checksums

On Mon, Dec 19, 2011 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

The only sensible way to handle this is to change the page format as
discussed. IMHO the only sensible way that can happen is if we also
support an online upgrade feature. I will take on the online upgrade
feature if others work on the page format issues, but none of this is
possible for 9.2, ISTM.

I've had another look at this just to make sure.

Doing this for 9.2 will change the page format, causing every user to
do an unload/reload, with no provided mechanism to do that, whether or
not they use this feature.

If we do that, the hints are all in the wrong places, meaning any hint
set will need to change the CRC.

Currently, setting hints can be done while holding a share lock on the
buffer. Preventing that would require us to change the way buffer
manager works to make it take an exclusive lock while writing out,
since a hint would change the CRC and so allowing hints to be set
while we write out would cause invalid CRCs. So we would need to hold
exclusive lock on buffers while we calculate CRCs.

Overall, this will cause a much bigger performance hit than we planned
for. But then we have SSI as an option, so why not this?

So, do we have enough people in the house that are willing to back
this idea, even with a severe performance hit? Are we willing to
change the page format now, with plans to change it again in the
future? Are we willing to change the page format for a feature many
people will need to disable anyway? Do we have people willing to spend
time measuring the performance in enough cases to allow educated
debate?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#38Aidan Van Dyk
aidan@highrise.ca
In reply to: Kevin Grittner (#35)
Re: Page Checksums

On Tue, Dec 20, 2011 at 12:38 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

I don't think the problem is having one page of corruption.  The
problem is *not knowing* that random pages are corrupted, and
living in the fear that they might be.

What would you want the server to do when a page with a mismatching
checksum is read?

But that's exactly the problem. I don't know what I want the server
to do, because I don't know if the page with the checksum mismatch is
one of the 10GB of pages in the page cache that were dirty and poses 0
risk (i.e. hint-bit only changes made it dirty), a page that was
really messed up on the kernel panic that last happened causing this
whole mess, or an even older page that really is giving bitrot...

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#36)
Re: Page Checksums

Andres Freund <andres@anarazel.de> writes:

On Tuesday, December 20, 2011 06:38:44 PM Kevin Grittner wrote:

What would you want the server to do when a page with a mismatching
checksum is read?

Follow the behaviour of zero_damaged_pages.

Surely not. Nobody runs with zero_damaged_pages turned on in
production; or at least, nobody with any semblance of a clue.

regards, tom lane

#40Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#39)
Re: Page Checksums

On Tuesday, December 20, 2011 07:08:56 PM Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On Tuesday, December 20, 2011 06:38:44 PM Kevin Grittner wrote:

What would you want the server to do when a page with a mismatching
checksum is read?

Follow the behaviour of zero_damaged_pages.

Surely not. Nobody runs with zero_damaged_pages turned on in
production; or at least, nobody with any semblance of a clue.

Thats my point. There is no automated solution for page errors. So it should
ERROR (not PANIC) out in normal operation and be "fixable" via
zero_damaged_pages.
I personally wouldn't even have a problem making zero_damaged_pages only
applicable in single backend mode.

Andres

#41Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#37)
Re: Page Checksums

On Tuesday, December 20, 2011 06:44:48 PM Simon Riggs wrote:

Currently, setting hints can be done while holding a share lock on the
buffer. Preventing that would require us to change the way buffer
manager works to make it take an exclusive lock while writing out,
since a hint would change the CRC and so allowing hints to be set
while we write out would cause invalid CRCs. So we would need to hold
exclusive lock on buffers while we calculate CRCs.

While hint bits are a problem that specific problem is actually handled by
copying the buffer onto a separate buffer and calculating the CRC on that copy.
Given that we already rely on the fact that the flags can be read consistently
from the individual backends thats fine.

Andres

#42Jesper Krogh
jesper@krogh.cc
In reply to: Simon Riggs (#37)
Re: Page Checksums

On 2011-12-20 18:44, Simon Riggs wrote:

On Mon, Dec 19, 2011 at 11:10 AM, Simon Riggs<simon@2ndquadrant.com> wrote:

The only sensible way to handle this is to change the page format as
discussed. IMHO the only sensible way that can happen is if we also
support an online upgrade feature. I will take on the online upgrade
feature if others work on the page format issues, but none of this is
possible for 9.2, ISTM.

I've had another look at this just to make sure.

Doing this for 9.2 will change the page format, causing every user to
do an unload/reload, with no provided mechanism to do that, whether or
not they use this feature.

How about only calculating the checksum and setting it in the "bgwriter"
just before
flying the buffer off to disk.

Perhaps even let autovacuum do the same if it flushes pages to disk as a
part
of the process.

If someone comes along and sets a hint bit,changes data, etc. its only
job is to clear
the checksum to a meaning telling "we dont have a checksum for this page".

Unless the bgwriter becomes bottlenecked by doing it, the impact on
"foreground"
work should be fairly limited.

Jesper .. just throwing in random thoughts ..
--
Jesper

#43Jesper Krogh
jesper@krogh.cc
In reply to: Greg Stark (#9)
Re: Page Checksums

On 2011-12-19 02:55, Greg Stark wrote:

On Sun, Dec 18, 2011 at 7:51 PM, Jesper Krogh<jesper@krogh.cc> wrote:

I dont know if it would be seen as a "half baked feature".. or similar,
and I dont know if the hint bit problem is solvable at all, but I could
easily imagine checksumming just "skipping" the hit bit entirely.

That was one approach discussed. The problem is that the hint bits are
currently in each heap tuple header which means the checksum code
would have to know a fair bit about the structure of the page format.
Also the closer people looked the more hint bits kept turning up
because the coding pattern had been copied to other places (the page
header has one, and index pointers have a hint bit indicating that the
target tuple is deleted, etc). And to make matters worse skipping
individual bits in varying places quickly becomes a big consumer of
cpu time since it means injecting logic into each iteration of the
checksum loop to mask out the bits.

I do know it is a valid and really relevant point (the cpu-time spend),
but here in late 2011 it is really a damn irritating limitation, since if
there any resources I have plenty available of in the production environment
then it is cpu-time, just not on the "single core currently serving the
client".

Jesper
--
Jesper

#44Greg Smith
greg@2ndQuadrant.com
In reply to: Kevin Grittner (#30)
Re: Page Checksums

On 12/19/2011 06:14 PM, Kevin Grittner wrote:

But if you need all that infrastructure just to get the feature
launched, that's a bit hard to stomach.

Triggering a vacuum or some hypothetical "scrubbing" feature?

What you were suggesting doesn't require triggering just a vacuum
though--it requires triggering some number of vacuums, for all impacted
relations. You said yourself that "all tables if the there's no way to
rule any of them out" was a possibility. I'm just pointing out that
scheduling that level of work is a logistics headache, and it would be
reasonable for people to expect some help with that were it to become a
necessary thing falling out of the implementation.

Some people think I border on the paranoid on this issue.

Those people are also out to get you, just like the hardware.

Are you arguing that autovacuum should be disabled after crash
recovery? I guess if you are arguing that a database VACUUM might
destroy recoverable data when hardware starts to fail, I can't
argue.

A CRC failure suggests to me a significantly higher possibility of
hardware likely to lead to more corruption than a normal crash does though.

The main way I expect to validate this sort of thing is with an as
yet unwritten function to grab information about a data block from
a standby server for this purpose, something like this:

Master: Computed CRC A, Stored CRC B; error raised because A!=B
Standby: Computed CRC C, Stored CRC D

If C==D&& A==C, the corruption is probably overwritten bits of
the CRC B.

Are you arguing we need *that* infrastructure to get the feature
launched?

No; just pointing out the things I'd eventually expect people to want,
because they help answer questions about what to do when CRC failures
occur. The most reasonable answer to "what should I do about suspected
corruption on a page?" in most of the production situations I worry
about is "see if it's recoverable from the standby". I see this as
being similar to how RAID-1 works: if you find garbage on one drive,
and you can get a clean copy of the block from the other one, use that
to recover the missing data. If you don't have that capability, you're
stuck with no clear path forward when a CRC failure happens, as you
noted downthread.

This obviously gets troublesome if you've recently written a page out,
so there's some concern about whether you are checking against the
correct version of the page or not, based on where the standby's replay
is at. I see that as being a case that's also possible to recover from
though, because then the page you're trying to validate on the master is
likely sitting in the recent WAL stream. This is already the sort of
thing companies doing database recovery work (of which we are one) deal
with, and I doubt any proposal will cover every possible situation. In
some cases there may be no better answer than "show all the known
versions and ask the user to sort it out". The method I suggested would
sometimes kick out an automatic fix.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#45Leonardo Francalanci
m_lists@yahoo.it
In reply to: Kevin Grittner (#35)
Re: Page Checksums

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such as
the one available in ZFS and btrfs?

#46Stephen Frost
sfrost@snowman.net
In reply to: Leonardo Francalanci (#45)
Re: Page Checksums

* Leonardo Francalanci (m_lists@yahoo.it) wrote:

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such
as the one available in ZFS and btrfs?

Depends on how much you trust the filesystem. :)

Stephen

#47Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#44)
Re: Page Checksums

Greg Smith <greg@2ndQuadrant.com> wrote:

Some people think I border on the paranoid on this issue.

Those people are also out to get you, just like the hardware.

Hah! I *knew* it!

Are you arguing that autovacuum should be disabled after crash
recovery? I guess if you are arguing that a database VACUUM
might destroy recoverable data when hardware starts to fail, I
can't argue.

A CRC failure suggests to me a significantly higher possibility
of hardware likely to lead to more corruption than a normal crash
does though.

Yeah, the discussion has me coming around to the point of view
advocated by Andres: that it should be treated the same as corrupt
pages detected through other means. But that can only be done if
you eliminate false positives from hint-bit-only updates. Without
some way to handle that, I guess that means the idea is dead.

Also, I'm not sure that our shop would want to dedicate any space
per page for this, since we're comparing between databases to ensure
that values actually match, row by row, during idle time. A CRC or
checksum is a lot weaker than that. I can see where it would be
very valuable where more rigorous methods aren't in use; but it
would really be just extra overhead with little or no benefit for
most of our database clusters.

-Kevin

#48Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#47)
Re: Page Checksums

On Wednesday, December 21, 2011 04:21:53 PM Kevin Grittner wrote:

Greg Smith <greg@2ndQuadrant.com> wrote:

Some people think I border on the paranoid on this issue.

Those people are also out to get you, just like the hardware.

Hah! I *knew* it!

Are you arguing that autovacuum should be disabled after crash
recovery? I guess if you are arguing that a database VACUUM
might destroy recoverable data when hardware starts to fail, I
can't argue.

A CRC failure suggests to me a significantly higher possibility
of hardware likely to lead to more corruption than a normal crash
does though.

Yeah, the discussion has me coming around to the point of view
advocated by Andres: that it should be treated the same as corrupt
pages detected through other means. But that can only be done if
you eliminate false positives from hint-bit-only updates. Without
some way to handle that, I guess that means the idea is dead.

Also, I'm not sure that our shop would want to dedicate any space
per page for this, since we're comparing between databases to ensure
that values actually match, row by row, during idle time. A CRC or
checksum is a lot weaker than that. I can see where it would be
very valuable where more rigorous methods aren't in use; but it
would really be just extra overhead with little or no benefit for
most of our database clusters.

Comparing between database will by far not recognize failures in all data
because you surely will not use all indexes. With index only scans the
likelihood of unnoticed heap corruption also increases.
E.g. I have seen disk level corruption silently corrupting a unique index so
it didn't cover all data anymore which lead to rather big problems.
Not everyone can do regular dump+restore tests to protect against such
scenarios...

Andres

#49Leonardo Francalanci
m_lists@yahoo.it
In reply to: Stephen Frost (#46)
Re: Page Checksums

On 21/12/2011 16.19, Stephen Frost wrote:

* Leonardo Francalanci (m_lists@yahoo.it) wrote:

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such
as the one available in ZFS and btrfs?

Depends on how much you trust the filesystem. :)

Ehm I hope that was a joke...

I think what I meant was: isn't this going to be useless in a couple of
years (if, say, btrfs will be available)? Or it actually gives something
that FS will never be able to give?

#50Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kevin Grittner (#47)
Re: Page Checksums

On 21.12.2011 17:21, Kevin Grittner wrote:

Also, I'm not sure that our shop would want to dedicate any space
per page for this, since we're comparing between databases to ensure
that values actually match, row by row, during idle time.

4 bytes out of a 8k block is just under 0.05%. I don't think anyone is
going to notice the extra disk space consumed by this. There's all those
other issues like the hint bits that make this a non-starter, but disk
space overhead is not one of them.

INHO we should just advise that you should use a filesystem with CRCs if
you want that extra level of safety. It's the hardware's and operating
system's job to ensure that data doesn't get corrupt after we hand it
over to the OS with write()/fsync().

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#51Robert Haas
robertmhaas@gmail.com
In reply to: Christopher Browne (#32)
Re: Page Checksums

On Tue, Dec 20, 2011 at 12:12 PM, Christopher Browne <cbbrowne@gmail.com> wrote:

This seems to be a frequent problem with this whole "doing CRCs on pages" thing.

It's not evident which problems will be "real" ones.

That depends on the implementation. If we have a flaky, broken
implementation such as the one proposed, then, yes, it will be
unclear. But if we properly guard against a torn page invalidating
the CRC, then it won't be unclear at all: any CRC mismatch means
something bad happened.

Of course, that may be fairly expensive in terms of performance. But
the only way I can see to get around that problem is to rewrite our
heap AM or our MVCC implementation in some fashion that gets rid of
hint bits.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#52Stephen Frost
sfrost@snowman.net
In reply to: Leonardo Francalanci (#49)
Re: Page Checksums

* Leonardo Francalanci (m_lists@yahoo.it) wrote:

Depends on how much you trust the filesystem. :)

Ehm I hope that was a joke...

It certainly wasn't..

I think what I meant was: isn't this going to be useless in a couple
of years (if, say, btrfs will be available)? Or it actually gives
something that FS will never be able to give?

Yes, it will help you find/address bugs in the filesystem. These things
are not unheard of...

Thanks,

Stephen

#53Leonardo Francalanci
m_lists@yahoo.it
In reply to: Stephen Frost (#52)
Re: Page Checksums

I think what I meant was: isn't this going to be useless in a couple
of years (if, say, btrfs will be available)? Or it actually gives
something that FS will never be able to give?

Yes, it will help you find/address bugs in the filesystem. These things
are not unheard of...

It sounds to me like a huge job to fix some issues "not unheard of"...

My point is: if we are trying to fix misbehaving drives/controllers
(something that is more common than one might think), that's already
done by ZFS on Solaris and FreeBSD, and will be done in btrfs for linux.

I understand not trusting drives/controllers; but not trusting a
filesystem...

What am I missing? (I'm far from being an expert... I just don't
understand...)

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#50)
Re: Page Checksums

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

4 bytes out of a 8k block is just under 0.05%. I don't think anyone is
going to notice the extra disk space consumed by this. There's all those
other issues like the hint bits that make this a non-starter, but disk
space overhead is not one of them.

The bigger problem is that adding a CRC necessarily changes the page
format and therefore breaks pg_upgrade. As Greg and Simon already
pointed out upthread, there's essentially zero chance of this getting
applied before we have a solution that allows pg_upgrade to cope with
page format changes. A CRC feature is not compelling enough to justify
a non-upgradable release cycle.

regards, tom lane

#55Greg Smith
greg@2ndQuadrant.com
In reply to: Stephen Frost (#52)
Re: Page Checksums

On 12/21/2011 10:49 AM, Stephen Frost wrote:

* Leonardo Francalanci (m_lists@yahoo.it) wrote:

I think what I meant was: isn't this going to be useless in a couple
of years (if, say, btrfs will be available)? Or it actually gives
something that FS will never be able to give?

Yes, it will help you find/address bugs in the filesystem. These things
are not unheard of...

There was a spike in data recovery business here after people started
migrating to ext4. New filesystems are no fun to roll out; some bugs
will only get shaken out when brave early adopters deploy them.

And there's even more radical changes in btrfs, since it wasn't starting
with a fairly robust filesystem as a base. And putting my tin foil hat
on, I don't feel real happy about assuming *the* solution for this issue
in PostgreSQL is the possibility of a filesystem coming one day when
that work is being steered by engineers who work at Oracle.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#56Martijn van Oosterhout
kleptog@svana.org
In reply to: Leonardo Francalanci (#45)
Re: Page Checksums

On Wed, Dec 21, 2011 at 09:32:28AM +0100, Leonardo Francalanci wrote:

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such
as the one available in ZFS and btrfs?

Hmm, filesystems are not magical. If they implement this then they will
have the same issues with torn pages as Postgres would. Which I
imagine they solve by doing a transactional update by writing the new
page to a new location, with checksum and updating a pointer. They
can't even put the checksum on the same page, like we could. How that
interacts with seqscans I have no idea.

Certainly I think we could look to them for implementation ideas, but I
don't imagine they've got something that can't be specialised for
better performence.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Smith (#55)
Re: Page Checksums

On Wed, Dec 21, 2011 at 7:35 PM, Greg Smith <greg@2ndquadrant.com> wrote:

And there's even more radical changes in btrfs, since it wasn't starting
with a fairly robust filesystem as a base.  And putting my tin foil hat on,
I don't feel real happy about assuming *the* solution for this issue in
PostgreSQL is the possibility of a filesystem coming one day when that work
is being steered by engineers who work at Oracle.

Agreed.

I do agree with Heikki that it really ought to be the OS problem, but
then we thought that about dtrace and we're still waiting for that or
similar to be usable on all platforms (+/- 4 years).

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#58Leonardo Francalanci
m_lists@yahoo.it
In reply to: Simon Riggs (#57)
Re: Page Checksums

Agreed.

I do agree with Heikki that it really ought to be the OS problem, but
then we thought that about dtrace and we're still waiting for that or
similar to be usable on all platforms (+/- 4 years).

My point is that it looks like this is going to take 1-2 years in
postgresql, so it looks like a huge job... but at the same time I
understand we can't "hope other filesystems will catch up"!

I guess this feature will be tunable (off/on)?

#59Greg Stark
stark@mit.edu
In reply to: Kevin Grittner (#24)
Re: Page Checksums

On Mon, Dec 19, 2011 at 7:16 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

It seems to me that on a typical production system you would
probably have zero or one such page per OS crash

Incidentally I don't think this is right. There are really two kinds
of torn pages:

1) The kernel vm has many dirty 4k pages and decides to flush one 4k
page of a Postgres 8k buffer but not the other one. It doesn't sound
very logical for it to do this but it has the same kind of tradeoffs
to make that Postgres does and there could easily be cases where the
extra book-keeping required to avoid it isn't deemed worthwhile. The
two memory pages might not even land on the same part of the disk
anyways so flushing one and not the other might be reasonable.

In this case there could be an unbounded number of such torn pages and
they can stay torn on disk for a long period of time so the torn pages
may not have been actively being written when the crash occurred. On
Linux these torn pages will always be on memory page boundaries -- ie
4k blocks on x86.

2) The i/o system was in the process of writing out blocks and the
system lost power or crashed as they were being written out. In this
case there will probably only be 0 or 1 torn pages -- perhaps as many
as the scsi queue depth if there's some weird i/o scheduling going on.
In this case the torn page could be on a hardware block boundary --
often 512 byte boundaries (or if the drives don't guarantee otherwise
it could corrupt a disk block).

--
greg

#60Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#14)
Re: Page Checksums

On Mon, 2011-12-19 at 07:50 -0500, Robert Haas wrote:

I
think it would be regrettable if everyone had to give up 4 bytes per
page because some people want checksums.

I can understand that some people might not want the CPU expense of
calculating CRCs; or the upgrade expense to convert to new pages; but do
you think 4 bytes out of 8192 is a real concern?

(Aside: it would be MAXALIGNed anyway, so probably more like 8 bytes.)

I was thinking we'd go in the other direction: expanding the header
would take so much effort, why not expand it a little more to give some
breathing room for the future?

Regards,
Jeff Davis

#61Jeff Davis
pgsql@j-davis.com
In reply to: Greg Stark (#9)
Re: Page Checksums

On Mon, 2011-12-19 at 01:55 +0000, Greg Stark wrote:

On Sun, Dec 18, 2011 at 7:51 PM, Jesper Krogh <jesper@krogh.cc> wrote:

I dont know if it would be seen as a "half baked feature".. or similar,
and I dont know if the hint bit problem is solvable at all, but I could
easily imagine checksumming just "skipping" the hit bit entirely.

That was one approach discussed. The problem is that the hint bits are
currently in each heap tuple header which means the checksum code
would have to know a fair bit about the structure of the page format.

Which is actually a bigger problem, because it might not be the backend
that's reading the page. It might be your backup script taking a new
base backup.

The kind of person to care about CRCs would also want the base backup
tool to verify them during the copy so that you don't overwrite your
previous (good) backup with a bad one. The more complicated we make the
verification process, the less workable that becomes.

I vote for a simple way to calculate the checksum -- fixed offsets of
each page (of course, it would need to know the page size), and a
standard checksum algorithm.

Regards,
Jeff Davis

#62Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#29)
Re: Page Checksums

On Mon, 2011-12-19 at 22:18 +0200, Heikki Linnakangas wrote:

Or you could just use a filesystem that does CRCs...

That just moves the problem. Correct me if I'm wrong, but I don't think
there's anything special that the filesystem can do that we can't.

The filesystems that support CRCs are more like ZFS than ext3. They do
all writes to a new location, thus fragmenting the files. That may be a
good trade-off for some people, but it's not free.

Regards,
Jeff Davis

#63Jeff Davis
pgsql@j-davis.com
In reply to: Greg Stark (#59)
Re: Page Checksums

On Sun, 2011-12-25 at 22:18 +0000, Greg Stark wrote:

2) The i/o system was in the process of writing out blocks and the
system lost power or crashed as they were being written out. In this
case there will probably only be 0 or 1 torn pages -- perhaps as many
as the scsi queue depth if there's some weird i/o scheduling going on.

That would also depend on how many disks you have and what configuration
they're in, right?

Regards,
Jeff Davis

#64Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#60)
Re: Page Checksums

On Tue, Dec 27, 2011 at 1:39 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2011-12-19 at 07:50 -0500, Robert Haas wrote:

I
think it would be regrettable if everyone had to give up 4 bytes per
page because some people want checksums.

I can understand that some people might not want the CPU expense of
calculating CRCs; or the upgrade expense to convert to new pages; but do
you think 4 bytes out of 8192 is a real concern?

(Aside: it would be MAXALIGNed anyway, so probably more like 8 bytes.)

Yeah, I do. Our on-disk footprint is already significantly greater
than that of some other systems, and IMHO we should be looking for a
way to shrink our overhead in that area, not make it bigger.
Admittedly, most of the fat is probably in the tuple header rather
than the page header, but at any rate I don't consider burning up 1%
of our available storage space to be a negligible overhead. I'm not
sure I believe it should need to be MAXALIGN'd, since it is followed
by item pointers which IIRC only need 2-byte alignment, but then again
Heikki also recently proposed adding 4 bytes per page to allow each
page to track its XID generation, to help mitigate the need for
anti-wraparound vacuuming.

I think Simon's approach of stealing the 16-bit page version field is
reasonably clever in this regard, although I also understand why Tom
objects to it, and I certainly agree with him that we need to be
careful not to back ourselves into a corner. What I'm not too clear
about is whether a 16-bit checksum meets the needs of people who want
checksums. If we assume that flaky hardware is going to corrupt pages
steadily over time, then it seems like it might be adequate, because
in the unlikely event that the first corrupted page happens to still
pass its checksum test, well, another will come along and we'll
probably spot the problem then, likely well before any significant
fraction of the data gets eaten. But I'm not sure whether that's the
right mental model. I, and I think some others, initially assumed
we'd want a 32-bit checksum, but I'm not sure I can justify that
beyond "well, I think that's what people usually do". It could be
that even if we add new page header space for the checksum (as opposed
to stuffing it into the page version field) we still want to add only
2 bytes. Not sure...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#65Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#64)
Re: Page Checksums

On Wed, Dec 28, 2011 at 9:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:

 What I'm not too clear
about is whether a 16-bit checksum meets the needs of people who want
checksums.

We need this now, hence the gymnastics to get it into this release.

16-bits of checksum is way better than zero bits of checksum, probably
about a million times better (numbers taken from papers quoted earlier
on effectiveness of checksums).

The strategy I am suggesting is 16-bits now, 32/64 later.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#66Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#64)
Re: Page Checksums

On 28.12.2011 11:00, Robert Haas wrote:

Admittedly, most of the fat is probably in the tuple header rather
than the page header, but at any rate I don't consider burning up 1%
of our available storage space to be a negligible overhead.

8 / 8192 = 0.1%.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#67Jim Nasby
jim@nasby.net
In reply to: Simon Riggs (#65)
Re: Page Checksums

On Dec 28, 2011, at 3:31 AM, Simon Riggs wrote:

On Wed, Dec 28, 2011 at 9:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:

What I'm not too clear
about is whether a 16-bit checksum meets the needs of people who want
checksums.

We need this now, hence the gymnastics to get it into this release.

16-bits of checksum is way better than zero bits of checksum, probably
about a million times better (numbers taken from papers quoted earlier
on effectiveness of checksums).

The strategy I am suggesting is 16-bits now, 32/64 later.

What about allowing for an initdb option? That means that if you want binary compatibility so you can pg_upgrade then you're stuck with 16 bit checksums. If you can tolerate replicating all your data then you can get more robust checksumming.

In either case, it seems that we're quickly approaching the point where we need to start putting resources into binary page upgrading...
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#68Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#29)
Re: Page Checksums

On Mon, Dec 19, 2011 at 8:18 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Double-writes would be a useful option also to reduce the size of WAL that
needs to be shipped in replication.

Or you could just use a filesystem that does CRCs...

Double writes would reduce the size of WAL and we discussed many times
we want that.

Using a filesystem that does CRCs is basically saying "let the
filesystem cope". If that is an option, why not just turn full page
writes off and let the filesystem cope?

Do we really need double writes or even checksums in Postgres? What
use case are we covering that isn't covered by using the right
filesystem for the job? Or is that the problem? Are we implementing a
feature we needed 5 years ago but don't need now? Yes, other databases
have some of these features, but do we need them? Do we still need
them now?

Tell me we really need some or all of this and I will do my best to
make it happen.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#69Jim Nasby
jim@nasby.net
In reply to: Simon Riggs (#68)
Re: Page Checksums

On Jan 8, 2012, at 5:25 PM, Simon Riggs wrote:

On Mon, Dec 19, 2011 at 8:18 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Double-writes would be a useful option also to reduce the size of WAL that
needs to be shipped in replication.

Or you could just use a filesystem that does CRCs...

Double writes would reduce the size of WAL and we discussed many times
we want that.

Using a filesystem that does CRCs is basically saying "let the
filesystem cope". If that is an option, why not just turn full page
writes off and let the filesystem cope?

I don't think that just because a filesystem CRC's that you can't have a torn write.

Filesystem CRCs very likely will not happen to data that's in the cache. For some users, that's a huge amount of data to leave un-protected.

Filesystem bugs do happen... though presumably most of those would be caught by the filesystem's CRC check... but you never know!
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#70Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jim Nasby (#69)
Re: Page Checksums

On 10.01.2012 02:12, Jim Nasby wrote:

Filesystem CRCs very likely will not happen to data that's in the cache. For some users, that's a huge amount of data to leave un-protected.

You can repeat that argument ad infinitum. Even if the CRC covers all
the pages in the OS buffer cache, it still doesn't cover the pages in
the shared_buffers, CPU caches, in-transit from one memory bank to
another etc. You have to draw the line somewhere, and it seems
reasonable to draw it where the data moves between long-term storage,
ie. disk, and RAM.

Filesystem bugs do happen... though presumably most of those would be caught by the filesystem's CRC check... but you never know!

Yeah. At some point we have to just have faith on the underlying system.
It's reasonable to provide protection or make recovery easier from bugs
or hardware faults that happen fairly often in the real world, but a
can't-trust-no-one attitude is not very helpful.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#71Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#70)
Re: Page Checksums

On Tue, Jan 10, 2012 at 8:04 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 10.01.2012 02:12, Jim Nasby wrote:

Filesystem CRCs very likely will not happen to data that's in the cache.
For some users, that's a huge amount of data to leave un-protected.

You can repeat that argument ad infinitum. Even if the CRC covers all the
pages in the OS buffer cache, it still doesn't cover the pages in the
shared_buffers, CPU caches, in-transit from one memory bank to another etc.
You have to draw the line somewhere, and it seems reasonable to draw it
where the data moves between long-term storage, ie. disk, and RAM.

We protect each change with a CRC when we write WAL, so doing the same
thing doesn't sound entirely unreasonable, especially if your database
fits in RAM and we aren't likely to be doing I/O anytime soon. The
long term storage argument may no longer apply in a world with very
large memory.

The question is, when exactly would we check the checksum? When we
lock the block, when we pin it? We certainly can't do it on every
access to the block since we don't even track where that happens in
the code.

I think we could add an option to check the checksum immediately after
we pin a block for the first time but it would be very expensive and
sounds like we're re-inventing hardware or OS features again. Work on
50% performance drain, as an estimate.

That is a level of protection no other DBMS offers, so that is either
an advantage or a warning. Jim, if you want this, please do the
research and work out what the probability of losing shared buffer
data in your ECC RAM really is so we are doing it for quantifiable
reasons (via old Google memory academic paper) and to verify that the
cost/benefit means you would actually use it if we built it. Research
into requirements is at least as important and time consuming as
research on possible designs.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#72Benedikt Grundmann
bgrundmann@janestreet.com
In reply to: Simon Riggs (#71)
Re: Page Checksums

On 10/01/12 09:07, Simon Riggs wrote:

You can repeat that argument ad infinitum. Even if the CRC covers all the
pages in the OS buffer cache, it still doesn't cover the pages in the
shared_buffers, CPU caches, in-transit from one memory bank to another etc.
You have to draw the line somewhere, and it seems reasonable to draw it
where the data moves between long-term storage, ie. disk, and RAM.

We protect each change with a CRC when we write WAL, so doing the same
thing doesn't sound entirely unreasonable, especially if your database
fits in RAM and we aren't likely to be doing I/O anytime soon. The
long term storage argument may no longer apply in a world with very
large memory.

I'm not so sure about that. The experience we have is that storage
and memory doesn't grow as fast as demand. Maybe we are in a minority
but at Jane Street memory size < database size is sadly true for most
of the important databases.

Concrete the two most important databases are

715 GB

and

473 GB

in size (the second used to be much closer to the first one in size but
we recently archived a lot of data).

In both databases there is a small set of tables that use the majority of
the disk space. Those tables are also the most used tables. Typically
the size of one of those tables is between 1-3x size of memory. And the
cumulative size of all indices on the table is normally roughly the same
size as the table.

Cheers,

Bene

#73Jim Nasby
jim@nasby.net
In reply to: Simon Riggs (#71)
Re: Page Checksums

On Jan 10, 2012, at 3:07 AM, Simon Riggs wrote:

I think we could add an option to check the checksum immediately after
we pin a block for the first time but it would be very expensive and
sounds like we're re-inventing hardware or OS features again. Work on
50% performance drain, as an estimate.

That is a level of protection no other DBMS offers, so that is either
an advantage or a warning. Jim, if you want this, please do the
research and work out what the probability of losing shared buffer
data in your ECC RAM really is so we are doing it for quantifiable
reasons (via old Google memory academic paper) and to verify that the
cost/benefit means you would actually use it if we built it. Research
into requirements is at least as important and time consuming as
research on possible designs.

Maybe I'm just dense, but it wasn't clear to me how you could use the information in the google paper to extrapolate data corruption probability.

I can say this: we have seen corruption from bad memory, and our Postgres buffer pool (8G) is FAR smaller than available memory on all of our servers (192G or 512G). So at least in our case, CRCs that protect the filesystem cache would protect the vast majority of our memory (96% or 98.5%).
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#74Robert Treat
rob@xzilla.net
In reply to: Jim Nasby (#73)
Re: Page Checksums

On Sat, Jan 21, 2012 at 6:12 PM, Jim Nasby <jim@nasby.net> wrote:

On Jan 10, 2012, at 3:07 AM, Simon Riggs wrote:

I think we could add an option to check the checksum immediately after
we pin a block for the first time but it would be very expensive and
sounds like we're re-inventing hardware or OS features again. Work on
50% performance drain, as an estimate.

That is a level of protection no other DBMS offers, so that is either
an advantage or a warning. Jim, if you want this, please do the
research and work out what the probability of losing shared buffer
data in your ECC RAM really is so we are doing it for quantifiable
reasons (via old Google memory academic paper) and to verify that the
cost/benefit means you would actually use it if we built it. Research
into requirements is at least as important and time consuming as
research on possible designs.

Maybe I'm just dense, but it wasn't clear to me how you could use the information in the google paper to extrapolate data corruption probability.

I can say this: we have seen corruption from bad memory, and our Postgres buffer pool (8G) is FAR smaller than
available memory on all of our servers (192G or 512G). So at least in our case, CRCs that protect the filesystem
cache would protect the vast majority of our memory (96% or 98.5%).

Would it be unfair to assert that people who want checksums but aren't
willing to pay the cost of running a filesystem that provides
checksums aren't going to be willing to make the cost/benefit trade
off that will be asked for? Yes, it is unfair of course, but it's
interesting how small the camp of those using checksummed filesystems
is.

Robert Treat
conjecture: xzilla.net
consulting: omniti.com

#75Florian Weimer
fweimer@bfk.de
In reply to: Robert Treat (#74)
Re: Page Checksums

* Robert Treat:

Would it be unfair to assert that people who want checksums but aren't
willing to pay the cost of running a filesystem that provides
checksums aren't going to be willing to make the cost/benefit trade
off that will be asked for? Yes, it is unfair of course, but it's
interesting how small the camp of those using checksummed filesystems
is.

Don't checksumming file systems currently come bundled with other
features you might not want (such as certain vendors)?

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

#76Noname
jesper@krogh.cc
In reply to: Florian Weimer (#75)
Re: Page Checksums

* Robert Treat:

Would it be unfair to assert that people who want checksums but aren't
willing to pay the cost of running a filesystem that provides
checksums aren't going to be willing to make the cost/benefit trade
off that will be asked for? Yes, it is unfair of course, but it's
interesting how small the camp of those using checksummed filesystems
is.

Don't checksumming file systems currently come bundled with other
features you might not want (such as certain vendors)?

I would chip in and say that I would prefer sticking to well-known proved
filesystems like xfs/ext4 and let the application do the checksumming.

I dont forsee fully production-ready checksumming filesystems readily
available in the standard Linux distributions within a near future.

And yes, I would for sure turn such functionality on if it were present.

--
Jesper

#77Florian Weimer
fweimer@bfk.de
In reply to: Noname (#76)
Re: Page Checksums

I would chip in and say that I would prefer sticking to well-known proved
filesystems like xfs/ext4 and let the application do the checksumming.

Yes, that's a different way of putting my concern. If you want a proven
file system with checksumming (and an fsck), options are really quite
limited.

And yes, I would for sure turn such functionality on if it were present.

Same here. I already use page-level checksum with Berkeley DB.

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

#78Robert Treat
rob@xzilla.net
In reply to: Noname (#76)
Re: Page Checksums

On Tue, Jan 24, 2012 at 3:02 AM,  <jesper@krogh.cc> wrote:

* Robert Treat:

Would it be unfair to assert that people who want checksums but aren't
willing to pay the cost of running a filesystem that provides
checksums aren't going to be willing to make the cost/benefit trade
off that will be asked for? Yes, it is unfair of course, but it's
interesting how small the camp of those using checksummed filesystems
is.

Don't checksumming file systems currently come bundled with other
features you might not want (such as certain vendors)?

I would chip in and say that I would prefer sticking to well-known proved
filesystems like xfs/ext4 and let the application do the checksumming.

*shrug* You could use Illumos or BSD and you'd get generally vendor
free systems using ZFS, which I'd say offers more well-known and
proved checksumming than anything cooking in linux land, or than the
as-to-be-written yet checksumming in postgres.

I dont forsee fully production-ready checksumming filesystems readily
available in the standard Linux distributions within a near future.

And yes, I would for sure turn such functionality on if it were present.

That's nice to say, but most people aren't willing to take a 50%
performance hit. Not saying what we end up with will be that bad, but
I've seen people get upset about performance hits much lower than
that.

Robert Treat
conjecture: xzilla.net
consulting: omniti.com

#79Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Treat (#78)
Re: Page Checksums

On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat <rob@xzilla.net> wrote:

And yes, I would for sure turn such functionality on if it were present.

That's nice to say, but most people aren't willing to take a 50%
performance hit. Not saying what we end up with will be that bad, but
I've seen people get upset about performance hits much lower than
that.

When we talk about a 50% hit, are we discussing (1) checksums that are
checked on each I/O, or (2) checksums that are checked each time we
re-pin a shared buffer? The 50% hit was my estimate of (2) and has
not yet been measured, so shouldn't be used unqualified when
discussing checksums. Same thing is also true "I would use it"
comments, since we're not sure whether you're voting for (1) or (2).

As to whether people will actually use (1), I have no clue. But I do
know is that many people request that feature, including people that
run heavy duty Postgres production systems and who also know about
filesystems. Do people need (2)? It's easy enough to add as an option,
once we have (1) and there is real interest.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#80Jim Nasby
jim@nasby.net
In reply to: Simon Riggs (#79)
Re: Page Checksums

On Jan 24, 2012, at 9:15 AM, Simon Riggs wrote:

On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat <rob@xzilla.net> wrote:

And yes, I would for sure turn such functionality on if it were present.

That's nice to say, but most people aren't willing to take a 50%
performance hit. Not saying what we end up with will be that bad, but
I've seen people get upset about performance hits much lower than
that.

When we talk about a 50% hit, are we discussing (1) checksums that are
checked on each I/O, or (2) checksums that are checked each time we
re-pin a shared buffer? The 50% hit was my estimate of (2) and has
not yet been measured, so shouldn't be used unqualified when
discussing checksums. Same thing is also true "I would use it"
comments, since we're not sure whether you're voting for (1) or (2).

As to whether people will actually use (1), I have no clue. But I do
know is that many people request that feature, including people that
run heavy duty Postgres production systems and who also know about
filesystems. Do people need (2)? It's easy enough to add as an option,
once we have (1) and there is real interest.

Some people will be able to take a 50% hit and will happily turn on checksumming every time a page is pinned. But I suspect a lot of folks can't afford that kind of hit, but would really like to have their filesystem cache protected (we're certainly in the later camp).

As for checksumming filesystems, I didn't see any answers about whether the filesystem *cache* was also protected by the filesystem checksum. Even if it is, the choice of checksumming filesystems is certainly limited... ZFS is the only one that seems to have real traction, but that forces you off of Linux, which is a problem for many shops.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net