16-bit page checksums for 9.2
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.
page_checksums = on | off (default)
There are no required block changes; checksums are optional and some
blocks may have a checksum, others not. This means that the patch will
allow pg_upgrade.
That capability also limits us to 16-bit checksums. Fletcher's 16 is
used in this patch and seems rather quick, though that is easily
replaceable/tuneable if desired, perhaps even as a parameter enum.
This patch is a step on the way to 32-bit checksums in a future
redesign of the page layout, though that is not a required future
change, nor does this prevent that.
Checksum is set whenever the buffer is flushed to disk, and checked
when the page is read in from disk. It is not set at other times, and
for much of the time may not be accurate. This follows earlier
discussions from 2010-12-22, and is discussed in detail in patch
comments.
Note it works with buffer manager pages, which includes shared and
local data buffers, but not SLRU pages (yet? an easy addition but
needs other discussion around contention).
Note that all this does is detect bit errors on the page, it doesn't
identify where the error is, how bad and definitely not what caused it
or when it happened.
The main body of the patch involves changes to bufpage.c/.h so this
differs completely from the VMware patch, for technical reasons. Also
included are facilities to LockBufferForHints() with usage in various
AMs, to avoid the case where hints are set during calculation of the
checksum.
In my view this is a fully working, committable patch but I'm not in a
hurry to do so given the holiday season.
Hopefully its a gift not a turkey, and therefore a challenge for some
to prove that wrong. Enjoy either way,
Merry Christmas,
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16.v1.patchtext/x-patch; charset=US-ASCII; name=checksum16.v1.patchDownload
*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
***************
*** 282,288 **** hashgettuple(PG_FUNCTION_ARGS)
--- 282,290 ----
/*
* Yes, so mark it by setting the LP_DEAD state in the item flags.
*/
+ LockBufferForHints(buf);
ItemIdMarkDead(PageGetItemId(page, offnum));
+ UnlockBufferForHints(buf);
/*
* Since this can be redone later if needed, it's treated the same
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 259,266 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
--- 259,268 ----
if (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid ||
PageIsFull(page))
{
+ LockBufferForHints(buffer);
((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
PageClearFull(page);
+ UnlockBufferForHints(buffer);
SetBufferCommitInfoNeedsSave(buffer);
}
}
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 403,415 **** _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
--- 403,427 ----
* everyone, so we may as well mark the index entry
* killed.
*/
+ if (nbuf != InvalidBuffer)
+ LockBufferForHints(nbuf);
+ else
+ LockBufferForHints(buf);
+
ItemIdMarkDead(curitemid);
opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
/* be sure to mark the proper buffer dirty... */
if (nbuf != InvalidBuffer)
+ {
+ UnlockBufferForHints(nbuf);
SetBufferCommitInfoNeedsSave(nbuf);
+ }
else
+ {
+ UnlockBufferForHints(buf);
SetBufferCommitInfoNeedsSave(buf);
+ }
}
}
}
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 1040,1046 **** restart:
--- 1040,1048 ----
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
+ LockBufferForHints(buf);
opaque->btpo_cycleid = 0;
+ UnlockBufferForHints(buf);
SetBufferCommitInfoNeedsSave(buf);
}
}
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
***************
*** 1092,1101 **** read_info(SeqTable elm, Relation rel, Buffer *buf)
--- 1092,1103 ----
*/
if (HeapTupleHeaderGetXmax(tuple.t_data) != InvalidTransactionId)
{
+ LockBufferForHints(*buf);
HeapTupleHeaderSetXmax(tuple.t_data, InvalidTransactionId);
tuple.t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
tuple.t_data->t_infomask |= HEAP_XMAX_INVALID;
SetBufferCommitInfoNeedsSave(*buf);
+ UnlockBufferForHints(*buf);
}
seq = (Form_pg_sequence) GETSTRUCT(&tuple);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 440,446 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
! if (!PageHeaderIsValid((PageHeader) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
--- 440,446 ----
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
! if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
***************
*** 1860,1865 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1860,1866 ----
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
***************
*** 1907,1912 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1908,1928 ----
buf->flags &= ~BM_JUST_DIRTIED;
UnlockBufHdr(buf);
+ /*
+ * Set page verification info immediately before we write the buffer to disk.
+ * Once we have flushed the buffer is marked clean again, meaning it can
+ * be replaced quickly and silently with another data block, so we must
+ * write verification info now. For efficiency, the process of cleaning
+ * and page replacement is asynchronous, so we can't do this *only* when
+ * we are about to replace the buffer, we need to do this for every flush.
+ *
+ * Note that the buffer must not be written to while we set verification,
+ * otherwise it would invalidate the page on-disk. So we must ensure
+ * writes are prevented, see comments for LockBufferForHints().
+ */
+ bufBlock = buf->buf_id < 0 ? LocalBufHdrGetBlock(buf) : BufHdrGetBlock(buf);
+ PageSetVerificationInfo((Page) bufBlock);
+
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
***************
*** 1921,1926 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1937,1960 ----
*/
TerminateBufferIO(buf, true, 0);
+ #ifdef USE_ASSERT_CHECKING
+ /*
+ * Confirm that all AMs have protected their hints with LockBufferForHints()
+ */
+ if (page_checksums)
+ {
+ bool just_dirtied;
+
+ LockBufHdr(buf);
+ just_dirtied = ((buf->flags & BM_JUST_DIRTIED) == BM_JUST_DIRTIED);
+ UnlockBufHdr(buf);
+
+ Assert(!just_dirtied);
+ }
+ #endif
+
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
***************
*** 2628,2633 **** WaitIO(volatile BufferDesc *buf)
--- 2662,2742 ----
}
/*
+ * Lock buffer to allow hints to be set, for use by all AMs.
+ *
+ * Many writes to PostgreSQL data pages are referred to as hints because
+ * those changes are not critical and we have elected to avoid WAL-logging
+ * the changes. As an optimisation, PostgreSQL allows hints to be set on
+ * pages locked in shared mode. When we are adding checksums prior to flush
+ * the page musn't change, so we hold the io_in_progress lock exclusively
+ * for the buffer prior to the smgrwrite. Thus setting hints must wait
+ * when we are flushing the buffer. The alternative would be to lock the
+ * whole buffer in exclusive mode when we wanted to set hints, which would
+ * cause more contention that simply holding the io_in_progress lock.
+ *
+ * It might seem possible to handle this optimistically and flush the
+ * buffer and then check to see if the buffer was just dirtied. However,
+ * that might result in the on-disk block having page checksums that don't
+ * match its contents and thus we would need to train ourselves to ignore
+ * "some" page errors. If you feel that is a path to take, then turn off
+ * checksums and be happy.
+ */
+ void
+ LockBufferForHints(Buffer buffer)
+ {
+ volatile BufferDesc *buf;
+
+ Assert(BufferIsValid(buffer));
+
+ /*
+ * If we aren't writing a checksum for buffers then the answer is true.
+ */
+ if (!page_checksums)
+ return;
+
+ /*
+ * If its a local buffer then it can't be IO in progress, since we'd be the
+ * one writing it and not checking to see if it was in progress.
+ */
+ if (BufferIsLocal(buffer))
+ return;
+
+ buf = &BufferDescriptors[buffer - 1];
+
+ LWLockAcquire(buf->io_in_progress_lock, LW_SHARED);
+ }
+
+ /*
+ * UnLock buffer to prevent hints being set.
+ *
+ * Don't do a CHECK_INTERRUPTS between BufferLockForHints() and BufferUnlockForHints()
+ */
+ void
+ UnlockBufferForHints(Buffer buffer)
+ {
+ volatile BufferDesc *buf;
+
+ Assert(BufferIsValid(buffer));
+
+ /*
+ * If we aren't writing a checksum for buffers, ignore.
+ */
+ if (!page_checksums)
+ return;
+
+ /*
+ * If its a local buffer then it can't be IO in progress, since we'd be the
+ * one writing it and not checking to see if it was in progress.
+ */
+ if (BufferIsLocal(buffer))
+ return;
+
+ buf = &BufferDescriptors[buffer - 1];
+
+ LWLockRelease(buf->io_in_progress_lock);
+ }
+
+ /*
* StartBufferIO: begin I/O on this buffer
* (Assumptions)
* My process is executing no IO
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 16,21 ****
--- 16,25 ----
#include "access/htup.h"
+ bool page_checksums = false;
+
+ static bool PageVerificationInfoOK(Page page);
+ static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
***************
*** 25,30 ****
--- 29,38 ----
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we writethe block to disk later.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
***************
*** 67,86 **** PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
! PageHeaderIsValid(PageHeader page)
{
char *pagebytes;
int i;
/* Check normal case */
! if (PageGetPageSize(page) == BLCKSZ &&
! PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
! (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
! page->pd_lower >= SizeOfPageHeaderData &&
! page->pd_lower <= page->pd_upper &&
! page->pd_upper <= page->pd_special &&
! page->pd_special <= BLCKSZ &&
! page->pd_special == MAXALIGN(page->pd_special))
return true;
/* Check all-zeroes case */
--- 75,94 ----
* will clean up such a page and make it usable.
*/
bool
! PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
/* Check normal case */
! if (PageVerificationInfoOK(page) &&
! (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
! p->pd_lower >= SizeOfPageHeaderData &&
! p->pd_lower <= p->pd_upper &&
! p->pd_upper <= p->pd_special &&
! p->pd_special <= BLCKSZ &&
! p->pd_special == MAXALIGN(p->pd_special))
return true;
/* Check all-zeroes case */
***************
*** 827,829 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 835,979 ----
pfree(itemidbase);
}
+
+ /*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page writes does not check verification
+ * info via this function.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering and leaving
+ * the buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data leaving and
+ * re-entering PostgreSQL controlled memory.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ *
+ * Note also that this returns a boolean, not a full damage assessment.
+ */
+ static bool
+ PageVerificationInfoOK(Page page)
+ {
+ PageHeader p = (PageHeader) page;
+
+ /*
+ * We set two non-adjacent bits to signify that the page has a checksum and
+ * should be verified against that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags and destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ */
+ if (PageHasChecksumFlag1(p) && PageHasChecksumFlag2(p))
+ {
+ if (PageCalcChecksum16(page) == p->pd_verify.pd_checksum16)
+ return true;
+ }
+ else if (!PageHasChecksumFlag1(p) && !PageHasChecksumFlag2(p))
+ {
+ if (PageGetPageLayoutVersion(p) == PG_PAGE_LAYOUT_VERSION &&
+ PageGetPageSize(p) == BLCKSZ)
+ return true;
+ }
+
+ return false;
+ }
+
+ /*
+ * Set verification info for page.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an old checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ */
+ void
+ PageSetVerificationInfo(Page page)
+ {
+ PageHeader p = (PageHeader) page;
+
+ if (page_checksums)
+ {
+ p->pd_flags |= PD_CHECKSUM;
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16(page);
+ }
+ else if (PageHasChecksumFlag1(p) || PageHasChecksumFlag2(p))
+ {
+ /* ensure any older checksum info is overwritten with watermark */
+ p->pd_flags &= ~PD_CHECKSUM;
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+ }
+ }
+
+ /*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+ static uint16
+ PageCalcChecksum16(Page page)
+ {
+ #define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+ #ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ *
+ * XXX present implementation is raw, untuned calculation, please tweak
+ */
+ PageHeader p = (PageHeader) page;
+ uint16 sum1 = 0;
+ uint16 sum2 = 0;
+ int i;
+
+ #define COMP_F16(from, to) \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = (sum1 + page[i]) % 255; \
+ sum2 = (sum1 + sum2) % 255; \
+ } while (0)
+
+ COMP_F16(0,
+ offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+
+ COMP_F16(offsetof(PageHeaderData, pd_prune_xid),
+ p->pd_upper);
+
+ COMP_F16(p->pd_upper,
+ (int) BLCKSZ);
+
+ return ((sum2 << 8) | sum1);
+ #endif
+ }
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 830,835 **** static struct config_bool ConfigureNamesBool[] =
--- 830,849 ----
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksums before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 150,164 ****
#------------------------------------------------------------------------------
! # WRITE AHEAD LOG
#------------------------------------------------------------------------------
! # - Settings -
! #wal_level = minimal # minimal, archive, or hot_standby
! # (change requires restart)
#fsync = on # turns forced synchronization on or off
#synchronous_commit = on # synchronization level; on, off, or local
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
--- 150,170 ----
#------------------------------------------------------------------------------
! # WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
! # - Reliability -
! #page_checksums = off # calculate checksum before database I/O
! #full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+ # - Write Ahead Log -
+
+ #wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
***************
*** 166,172 ****
# fsync
# fsync_writethrough
# open_sync
- #full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
--- 172,177 ----
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 119,125 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer,
--- 119,127 ----
return; /* not flushed yet, so don't set hint */
}
+ LockBufferForHints(buffer);
tuple->t_infomask |= infomask;
+ UnlockBufferForHints(buffer);
SetBufferCommitInfoNeedsSave(buffer);
}
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 209,214 **** extern bool ConditionalLockBuffer(Buffer buffer);
--- 209,216 ----
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
+ extern void LockBufferForHints(Buffer buffer);
+ extern void UnlockBufferForHints(Buffer buffer);
extern void AbortBufferIO(void);
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 18,23 ****
--- 18,25 ----
#include "storage/item.h"
#include "storage/off.h"
+ extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
***************
*** 130,136 **** typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
! uint16 pd_pagesize_version;
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
--- 132,144 ----
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
!
! union
! {
! uint16 pd_pagesize_version;
! uint16 pd_checksum16;
! } pd_verify; /* page verification data */
!
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
***************
*** 155,161 **** typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
! #define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
--- 163,178 ----
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
! #define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */
!
! #define PD_CHECKSUM1 0x0008 /* First checksum bit */
! #define PD_CHECKSUM2 0x8000 /* Second checksum bit */
! #define PD_CHECKSUM 0x8008 /* OR of both checksum flags */
!
! #define PageHasChecksumFlag1(page) \
! ((((PageHeader) (page))->pd_flags & PD_CHECKSUM1) == PD_CHECKSUM1)
! #define PageHasChecksumFlag2(page) \
! ((((PageHeader) (page))->pd_flags & PD_CHECKSUM2) == PD_CHECKSUM2)
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
***************
*** 165,170 **** typedef PageHeaderData *PageHeader;
--- 182,189 ----
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses 4 as well, though with changed meaning of verification bits.
+ * We deliberately don't bump the page version for that, to allow upgrades.
*/
#define PG_PAGE_LAYOUT_VERSION 4
***************
*** 231,249 **** typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
! * this can only be called on a formatted page (unlike
! * BufferGetPageSize, which can be called on an unformatted page).
! * however, it can be called on a page that is not stored in a buffer.
*/
! #define PageGetPageSize(page) \
! ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
/*
* PageGetPageLayoutVersion
* Returns the page layout version of a page.
*/
#define PageGetPageLayoutVersion(page) \
! (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
--- 250,271 ----
* PageGetPageSize
* Returns the page size of a page.
*
! * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
! * This can be called on any page, initialised or not, in or out of buffers.
! * You might think this can vary at runtime but you'd be wrong, since pages
! * frequently need to occupy buffers and pages are copied from one to another
! * so there are many hidden assumptions that this simple definition is true.
*/
! #define PageGetPageSize(page) (BLCKSZ)
/*
* PageGetPageLayoutVersion
* Returns the page layout version of a page.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageGetPageLayoutVersion(page) \
! (((PageHeader) (page))->pd_verify.pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
***************
*** 251,262 **** typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
! ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
)
/* ----------------
--- 273,286 ----
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
! ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
/* ----------------
***************
*** 368,374 **** do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
! extern bool PageHeaderIsValid(PageHeader page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
--- 392,398 ----
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
! extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
***************
*** 381,385 **** extern Size PageGetExactFreeSpace(Page page);
--- 405,410 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageSetVerificationInfo(Page page);
#endif /* BUFPAGE_H */
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.
I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint. I also wonder whether it might not result in
deadlocks.
Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted. Do you really think this is
the last page layout change we'll ever make?
regards, tom lane
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint. I also wonder whether it might not result in
deadlocks.
Why don't you use the same tricks as the former patch and copy the buffer,
compute the checksum on that, and then write out that copy (you can even do
both at the same time). I have a hard time believing that the additional copy
is more expensive than the locking.
Andres
On Sat, Dec 24, 2011 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint.
Anyone choosing page_checksums = on has already made a performance
reducing decision in favour of reliability. So they understand and
accept the impact. There is no locking when the parameter is off.
A safe alternative is to use LockBuffer, which has a much greater
performance impact.
I did think about optimistically checking after the write, but if we
crash at that point we will then see a block that has an invalid
checksum. It's faster but you may get a checksum failure if you crash
- but then one important aspect of this is to spot problems in case of
a crash, so that seems unacceptable.
I also wonder whether it might not result in
deadlocks.
If you can see how, please say. I can't see any ways for that myself.
Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted. Do you really think this is
the last page layout change we'll ever make?
No, I don't. I hope and expect the next page layout change to
reintroduce such a field.
But since we're agreed now that upgrading is important, changing page
format isn't likely to be happening until we get an online upgrade
process. So future changes are much less likely. If they do happen, we
have some flag bits spare that can be used to indicate later versions.
It's not the prettiest thing in the world, but it's a small ugliness
in return for an important feature. If there was a way without that, I
would have chosen it.
pg_filedump will need to be changed more than normal, but the version
isn't used anywhere else in the server code.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint. I also wonder whether it might not result in
deadlocks.
Why don't you use the same tricks as the former patch and copy the buffer,
compute the checksum on that, and then write out that copy (you can even do
both at the same time). I have a hard time believing that the additional copy
is more expensive than the locking.
We would copy every time we write, yet lock only every time we set hint bits.
If that option is favoured, I'll write another version after Christmas.
ISTM we can't write and copy at the same time because the cheksum is
not a trailer field.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 24, 2011 at 3:51 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
Not an expert here, but after reading through the patch quickly, I
don't see anything that changes the torn-page problem though, right?Hint bits aren't wal-logged, and FPW isn't forced on the hint-bit-only
dirty, right?
Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.
So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.
Checksums work even when there is no crash, so if your disk goes bad
and corrupts data then you'll know about it as soon as it happens.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Import Notes
Reply to msg id not found: CAC_2qU_wNTVAA8aJ0upPK3BxmfQQXt+3ECAzK6JEq+vczHc+SA@mail.gmail.com
On Saturday, December 24, 2011 05:01:02 PM Simon Riggs wrote:
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint. I also wonder whether it might not result in
deadlocks.Why don't you use the same tricks as the former patch and copy the
buffer, compute the checksum on that, and then write out that copy (you
can even do both at the same time). I have a hard time believing that
the additional copy is more expensive than the locking.We would copy every time we write, yet lock only every time we set hint
bits.
Isn't setting hint bits also a rather frequent operation? At least in a well-
cached workload where most writeout happens due to checkpoints.
If that option is favoured, I'll write another version after Christmas.
Seems less complicated (wrt deadlocking et al) to me. But I havent read your
patch, so I will shut up now ;)
Andres
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.
This is missing the point. If you have a torn page on a page that is
only dirty due to hint bits then the checksum will show a spurious
checksum failure. It will "detect" a problem that isn't there.
The problem is that there is no WAL indicating the hint bit change.
And if the torn page includes the new checksum but not the new hint
bit or vice versa it will be a checksum mismatch.
The strategy discussed in the past was moving all the hint bits to a
common area and skipping them in the checksum. No amount of double
writing or buffering or locking will avoid this problem.
--
greg
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark@mit.edu> wrote:
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.This is missing the point. If you have a torn page on a page that is
only dirty due to hint bits then the checksum will show a spurious
checksum failure. It will "detect" a problem that isn't there.
It will detect a problem that *is* there, but one you are classifying
it as a non-problem because it is a correctable or acceptable bit
error. Given that acceptable bit errors on hints cover no more than 1%
of a block, the great likelihood is that the bit error is unacceptable
in any case, so false positives page errors are in fact very rare.
Any bit error is an indicator of problems on the external device, so
many would regard any bit error as unacceptable.
The problem is that there is no WAL indicating the hint bit change.
And if the torn page includes the new checksum but not the new hint
bit or vice versa it will be a checksum mismatch.The strategy discussed in the past was moving all the hint bits to a
common area and skipping them in the checksum. No amount of double
writing or buffering or locking will avoid this problem.
I completely agree we should do this, but we are unable to do it now,
so this patch is a stop-gap and provides a much requested feature
*now*.
In the future, we will be able to tell the difference between an
acceptable and an unacceptable bit error. Right now, all we have is
the ability to detect a bit error and as I point out above that is 99%
of the problem solves, at least.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs wrote:
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark wrote:
The problem is that there is no WAL indicating the hint bit
change. And if the torn page includes the new checksum but not the
new hint bit or vice versa it will be a checksum mismatch.
With *just* this patch, true. An OS crash or hardware failure could
sometimes create an invalid page.
The strategy discussed in the past was moving all the hint bits to
a common area and skipping them in the checksum. No amount of
double writing or buffering or locking will avoid this problem.
I don't believe that. Double-writing is a technique to avoid torn
pages, but it requires a checksum to work. This chicken-and-egg
problem requires the checksum to be implemented first.
I completely agree we should do this, but we are unable to do it
now, so this patch is a stop-gap and provides a much requested
feature *now*.
Yes, for people who trust their environment to prevent torn pages, or
who are willing to tolerate one bad page per OS crash in return for
quick reporting of data corruption from unreliable file systems, this
is a good feature even without double-writes.
In the future, we will be able to tell the difference between an
acceptable and an unacceptable bit error.
A double-write patch would provide that, and it sounds like VMware
has a working patch for that which is being polished for submission.
It would need to wait until we have some consensus on the checksum
patch before it can be finalized. I'll try to review the patch from
this thread today, to do what I can to move that along.
-Kevin
Import Notes
Resolved by subject fallback
On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote:
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
Why don't you use the same tricks as the former patch and copy the buffer,
compute the checksum on that, and then write out that copy (you can even do
both at the same time). I have a hard time believing that the additional copy
is more expensive than the locking.ISTM we can't write and copy at the same time because the cheksum is
not a trailer field.
Ofcourse you can. If the checksum is in the trailer field you get the
nice property that the whole block has a constant checksum. However, if
you store the checksum elsewhere you just need to change the checking
algorithm to copy the checksum out, zero those bytes and run the
checksum and compare with the extracted checksum.
Not pretty, but I don't think it makes a difference in performence.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
On Sun, Dec 25, 2011 at 5:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark@mit.edu> wrote:
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.This is missing the point. If you have a torn page on a page that is
only dirty due to hint bits then the checksum will show a spurious
checksum failure. It will "detect" a problem that isn't there.It will detect a problem that *is* there, but one you are classifying
it as a non-problem because it is a correctable or acceptable bit
error.
I don't agree with this. We don't WAL-log hint bit changes precisely
because it's OK if they make it to disk and it's OK if they don't.
Given that, I don't see how we can say that writing out only half of a
page that has had hint bit changes is a problem. It's not.
(And if it is, then we ought to WAL-log all such changes regardless of
whether CRCs are in use.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 25.12.2011 15:01, Kevin Grittner wrote:
I don't believe that. Double-writing is a technique to avoid torn
pages, but it requires a checksum to work. This chicken-and-egg
problem requires the checksum to be implemented first.
I don't think double-writes require checksums on the data pages
themselves, just on the copies in the double-write buffers. In the
double-write buffer, you'll need some extra information per-page anyway,
like a relfilenode and block number that indicates which page it is in
the buffer.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 25.12.2011 15:01, Kevin Grittner wrote:
I don't believe that. Double-writing is a technique to avoid torn
pages, but it requires a checksum to work. This chicken-and-egg
problem requires the checksum to be implemented first.I don't think double-writes require checksums on the data pages themselves,
just on the copies in the double-write buffers. In the double-write buffer,
you'll need some extra information per-page anyway, like a relfilenode and
block number that indicates which page it is in the buffer.
How would you know when to look in the double write buffer?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Dec 25, 2011 at 1:01 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
This chicken-and-egg
problem requires the checksum to be implemented first.
v2 of checksum patch, using a conditional copy if checksumming is
enabled, so locking is removed.
Thanks to Andres for thwacking me with the cluestick, though I have
used a simple copy rather than a copy & calc.
Tested using make installcheck with parameter on/off, then restart and
vacuumdb to validate all pages.
Reviews, objections, user interface tweaks all welcome.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16.v2.patchtext/x-patch; charset=US-ASCII; name=checksum16.v2.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1701,1706 **** SET ENABLE_SEQSCAN TO OFF;
--- 1701,1748 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-page-checksums" xreflabel="page_checksums">
+ <indexterm>
+ <primary><varname>page_checksums</> configuration parameter</primary>
+ </indexterm>
+ <term><varname>page_checksums</varname> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ When this parameter is on, the <productname>PostgreSQL</> server
+ calculates checksums when it writes main database pages to disk,
+ flagging the page as checksum protected. When this parameter is off,
+ no checksum is written, only a standard watermark in the page header.
+ The database may thus contain a mix of pages with checksums and pages
+ without checksums.
+ </para>
+
+ <para>
+ When pages are read into shared buffers any page flagged with a
+ checksum has the checksum re-calculated and compared against the
+ stored value to provide greatly improved validation of page contents.
+ </para>
+
+ <para>
+ Writes via temp_buffers are not checksummed.
+ </para>
+
+ <para>
+ Turning this parameter off speeds normal operation, but
+ might allow data corruption to go unnoticed. The checksum uses
+ 16-bit checksums, using the fast Fletcher 16 algorithm. With this
+ parameter enabled there is still a non-zero probability that an error
+ could go undetected, as well as a non-zero probability of false
+ positives.
+ </para>
+
+ <para>
+ This parameter can only be set in the <filename>postgresql.conf</>
+ file or on the server command line.
+ The default is <literal>off</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
<term><varname>wal_buffers</varname> (<type>integer</type>)</term>
<indexterm>
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 440,446 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
! if (!PageHeaderIsValid((PageHeader) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
--- 440,446 ----
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
! if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
***************
*** 1860,1865 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1860,1867 ----
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
+ char *bufCopy;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
***************
*** 1907,1916 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
buf->flags &= ~BM_JUST_DIRTIED;
UnlockBufHdr(buf);
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
! (char *) BufHdrGetBlock(buf),
false);
pgBufferUsage.shared_blks_written++;
--- 1909,1932 ----
buf->flags &= ~BM_JUST_DIRTIED;
UnlockBufHdr(buf);
+ /*
+ * Set page verification info immediately before we write the buffer to disk.
+ * Once we have flushed the buffer is marked clean again, meaning it can
+ * be replaced quickly and silently with another data block, so we must
+ * write verification info now. For efficiency, the process of cleaning
+ * and page replacement is asynchronous, so we can't do this *only* when
+ * we are about to replace the buffer, we need to do this for every flush.
+ */
+ bufBlock = BufHdrGetBlock(buf);
+ bufCopy = PageSetVerificationInfo((Page) bufBlock);
+
+ /*
+ * bufToWrite is either the shared buffer or a copy, as appropriate.
+ */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
! (char *) bufCopy,
false);
pgBufferUsage.shared_blks_written++;
***************
*** 1921,1926 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1937,1944 ----
*/
TerminateBufferIO(buf, true, 0);
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
*** a/src/backend/storage/buffer/localbuf.c
--- b/src/backend/storage/buffer/localbuf.c
***************
*** 200,205 **** LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
--- 200,207 ----
/* Find smgr relation for buffer */
oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ /* XXX do we want to write checksums for local buffers? An option? */
+
/* And write... */
smgrwrite(oreln,
bufHdr->tag.forkNum,
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 16,21 ****
--- 16,27 ----
#include "access/htup.h"
+ bool page_checksums = false;
+
+ static char pageCopy[BLCKSZ]; /* temporary buffer to allow checksum calculation */
+
+ static bool PageVerificationInfoOK(Page page);
+ static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
***************
*** 25,30 ****
--- 31,40 ----
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we write the block to disk.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
***************
*** 67,86 **** PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
! PageHeaderIsValid(PageHeader page)
{
char *pagebytes;
int i;
/* Check normal case */
! if (PageGetPageSize(page) == BLCKSZ &&
! PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
! (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
! page->pd_lower >= SizeOfPageHeaderData &&
! page->pd_lower <= page->pd_upper &&
! page->pd_upper <= page->pd_special &&
! page->pd_special <= BLCKSZ &&
! page->pd_special == MAXALIGN(page->pd_special))
return true;
/* Check all-zeroes case */
--- 77,96 ----
* will clean up such a page and make it usable.
*/
bool
! PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
/* Check normal case */
! if (PageVerificationInfoOK(page) &&
! (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
! p->pd_lower >= SizeOfPageHeaderData &&
! p->pd_lower <= p->pd_upper &&
! p->pd_upper <= p->pd_special &&
! p->pd_special <= BLCKSZ &&
! p->pd_special == MAXALIGN(p->pd_special))
return true;
/* Check all-zeroes case */
***************
*** 827,829 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 837,1050 ----
pfree(itemidbase);
}
+
+ /*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page writes does not check verification
+ * info via this function.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering and leaving
+ * the buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data leaving and
+ * re-entering PostgreSQL controlled memory.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ *
+ * Note also that this returns a boolean, not a full damage assessment.
+ */
+ static bool
+ PageVerificationInfoOK(Page page)
+ {
+ PageHeader p = (PageHeader) page;
+
+ /*
+ * We set two non-adjacent bits to signify that the page has a checksum and
+ * should be verified against that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags and destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ */
+ if (PageHasChecksumFlag1(p) && PageHasChecksumFlag2(p))
+ {
+ uint16 checksum = PageCalcChecksum16(page);
+
+ if (checksum == p->pd_verify.pd_checksum16)
+ return true;
+
+ elog(LOG, "page verification failed - checksum was %u page checksum field is %u",
+ checksum, p->pd_verify.pd_checksum16);
+ }
+ else if (!PageHasChecksumFlag1(p) && !PageHasChecksumFlag2(p))
+ {
+ if (PageGetPageLayoutVersion(p) == PG_PAGE_LAYOUT_VERSION &&
+ PageGetPageSize(p) == BLCKSZ)
+ return true;
+ }
+
+ elog(LOG, "page verification failed - page has one checksum flag set");
+ return false;
+ }
+
+ /*
+ * Set verification info for page.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an invalid checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ *
+ * Returns a pointer to the block-sized data that needs to be written. That
+ * allows us to either copy, or not, depending upon whether we checksum.
+ */
+ char *
+ PageSetVerificationInfo(Page page)
+ {
+ PageHeader p;
+
+ if (PageIsNew(page))
+ return (char *) page;
+
+ if (page_checksums)
+ {
+ /*
+ * We make a copy iff we need to calculate a checksum because other
+ * backends may set hint bits on this page while we write, which
+ * would mean the checksum differs from the page contents. It doesn't
+ * matter if we include or exclude hints during the copy, as long
+ * as we write a valid page and associated checksum.
+ */
+ memcpy(&pageCopy, page, BLCKSZ);
+
+ p = (PageHeader) &pageCopy;
+ p->pd_flags |= PD_CHECKSUM;
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16((Page) &pageCopy);
+
+ return (char *) &pageCopy;
+ }
+
+ p = (PageHeader) page;
+
+ if (PageHasChecksumFlag1(p) || PageHasChecksumFlag2(p))
+ {
+ /* ensure any older checksum info is overwritten with watermark */
+ p->pd_flags &= ~PD_CHECKSUM;
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+ }
+
+ return (char *) page;
+ }
+
+ /*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+ static uint16
+ PageCalcChecksum16(Page page)
+ {
+ #define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+ #ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ *
+ * XXX present implementation is raw, untuned calculation, please tweak
+ */
+ PageHeader p = (PageHeader) page;
+ uint page_header_stop = (uint)(offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+ uint page_lower_start = (uint)(offsetof(PageHeaderData, pd_prune_xid));
+ uint page_lower_stop;
+ uint16 sum1 = 0;
+ uint16 sum2 = 0;
+ int i;
+
+ /*
+ * Avoid calculating checksum if page is new, just return a value that
+ * will cause the check to fail. We may still pass the all-zeroes check.
+ */
+ if (PageIsNew(page))
+ return 1;
+
+ /*
+ * Just add in the pd_prune_xid if there are no line pointers yet.
+ */
+ page_lower_stop = p->pd_lower;
+ if (page_lower_stop == 0)
+ page_lower_stop = page_lower_start + sizeof(TransactionId);
+
+ Assert(p->pd_upper != 0);
+
+ #ifdef DEBUG_CHECKSUM
+ elog(LOG, "calculating checksum for %u-%u %u-%u %u-%u",
+ 0, /* page_header_start */
+ page_header_stop,
+ page_lower_start,
+ page_lower_stop,
+ p->pd_upper,
+ BLCKSZ
+ );
+ #endif
+
+ #define COMP_F16(from, to) \
+ do { \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = (sum1 + page[i]) % 255; \
+ sum2 = (sum1 + sum2) % 255; \
+ } \
+ } while (0); \
+
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ page_lower_stop);
+
+ /* ignore the hole in the middle of the block */
+
+ COMP_F16(p->pd_upper,
+ BLCKSZ - 1);
+
+ #ifdef DEBUG_CHECKSUM
+ elog(LOG, "checksum %u", ((sum2 << 8) | sum1));
+ #endif
+
+ return ((sum2 << 8) | sum1);
+ #endif
+ }
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 830,835 **** static struct config_bool ConfigureNamesBool[] =
--- 830,849 ----
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksums before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 150,164 ****
#------------------------------------------------------------------------------
! # WRITE AHEAD LOG
#------------------------------------------------------------------------------
! # - Settings -
! #wal_level = minimal # minimal, archive, or hot_standby
! # (change requires restart)
#fsync = on # turns forced synchronization on or off
#synchronous_commit = on # synchronization level; on, off, or local
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
--- 150,170 ----
#------------------------------------------------------------------------------
! # WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
! # - Reliability -
! #page_checksums = off # calculate checksum before database I/O
! #full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+ # - Write Ahead Log -
+
+ #wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
***************
*** 166,172 ****
# fsync
# fsync_writethrough
# open_sync
- #full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
--- 172,177 ----
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 18,23 ****
--- 18,25 ----
#include "storage/item.h"
#include "storage/off.h"
+ extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
***************
*** 93,99 **** typedef uint16 LocationIndex;
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
! * pd_pagesize_version - size in bytes and page layout version number.
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
--- 95,101 ----
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
! * pd_verify - page verification information of different kinds
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
***************
*** 106,112 **** typedef uint16 LocationIndex;
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
! * The page version number and page size are packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
--- 108,115 ----
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
! * For verification we store either a 16 bit checksum or a watermark of
! * the page version number and page size packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
***************
*** 130,136 **** typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
! uint16 pd_pagesize_version;
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
--- 133,145 ----
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
!
! union
! {
! uint16 pd_pagesize_version;
! uint16 pd_checksum16;
! } pd_verify; /* page verification data */
!
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
***************
*** 155,161 **** typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
! #define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
--- 164,179 ----
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
! #define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */
!
! #define PD_CHECKSUM1 0x0008 /* First checksum bit */
! #define PD_CHECKSUM2 0x8000 /* Second checksum bit */
! #define PD_CHECKSUM 0x8008 /* OR of both checksum flags */
!
! #define PageHasChecksumFlag1(page) \
! ((((PageHeader) (page))->pd_flags & PD_CHECKSUM1) == PD_CHECKSUM1)
! #define PageHasChecksumFlag2(page) \
! ((((PageHeader) (page))->pd_flags & PD_CHECKSUM2) == PD_CHECKSUM2)
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
***************
*** 165,170 **** typedef PageHeaderData *PageHeader;
--- 183,190 ----
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses 4 as well, though with changed meaning of verification bits.
+ * We deliberately don't bump the page version for that, to allow upgrades.
*/
#define PG_PAGE_LAYOUT_VERSION 4
***************
*** 231,249 **** typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
! * this can only be called on a formatted page (unlike
! * BufferGetPageSize, which can be called on an unformatted page).
! * however, it can be called on a page that is not stored in a buffer.
*/
! #define PageGetPageSize(page) \
! ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
/*
* PageGetPageLayoutVersion
* Returns the page layout version of a page.
*/
#define PageGetPageLayoutVersion(page) \
! (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
--- 251,272 ----
* PageGetPageSize
* Returns the page size of a page.
*
! * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
! * This can be called on any page, initialised or not, in or out of buffers.
! * You might think this can vary at runtime but you'd be wrong, since pages
! * frequently need to occupy buffers and pages are copied from one to another
! * so there are many hidden assumptions that this simple definition is true.
*/
! #define PageGetPageSize(page) (BLCKSZ)
/*
* PageGetPageLayoutVersion
* Returns the page layout version of a page.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageGetPageLayoutVersion(page) \
! (((PageHeader) (page))->pd_verify.pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
***************
*** 251,262 **** typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
! ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
)
/* ----------------
--- 274,287 ----
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
! ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
/* ----------------
***************
*** 368,374 **** do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
! extern bool PageHeaderIsValid(PageHeader page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
--- 393,399 ----
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
! extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
***************
*** 381,385 **** extern Size PageGetExactFreeSpace(Page page);
--- 406,411 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern char *PageSetVerificationInfo(Page page);
#endif /* BUFPAGE_H */
On 28.12.2011 01:39, Simon Riggs wrote:
On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 25.12.2011 15:01, Kevin Grittner wrote:
I don't believe that. Double-writing is a technique to avoid torn
pages, but it requires a checksum to work. This chicken-and-egg
problem requires the checksum to be implemented first.I don't think double-writes require checksums on the data pages themselves,
just on the copies in the double-write buffers. In the double-write buffer,
you'll need some extra information per-page anyway, like a relfilenode and
block number that indicates which page it is in the buffer.How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double write
buffer that has a valid checksum, you copy to the main storage. There's
no need to check validity of pages in the main storage.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double write buffer
that has a valid checksum, you copy to the main storage. There's no need to
check validity of pages in the main storage.
OK, then we are talking at cross purposes. Double write buffers, in
the way you explain them allow us to remove full page writes. They
clearly don't do anything to check page validity on read. Torn pages
are not the only fault we wish to correct against... and the double
writes idea is orthogonal to the idea of checksums.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 28.12.2011 11:22, Simon Riggs wrote:
On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double write buffer
that has a valid checksum, you copy to the main storage. There's no need to
check validity of pages in the main storage.OK, then we are talking at cross purposes. Double write buffers, in
the way you explain them allow us to remove full page writes. They
clearly don't do anything to check page validity on read. Torn pages
are not the only fault we wish to correct against... and the double
writes idea is orthogonal to the idea of checksums.
The reason we're talking about double write buffers in this thread is
that double write buffers can be used to solve the problem with hint
bits and checksums.
You're right, though, that it's academical whether double write buffers
can be used without checksums on data pages, if the whole point of the
exercise is to make it possible to have checksums on data pages..
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 28, 2011 at 5:45 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 28.12.2011 11:22, Simon Riggs wrote:
On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double write
buffer
that has a valid checksum, you copy to the main storage. There's no need
to
check validity of pages in the main storage.OK, then we are talking at cross purposes. Double write buffers, in
the way you explain them allow us to remove full page writes. They
clearly don't do anything to check page validity on read. Torn pages
are not the only fault we wish to correct against... and the double
writes idea is orthogonal to the idea of checksums.The reason we're talking about double write buffers in this thread is that
double write buffers can be used to solve the problem with hint bits and
checksums.
Torn pages are not the only problem we need to detect.
You said "You scan the double write buffer...". When exactly would you do that?
Please explain how a double write buffer detects problems that do not
occur as the result of a crash.
We don't have much time, so please be clear and lucid.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Heikki Linnakangas wrote:
On 28.12.2011 01:39, Simon Riggs wrote:On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
wrote:On 25.12.2011 15:01, Kevin Grittner wrote:
I don't believe that. Double-writing is a technique to avoid
torn pages, but it requires a checksum to work. This chicken-
and-egg problem requires the checksum to be implemented first.I don't think double-writes require checksums on the data pages
themselves, just on the copies in the double-write buffers. In
the double-write buffer, you'll need some extra information per-
page anyway, like a relfilenode and block number that indicates
which page it is in the buffer.
You are clearly right -- if there is no checksum in the page itself,
you can put one in the double-write metadata. I've never seen that
discussed before, but I'm embarrassed that it never occurred to me.
How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double
write buffer that has a valid checksum, you copy to the main
storage. There's no need to check validity of pages in the main
storage.
Right. I'll recap my understanding of double-write (from memory --
if there's a material error or omission, I hope someone will correct
me).
The write-ups I've seen on double-write techniques have all the
writes to the double-write buffer (a single, sequential file that
stays around). This is done as sequential writing to a file which is
overwritten pretty frequently, making the writes to a controller very
fast, and a BBU write-back cache unlikely to actually write to disk
very often. On good server-quality hardware, it should be blasting
RAM-to_RAM very efficiently. The file is fsync'd (like I said,
hopefully to BBU cache), then each page in the double-write buffer is
written to the normal page location, and that is fsync'd. Once that
is done, the database writes have no risk of being torn, and the
double-write buffer is marked as empty. This all happens at the
point when you would be writing the page to the database, after the
WAL-logging.
On crash recovery you read through the double-write buffer from the
start and write the pages which look good (including a good checksum)
to the database before replaying WAL. If you find a checksum error
in processing the double-write buffer, you assume that you never got
as far as the fsync of the double-write buffer, which means you never
started writing the buffer contents to the database, which means
there can't be any torn pages there. If you get to the end and
fsync, you can be sure any torn pages from a previous attempt to
write to the database itself have been overwritten with the good copy
in the double-write buffer. Either way, you move on to WAL
processing.
You wind up with a database free of torn pages before you apply WAL.
full_page_writes to the WAL are not needed as long as double-write is
used for any pages which would have been written to the WAL. If
checksums were written to the double-buffer metadata instead of
adding them to the page itself, this could be implemented alone. It
would probably allow a modest speed improvement over using
full_page_writes and would eliminate those full-page images from the
WAL files, making them smaller.
If we do add a checksum to the page header, that could be used for
testing for torn pages in the double-write buffer without needing a
redundant calculation for double-write. With no torn pages in the
actual database, checksum failures there would never be false
positives. To get this right for a checksum in the page header,
double-write would need to be used for all cases where
full_page_writes now are used (i.e., the first write of a page after
a checkpoint), and for all unlogged writes (e.g., hint-bit-only
writes). There would be no correctness problem for always using
double-write, but it would be unnecessary overhead for other page
writes, which I think we can avoid.
-Kevin
Import Notes
Resolved by subject fallback
Heikki Linnakangas wrote:
Simon Riggs wrote:
OK, then we are talking at cross purposes. Double write buffers,
in the way you explain them allow us to remove full page writes.
They clearly don't do anything to check page validity on read.
Torn pages are not the only fault we wish to correct against...
and the double writes idea is orthogonal to the idea of checksums.The reason we're talking about double write buffers in this thread
is that double write buffers can be used to solve the problem with
hint bits and checksums.
Exactly. Every time the issue of page checksums is raised, there are
objections because OS or hardware crashes could cause torn pages for
hint-bit-only writes which would be treated as serious errors
(potentially indicating hardware failure) when they are in fact
expected and benign. Some time before the thread dies, someone
generally points out that double-write technology would be a graceful
way to handle that, with the side benefit of smaller WAL files. All
available evidence suggests it would also allow a small performance
improvement, although I hesitate to emphasize that aspect of it; the
other benefits fully justify the effort without that.
I do feel there is value in a page checksum patch even without torn
page protection. The discussion on the list has convinced me that a
failed checksum should be treated as seriously as other page format
errors, rather than as a warning, even though (in the absence of torn
page protection) torn hint-bit-only page writes would be benign.
As an example of how this might be useful, consider our central
databases which contain all the detail replicated from the circuit
court databases in all the counties. These are mission-critical, so
we have redundant servers in separate buildings. At one point, one
of them experienced hardware problems and we started seeing invalid
pages. Since we can shift the load between these servers without
down time, we moved all applications to other servers, and
investigated. Now, it's possible that for some time before we got
errors on the bad pages, there could have been subtle corruption
which didn't generate errors but presented bad data on our web site.
A page checksum would help prevent that sort of problem, and a
post-crash false positive might waste a little time in investigation,
but that cost would be far outweighed by the benefit of better
accuracy guarantees.
Of course, it will be a big plus if we can roll this out in 9.2 in
conjunction with a double-write feature. Not only will double-write
probably be a bit faster than full_page_writes in the WAL log, but it
will allow protection against torn pages on hint-bit-only writes
without adding those writes to the WAL or doing any major
rearrangement of where they sit that would break pg_upgrade. It
would be nice not to have to put all sorts of caveats and
explanations into the docs about how a checksum error might be benign
due to hint bit writes.
-Kevin
Import Notes
Resolved by subject fallback
On Thu, Dec 29, 2011 at 11:08:43AM -0600, Kevin Grittner wrote:
Heikki Linnakangas wrote:
Simon Riggs wrote:OK, then we are talking at cross purposes. Double write buffers,
in the way you explain them allow us to remove full page writes.
They clearly don't do anything to check page validity on read.
Torn pages are not the only fault we wish to correct against...
and the double writes idea is orthogonal to the idea of checksums.The reason we're talking about double write buffers in this thread
is that double write buffers can be used to solve the problem with
hint bits and checksums.
Of course, it will be a big plus if we can roll this out in 9.2 in
conjunction with a double-write feature. Not only will double-write
probably be a bit faster than full_page_writes in the WAL log, but it
will allow protection against torn pages on hint-bit-only writes
without adding those writes to the WAL or doing any major
rearrangement of where they sit that would break pg_upgrade.
[Thanks for your recent thread summaries.]
A double-write buffer, like a WAL-logged full-page image, is a technique for
performing atomic writes wider than those automatically provided by components
further down the storage stack. The two strategies have different performance
characteristics, and we're told that a double-write buffer would better serve us
overall. However, its benefits would not be *greater* for hint-only writes than
for any other write. For that reason, I think we should consider these changes
independently. With page checksums enabled, remove the hazard of torn hint-only
writes by ensuring that a WAL FPI has flushed since the last checkpoint. When
necessary, emit an FPI-only record. Separately, optimize first-since-checkpoint
writes by replacing FPIs with double-write buffers. The double-write patch will
reduce the added WAL of the checksum/safe-hint-updates patch to zero. If the
double-writes patch founders, we'll just have more-costly, yet equally reliable,
page checksums.
nm
On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
positives. To get this right for a checksum in the page header,
double-write would need to be used for all cases where
full_page_writes now are used (i.e., the first write of a page after
a checkpoint), and for all unlogged writes (e.g., hint-bit-only
writes). There would be no correctness problem for always using
double-write, but it would be unnecessary overhead for other page
writes, which I think we can avoid.
Unless I'm missing something, double-writes are needed for all writes,
not only the first page after a checkpoint. Consider this sequence of
events:
1. Checkpoint
2. Double-write of page A (DW buffer write, sync, heap write)
3. Sync of heap, releasing DW buffer for new writes.
... some time goes by
4. Regular write of page A
5. OS writes one part of page A
6. Crash!
Now recovery comes along, page A is broken in the heap with no
double-write buffer backup nor anything to recover it by in the WAL.
--
Ants Aasma
2011/12/30 Ants Aasma <ants.aasma@eesti.ee>:
On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:positives. To get this right for a checksum in the page header,
double-write would need to be used for all cases where
full_page_writes now are used (i.e., the first write of a page after
a checkpoint), and for all unlogged writes (e.g., hint-bit-only
writes). There would be no correctness problem for always using
double-write, but it would be unnecessary overhead for other page
writes, which I think we can avoid.Unless I'm missing something, double-writes are needed for all writes,
not only the first page after a checkpoint. Consider this sequence of
events:1. Checkpoint
2. Double-write of page A (DW buffer write, sync, heap write)
3. Sync of heap, releasing DW buffer for new writes.
... some time goes by
4. Regular write of page A
5. OS writes one part of page A
6. Crash!Now recovery comes along, page A is broken in the heap with no
double-write buffer backup nor anything to recover it by in the WAL.
I guess the assumption is that the write in (4) is either backed by
the WAL, or made safe by double writing. ISTM that such reasoning is
only correct if the change that is expressed by the WAL record can be
applied in the context of inconsistent (i.e., partially written)
pages, which I assume is not the case (excuse my ignorance regarding
such basic facts).
So I think you are right.
Nicolas
--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?
On Thu, Dec 29, 2011 at 4:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Heikki Linnakangas wrote:
On 28.12.2011 01:39, Simon Riggs wrote:On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
wrote:On 25.12.2011 15:01, Kevin Grittner wrote:
I don't believe that. Double-writing is a technique to avoid
torn pages, but it requires a checksum to work. This chicken-
and-egg problem requires the checksum to be implemented first.I don't think double-writes require checksums on the data pages
themselves, just on the copies in the double-write buffers. In
the double-write buffer, you'll need some extra information per-
page anyway, like a relfilenode and block number that indicates
which page it is in the buffer.You are clearly right -- if there is no checksum in the page itself,
you can put one in the double-write metadata. I've never seen that
discussed before, but I'm embarrassed that it never occurred to me.
Heikki's idea for double writes works well. It solves the problems of
torn pages in a way that would make FPW redundant.
However, I don't see that it provides protection across non-crash
write problems. We know we have these since many systems have run
without a crash for years and yet still experience corrupt data.
Double writes do not require page checksums but neither do they
replace page checksums.
So I think we need page checksums plus either FPWs or double writes.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Nicolas Barbier wrote:
2011/12/30 Ants Aasma :Kevin Grittner wrote:
positives. To get this right for a checksum in the page header,
double-write would need to be used for all cases where
full_page_writes now are used (i.e., the first write of a page
after a checkpoint), and for all unlogged writes (e.g.,
hint-bit-only writes). There would be no correctness problem for
always using double-write, but it would be unnecessary overhead
for other page writes, which I think we can avoid.Unless I'm missing something, double-writes are needed for all
writes, not only the first page after a checkpoint. Consider this
sequence of events:1. Checkpoint
2. Double-write of page A (DW buffer write, sync, heap write)
3. Sync of heap, releasing DW buffer for new writes.
... some time goes by
4. Regular write of page A
5. OS writes one part of page A
6. Crash!Now recovery comes along, page A is broken in the heap with no
double-write buffer backup nor anything to recover it by in the
WAL.I guess the assumption is that the write in (4) is either backed by
the WAL, or made safe by double writing. ISTM that such reasoning
is only correct if the change that is expressed by the WAL record
can be applied in the context of inconsistent (i.e., partially
written) pages, which I assume is not the case (excuse my ignorance
regarding such basic facts).So I think you are right.
Hmm. It appears that I didn't think that through all the way. I see
two alternatives.
(1) We don't eliminate full_page_writes and we only need to use
double-writes for unlogged writes.
(2) We double-write all writes and on recovery we only apply WAL to
a page from pd_lsn onward. We would start from the same point and
follow the same rules except that when we read a page and find a
pd_lsn past the location of the record we are applying, we do nothing
because we are 100% sure everything to that point is safely written
and not torn. full_page_writes to WAL would not be needed.
-Kevin
Import Notes
Resolved by subject fallback
Simon Riggs wrote:
Kevin Grittner wrote:
if there is no checksum in the page itself, you can put one in the
double-write metadata.
However, I don't see that it provides protection across non-crash
write problems. We know we have these since many systems have run
without a crash for years and yet still experience corrupt data.
Agreed. I don't think anyone has tried to assert it solves the same
problems that checksums solve -- it is a high-performance way to
solve some of the problems that an in-page checksum *creates* without
breaking pg_upgrade.
Double writes do not require page checksums but neither do they
replace page checksums.
To nit-pick: double writes require a page checksum, but (as Heikki
pointed out) they don't require it to be stored in the page. If
there *is* one stored in the page, it probably makes sense to use it.
So I think we need page checksums plus either FPWs or double
writes.
Adding checksums by themselves creates a risk of false positive
corrupted page indications following an OS or hardware crash.
Additional FPWs or a new double-write mechanism are two of miriad
possible solutions to that. If it is going to be addressed for 9.2,
I believe they're the two most reasonable, especially from the POV of
pg_upgrade.
So, while they should be separate patches, the complement each other;
each makes the other perform better, and they should share some code.
-Kevin
Import Notes
Resolved by subject fallback
On Thu, Dec 29, 2011 at 11:44 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
You wind up with a database free of torn pages before you apply WAL.
full_page_writes to the WAL are not needed as long as double-write is
used for any pages which would have been written to the WAL. If
checksums were written to the double-buffer metadata instead of
adding them to the page itself, this could be implemented alone. It
would probably allow a modest speed improvement over using
full_page_writes and would eliminate those full-page images from the
WAL files, making them smaller.
Correct. So now lots of people seem to be jumping on the double-write
bandwagon and looking at some the things it promise: All writes are
durable
This solves 2 big issues:
- Remove torn-page problem
- Remove FPW from WAL
That up front looks pretty attractive. But we need to look at the
tradeoffs, and then decide (benchmark anyone).
Remember, postgresql is a double-write system right now. The 1st,
checkumed write is the FPW in WAL. It's fsynced. And the 2nd synced
write is when the file is synced during checkpoint.
So, postgresql currently has an optimization now that not every write
has *requirements* for atomic, instant durability. And so postgresql
get's to do lots of writes to the OS cache and *not* request them to
be instantly synced. And then at some point, when it's reay to clear
the 1st checksumed write, make sure everywrite is synced. And lots of
work went into PG recently to get even better at the collection of
writes/syncs that happen at checkpoint time to take even biger
advantage of the fact that its' better to write everything in a fil
efirst, then call a single sync.
So moving to this new double-write-area bandwagon, we move from a "WAL
FPW synced at the commit, collect as many other writes, then final
sync" type system to a system where *EVERY* write requires syncs of 2
separate 8K writes at buffer write-out time. So we avoid the FPW at
commit (yes, that's nice for latency), and we guarentee every buffer
written is consistent (that fixes our hit-bit-only dirty writes from
being torn). And we do that at a cost of every buffer write requiring
2 fsyncs, in a serial fashion. Come checkpoint, I'm wondering....
Again, all that to avoid a single "optimization" that postgresql currently has:
1) writes for hint-bit only buffers don't need to be durable
And the problem that optimization introduces:
1) Since they aren't guarenteed durable, we can't believe a checksum
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On 12/29/11, Ants Aasma <ants.aasma@eesti.ee> wrote:
Unless I'm missing something, double-writes are needed for all writes,
not only the first page after a checkpoint. Consider this sequence of
events:1. Checkpoint
2. Double-write of page A (DW buffer write, sync, heap write)
3. Sync of heap, releasing DW buffer for new writes.
... some time goes by
4. Regular write of page A
5. OS writes one part of page A
6. Crash!Now recovery comes along, page A is broken in the heap with no
double-write buffer backup nor anything to recover it by in the WAL.
Isn't 3 the very definition of a checkpoint, meaning that 4 is not
really a regular write as it is the first one after a checkpoint?
But it doesn't seem safe to me replace a page from the DW buffer and
then apply WAL to that replaced page which preceded the age of the
page in the buffer.
Cheers,
Jeff
Simon Riggs <simon@2ndQuadrant.com> wrote:
v2 of checksum patch, using a conditional copy if checksumming is
enabled, so locking is removed.Thanks to Andres for thwacking me with the cluestick, though I
have used a simple copy rather than a copy & calc.Tested using make installcheck with parameter on/off, then restart
and vacuumdb to validate all pages.Reviews, objections, user interface tweaks all welcome.
I'm happy with how this looks, except (as noted in a code comment)
that there seems to be room for optimization of the calculation
itself. Details below:
(1) I like the choice of Fletcher-16. It should be very good at
detecting problems while being a lot less expensive that an official
CRC calculation. The fact that it was developed at Lawrence
Livermore Labs and has been subjected to peer review in the IEEE
Transactions on Communications generates confidence in the
technique. According to what I've read, though, the technique is
conceptually based around the modulus of the sums. Doing the
modulus calculation for each addition is often done to prevent
overflow, but if we define sum1 and sum2 as uint I don't see how we
can get an overflow with 8k byes, so I suggest we declare both of
these local variables as uint and leave the modulus to the end. It
can't hurt to leave off 16000 division operations per checksum
generation.
(2) I'm not sure about doing this in three parts, to skip the
checksum itself and the hole in the middle of the page. Is this
because the hole might not have predictable data? Why would that
matter, as long as it is read back the same? Is it expected that
this will reduce the cost by checking fewer bytes? That we could
tolerate the read of an actual corrupted disk sector in the middle
of a page if it didn't contain any data? If we can set the checksum
to zero before starting, might it be faster to load four bytes at a
time, and iterate fewer times? (Like I said, I'm not sure about any
of this, but it seemed worth asking the questions.)
(3) Rather than having PageSetVerificationInfo() use memcpy,
followed by pass through the copied data to calculate the checksum,
might it be better to have a "copy and calculate" version of the
function (as in VMware's original patch) to save an extra pass over
the page image?
Other than these performance tweaks around the calculation phase, I
didn't spot any problems. I beat up on it a bit on a couple
machines without hitting any bugs or seeing any regression test
failures.
-Kevin
On Jan 3, 2012, at 4:21 PM, Kevin Grittner wrote:
(2) I'm not sure about doing this in three parts, to skip the
checksum itself and the hole in the middle of the page. Is this
because the hole might not have predictable data? Why would that
matter, as long as it is read back the same?
IMO not checksumming the free space would be a really bad idea. It's entirely possible to have your hardware crapping on your free space, and I'd still want to know that that was happening. Now, it would be interesting if the free space could be checksummed separately, since there's no reason to refuse to read the page if only the free space is screwed up... But given the choice, I'd rather get an error when the free space is "corrupted" and be forced to look into things rather than blissfully ignore corrupted free space only to be hit later with real data loss.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On 12/29/11, Ants Aasma <ants.aasma@eesti.ee> wrote:
Unless I'm missing something, double-writes are needed for all writes,
not only the first page after a checkpoint. Consider this sequence of
events:1. Checkpoint
2. Double-write of page A (DW buffer write, sync, heap write)
3. Sync of heap, releasing DW buffer for new writes.
... some time goes by
4. Regular write of page A
5. OS writes one part of page A
6. Crash!Now recovery comes along, page A is broken in the heap with no
double-write buffer backup nor anything to recover it by in the WAL.Isn't 3 the very definition of a checkpoint, meaning that 4 is not
really a regular write as it is the first one after a checkpoint?
I think you nailed it.
But it doesn't seem safe to me replace a page from the DW buffer and
then apply WAL to that replaced page which preceded the age of the
page in the buffer.
That's what LSNs are for.
If we write the page to the checkpoint buffer just once per
checkpoint, recovery can restore the double-written versions of the
pages and then begin WAL replay, which will restore all the subsequent
changes made to the page. Recovery may also need to do additional
double-writes if it encounters pages that for which we wrote WAL but
never flushed the buffer, because a crash during recovery can also
create torn pages. When we reach a restartpoint, we fsync everything
down to disk and then nuke the double-write buffer. Similarly, in
normal running, we can nuke the double-write buffer at checkpoint
time, once the fsyncs are complete.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
"Kevin Grittner" wrote:
if we define sum1 and sum2 as uint I don't see how we can get an
overflow with 8k byes
I feel the need to amend that opinion.
While sum1 only needs to hold a maximum of (BLCKSZ * 255), which
would be adequate for a BLCKSZ up to 16 MB, sum2 needs to hold up to
a maximum of ((BLCKSZ * (BLCKSZ + 1) / 2) * 255) so a 32-bit unsigned
isn't even good for an 8 KB block size. A uint64 for sum2 can handle
BLCKSZ up to 256 MB. So we could define sum1 as uint (which is
presumably either uint32 or uint64) and sum2 as uint64 and for a
BLCKSZ up toe 16 MB we still don't need to find the modulo 255
numbers until we're done adding things up.
Does anyone think we need to support anything bigger than that?
Making them both uint64 would support a BLCKSZ up to 256 MB without
needing to do the division inside the loop. but it might be slightly
slower on 32-bit builds.
-Kevin
Import Notes
Resolved by subject fallback
On Tue, Jan 3, 2012 at 11:00 PM, Jim Nasby <jim@nasby.net> wrote:
On Jan 3, 2012, at 4:21 PM, Kevin Grittner wrote:
(2) I'm not sure about doing this in three parts, to skip the
checksum itself and the hole in the middle of the page. Is this
because the hole might not have predictable data? Why would that
matter, as long as it is read back the same?IMO not checksumming the free space would be a really bad idea. It's entirely possible to have your hardware crapping on your free space, and I'd still want to know that that was happening. Now, it would be interesting if the free space could be checksummed separately, since there's no reason to refuse to read the page if only the free space is screwed up... But given the choice, I'd rather get an error when the free space is "corrupted" and be forced to look into things rather than blissfully ignore corrupted free space only to be hit later with real data loss.
I see that argument. We don't have space for 2 checksums.
We can either
(1) report all errors on a page, including errors that don't change
PostgreSQL data. This involves checksumming long strings of zeroes,
which the checksum algorithm can't tell apart from long strings of
ones.
(2) report only errors that changed PostgreSQL data.
We already do (1) for WAL CRCs so doing the same thing for page
checksums makes sense and is much faster.
If enough people think we should do (2) then its a simple change to the patch.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 3, 2012 at 10:21 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
I'm happy with how this looks, except (as noted in a code comment)
that there seems to be room for optimization of the calculation
itself. Details below:
...
(3) Rather than having PageSetVerificationInfo() use memcpy,
followed by pass through the copied data to calculate the checksum,
might it be better to have a "copy and calculate" version of the
function (as in VMware's original patch) to save an extra pass over
the page image?
Other than these performance tweaks around the calculation phase, I
didn't spot any problems. I beat up on it a bit on a couple
machines without hitting any bugs or seeing any regression test
failures.
My focus was on getting something working first, then tuning. If we're
agreed that we have everything apart from the tuning then we can
proceed with tests to see which works better.
The copy and calculate approach might get in the way of hardware
prefetch since in my understanding the memory fetch time exceeds the
calculation time. As discussed elsewhere using that code or not would
not stop that work being credited.
David, please can you rework the VMware calc patch to produce an
additional 16-bit checksum mechanism in a way compatible with the
16bit patch, so we can test the two versions of the calculation? We
can make the GUC an enum so that the page checksum is selectable (for
testing).
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tuesday, January 03, 2012 11:21:42 PM Kevin Grittner wrote:
(1) I like the choice of Fletcher-16. It should be very good at
detecting problems while being a lot less expensive that an official
CRC calculation.
I wonder if CRC32c wouldn't be a good alternative given more and more cpus
(its in SSE 4.2) support calculating it in silicon.
Andres
On Wed, Jan 4, 2012 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote:
On Tuesday, January 03, 2012 11:21:42 PM Kevin Grittner wrote:
(1) I like the choice of Fletcher-16. It should be very good at
detecting problems while being a lot less expensive that an official
CRC calculation.I wonder if CRC32c wouldn't be a good alternative given more and more cpus
(its in SSE 4.2) support calculating it in silicon.
We're trying to get something that fits in 16bits for this release.
I'm guessing CRC32c doesn't?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 4, 2012 at 3:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On 12/29/11, Ants Aasma <ants.aasma@eesti.ee> wrote:
Unless I'm missing something, double-writes are needed for all writes,
not only the first page after a checkpoint. Consider this sequence of
events:1. Checkpoint
2. Double-write of page A (DW buffer write, sync, heap write)
3. Sync of heap, releasing DW buffer for new writes.
... some time goes by
4. Regular write of page A
5. OS writes one part of page A
6. Crash!Now recovery comes along, page A is broken in the heap with no
double-write buffer backup nor anything to recover it by in the WAL.Isn't 3 the very definition of a checkpoint, meaning that 4 is not
really a regular write as it is the first one after a checkpoint?I think you nailed it.
No, I should have explicitly stated that no checkpoint happens in
between. I think the confusion here is because I assumed Kevin
described a fixed size d-w buffer in this message:
On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
... The file is fsync'd (like I said,
hopefully to BBU cache), then each page in the double-write buffer is
written to the normal page location, and that is fsync'd. Once that
is done, the database writes have no risk of being torn, and the
double-write buffer is marked as empty. ...
If the double-write buffer survives until the next checkpoint,
double-writing only the first write should work just fine. The
advantage over current full-page writes is that the write is not into
the WAL stream and is done (hopefully) by the bgwriter/checkpointer in
the background.
--
Ants Aasma
2012/1/4 Simon Riggs <simon@2ndquadrant.com>:
On Wed, Jan 4, 2012 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote:
I wonder if CRC32c wouldn't be a good alternative given more and more cpus
(its in SSE 4.2) support calculating it in silicon.We're trying to get something that fits in 16bits for this release.
I'm guessing CRC32c doesn't?
What happens to the problem-detecting performance of a 16 bit part of
a CRC32c vs. a real 16 bit checksum? If it is still as good, it might
make sense to use the former, assuming that there is a way to easily
trigger the silicon support and enough CPUs support it.
Nicolas
--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?
Simon Riggs wrote:
We can either
(1) report all errors on a page, including errors that don't change
PostgreSQL data. This involves checksumming long strings of zeroes,
which the checksum algorithm can't tell apart from long strings of
ones.
What do you mean? Each byte which goes into the checksum, and the
position of that byte influences the outcome once you've got a
non-zero value in sum1. The number of leading NIL bytes would not
affect the outcome unless you seed the calculation with something
non-zero, but including the page header in the calculation seems to
cover that OK.
(2) report only errors that changed PostgreSQL data.
We already do (1) for WAL CRCs so doing the same thing for page
checksums makes sense and is much faster.If enough people think we should do (2) then its a simple change to
the patch.
To me, (1) makes more sense, but it seems to me you're currently
doing (2) because you check in three parts, skipping the free space
in the middle of the page.
-Kevin
Import Notes
Resolved by subject fallback
Excerpts from Kevin Grittner's message of mié ene 04 04:12:43 -0300 2012:
"Kevin Grittner" wrote:
if we define sum1 and sum2 as uint I don't see how we can get an
overflow with 8k byesI feel the need to amend that opinion.
While sum1 only needs to hold a maximum of (BLCKSZ * 255), which
would be adequate for a BLCKSZ up to 16 MB, sum2 needs to hold up to
a maximum of ((BLCKSZ * (BLCKSZ + 1) / 2) * 255) so a 32-bit unsigned
isn't even good for an 8 KB block size. A uint64 for sum2 can handle
BLCKSZ up to 256 MB. So we could define sum1 as uint (which is
presumably either uint32 or uint64) and sum2 as uint64 and for a
BLCKSZ up toe 16 MB we still don't need to find the modulo 255
numbers until we're done adding things up.Does anyone think we need to support anything bigger than that?
Making them both uint64 would support a BLCKSZ up to 256 MB without
needing to do the division inside the loop. but it might be slightly
slower on 32-bit builds.
We don't support BLCKSZ higher than 32k anyway ... saith pg_config.h:
/* Size of a disk block --- this also limits the size of a tuple. You can set
it bigger if you need bigger tuples (although TOAST should reduce the need
to have large tuples, since fields can be spread across multiple tuples).
BLCKSZ must be a power of 2. The maximum possible value of BLCKSZ is
currently 2^15 (32768). This is determined by the 15-bit widths of the
lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h).
Changing BLCKSZ requires an initdb. */
#define BLCKSZ 8192
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Robert Haas wrote:
Jeff Janes wrote:
But it doesn't seem safe to me replace a page from the DW buffer
and then apply WAL to that replaced page which preceded the age of
the page in the buffer.That's what LSNs are for.
Agreed.
If we write the page to the checkpoint buffer just once per
checkpoint, recovery can restore the double-written versions of the
pages and then begin WAL replay, which will restore all the
subsequent changes made to the page. Recovery may also need to do
additional double-writes if it encounters pages that for which we
wrote WAL but never flushed the buffer, because a crash during
recovery can also create torn pages.
That's a good point. I think WAL application does need to use
double-write. As usual, it doesn't affect *when* a page must be
written, but *how*.
When we reach a restartpoint, we fsync everything down to disk and
then nuke the double-write buffer.
I think we add to the double-write buffer as we write pages from the
buffer to disk. I don't think it makes sense to do potentially
repeated writes of the same page with different contents to the
double-write buffer as we go; nor is it a good idea to leave the page
unsynced and let the double-write buffer grow for a long time.
Similarly, in normal running, we can nuke the double-write buffer
at checkpoint time, once the fsyncs are complete.
Well, we should nuke it for re-use as soon as all pages in the buffer
are written and fsynced. I'm not at all sure that the best
performance is hit by waiting for checkpoint for that versus doing it
at page eviction time.
The whole reason that double-write techniques don't double the write
time is that it is relatively small and the multiple writes to the
same disk sectors get absorbed by the BBU write-back without actually
hitting the disk all the time. Letting the double-write buffer grow
to a large size seems likely to me to be a performance killer. The
whole double-write, including fsyncs to buffer and the actual page
location should just be considered part of the page write process, I
think.
-Kevin
Import Notes
Resolved by subject fallback
Simon, all,
* Simon Riggs (simon@2ndQuadrant.com) wrote:
(1) report all errors on a page, including errors that don't change
PostgreSQL data. This involves checksumming long strings of zeroes,
which the checksum algorithm can't tell apart from long strings of
ones.
Do we actually know when/where it's supposed to be all zeros, and hence
could we check for that explicitly? If we know what it's supposed to
be, in order to be consistent with other information, I could certainly
see value in actually checking that.
I don't think that's valuable enough to go breaking abstraction layers
or bending over backwards to do it though. If we don't have the
knowledge, at the right level, that the data should all be zeros then
including those pieces in the CRC certainly makes sense to me.
Just my 2c.
Thanks,
Stephen
Simon Riggs wrote:
My focus was on getting something working first, then tuning. If
we're agreed that we have everything apart from the tuning then we
can proceed with tests to see which works better.
Sure. I just think you are there already except for what I got into.
FWIW, moving the modulus application out of the loop is a very
trivial change and has no affect on the results; it's strictly a
performance issue.
-Kevin
Import Notes
Resolved by subject fallback
Alvaro Herrera wrote:
We don't support BLCKSZ higher than 32k anyway
Thanks for pointing that out. Then I think we should declare sum1 to
be uint and sum2 to be uint64. We can take out the "% 255" out from
where it sits in the v2 patch, and just add something like this after
the sums are generated:
sum1 %= 255;
sum2 %= 255;
Or, of course, we can just do it on the line where we combine the two
sums for the final result, if that's not too hard to read.
-Kevin
Import Notes
Resolved by subject fallback
On Wed, Jan 4, 2012 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote:
Simon, all,
* Simon Riggs (simon@2ndQuadrant.com) wrote:
(1) report all errors on a page, including errors that don't change
PostgreSQL data. This involves checksumming long strings of zeroes,
which the checksum algorithm can't tell apart from long strings of
ones.Do we actually know when/where it's supposed to be all zeros, and hence
could we check for that explicitly? If we know what it's supposed to
be, in order to be consistent with other information, I could certainly
see value in actually checking that.
Yes, we can. Excellent suggestion, will implement.
That means we can keep the CRC calc fast as well as check the whole of
the page inbound.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 4, 2012 at 8:31 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
When we reach a restartpoint, we fsync everything down to disk and
then nuke the double-write buffer.I think we add to the double-write buffer as we write pages from the
buffer to disk. I don't think it makes sense to do potentially
repeated writes of the same page with different contents to the
double-write buffer as we go; nor is it a good idea to leave the page
unsynced and let the double-write buffer grow for a long time.
You may be right. Currently, though, we only fsync() at
end-of-checkpoint. So we'd have to think about what to fsync, and how
often, to keep the double-write buffer to a manageable size. I can't
help thinking that any extra fsyncs are pretty expensive, though,
especially if you have to fsync() every file that's been
double-written before clearing the buffer. Possibly we could have 2^N
separate buffers based on an N-bit hash of the relfilenode and segment
number, so that we could just fsync 1/(2^N)-th of the open files at a
time. But even that sounds expensive: writing back lots of dirty data
isn't cheap. One of the systems I've been doing performance testing
on can sometimes take >15 seconds to write a shutdown checkpoint, and
I'm sure that other people have similar (and worse) problems.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote:
we only fsync() at end-of-checkpoint. So we'd have to think about
what to fsync, and how often, to keep the double-write buffer to a
manageable size.
I think this is the big tuning challenge with this technology.
I can't help thinking that any extra fsyncs are pretty expensive,
though, especially if you have to fsync() every file that's been
double-written before clearing the buffer. Possibly we could have
2^N separate buffers based on an N-bit hash of the relfilenode and
segment number, so that we could just fsync 1/(2^N)-th of the open
files at a time.
I'm not sure I'm following -- we would just be fsyncing those files
we actually wrote pages into, right? Not all segments for the table
involved?
But even that sounds expensive: writing back lots of dirty data
isn't cheap. One of the systems I've been doing performance
testing on can sometimes take >15 seconds to write a shutdown
checkpoint,
Consider the relation-file fsyncs for double-write as a form of
checkpoint spreading, and maybe it won't seem so bad. It should
make that shutdown checkpoint less painful. Now, I have been
thinking that on a write-heavy system you had better have a BBU
write-back cache, but that's my recommendation, anyway.
and I'm sure that other people have similar (and worse) problems.
Well, I have no doubt that this feature should be optional. Those
who prefer can continue to do full-page writes to the WAL, instead.
Or take the "running with scissors" approach.
-Kevin
On Wed, Jan 4, 2012 at 1:32 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
we only fsync() at end-of-checkpoint. So we'd have to think about
what to fsync, and how often, to keep the double-write buffer to a
manageable size.I think this is the big tuning challenge with this technology.
One of them, anyway. I think it may also be tricky to make sure that
a backend that needs to write a dirty buffer doesn't end up having to
wait for a double-write to be fsync'd.
I can't help thinking that any extra fsyncs are pretty expensive,
though, especially if you have to fsync() every file that's been
double-written before clearing the buffer. Possibly we could have
2^N separate buffers based on an N-bit hash of the relfilenode and
segment number, so that we could just fsync 1/(2^N)-th of the open
files at a time.I'm not sure I'm following -- we would just be fsyncing those files
we actually wrote pages into, right? Not all segments for the table
involved?
Yes.
But even that sounds expensive: writing back lots of dirty data
isn't cheap. One of the systems I've been doing performance
testing on can sometimes take >15 seconds to write a shutdown
checkpoint,Consider the relation-file fsyncs for double-write as a form of
checkpoint spreading, and maybe it won't seem so bad. It should
make that shutdown checkpoint less painful. Now, I have been
thinking that on a write-heavy system you had better have a BBU
write-back cache, but that's my recommendation, anyway.
I think this point has possibly been beaten to death, but at the risk
of belaboring the point I'll bring it up again: the frequency with
which we fsync() is basically a trade-off between latency and
throughput. If you fsync a lot, then each one will be small, so you
shouldn't experience much latency, but throughput might suck. If you
don't fsync very much, then you maximize the chances for
write-combining (because inserting an fsync between two writes to the
same block forces that block to be physically written twice rather
than just once) thus improving throughput, but when you do get around
to calling fsync() there may be a lot of data to write all at once,
and you may get a gigantic latency spike.
As far as I can tell, one fsync per checkpoint is the theoretical
minimum, and that's what we do now. So our current system is
optimized for throughput. The decision to put full-page images into
WAL rather than a separate buffer is essentially turning the dial in
the same direction, because, in effect, the double-write fsync
piggybacks on the WAL fsync which we must do anyway. So both the
decision to use a double-write buffer AT ALL and the decision to fsync
more frequently to keep that buffer to a manageable size are going to
result in turning that dial in the opposite direction. It seems to me
inevitable that, even with the best possible implementation,
throughput will get worse. With a good implementation but not a bad
one, latency should improve.
Now, this is not necessarily a reason to reject the idea. I believe
that several people have proposed that our current implementation is
*overly* optimized for throughput *at the expense of* latency, and
that we might want to provide some options that, in one way or
another, fsync more frequently, so that checkpoint spikes aren't as
bad. But when it comes time to benchmark, we might need to think
somewhat carefully about what we're testing...
Another thought here is that double-writes may not be the best
solution, and are almost certainly not the easiest-to-implement
solution. We could instead do something like this: when an unlogged
change is made to a buffer (e.g. a hint bit is set), we set a flag on
the buffer header. When we evict such a buffer, we emit a WAL record
that just overwrites the whole buffer with a new FPI. There are some
pretty obvious usage patterns where this is likely to be painful (e.g.
load a big table without setting hint bits, and then seq-scan it).
But there are also many use cases where the working set fits inside
shared buffers and data pages don't get written very often, apart from
checkpoint time, and those cases might work just fine. Also, the
cases that are problems for this implementation are likely to also be
problems for a double-write based implementation, for exactly the same
reasons: if you discover at buffer eviction time that you need to
fsync something (whether it's WAL or DW), it's going to hurt.
Checksums aren't free even when using double-writes: if you don't have
checksums, pages that have only hint bit-changes don't need to be
double-written. If double writes aren't going to give us anything
"for free", maybe that's not the right place to be focusing our
efforts...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote:
I think it may also be tricky to make sure that a backend that
needs to write a dirty buffer doesn't end up having to wait for a
double-write to be fsync'd.
This and other parts of your post seem to ignore the BBU write-back
cache. Multiple fsyncs of a single page can still be collapsed at
that level to a single actual disk write. In fact, I rather doubt
this technology will look very good on machines without write-back
caching. I'm not as sure as you are that this is a net loss in
throughput, either. When the fsync storm clogs the RAID controller,
even reads stall, so something which more evenly pushes writes to
disk might avoid these non-productive pauses. I think that could
improve throughput enough to balance or exceed the other effects.
Maybe. I agree we need to be careful to craft a good set of
benchmarks here.
Checksums aren't free even when using double-writes: if you don't
have checksums, pages that have only hint bit-changes don't need
to be double-written.
Agreed. Checksums aren't expected to be free under any
circumstances. I'm expecting DW to be slightly faster than FPW in
general, with or without in-page checksums.
If double writes aren't going to give us anything "for free",
maybe that's not the right place to be focusing our
efforts...
I'm not sure why it's not enough that they improve performance over
the alternative. Making some other feature with obvious overhead
"free" seems an odd requirement to hang on this. (Maybe I'm
misunderstanding you on that point?)
-Kevin
On Wed, Jan 4, 2012 at 3:51 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
If double writes aren't going to give us anything "for free",
maybe that's not the right place to be focusing our
efforts...I'm not sure why it's not enough that they improve performance over
the alternative. Making some other feature with obvious overhead
"free" seems an odd requirement to hang on this. (Maybe I'm
misunderstanding you on that point?)
Well, this thread is nominally about checksums, but here we are
talking about double writes, so I thought we were connecting those
features in some way?
Certainly, life is easier if we can develop them completely separately
- but checksums really ought to come with some sort of solution to the
problem of a torn-page with hint bit changes, IMO, and I thought
that's why were thinking so hard about DW just now.
Maybe I'm confused.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jan 4, 2012 at 3:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, Jan 4, 2012 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote:
Simon, all,
* Simon Riggs (simon@2ndQuadrant.com) wrote:
(1) report all errors on a page, including errors that don't change
PostgreSQL data. This involves checksumming long strings of zeroes,
which the checksum algorithm can't tell apart from long strings of
ones.Do we actually know when/where it's supposed to be all zeros, and hence
could we check for that explicitly? If we know what it's supposed to
be, in order to be consistent with other information, I could certainly
see value in actually checking that.Yes, we can. Excellent suggestion, will implement.
No, we can't.
I discover that non-all-zeroes holes are fairly common, just not very frequent.
That may or may not be a problem, but not something to be dealt with
here and now.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 4, 2012 at 1:35 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Simon Riggs wrote:
My focus was on getting something working first, then tuning. If
we're agreed that we have everything apart from the tuning then we
can proceed with tests to see which works better.Sure. I just think you are there already except for what I got into.
FWIW, moving the modulus application out of the loop is a very
trivial change and has no affect on the results; it's strictly a
performance issue.
New version attached, with your suggested changes included. Hole check
code is there as well, but ifdef'd out since it isn't a valid check in
all cases.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16.v3.patchtext/x-patch; charset=US-ASCII; name=checksum16.v3.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..9b367a3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1701,6 +1701,48 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-page-checksums" xreflabel="page_checksums">
+ <indexterm>
+ <primary><varname>page_checksums</> configuration parameter</primary>
+ </indexterm>
+ <term><varname>page_checksums</varname> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ When this parameter is on, the <productname>PostgreSQL</> server
+ calculates checksums when it writes main database pages to disk,
+ flagging the page as checksum protected. When this parameter is off,
+ no checksum is written, only a standard watermark in the page header.
+ The database may thus contain a mix of pages with checksums and pages
+ without checksums.
+ </para>
+
+ <para>
+ When pages are read into shared buffers any page flagged with a
+ checksum has the checksum re-calculated and compared against the
+ stored value to provide greatly improved validation of page contents.
+ </para>
+
+ <para>
+ Writes via temp_buffers are not checksummed.
+ </para>
+
+ <para>
+ Turning this parameter off speeds normal operation, but
+ might allow data corruption to go unnoticed. The checksum uses
+ 16-bit checksums, using the fast Fletcher 16 algorithm. With this
+ parameter enabled there is still a non-zero probability that an error
+ could go undetected, as well as a non-zero probability of false
+ positives.
+ </para>
+
+ <para>
+ This parameter can only be set in the <filename>postgresql.conf</>
+ file or on the server command line.
+ The default is <literal>off</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
<term><varname>wal_buffers</varname> (<type>integer</type>)</term>
<indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 91cc001..a43b7be 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -440,7 +440,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
- if (!PageHeaderIsValid((PageHeader) bufBlock))
+ if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
@@ -1860,6 +1860,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
+ char *bufCopy;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
@@ -1907,10 +1909,24 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
buf->flags &= ~BM_JUST_DIRTIED;
UnlockBufHdr(buf);
+ /*
+ * Set page verification info immediately before we write the buffer to disk.
+ * Once we have flushed the buffer is marked clean again, meaning it can
+ * be replaced quickly and silently with another data block, so we must
+ * write verification info now. For efficiency, the process of cleaning
+ * and page replacement is asynchronous, so we can't do this *only* when
+ * we are about to replace the buffer, we need to do this for every flush.
+ */
+ bufBlock = BufHdrGetBlock(buf);
+ bufCopy = PageSetVerificationInfo((Page) bufBlock);
+
+ /*
+ * bufToWrite is either the shared buffer or a copy, as appropriate.
+ */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
- (char *) BufHdrGetBlock(buf),
+ (char *) bufCopy,
false);
pgBufferUsage.shared_blks_written++;
@@ -1921,6 +1937,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
*/
TerminateBufferIO(buf, true, 0);
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 096d36a..a220310 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
/* Find smgr relation for buffer */
oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ /* XXX do we want to write checksums for local buffers? An option? */
+
/* And write... */
smgrwrite(oreln,
bufHdr->tag.forkNum,
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 90a731c..c49c2e1 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -16,6 +16,12 @@
#include "access/htup.h"
+bool page_checksums = false;
+
+static char pageCopy[BLCKSZ]; /* temporary buffer to allow checksum calculation */
+
+static bool PageVerificationInfoOK(Page page);
+static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
@@ -25,6 +31,10 @@
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we write the block to disk.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
@@ -67,20 +77,20 @@ PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
-PageHeaderIsValid(PageHeader page)
+PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
/* Check normal case */
- if (PageGetPageSize(page) == BLCKSZ &&
- PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
- (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
- page->pd_lower >= SizeOfPageHeaderData &&
- page->pd_lower <= page->pd_upper &&
- page->pd_upper <= page->pd_special &&
- page->pd_special <= BLCKSZ &&
- page->pd_special == MAXALIGN(page->pd_special))
+ if (PageVerificationInfoOK(page) &&
+ (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
+ p->pd_lower >= SizeOfPageHeaderData &&
+ p->pd_lower <= p->pd_upper &&
+ p->pd_upper <= p->pd_special &&
+ p->pd_special <= BLCKSZ &&
+ p->pd_special == MAXALIGN(p->pd_special))
return true;
/* Check all-zeroes case */
@@ -93,7 +103,6 @@ PageHeaderIsValid(PageHeader page)
return true;
}
-
/*
* PageAddItem
*
@@ -827,3 +836,239 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
pfree(itemidbase);
}
+
+/*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page writes does not check verification
+ * info via this function.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering and leaving
+ * the buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data leaving and
+ * re-entering PostgreSQL controlled memory.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ *
+ * Note also that this returns a boolean, not a full damage assessment.
+ */
+static bool
+PageVerificationInfoOK(Page page)
+{
+ PageHeader p = (PageHeader) page;
+
+ /*
+ * We set two non-adjacent bits to signify that the page has a checksum and
+ * should be verified against that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags and destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ */
+ if (PageHasChecksumFlag1(p) && PageHasChecksumFlag2(p))
+ {
+ uint16 checksum = PageCalcChecksum16(page);
+
+ if (checksum == p->pd_verify.pd_checksum16)
+ {
+#ifdef CHECK_HOLE
+ /* Also check page hole is all-zeroes */
+ char *pagebytes;
+ bool empty = true;
+ int i;
+
+ pagebytes = (char *) page;
+ for (i = p->pd_lower; i < p->pd_upper; i++)
+ {
+ if (pagebytes[i] != 0)
+ {
+ empty = false;
+ break;
+ }
+ }
+
+ if (!empty)
+ elog(LOG, "hole was not empty at byte %d pd_lower %d pd_upper %d",
+ i, p->pd_lower, p->pd_upper);
+#endif
+ return true;
+ }
+
+ elog(LOG, "page verification failed - checksum was %u page checksum field is %u",
+ checksum, p->pd_verify.pd_checksum16);
+ }
+ else if (!PageHasChecksumFlag1(p) && !PageHasChecksumFlag2(p))
+ {
+ if (PageGetPageLayoutVersion(p) == PG_PAGE_LAYOUT_VERSION &&
+ PageGetPageSize(p) == BLCKSZ)
+ return true;
+ }
+ else
+ elog(LOG, "page verification failed - page has one checksum flag set");
+
+ return false;
+}
+
+/*
+ * Set verification info for page.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an invalid checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ *
+ * Returns a pointer to the block-sized data that needs to be written. That
+ * allows us to either copy, or not, depending upon whether we checksum.
+ */
+char *
+PageSetVerificationInfo(Page page)
+{
+ PageHeader p;
+
+ if (PageIsNew(page))
+ return (char *) page;
+
+ if (page_checksums)
+ {
+ /*
+ * We make a copy iff we need to calculate a checksum because other
+ * backends may set hint bits on this page while we write, which
+ * would mean the checksum differs from the page contents. It doesn't
+ * matter if we include or exclude hints during the copy, as long
+ * as we write a valid page and associated checksum.
+ */
+ memcpy(&pageCopy, page, BLCKSZ);
+
+ p = (PageHeader) &pageCopy;
+ p->pd_flags |= PD_CHECKSUM;
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16((Page) &pageCopy);
+
+ return (char *) &pageCopy;
+ }
+
+ p = (PageHeader) page;
+
+ if (PageHasChecksumFlag1(p) || PageHasChecksumFlag2(p))
+ {
+ /* ensure any older checksum info is overwritten with watermark */
+ p->pd_flags &= ~PD_CHECKSUM;
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+ }
+
+ return (char *) page;
+}
+
+/*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+static uint16
+PageCalcChecksum16(Page page)
+{
+#define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+#ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ *
+ * XXX present implementation is raw, untuned calculation, please tweak
+ */
+ PageHeader p = (PageHeader) page;
+ uint page_header_stop = (uint)(offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+ uint page_lower_start = (uint)(offsetof(PageHeaderData, pd_prune_xid));
+ uint page_lower_stop;
+ uint sum1 = 0;
+ uint64 sum2 = 0;
+ int i;
+
+ /*
+ * Avoid calculating checksum if page is new, just return a value that
+ * will cause the check to fail. We may still pass the all-zeroes check.
+ */
+ if (PageIsNew(page))
+ return 1;
+
+ /*
+ * Just add in the pd_prune_xid if there are no line pointers yet.
+ */
+ page_lower_stop = p->pd_lower;
+ if (page_lower_stop == 0)
+ page_lower_stop = page_lower_start + sizeof(TransactionId);
+
+ Assert(p->pd_upper != 0);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "calculating checksum for %u-%u %u-%u %u-%u",
+ 0, /* page_header_start */
+ page_header_stop,
+ page_lower_start,
+ page_lower_stop,
+ p->pd_upper,
+ BLCKSZ
+ );
+#endif
+
+#define COMP_F16(from, to) \
+do { \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = sum1 + page[i]; \
+ sum2 = sum1 + sum2; \
+ } \
+ sum1 %= 255; \
+ sum2 %= 255; \
+} while (0); \
+
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ page_lower_stop);
+
+ /* ignore the hole in the middle of the block */
+
+ COMP_F16(p->pd_upper,
+ BLCKSZ - 1);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "checksum %u", ((sum2 << 8) | sum1));
+#endif
+
+ return ((sum2 << 8) | sum1);
+#endif
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5c910dd..e868280 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -830,6 +830,20 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksums before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..6f81023 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -150,15 +150,21 @@
#------------------------------------------------------------------------------
-# WRITE AHEAD LOG
+# WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
-# - Settings -
+# - Reliability -
-#wal_level = minimal # minimal, archive, or hot_standby
- # (change requires restart)
+#page_checksums = off # calculate checksum before database I/O
+#full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+# - Write Ahead Log -
+
+#wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
@@ -166,7 +172,6 @@
# fsync
# fsync_writethrough
# open_sync
-#full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 1ab64e0..dab9189 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -18,6 +18,8 @@
#include "storage/item.h"
#include "storage/off.h"
+extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
@@ -93,7 +95,7 @@ typedef uint16 LocationIndex;
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
- * pd_pagesize_version - size in bytes and page layout version number.
+ * pd_verify - page verification information of different kinds
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
@@ -106,7 +108,8 @@ typedef uint16 LocationIndex;
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
- * The page version number and page size are packed together into a single
+ * For verification we store either a 16 bit checksum or a watermark of
+ * the page version number and page size packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
@@ -130,7 +133,13 @@ typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
- uint16 pd_pagesize_version;
+
+ union
+ {
+ uint16 pd_pagesize_version;
+ uint16 pd_checksum16;
+ } pd_verify; /* page verification data */
+
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
@@ -155,7 +164,16 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */
+
+#define PD_CHECKSUM1 0x0008 /* First checksum bit */
+#define PD_CHECKSUM2 0x8000 /* Second checksum bit */
+#define PD_CHECKSUM 0x8008 /* OR of both checksum flags */
+
+#define PageHasChecksumFlag1(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM1) == PD_CHECKSUM1)
+#define PageHasChecksumFlag2(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM2) == PD_CHECKSUM2)
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -165,6 +183,8 @@ typedef PageHeaderData *PageHeader;
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses 4 as well, though with changed meaning of verification bits.
+ * We deliberately don't bump the page version for that, to allow upgrades.
*/
#define PG_PAGE_LAYOUT_VERSION 4
@@ -231,19 +251,22 @@ typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
- * this can only be called on a formatted page (unlike
- * BufferGetPageSize, which can be called on an unformatted page).
- * however, it can be called on a page that is not stored in a buffer.
+ * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
+ * This can be called on any page, initialised or not, in or out of buffers.
+ * You might think this can vary at runtime but you'd be wrong, since pages
+ * frequently need to occupy buffers and pages are copied from one to another
+ * so there are many hidden assumptions that this simple definition is true.
*/
-#define PageGetPageSize(page) \
- ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
+#define PageGetPageSize(page) (BLCKSZ)
/*
* PageGetPageLayoutVersion
* Returns the page layout version of a page.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageGetPageLayoutVersion(page) \
- (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
+ (((PageHeader) (page))->pd_verify.pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
@@ -251,12 +274,14 @@ typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
- ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
/* ----------------
@@ -368,7 +393,7 @@ do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
-extern bool PageHeaderIsValid(PageHeader page);
+extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
@@ -381,5 +406,6 @@ extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+extern char *PageSetVerificationInfo(Page page);
#endif /* BUFPAGE_H */
* Simon Riggs (simon@2ndQuadrant.com) wrote:
I discover that non-all-zeroes holes are fairly common, just not very frequent.
Curious, might be interesting to find out why.
That may or may not be a problem, but not something to be dealt with
here and now.
But I agree that it's not the job of this patch/effort. It sounds like
we have clear indication, however, that those areas, as they are not
necessairly all zeros, should be included in the checksum.
Thanks,
Stephen
On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost <sfrost@snowman.net> wrote:
* Simon Riggs (simon@2ndQuadrant.com) wrote:
I discover that non-all-zeroes holes are fairly common, just not very frequent.
Curious, might be interesting to find out why.
That may or may not be a problem, but not something to be dealt with
here and now.But I agree that it's not the job of this patch/effort. It sounds like
we have clear indication, however, that those areas, as they are not
necessairly all zeros, should be included in the checksum.
Disagree. Full page writes ignore the hole, so its appropriate to do
so here also.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Friday, January 06, 2012 11:30:53 AM Simon Riggs wrote:
On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost <sfrost@snowman.net> wrote:
* Simon Riggs (simon@2ndQuadrant.com) wrote:
I discover that non-all-zeroes holes are fairly common, just not very
frequent.Curious, might be interesting to find out why.
That may or may not be a problem, but not something to be dealt with
here and now.But I agree that it's not the job of this patch/effort. It sounds like
we have clear indication, however, that those areas, as they are not
necessairly all zeros, should be included in the checksum.Disagree. Full page writes ignore the hole, so its appropriate to do
so here also.
Well, ignoriging them in fpw has clear space benefits. Ignoring them while
checksumming doesn't have that much of a benefit.
Andres
On Thu, Jan 5, 2012 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, Jan 4, 2012 at 1:35 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Sure. I just think you are there already except for what I got into.
New version attached, with your suggested changes included. Hole check
code is there as well, but ifdef'd out since it isn't a valid check in
all cases.
Following private discussions, Kevin showed me the patch at v3 was
inadequate because it didn't cater for torn pages as a result of hint
bit writes.
The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).
Dan Scales also observed some cases where private writes aren't
checksummed, e.g. index build. Those suggestions are still outstanding
from me, but the v4 is worthy of more serious review.
I'll now turn my attention more fully onto the DW patch.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16_with_wallogged_hint_bits.v4.patchtext/x-patch; charset=US-ASCII; name=checksum16_with_wallogged_hint_bits.v4.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..9b367a3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1701,6 +1701,48 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-page-checksums" xreflabel="page_checksums">
+ <indexterm>
+ <primary><varname>page_checksums</> configuration parameter</primary>
+ </indexterm>
+ <term><varname>page_checksums</varname> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ When this parameter is on, the <productname>PostgreSQL</> server
+ calculates checksums when it writes main database pages to disk,
+ flagging the page as checksum protected. When this parameter is off,
+ no checksum is written, only a standard watermark in the page header.
+ The database may thus contain a mix of pages with checksums and pages
+ without checksums.
+ </para>
+
+ <para>
+ When pages are read into shared buffers any page flagged with a
+ checksum has the checksum re-calculated and compared against the
+ stored value to provide greatly improved validation of page contents.
+ </para>
+
+ <para>
+ Writes via temp_buffers are not checksummed.
+ </para>
+
+ <para>
+ Turning this parameter off speeds normal operation, but
+ might allow data corruption to go unnoticed. The checksum uses
+ 16-bit checksums, using the fast Fletcher 16 algorithm. With this
+ parameter enabled there is still a non-zero probability that an error
+ could go undetected, as well as a non-zero probability of false
+ positives.
+ </para>
+
+ <para>
+ This parameter can only be set in the <filename>postgresql.conf</>
+ file or on the server command line.
+ The default is <literal>off</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
<term><varname>wal_buffers</varname> (<type>integer</type>)</term>
<indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e65962..c9538d3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -708,6 +708,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
bool updrqst;
bool doPageWrites;
bool isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ bool IsHint = (rmid == RM_SMGR_ID && info == XLOG_SMGR_HINT);
/* cross-check on whether we should be here or not */
if (!XLogInsertAllowed())
@@ -975,6 +976,18 @@ begin:;
}
/*
+ * If this is a hint record and we don't need a backup block then
+ * we have no more work to do and can exit quickly without inserting
+ * a WAL record at all. In that case return InvalidXLogRecPtr.
+ */
+ if (IsHint && !(info & XLR_BKP_BLOCK_MASK))
+ {
+ LWLockRelease(WALInsertLock);
+ END_CRIT_SECTION();
+ return InvalidXLogRecPtr;
+ }
+
+ /*
* If there isn't enough space on the current XLOG page for a record
* header, advance to the next page (leaving the unused space as zeroes).
*/
@@ -3670,6 +3683,13 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
}
+ /*
+ * Any checksum set on this page will be invalid. We don't need
+ * to reset it here since it will be reset before being written
+ * but it seems worth doing this for general sanity and hygiene.
+ */
+ PageSetPageSizeAndVersion(page, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a017101..618c8f9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -20,6 +20,7 @@
#include "postgres.h"
#include "access/visibilitymap.h"
+#include "access/transam.h"
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
@@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_HINT 0x40
typedef struct xl_smgr_create
{
@@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
smgrDoPendingDeletes(false);
}
+/*
+ * Write a backup block if needed when we are setting a hint.
+ *
+ * Deciding the "if needed" bit is delicate and requires us to either
+ * grab WALInsertLock or check the info_lck spinlock. If we check the
+ * spinlock and it says Yes then we will need to get WALInsertLock as well,
+ * so the design choice here is to just go straight for the WALInsertLock
+ * and trust that calls to this function are minimised elsewhere.
+ *
+ * Callable while holding share lock on the buffer content.
+ *
+ * Possible that multiple concurrent backends could attempt to write
+ * WAL records. In that case, more than one backup block may be recorded
+ * though that isn't important to the outcome and the backup blocks are
+ * likely to be identical anyway.
+ */
+#define SMGR_HINT_WATERMARK 13579
+void
+smgr_buffer_hint(Buffer buffer)
+{
+ /*
+ * Make an XLOG entry reporting the hint
+ */
+ XLogRecPtr lsn;
+ XLogRecData rdata[2];
+ int watermark = SMGR_HINT_WATERMARK;
+
+ /*
+ * Not allowed to have zero-length records, so use a small watermark
+ */
+ rdata[0].data = (char *) (&watermark);
+ rdata[0].len = sizeof(int);
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
+
+ /*
+ * Set the page LSN if we wrote a backup block.
+ */
+ if (!XLByteEQ(InvalidXLogRecPtr, lsn))
+ {
+ Page page = BufferGetPage(buffer);
+ PageSetLSN(page, lsn);
+ elog(LOG, "inserted backup block for hint bit");
+ }
+}
+
void
smgr_redo(XLogRecPtr lsn, XLogRecord *record)
{
uint8 info = record->xl_info & ~XLR_INFO_MASK;
- /* Backup blocks are not used in smgr records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
-
if (info == XLOG_SMGR_CREATE)
{
xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
SMgrRelation reln;
+ /* Backup blocks are not used in smgr truncate records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
@@ -499,6 +556,9 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
SMgrRelation reln;
Relation rel;
+ /* Backup blocks are not used in smgr truncate records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
reln = smgropen(xlrec->rnode, InvalidBackendId);
/*
@@ -524,6 +584,28 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_HINT)
+ {
+ int *watermark = (int *) XLogRecGetData(record);
+
+ /* Check the watermark is correct for the hint record */
+ Assert(*watermark == SMGR_HINT_WATERMARK);
+
+ /* Backup blocks must be present for smgr hint records */
+ Assert(record->xl_info & XLR_BKP_BLOCK_MASK);
+
+ /*
+ * Hint records have no information that needs to be replayed.
+ * The sole purpose of them is to ensure that a hint bit does
+ * not cause a checksum invalidation if a hint bit write should
+ * cause a torn page. So the body of the record is empty but
+ * there can be one backup block.
+ *
+ * Since the only change in the backup block is a hint bit,
+ * there is no confict with Hot Standby.
+ */
+ RestoreBkpBlocks(lsn, record, false);
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
@@ -550,6 +632,8 @@ smgr_desc(StringInfo buf, uint8 xl_info, char *rec)
xlrec->blkno);
pfree(path);
}
+ else if (info == XLOG_SMGR_HINT)
+ appendStringInfo(buf, "buffer hint");
else
appendStringInfo(buf, "UNKNOWN");
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 91cc001..d63056d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
#include <unistd.h>
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "executor/instrument.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -440,7 +441,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
- if (!PageHeaderIsValid((PageHeader) bufBlock))
+ if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
@@ -1860,6 +1861,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
+ char *bufCopy;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
@@ -1907,10 +1910,24 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
buf->flags &= ~BM_JUST_DIRTIED;
UnlockBufHdr(buf);
+ /*
+ * Set page verification info immediately before we write the buffer to disk.
+ * Once we have flushed the buffer is marked clean again, meaning it can
+ * be replaced quickly and silently with another data block, so we must
+ * write verification info now. For efficiency, the process of cleaning
+ * and page replacement is asynchronous, so we can't do this *only* when
+ * we are about to replace the buffer, we need to do this for every flush.
+ */
+ bufBlock = BufHdrGetBlock(buf);
+ bufCopy = PageSetVerificationInfo((Page) bufBlock);
+
+ /*
+ * bufToWrite is either the shared buffer or a copy, as appropriate.
+ */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
- (char *) BufHdrGetBlock(buf),
+ (char *) bufCopy,
false);
pgBufferUsage.shared_blks_written++;
@@ -1921,6 +1938,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
*/
TerminateBufferIO(buf, true, 0);
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
@@ -2341,6 +2360,31 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
+ /*
+ * Write a full page into WAL iff this is the first change on the
+ * block since the last checkpoint. That will never be the case
+ * if the block is already dirty because we either made a change
+ * or set a hint already. Note that aggressive cleaning of blocks
+ * dirtied by hint bit setting would increase the call rate.
+ * Bulk setting of hint bits would reduce the call rate...
+ *
+ * We must issue the WAL record before we mark the buffer dirty.
+ * Otherwise we might write the page before we write the WAL.
+ * That causes a race condition, since a checkpoint might
+ * occur between writing the WAL record and marking the buffer dirty.
+ * We solve that with a kluge, but one that is already in use
+ * during transaction commit to prevent race conditions.
+ * Basically, we simply prevent the checkpoint WAL record from
+ * being written until we have marked the buffer dirty. We don't
+ * start the checkpoint flush until we have marked dirty, so our
+ * checkpoint must flush the change to disk successfully or the
+ * checkpoint never gets written, so crash recovery will set us right.
+ *
+ * XXX rename PGPROC variable later; keep it same now for clarity
+ */
+ MyPgXact->inCommit = true;
+ smgr_buffer_hint(buffer);
+
LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);
if (!(bufHdr->flags & BM_DIRTY))
@@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
}
bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
UnlockBufHdr(bufHdr);
+ MyPgXact->inCommit = false;
}
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 096d36a..a220310 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
/* Find smgr relation for buffer */
oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ /* XXX do we want to write checksums for local buffers? An option? */
+
/* And write... */
smgrwrite(oreln,
bufHdr->tag.forkNum,
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 90a731c..c3e9928 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -16,6 +16,12 @@
#include "access/htup.h"
+bool page_checksums = false;
+
+static char pageCopy[BLCKSZ]; /* temporary buffer to allow checksum calculation */
+
+static bool PageVerificationInfoOK(Page page);
+static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
@@ -25,6 +31,10 @@
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we write the block to disk.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
@@ -67,20 +77,20 @@ PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
-PageHeaderIsValid(PageHeader page)
+PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
/* Check normal case */
- if (PageGetPageSize(page) == BLCKSZ &&
- PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
- (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
- page->pd_lower >= SizeOfPageHeaderData &&
- page->pd_lower <= page->pd_upper &&
- page->pd_upper <= page->pd_special &&
- page->pd_special <= BLCKSZ &&
- page->pd_special == MAXALIGN(page->pd_special))
+ if (PageVerificationInfoOK(page) &&
+ (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
+ p->pd_lower >= SizeOfPageHeaderData &&
+ p->pd_lower <= p->pd_upper &&
+ p->pd_upper <= p->pd_special &&
+ p->pd_special <= BLCKSZ &&
+ p->pd_special == MAXALIGN(p->pd_special))
return true;
/* Check all-zeroes case */
@@ -93,7 +103,6 @@ PageHeaderIsValid(PageHeader page)
return true;
}
-
/*
* PageAddItem
*
@@ -827,3 +836,239 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
pfree(itemidbase);
}
+
+/*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page writes does not check verification
+ * info via this function. So we zero the checksum when restoring backup blocks.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering and leaving
+ * the buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data leaving and
+ * re-entering PostgreSQL controlled memory.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ *
+ * Note also that this returns a boolean, not a full damage assessment.
+ */
+static bool
+PageVerificationInfoOK(Page page)
+{
+ PageHeader p = (PageHeader) page;
+
+ /*
+ * We set two non-adjacent bits to signify that the page has a checksum and
+ * should be verified against that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags and destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ */
+ if (PageHasChecksumFlag1(p) && PageHasChecksumFlag2(p))
+ {
+ uint16 checksum = PageCalcChecksum16(page);
+
+ if (checksum == p->pd_verify.pd_checksum16)
+ {
+#ifdef CHECK_HOLE
+ /* Also check page hole is all-zeroes */
+ char *pagebytes;
+ bool empty = true;
+ int i;
+
+ pagebytes = (char *) page;
+ for (i = p->pd_lower; i < p->pd_upper; i++)
+ {
+ if (pagebytes[i] != 0)
+ {
+ empty = false;
+ break;
+ }
+ }
+
+ if (!empty)
+ elog(LOG, "hole was not empty at byte %d pd_lower %d pd_upper %d",
+ i, p->pd_lower, p->pd_upper);
+#endif
+ return true;
+ }
+
+ elog(LOG, "page verification failed - checksum was %u page checksum field is %u",
+ checksum, p->pd_verify.pd_checksum16);
+ }
+ else if (!PageHasChecksumFlag1(p) && !PageHasChecksumFlag2(p))
+ {
+ if (PageGetPageLayoutVersion(p) == PG_PAGE_LAYOUT_VERSION &&
+ PageGetPageSize(p) == BLCKSZ)
+ return true;
+ }
+ else
+ elog(LOG, "page verification failed - page has one checksum flag set");
+
+ return false;
+}
+
+/*
+ * Set verification info for page.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an invalid checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ *
+ * Returns a pointer to the block-sized data that needs to be written. That
+ * allows us to either copy, or not, depending upon whether we checksum.
+ */
+char *
+PageSetVerificationInfo(Page page)
+{
+ PageHeader p;
+
+ if (PageIsNew(page))
+ return (char *) page;
+
+ if (page_checksums)
+ {
+ /*
+ * We make a copy iff we need to calculate a checksum because other
+ * backends may set hint bits on this page while we write, which
+ * would mean the checksum differs from the page contents. It doesn't
+ * matter if we include or exclude hints during the copy, as long
+ * as we write a valid page and associated checksum.
+ */
+ memcpy(&pageCopy, page, BLCKSZ);
+
+ p = (PageHeader) &pageCopy;
+ p->pd_flags |= PD_CHECKSUM;
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16((Page) &pageCopy);
+
+ return (char *) &pageCopy;
+ }
+
+ p = (PageHeader) page;
+
+ if (PageHasChecksumFlag1(p) || PageHasChecksumFlag2(p))
+ {
+ /* ensure any older checksum info is overwritten with watermark */
+ p->pd_flags &= ~PD_CHECKSUM;
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+ }
+
+ return (char *) page;
+}
+
+/*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+static uint16
+PageCalcChecksum16(Page page)
+{
+#define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+#ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ *
+ * XXX present implementation is raw, untuned calculation, please tweak
+ */
+ PageHeader p = (PageHeader) page;
+ uint page_header_stop = (uint)(offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+ uint page_lower_start = (uint)(offsetof(PageHeaderData, pd_prune_xid));
+ uint page_lower_stop;
+ uint sum1 = 0;
+ uint64 sum2 = 0;
+ int i;
+
+ /*
+ * Avoid calculating checksum if page is new, just return a value that
+ * will cause the check to fail. We may still pass the all-zeroes check.
+ */
+ if (PageIsNew(page))
+ return 1;
+
+ /*
+ * Just add in the pd_prune_xid if there are no line pointers yet.
+ */
+ page_lower_stop = p->pd_lower;
+ if (page_lower_stop == 0)
+ page_lower_stop = page_lower_start + sizeof(TransactionId);
+
+ Assert(p->pd_upper != 0);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "calculating checksum for %u-%u %u-%u %u-%u",
+ 0, /* page_header_start */
+ page_header_stop,
+ page_lower_start,
+ page_lower_stop,
+ p->pd_upper,
+ BLCKSZ
+ );
+#endif
+
+#define COMP_F16(from, to) \
+do { \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = sum1 + page[i]; \
+ sum2 = sum1 + sum2; \
+ } \
+ sum1 %= 255; \
+ sum2 %= 255; \
+} while (0); \
+
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ page_lower_stop);
+
+ /* ignore the hole in the middle of the block */
+
+ COMP_F16(p->pd_upper,
+ BLCKSZ - 1);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "checksum %u", ((sum2 << 8) | sum1));
+#endif
+
+ return ((sum2 << 8) | sum1);
+#endif
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5c910dd..e868280 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -830,6 +830,20 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksums before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..6f81023 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -150,15 +150,21 @@
#------------------------------------------------------------------------------
-# WRITE AHEAD LOG
+# WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
-# - Settings -
+# - Reliability -
-#wal_level = minimal # minimal, archive, or hot_standby
- # (change requires restart)
+#page_checksums = off # calculate checksum before database I/O
+#full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+# - Write Ahead Log -
+
+#wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
@@ -166,7 +172,6 @@
# fsync
# fsync_writethrough
# open_sync
-#full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index db6380f..eb32856 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -114,6 +114,8 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
#define XLogPageHeaderSize(hdr) \
(((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
+#define XLOG_SMGR_HINT 0x40
+
/*
* We break each logical log file (xlogid value) into segment files of the
* size indicated by XLOG_SEG_SIZE. One possible segment at the end of each
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index d5103a8..48a728c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -36,6 +36,7 @@ extern void PostPrepare_smgr(void);
extern void log_smgrcreate(RelFileNode *rnode, ForkNumber forkNum);
+extern void smgr_buffer_hint(Buffer buffer);
extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 1ab64e0..dab9189 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -18,6 +18,8 @@
#include "storage/item.h"
#include "storage/off.h"
+extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
@@ -93,7 +95,7 @@ typedef uint16 LocationIndex;
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
- * pd_pagesize_version - size in bytes and page layout version number.
+ * pd_verify - page verification information of different kinds
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
@@ -106,7 +108,8 @@ typedef uint16 LocationIndex;
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
- * The page version number and page size are packed together into a single
+ * For verification we store either a 16 bit checksum or a watermark of
+ * the page version number and page size packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
@@ -130,7 +133,13 @@ typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
- uint16 pd_pagesize_version;
+
+ union
+ {
+ uint16 pd_pagesize_version;
+ uint16 pd_checksum16;
+ } pd_verify; /* page verification data */
+
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
@@ -155,7 +164,16 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */
+
+#define PD_CHECKSUM1 0x0008 /* First checksum bit */
+#define PD_CHECKSUM2 0x8000 /* Second checksum bit */
+#define PD_CHECKSUM 0x8008 /* OR of both checksum flags */
+
+#define PageHasChecksumFlag1(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM1) == PD_CHECKSUM1)
+#define PageHasChecksumFlag2(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM2) == PD_CHECKSUM2)
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -165,6 +183,8 @@ typedef PageHeaderData *PageHeader;
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses 4 as well, though with changed meaning of verification bits.
+ * We deliberately don't bump the page version for that, to allow upgrades.
*/
#define PG_PAGE_LAYOUT_VERSION 4
@@ -231,19 +251,22 @@ typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
- * this can only be called on a formatted page (unlike
- * BufferGetPageSize, which can be called on an unformatted page).
- * however, it can be called on a page that is not stored in a buffer.
+ * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
+ * This can be called on any page, initialised or not, in or out of buffers.
+ * You might think this can vary at runtime but you'd be wrong, since pages
+ * frequently need to occupy buffers and pages are copied from one to another
+ * so there are many hidden assumptions that this simple definition is true.
*/
-#define PageGetPageSize(page) \
- ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
+#define PageGetPageSize(page) (BLCKSZ)
/*
* PageGetPageLayoutVersion
* Returns the page layout version of a page.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageGetPageLayoutVersion(page) \
- (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
+ (((PageHeader) (page))->pd_verify.pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
@@ -251,12 +274,14 @@ typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
- ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
/* ----------------
@@ -368,7 +393,7 @@ do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
-extern bool PageHeaderIsValid(PageHeader page);
+extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
@@ -381,5 +406,6 @@ extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+extern char *PageSetVerificationInfo(Page page);
#endif /* BUFPAGE_H */
On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote:
The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).
Well, but it will hurt in those cases where hint bits already hurt hugely.
Which is for example when accessing huge amounts of data the first time after
loading it. Which is not exactly uncommon...
Andres
On 06.01.2012 20:26, Simon Riggs wrote:
The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).
How's that going to work during recovery? Like in hot standby.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
On 06.01.2012 20:26, Simon Riggs wrote:
The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).How's that going to work during recovery? Like in hot standby.
How's recovery a problem? Unless I miss something that doesn't actually
introduce a new possibility to transport hint bits to the standby (think
fpw's). A new transport will obviously increase traffic but ...
Andres
On Fri, Jan 6, 2012 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:
On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote:
The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).Well, but it will hurt in those cases where hint bits already hurt hugely.
Which is for example when accessing huge amounts of data the first time after
loading it. Which is not exactly uncommon...
No, but I'd rather see us commit a checksum patch that sometimes
carries a major performance penalty and then work to reduce that
penalty later than never commit anything at all. I think it's
unrealistic to assume we're going to get this perfectly right in one
try. I am not sure whether this particular patch is close enough for
government work or still needs more hacking, and it may well be the
latter, but there is no use holding our breath and waiting for
absolute perfection.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres@anarazel.de> wrote:
On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
On 06.01.2012 20:26, Simon Riggs wrote:
The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).How's that going to work during recovery? Like in hot standby.
How's recovery a problem? Unless I miss something that doesn't actually
introduce a new possibility to transport hint bits to the standby (think
fpw's). A new transport will obviously increase traffic but ...
The standby can set hint bits locally that weren't set on the data it
received from the master. This will require rechecksumming and
rewriting the page, but obviously we can't write the WAL records
needed to protect those writes during recovery. So a crash could
create a torn page, invalidating the checksum.
Ignoring checksum errors during Hot Standby operation doesn't fix it,
either, because eventually you might want to promote the standby, and
the checksum will still be invalid.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Friday, January 06, 2012 08:53:38 PM Robert Haas wrote:
On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres@anarazel.de> wrote:
On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
On 06.01.2012 20:26, Simon Riggs wrote:
The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).How's that going to work during recovery? Like in hot standby.
How's recovery a problem? Unless I miss something that doesn't actually
introduce a new possibility to transport hint bits to the standby (think
fpw's). A new transport will obviously increase traffic but ...The standby can set hint bits locally that weren't set on the data it
received from the master. This will require rechecksumming and
rewriting the page, but obviously we can't write the WAL records
needed to protect those writes during recovery. So a crash could
create a torn page, invalidating the checksum.
Err. Stupid me, thanks.
Ignoring checksum errors during Hot Standby operation doesn't fix it,
either, because eventually you might want to promote the standby, and
the checksum will still be invalid.
Its funny. I have the feeling we all are missing a very obvious brilliant
solution to this...
Andres
On Fri, Jan 6, 2012 at 7:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres@anarazel.de> wrote:
On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
On 06.01.2012 20:26, Simon Riggs wrote:
The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).How's that going to work during recovery? Like in hot standby.
How's recovery a problem? Unless I miss something that doesn't actually
introduce a new possibility to transport hint bits to the standby (think
fpw's). A new transport will obviously increase traffic but ...The standby can set hint bits locally that weren't set on the data it
received from the master. This will require rechecksumming and
rewriting the page, but obviously we can't write the WAL records
needed to protect those writes during recovery. So a crash could
create a torn page, invalidating the checksum.Ignoring checksum errors during Hot Standby operation doesn't fix it,
either, because eventually you might want to promote the standby, and
the checksum will still be invalid.
Since we ignore hints during HS anyway, not setting them seems OK if
checksums defined. Or we can recommend that you don't use checksums on
a standby. Whichever fits.
Writing pages during recovery doesn't need WAL. If we crash, we replay
using the already generated WAL.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 6, 2012 at 7:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
No, but I'd rather see us commit a checksum patch that sometimes
carries a major performance penalty and then work to reduce that
penalty later than never commit anything at all. I think it's
unrealistic to assume we're going to get this perfectly right in one
try. I am not sure whether this particular patch is close enough for
government work or still needs more hacking, and it may well be the
latter, but there is no use holding our breath and waiting for
absolute perfection.
Well said. My view completely.
I do want jam tomorrow, I just want bread and butter today as well.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 6, 2012 at 3:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Since we ignore hints during HS anyway,
No, we don't. We need to ignore visibility map bits, but we need not
and do not ignore HEAP_XMIN_COMMITTED, HEAP_XMAX_COMMITTED, etc.
not setting them seems OK if
checksums defined. Or we can recommend that you don't use checksums on
a standby. Whichever fits.Writing pages during recovery doesn't need WAL. If we crash, we replay
using the already generated WAL.
Which is all fine, except when you start making changes that are not
WAL-logged. Then, you have the same torn page problem that exists
when you it in normal running.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote:
The standby can set hint bits locally that weren't set on the data it
received from the master. This will require rechecksumming and
rewriting the page, but obviously we can't write the WAL records
needed to protect those writes during recovery. So a crash could
create a torn page, invalidating the checksum.Err. Stupid me, thanks.
Ignoring checksum errors during Hot Standby operation doesn't fix it,
either, because eventually you might want to promote the standby, and
the checksum will still be invalid.Its funny. I have the feeling we all are missing a very obvious brilliant
solution to this...
Like getting rid of hint bits?
merlin
On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote:
The standby can set hint bits locally that weren't set on the data it
received from the master. This will require rechecksumming and
rewriting the page, but obviously we can't write the WAL records
needed to protect those writes during recovery. So a crash could
create a torn page, invalidating the checksum.Err. Stupid me, thanks.
Ignoring checksum errors during Hot Standby operation doesn't fix it,
either, because eventually you might want to promote the standby, and
the checksum will still be invalid.Its funny. I have the feeling we all are missing a very obvious brilliant
solution to this...Like getting rid of hint bits?
Or even just not bothering to consider them as making buffers dirty,
so the only writes are already protected by the double-write (WAL, or
if they get some DW outside of WAL).
I think I've said it before, but I'm guessing OLTP style database
rarely have pages written that are dirty that aren't covered by real
changes (so have the FPW anyways) and OLAP type generally freeze after
loads to avoid the hint-bit-write penalty too...
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
Its funny. I have the feeling we all are missing a very obvious brilliant
solution to this...Like getting rid of hint bits?
No.
First, that solution has been proposed before, and it crashed and
burned before. You may as well propose getting rid of VACUUM. In
each case, the supposed problem is in fact a cure for a much more
severe problem that would quickly make you long for the days when you
had the first one. I think you (or someone else?) had a fairly
well-developed patch for blunting the impact of hint bit setting, and
that we might be able to do if you (or someone) wants to finish it.
But getting rid of them altogether isn't likely, not because
PostgreSQL developers are an intractable herd of mule-headed idiots
(although I have been known to be all of those things on occasion) but
because the performance characteristics demonstrably suck, as mostly
recently demonstrated by commit
4de82f7d7c50a81ec8e70e2cb0ab413ab9134c0b, which significantly improved
performance with synchronous_commit=off by shaving about an average of
one-tenth of one second off the time before hint bits could be set on
any given tuple.
Second, even if you did it, it wouldn't fix this problem, because hint
bits aren't the only changes we make without WAL logging. In
particular, when an index scan runs across a tuple that turns out to
be dead, we kill the index tuple so that future scans don't need to
revisit it. I haven't actually done any testing to measure how
significant this is, but I wouldn't bet on it not mattering.
I think it would be wonderful to come up with a design that either
blunted the effect of hint bits or eliminated them altogether, but the
idea that there's an obvious way forward there that we've simply
refused to pursue is, AFAICT, flat wrong. Let's not get sidetracked
into talking about things that aren't going to change in time for 9.2
(and probably not 9.3, either, given that no one seems to be putting
any time into it) and aren't even feasible solutions to the problem at
hand (checksums) anyway.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jan 6, 2012 at 6:48 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
I think I've said it before, but I'm guessing OLTP style database
rarely have pages written that are dirty that aren't covered by real
changes (so have the FPW anyways) and OLAP type generally freeze after
loads to avoid the hint-bit-write penalty too...
But ya, again, I've never measured ;-)
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On Fri, Jan 6, 2012 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Writing pages during recovery doesn't need WAL. If we crash, we replay
using the already generated WAL.Which is all fine, except when you start making changes that are not
WAL-logged. Then, you have the same torn page problem that exists
when you it in normal running.
Yes, of course. My point is that this is not a blocker to using
checksums, only that some actions cannot occur on the standby but the
replay of changes is not in danger.
page_checksums is an optional parameter, so you can turn it on or off
on the standby as you wish. People frequently have a standby dedicated
to HA and other standbys for queries. So this is all normal and
natural.
page_checksums will default to 'off' in the final patch anyway, in my
understanding.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 07.01.2012 12:14, Simon Riggs wrote:
page_checksums is an optional parameter, so you can turn it on or off
on the standby as you wish. People frequently have a standby dedicated
to HA and other standbys for queries. So this is all normal and
natural.
There needs to be a well-documented way of turning it on/off. In
particular, from off to on. If you ever flip it off even for a minute,
pages with invalid checksums start to appear in the data directory, and
if you then turn it on again, you start getting errors. Perhaps there
needs to be a third setting, calculate-but-dont-verify, where CRCs are
updated but existing CRCs are not expected to be correct. And a utility
to scan through your database and fix any incorrect CRCs, so that after
that's run in calculate-but-dont-verify mode, you can safely turn
checksums on.
Even better would be a way to make that process robust so that you can't
do a pilot error and turn page_checksums 'on' when it's not safe to do
so. Otherwise when you get a checksum error, your first thought is going
to be "hmm, I wonder if anyone ever turned page_checksums 'off' by
accident anytime in the past, or if this is a genuine error".
page_checksums will default to 'off' in the final patch anyway, in my
understanding.
That makes the need for such a facility even more important. Otherwise
there's no safe way to ever switch it from off to on.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sat, Jan 7, 2012 at 10:26 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
There needs to be a well-documented way of turning it on/off. In particular,
from off to on.
There is.... in the patch.
The checksum field is optional, as is the parameter.
If page_checksums is on, we write a checksum and it is correct. We
also validate any checksums we see.
If page_checksums is off we clear the checksum on write, so an
incorrect checksum is never written.
So there isn't any problem with there being incorrect checksums on
blocks and you can turn the parameter on and off as often and as
easily as you want. I think it can be USERSET but I wouldn't want to
encourage users to see turning it off as a performance tuning feature.
If the admin turns it on for the server, its on, so its SIGHUP.
Any holes in that I haven't noticed?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
So there isn't any problem with there being incorrect checksums on
blocks and you can turn the parameter on and off as often and as
easily as you want. I think it can be USERSET but I wouldn't want to
encourage users to see turning it off as a performance tuning feature.
If the admin turns it on for the server, its on, so its SIGHUP.Any holes in that I haven't noticed?
And of course, as soon as I wrote that I thought of the problem. We
mustn't make a write that hasn't been covered by a FPW, so we must
know ahead of time whether to WAL log hints or not. We can't simply
turn it on/off any longer, now that we have to WAL log hint bits also.
So thanks for making me think of that.
We *could* make it turn on/off at each checkpoint, but its easier just
to say that it can be turned on/off at server start.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
So there isn't any problem with there being incorrect checksums on
blocks and you can turn the parameter on and off as often and as
easily as you want. I think it can be USERSET but I wouldn't want to
encourage users to see turning it off as a performance tuning feature.
If the admin turns it on for the server, its on, so its SIGHUP.Any holes in that I haven't noticed?
And of course, as soon as I wrote that I thought of the problem. We
mustn't make a write that hasn't been covered by a FPW, so we must
know ahead of time whether to WAL log hints or not. We can't simply
turn it on/off any longer, now that we have to WAL log hint bits also.
So thanks for making me think of that.We *could* make it turn on/off at each checkpoint, but its easier just
to say that it can be turned on/off at server start.
Attached patch v6 now handles hint bits and checksums correctly,
following Heikki's comments.
In recovery, setting a hint doesn't dirty a block if it wasn't already
dirty. So we can write some hints, and we can set others but not write
them.
Lots of comments in the code.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16_with_wallogged_hint_bits.v6.patchtext/x-patch; charset=US-ASCII; name=checksum16_with_wallogged_hint_bits.v6.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..3cb8d2a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1701,6 +1701,47 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-page-checksums" xreflabel="page_checksums">
+ <indexterm>
+ <primary><varname>page_checksums</> configuration parameter</primary>
+ </indexterm>
+ <term><varname>page_checksums</varname> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ When this parameter is on, the <productname>PostgreSQL</> server
+ calculates checksums when it writes main database pages to disk,
+ flagging the page as checksum protected. When this parameter is off,
+ no checksum is written, only a standard watermark in the page header.
+ The database may thus contain a mix of pages with checksums and pages
+ without checksums.
+ </para>
+
+ <para>
+ When pages are read into shared buffers any page flagged with a
+ checksum has the checksum re-calculated and compared against the
+ stored value to provide greatly improved validation of page contents.
+ </para>
+
+ <para>
+ Writes via temp_buffers are not checksummed.
+ </para>
+
+ <para>
+ Turning this parameter off speeds normal operation, but
+ might allow data corruption to go unnoticed. The checksum uses
+ 16-bit checksums, using the fast Fletcher 16 algorithm. With this
+ parameter enabled there is still a non-zero probability that an error
+ could go undetected, as well as a non-zero probability of false
+ positives.
+ </para>
+
+ <para>
+ This parameter can only be set at server start.
+ The default is <literal>off</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
<term><varname>wal_buffers</varname> (<type>integer</type>)</term>
<indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e65962..c9538d3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -708,6 +708,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
bool updrqst;
bool doPageWrites;
bool isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ bool IsHint = (rmid == RM_SMGR_ID && info == XLOG_SMGR_HINT);
/* cross-check on whether we should be here or not */
if (!XLogInsertAllowed())
@@ -975,6 +976,18 @@ begin:;
}
/*
+ * If this is a hint record and we don't need a backup block then
+ * we have no more work to do and can exit quickly without inserting
+ * a WAL record at all. In that case return InvalidXLogRecPtr.
+ */
+ if (IsHint && !(info & XLR_BKP_BLOCK_MASK))
+ {
+ LWLockRelease(WALInsertLock);
+ END_CRIT_SECTION();
+ return InvalidXLogRecPtr;
+ }
+
+ /*
* If there isn't enough space on the current XLOG page for a record
* header, advance to the next page (leaving the unused space as zeroes).
*/
@@ -3670,6 +3683,13 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
}
+ /*
+ * Any checksum set on this page will be invalid. We don't need
+ * to reset it here since it will be reset before being written
+ * but it seems worth doing this for general sanity and hygiene.
+ */
+ PageSetPageSizeAndVersion(page, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a017101..618c8f9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -20,6 +20,7 @@
#include "postgres.h"
#include "access/visibilitymap.h"
+#include "access/transam.h"
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
@@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_HINT 0x40
typedef struct xl_smgr_create
{
@@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
smgrDoPendingDeletes(false);
}
+/*
+ * Write a backup block if needed when we are setting a hint.
+ *
+ * Deciding the "if needed" bit is delicate and requires us to either
+ * grab WALInsertLock or check the info_lck spinlock. If we check the
+ * spinlock and it says Yes then we will need to get WALInsertLock as well,
+ * so the design choice here is to just go straight for the WALInsertLock
+ * and trust that calls to this function are minimised elsewhere.
+ *
+ * Callable while holding share lock on the buffer content.
+ *
+ * Possible that multiple concurrent backends could attempt to write
+ * WAL records. In that case, more than one backup block may be recorded
+ * though that isn't important to the outcome and the backup blocks are
+ * likely to be identical anyway.
+ */
+#define SMGR_HINT_WATERMARK 13579
+void
+smgr_buffer_hint(Buffer buffer)
+{
+ /*
+ * Make an XLOG entry reporting the hint
+ */
+ XLogRecPtr lsn;
+ XLogRecData rdata[2];
+ int watermark = SMGR_HINT_WATERMARK;
+
+ /*
+ * Not allowed to have zero-length records, so use a small watermark
+ */
+ rdata[0].data = (char *) (&watermark);
+ rdata[0].len = sizeof(int);
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
+
+ /*
+ * Set the page LSN if we wrote a backup block.
+ */
+ if (!XLByteEQ(InvalidXLogRecPtr, lsn))
+ {
+ Page page = BufferGetPage(buffer);
+ PageSetLSN(page, lsn);
+ elog(LOG, "inserted backup block for hint bit");
+ }
+}
+
void
smgr_redo(XLogRecPtr lsn, XLogRecord *record)
{
uint8 info = record->xl_info & ~XLR_INFO_MASK;
- /* Backup blocks are not used in smgr records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
-
if (info == XLOG_SMGR_CREATE)
{
xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
SMgrRelation reln;
+ /* Backup blocks are not used in smgr truncate records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
@@ -499,6 +556,9 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
SMgrRelation reln;
Relation rel;
+ /* Backup blocks are not used in smgr truncate records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
reln = smgropen(xlrec->rnode, InvalidBackendId);
/*
@@ -524,6 +584,28 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_HINT)
+ {
+ int *watermark = (int *) XLogRecGetData(record);
+
+ /* Check the watermark is correct for the hint record */
+ Assert(*watermark == SMGR_HINT_WATERMARK);
+
+ /* Backup blocks must be present for smgr hint records */
+ Assert(record->xl_info & XLR_BKP_BLOCK_MASK);
+
+ /*
+ * Hint records have no information that needs to be replayed.
+ * The sole purpose of them is to ensure that a hint bit does
+ * not cause a checksum invalidation if a hint bit write should
+ * cause a torn page. So the body of the record is empty but
+ * there can be one backup block.
+ *
+ * Since the only change in the backup block is a hint bit,
+ * there is no confict with Hot Standby.
+ */
+ RestoreBkpBlocks(lsn, record, false);
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
@@ -550,6 +632,8 @@ smgr_desc(StringInfo buf, uint8 xl_info, char *rec)
xlrec->blkno);
pfree(path);
}
+ else if (info == XLOG_SMGR_HINT)
+ appendStringInfo(buf, "buffer hint");
else
appendStringInfo(buf, "UNKNOWN");
}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index f7712a9..8a8232b 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -1092,10 +1092,10 @@ read_info(SeqTable elm, Relation rel, Buffer *buf)
*/
if (HeapTupleHeaderGetXmax(tuple.t_data) != InvalidTransactionId)
{
- HeapTupleHeaderSetXmax(tuple.t_data, InvalidTransactionId);
- tuple.t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
- tuple.t_data->t_infomask |= HEAP_XMAX_INVALID;
- SetBufferCommitInfoNeedsSave(*buf);
+ HeapTupleHeaderSetXmax(tuple.t_data, InvalidTransactionId);
+ tuple.t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
+ tuple.t_data->t_infomask |= HEAP_XMAX_INVALID;
+ SetBufferCommitInfoNeedsSave(*buf);
}
seq = (Form_pg_sequence) GETSTRUCT(&tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 91cc001..42d43cd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
#include <unistd.h>
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "executor/instrument.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -440,7 +441,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
- if (!PageHeaderIsValid((PageHeader) bufBlock))
+ if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
@@ -1860,6 +1861,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
+ char *bufCopy;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
@@ -1907,10 +1910,24 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
buf->flags &= ~BM_JUST_DIRTIED;
UnlockBufHdr(buf);
+ /*
+ * Set page verification info immediately before we write the buffer to disk.
+ * Once we have flushed the buffer is marked clean again, meaning it can
+ * be replaced quickly and silently with another data block, so we must
+ * write verification info now. For efficiency, the process of cleaning
+ * and page replacement is asynchronous, so we can't do this *only* when
+ * we are about to replace the buffer, we need to do this for every flush.
+ */
+ bufBlock = BufHdrGetBlock(buf);
+ bufCopy = PageSetVerificationInfo((Page) bufBlock);
+
+ /*
+ * bufToWrite is either the shared buffer or a copy, as appropriate.
+ */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
- (char *) BufHdrGetBlock(buf),
+ (char *) bufCopy,
false);
pgBufferUsage.shared_blks_written++;
@@ -1921,6 +1938,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
*/
TerminateBufferIO(buf, true, 0);
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
@@ -2341,6 +2360,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
+ /*
+ * If we're writing checksums and we care about torn pages then we
+ * cannot dirty a page during recovery as a result of a hint.
+ * We can set the hint, just not dirty the page as a result.
+ *
+ * See long discussion in bufpage.c
+ */
+ if (HintsMustNotDirtyPage())
+ return;
+
+ /*
+ * Write a full page into WAL iff this is the first change on the
+ * block since the last checkpoint. That will never be the case
+ * if the block is already dirty because we either made a change
+ * or set a hint already. Note that aggressive cleaning of blocks
+ * dirtied by hint bit setting would increase the call rate.
+ * Bulk setting of hint bits would reduce the call rate...
+ *
+ * We must issue the WAL record before we mark the buffer dirty.
+ * Otherwise we might write the page before we write the WAL.
+ * That causes a race condition, since a checkpoint might
+ * occur between writing the WAL record and marking the buffer dirty.
+ * We solve that with a kluge, but one that is already in use
+ * during transaction commit to prevent race conditions.
+ * Basically, we simply prevent the checkpoint WAL record from
+ * being written until we have marked the buffer dirty. We don't
+ * start the checkpoint flush until we have marked dirty, so our
+ * checkpoint must flush the change to disk successfully or the
+ * checkpoint never gets written, so crash recovery will set us right.
+ *
+ * XXX rename PGPROC variable later; keep it same now for clarity
+ */
+ MyPgXact->inCommit = true;
+ smgr_buffer_hint(buffer);
+
LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);
if (!(bufHdr->flags & BM_DIRTY))
@@ -2351,6 +2405,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
}
bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
UnlockBufHdr(bufHdr);
+ MyPgXact->inCommit = false;
}
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 096d36a..a220310 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
/* Find smgr relation for buffer */
oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ /* XXX do we want to write checksums for local buffers? An option? */
+
/* And write... */
smgrwrite(oreln,
bufHdr->tag.forkNum,
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 90a731c..e07f133 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -16,6 +16,12 @@
#include "access/htup.h"
+bool page_checksums = false;
+
+static char pageCopy[BLCKSZ]; /* temporary buffer to allow checksum calculation */
+
+static bool PageVerificationInfoOK(Page page);
+static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
@@ -25,6 +31,10 @@
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we write the block to disk.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
@@ -67,20 +77,20 @@ PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
-PageHeaderIsValid(PageHeader page)
+PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
/* Check normal case */
- if (PageGetPageSize(page) == BLCKSZ &&
- PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
- (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
- page->pd_lower >= SizeOfPageHeaderData &&
- page->pd_lower <= page->pd_upper &&
- page->pd_upper <= page->pd_special &&
- page->pd_special <= BLCKSZ &&
- page->pd_special == MAXALIGN(page->pd_special))
+ if (PageVerificationInfoOK(page) &&
+ (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
+ p->pd_lower >= SizeOfPageHeaderData &&
+ p->pd_lower <= p->pd_upper &&
+ p->pd_upper <= p->pd_special &&
+ p->pd_special <= BLCKSZ &&
+ p->pd_special == MAXALIGN(p->pd_special))
return true;
/* Check all-zeroes case */
@@ -93,7 +103,6 @@ PageHeaderIsValid(PageHeader page)
return true;
}
-
/*
* PageAddItem
*
@@ -827,3 +836,258 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
pfree(itemidbase);
}
+
+/*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ *
+ * Any write of a data block can cause a torn page if the write is unsuccessful.
+ * Full page writes protect us from that, which are stored in WAL. Setting
+ * hint bits when a page is already dirty is OK because a full page write
+ * must already have been written for that since the last checkpoint.
+ * Setting hint bits on an otherwise clean page can allow torn pages; this
+ * doesn't normally matter since they are just hints. When the page has
+ * checksums, losing a few bits would cause the checksum to be invalid.
+ * So if we have full_page_writes = on and page_checksums = on then we must
+ * write a WAL record specifically so that we record a full page image in WAL.
+ * New WAL records cannot be written during recovery, so hint bits set
+ * during recovery must not dirty the page if the buffer is not already dirty,
+ * when page_checksums = on. Enforced by checking HintsMustNotDirtyPage()
+ *
+ * So we cannot enable/disable page_checksums except at a checkpoint if
+ * full_page_writes is enabled. We choose to only allow changes at server start.
+ *
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page writes does not check verification
+ * info via this function. So we zero the checksum when restoring backup blocks.
+ * In recovery, since we only dirty a block when we have a full page image
+ * available if we crash, we are fully OK to use page verification.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering and leaving
+ * the buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data leaving and
+ * re-entering PostgreSQL controlled memory.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ *
+ * Note also that this returns a boolean, not a full damage assessment.
+ */
+static bool
+PageVerificationInfoOK(Page page)
+{
+ PageHeader p = (PageHeader) page;
+
+ /*
+ * We set two non-adjacent bits to signify that the page has a checksum and
+ * should be verified against that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags and destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ */
+ if (PageHasChecksumFlag1(p) && PageHasChecksumFlag2(p))
+ {
+ uint16 checksum = PageCalcChecksum16(page);
+
+ if (checksum == p->pd_verify.pd_checksum16)
+ {
+#ifdef CHECK_HOLE
+ /* Also check page hole is all-zeroes */
+ char *pagebytes;
+ bool empty = true;
+ int i;
+
+ pagebytes = (char *) page;
+ for (i = p->pd_lower; i < p->pd_upper; i++)
+ {
+ if (pagebytes[i] != 0)
+ {
+ empty = false;
+ break;
+ }
+ }
+
+ if (!empty)
+ elog(LOG, "hole was not empty at byte %d pd_lower %d pd_upper %d",
+ i, p->pd_lower, p->pd_upper);
+#endif
+ return true;
+ }
+
+ elog(LOG, "page verification failed - checksum was %u page checksum field is %u",
+ checksum, p->pd_verify.pd_checksum16);
+ }
+ else if (!PageHasChecksumFlag1(p) && !PageHasChecksumFlag2(p))
+ {
+ if (PageGetPageLayoutVersion(p) == PG_PAGE_LAYOUT_VERSION &&
+ PageGetPageSize(p) == BLCKSZ)
+ return true;
+ }
+ else
+ elog(LOG, "page verification failed - page has one checksum flag set");
+
+ return false;
+}
+
+/*
+ * Set verification info for page.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an invalid checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ *
+ * Returns a pointer to the block-sized data that needs to be written. That
+ * allows us to either copy, or not, depending upon whether we checksum.
+ */
+char *
+PageSetVerificationInfo(Page page)
+{
+ PageHeader p;
+
+ if (PageIsNew(page))
+ return (char *) page;
+
+ if (page_checksums)
+ {
+ /*
+ * We make a copy iff we need to calculate a checksum because other
+ * backends may set hint bits on this page while we write, which
+ * would mean the checksum differs from the page contents. It doesn't
+ * matter if we include or exclude hints during the copy, as long
+ * as we write a valid page and associated checksum.
+ */
+ memcpy(&pageCopy, page, BLCKSZ);
+
+ p = (PageHeader) &pageCopy;
+ p->pd_flags |= PD_CHECKSUM;
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16((Page) &pageCopy);
+
+ return (char *) &pageCopy;
+ }
+
+ p = (PageHeader) page;
+
+ if (PageHasChecksumFlag1(p) || PageHasChecksumFlag2(p))
+ {
+ /* ensure any older checksum info is overwritten with watermark */
+ p->pd_flags &= ~PD_CHECKSUM;
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+ }
+
+ return (char *) page;
+}
+
+/*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+static uint16
+PageCalcChecksum16(Page page)
+{
+#define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+#ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ *
+ * XXX present implementation is raw, untuned calculation, please tweak
+ */
+ PageHeader p = (PageHeader) page;
+ uint page_header_stop = (uint)(offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+ uint page_lower_start = (uint)(offsetof(PageHeaderData, pd_prune_xid));
+ uint page_lower_stop;
+ uint sum1 = 0;
+ uint64 sum2 = 0;
+ int i;
+
+ /*
+ * Avoid calculating checksum if page is new, just return a value that
+ * will cause the check to fail. We may still pass the all-zeroes check.
+ */
+ if (PageIsNew(page))
+ return 1;
+
+ /*
+ * Just add in the pd_prune_xid if there are no line pointers yet.
+ */
+ page_lower_stop = p->pd_lower;
+ if (page_lower_stop == 0)
+ page_lower_stop = page_lower_start + sizeof(TransactionId);
+
+ Assert(p->pd_upper != 0);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "calculating checksum for %u-%u %u-%u %u-%u",
+ 0, /* page_header_start */
+ page_header_stop,
+ page_lower_start,
+ page_lower_stop,
+ p->pd_upper,
+ BLCKSZ
+ );
+#endif
+
+#define COMP_F16(from, to) \
+do { \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = sum1 + page[i]; \
+ sum2 = sum1 + sum2; \
+ } \
+ sum1 %= 255; \
+ sum2 %= 255; \
+} while (0); \
+
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ page_lower_stop);
+
+ /* ignore the hole in the middle of the block */
+
+ COMP_F16(p->pd_upper,
+ BLCKSZ - 1);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "checksum %u", ((sum2 << 8) | sum1));
+#endif
+
+ return ((sum2 << 8) | sum1);
+#endif
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5c910dd..1c3f485 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -830,6 +830,20 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksums before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..6f81023 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -150,15 +150,21 @@
#------------------------------------------------------------------------------
-# WRITE AHEAD LOG
+# WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
-# - Settings -
+# - Reliability -
-#wal_level = minimal # minimal, archive, or hot_standby
- # (change requires restart)
+#page_checksums = off # calculate checksum before database I/O
+#full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+# - Write Ahead Log -
+
+#wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
@@ -166,7 +172,6 @@
# fsync
# fsync_writethrough
# open_sync
-#full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index db6380f..eb32856 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -114,6 +114,8 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
#define XLogPageHeaderSize(hdr) \
(((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
+#define XLOG_SMGR_HINT 0x40
+
/*
* We break each logical log file (xlogid value) into segment files of the
* size indicated by XLOG_SEG_SIZE. One possible segment at the end of each
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index d5103a8..48a728c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -36,6 +36,7 @@ extern void PostPrepare_smgr(void);
extern void log_smgrcreate(RelFileNode *rnode, ForkNumber forkNum);
+extern void smgr_buffer_hint(Buffer buffer);
extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 1ab64e0..38708c0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -18,6 +18,8 @@
#include "storage/item.h"
#include "storage/off.h"
+extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
@@ -93,7 +95,7 @@ typedef uint16 LocationIndex;
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
- * pd_pagesize_version - size in bytes and page layout version number.
+ * pd_verify - page verification information of different kinds
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
@@ -106,7 +108,8 @@ typedef uint16 LocationIndex;
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
- * The page version number and page size are packed together into a single
+ * For verification we store either a 16 bit checksum or a watermark of
+ * the page version number and page size packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
@@ -130,7 +133,13 @@ typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
- uint16 pd_pagesize_version;
+
+ union
+ {
+ uint16 pd_pagesize_version;
+ uint16 pd_checksum16;
+ } pd_verify; /* page verification data */
+
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
@@ -155,7 +164,16 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */
+
+#define PD_CHECKSUM1 0x0008 /* First checksum bit */
+#define PD_CHECKSUM2 0x8000 /* Second checksum bit */
+#define PD_CHECKSUM 0x8008 /* OR of both checksum flags */
+
+#define PageHasChecksumFlag1(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM1) == PD_CHECKSUM1)
+#define PageHasChecksumFlag2(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM2) == PD_CHECKSUM2)
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -165,6 +183,8 @@ typedef PageHeaderData *PageHeader;
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses 4 as well, though with changed meaning of verification bits.
+ * We deliberately don't bump the page version for that, to allow upgrades.
*/
#define PG_PAGE_LAYOUT_VERSION 4
@@ -231,19 +251,22 @@ typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
- * this can only be called on a formatted page (unlike
- * BufferGetPageSize, which can be called on an unformatted page).
- * however, it can be called on a page that is not stored in a buffer.
+ * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
+ * This can be called on any page, initialised or not, in or out of buffers.
+ * You might think this can vary at runtime but you'd be wrong, since pages
+ * frequently need to occupy buffers and pages are copied from one to another
+ * so there are many hidden assumptions that this simple definition is true.
*/
-#define PageGetPageSize(page) \
- ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
+#define PageGetPageSize(page) (BLCKSZ)
/*
* PageGetPageLayoutVersion
* Returns the page layout version of a page.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageGetPageLayoutVersion(page) \
- (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
+ (((PageHeader) (page))->pd_verify.pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
@@ -251,14 +274,24 @@ typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
- ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
+/*
+ * HintsMustNotDirtyPage
+ * See discussion for PageVerificationInfoOK()
+ */
+#define HintsMustNotDirtyPage() \
+ (page_checksums && fullPageWrites && RecoveryInProgress())
+extern bool fullPageWrites;
+
/* ----------------
* page special data macros
* ----------------
@@ -368,7 +401,7 @@ do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
-extern bool PageHeaderIsValid(PageHeader page);
+extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
@@ -381,5 +414,6 @@ extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+extern char *PageSetVerificationInfo(Page page);
#endif /* BUFPAGE_H */
On Jan 6, 2012, at 4:36 AM, Andres Freund wrote:
On Friday, January 06, 2012 11:30:53 AM Simon Riggs wrote:
On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost <sfrost@snowman.net> wrote:
* Simon Riggs (simon@2ndQuadrant.com) wrote:
I discover that non-all-zeroes holes are fairly common, just not very
frequent.Curious, might be interesting to find out why.
That may or may not be a problem, but not something to be dealt with
here and now.But I agree that it's not the job of this patch/effort. It sounds like
we have clear indication, however, that those areas, as they are not
necessairly all zeros, should be included in the checksum.Disagree. Full page writes ignore the hole, so its appropriate to do
so here also.Well, ignoriging them in fpw has clear space benefits. Ignoring them while
checksumming doesn't have that much of a benefit.
I agree with Andres... we should checksum zero bytes, because if they're screwed up then something is wrong with your system, even if you got lucky with what data got trashed.
As I mentioned before, 2 separate checksums would be nice, but if we can't have that I think we need to fail on any checksum error.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On 1/7/12 5:26 AM, Heikki Linnakangas wrote:
Perhaps there
needs to be a third setting, calculate-but-dont-verify, where CRCs are
updated but existing CRCs are not expected to be correct. And a utility
to scan through your database and fix any incorrect CRCs, so that after
that's run in calculate-but-dont-verify mode, you can safely turn
checksums on.
The users of any feature like this are eventually going to demand a
"scrubbing" utility that validates the checksums in a background task.
Doing something like pg_dump just to validate your copy of the data is
still clean is not a great approach. If you accept that sort of utility
is inevitable, you might as well presume it could handle this sort of
task eventually too.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 12/30/11 9:44 AM, Aidan Van Dyk wrote:
So moving to this new double-write-area bandwagon, we move from a "WAL
FPW synced at the commit, collect as many other writes, then final
sync" type system to a system where *EVERY* write requires syncs of 2
separate 8K writes at buffer write-out time.
It's not quite that bad. The double-write area is going to be a small
chunk of re-used sequential I/O, like the current WAL. And if this
approach shifts some of the full-page writes out of the WAL and toward
the new area instead, that's not a real doubling either. Could probably
put both on the same disk, and in situations where you don't have a
battery-backed write cache it's possible to get a write to both per
rotation.
This idea has been tested pretty extensively as part of MySQL's Innodb
engine. Results there suggest the overhead is in the 5% to 30% range;
some examples mentioning both extremes of that:
http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/
http://www.bigdbahead.com/?p=392
Makes me wish I knew off the top of my head how expensive WAL logging
hint bits would be, for comparison sake.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sun, Jan 8, 2012 at 2:03 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
So there isn't any problem with there being incorrect checksums on
blocks and you can turn the parameter on and off as often and as
easily as you want. I think it can be USERSET but I wouldn't want to
encourage users to see turning it off as a performance tuning feature.
If the admin turns it on for the server, its on, so its SIGHUP.Any holes in that I haven't noticed?
And of course, as soon as I wrote that I thought of the problem. We
mustn't make a write that hasn't been covered by a FPW, so we must
know ahead of time whether to WAL log hints or not. We can't simply
turn it on/off any longer, now that we have to WAL log hint bits also.
So thanks for making me think of that.We *could* make it turn on/off at each checkpoint, but its easier just
to say that it can be turned on/off at server start.Attached patch v6 now handles hint bits and checksums correctly,
following Heikki's comments.In recovery, setting a hint doesn't dirty a block if it wasn't already
dirty. So we can write some hints, and we can set others but not write
them.Lots of comments in the code.
v7
* Fixes merge conflict
* Minor patch cleanups
* Adds checksum of complete page including hole
* Calcs checksum in mdwrite() so we pickup all non-shared buffer writes also
Robert mentioned to me there were outstanding concerns on this patch.
I know of none, and have double checked the thread to confirm all
concerns are fully addressed. Adding to CF.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16_with_wallogged_hint_bits.v7.patchtext/x-patch; charset=US-ASCII; name=checksum16_with_wallogged_hint_bits.v7.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..3cb8d2a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1701,6 +1701,47 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-page-checksums" xreflabel="page_checksums">
+ <indexterm>
+ <primary><varname>page_checksums</> configuration parameter</primary>
+ </indexterm>
+ <term><varname>page_checksums</varname> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ When this parameter is on, the <productname>PostgreSQL</> server
+ calculates checksums when it writes main database pages to disk,
+ flagging the page as checksum protected. When this parameter is off,
+ no checksum is written, only a standard watermark in the page header.
+ The database may thus contain a mix of pages with checksums and pages
+ without checksums.
+ </para>
+
+ <para>
+ When pages are read into shared buffers any page flagged with a
+ checksum has the checksum re-calculated and compared against the
+ stored value to provide greatly improved validation of page contents.
+ </para>
+
+ <para>
+ Writes via temp_buffers are not checksummed.
+ </para>
+
+ <para>
+ Turning this parameter off speeds normal operation, but
+ might allow data corruption to go unnoticed. The checksum uses
+ 16-bit checksums, using the fast Fletcher 16 algorithm. With this
+ parameter enabled there is still a non-zero probability that an error
+ could go undetected, as well as a non-zero probability of false
+ positives.
+ </para>
+
+ <para>
+ This parameter can only be set at server start.
+ The default is <literal>off</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
<term><varname>wal_buffers</varname> (<type>integer</type>)</term>
<indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 19ef66b..7c7b20e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -709,6 +709,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
bool updrqst;
bool doPageWrites;
bool isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ bool IsHint = (rmid == RM_SMGR_ID && info == XLOG_SMGR_HINT);
uint8 info_orig = info;
/* cross-check on whether we should be here or not */
@@ -955,6 +956,18 @@ begin:;
}
/*
+ * If this is a hint record and we don't need a backup block then
+ * we have no more work to do and can exit quickly without inserting
+ * a WAL record at all. In that case return InvalidXLogRecPtr.
+ */
+ if (IsHint && !(info & XLR_BKP_BLOCK_MASK))
+ {
+ LWLockRelease(WALInsertLock);
+ END_CRIT_SECTION();
+ return InvalidXLogRecPtr;
+ }
+
+ /*
* If there isn't enough space on the current XLOG page for a record
* header, advance to the next page (leaving the unused space as zeroes).
*/
@@ -3650,6 +3663,13 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
}
+ /*
+ * Any checksum set on this page will be invalid. We don't need
+ * to reset it here since it will be reset before being written
+ * but it seems worth doing this for general sanity and hygiene.
+ */
+ PageSetPageSizeAndVersion(page, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a017101..618c8f9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -20,6 +20,7 @@
#include "postgres.h"
#include "access/visibilitymap.h"
+#include "access/transam.h"
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
@@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_HINT 0x40
typedef struct xl_smgr_create
{
@@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
smgrDoPendingDeletes(false);
}
+/*
+ * Write a backup block if needed when we are setting a hint.
+ *
+ * Deciding the "if needed" bit is delicate and requires us to either
+ * grab WALInsertLock or check the info_lck spinlock. If we check the
+ * spinlock and it says Yes then we will need to get WALInsertLock as well,
+ * so the design choice here is to just go straight for the WALInsertLock
+ * and trust that calls to this function are minimised elsewhere.
+ *
+ * Callable while holding share lock on the buffer content.
+ *
+ * Possible that multiple concurrent backends could attempt to write
+ * WAL records. In that case, more than one backup block may be recorded
+ * though that isn't important to the outcome and the backup blocks are
+ * likely to be identical anyway.
+ */
+#define SMGR_HINT_WATERMARK 13579
+void
+smgr_buffer_hint(Buffer buffer)
+{
+ /*
+ * Make an XLOG entry reporting the hint
+ */
+ XLogRecPtr lsn;
+ XLogRecData rdata[2];
+ int watermark = SMGR_HINT_WATERMARK;
+
+ /*
+ * Not allowed to have zero-length records, so use a small watermark
+ */
+ rdata[0].data = (char *) (&watermark);
+ rdata[0].len = sizeof(int);
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
+
+ /*
+ * Set the page LSN if we wrote a backup block.
+ */
+ if (!XLByteEQ(InvalidXLogRecPtr, lsn))
+ {
+ Page page = BufferGetPage(buffer);
+ PageSetLSN(page, lsn);
+ elog(LOG, "inserted backup block for hint bit");
+ }
+}
+
void
smgr_redo(XLogRecPtr lsn, XLogRecord *record)
{
uint8 info = record->xl_info & ~XLR_INFO_MASK;
- /* Backup blocks are not used in smgr records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
-
if (info == XLOG_SMGR_CREATE)
{
xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
SMgrRelation reln;
+ /* Backup blocks are not used in smgr truncate records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
@@ -499,6 +556,9 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
SMgrRelation reln;
Relation rel;
+ /* Backup blocks are not used in smgr truncate records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
reln = smgropen(xlrec->rnode, InvalidBackendId);
/*
@@ -524,6 +584,28 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_HINT)
+ {
+ int *watermark = (int *) XLogRecGetData(record);
+
+ /* Check the watermark is correct for the hint record */
+ Assert(*watermark == SMGR_HINT_WATERMARK);
+
+ /* Backup blocks must be present for smgr hint records */
+ Assert(record->xl_info & XLR_BKP_BLOCK_MASK);
+
+ /*
+ * Hint records have no information that needs to be replayed.
+ * The sole purpose of them is to ensure that a hint bit does
+ * not cause a checksum invalidation if a hint bit write should
+ * cause a torn page. So the body of the record is empty but
+ * there can be one backup block.
+ *
+ * Since the only change in the backup block is a hint bit,
+ * there is no confict with Hot Standby.
+ */
+ RestoreBkpBlocks(lsn, record, false);
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
@@ -550,6 +632,8 @@ smgr_desc(StringInfo buf, uint8 xl_info, char *rec)
xlrec->blkno);
pfree(path);
}
+ else if (info == XLOG_SMGR_HINT)
+ appendStringInfo(buf, "buffer hint");
else
appendStringInfo(buf, "UNKNOWN");
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8f68bcc..dad34f2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
#include <unistd.h>
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "executor/instrument.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -440,7 +441,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
- if (!PageHeaderIsValid((PageHeader) bufBlock))
+ if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
@@ -1860,6 +1861,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
@@ -1907,10 +1909,15 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
buf->flags &= ~BM_JUST_DIRTIED;
UnlockBufHdr(buf);
+ bufBlock = BufHdrGetBlock(buf);
+
+ /*
+ * bufToWrite is either the shared buffer or a copy, as appropriate.
+ */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
- (char *) BufHdrGetBlock(buf),
+ (char *) bufBlock,
false);
pgBufferUsage.shared_blks_written++;
@@ -1921,6 +1928,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
*/
TerminateBufferIO(buf, true, 0);
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
@@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
+ /*
+ * If we're writing checksums and we care about torn pages then we
+ * cannot dirty a page during recovery as a result of a hint.
+ * We can set the hint, just not dirty the page as a result.
+ *
+ * See long discussion in bufpage.c
+ */
+ if (HintsMustNotDirtyPage())
+ return;
+
+ /*
+ * Write a full page into WAL iff this is the first change on the
+ * block since the last checkpoint. That will never be the case
+ * if the block is already dirty because we either made a change
+ * or set a hint already. Note that aggressive cleaning of blocks
+ * dirtied by hint bit setting would increase the call rate.
+ * Bulk setting of hint bits would reduce the call rate...
+ *
+ * We must issue the WAL record before we mark the buffer dirty.
+ * Otherwise we might write the page before we write the WAL.
+ * That causes a race condition, since a checkpoint might
+ * occur between writing the WAL record and marking the buffer dirty.
+ * We solve that with a kluge, but one that is already in use
+ * during transaction commit to prevent race conditions.
+ * Basically, we simply prevent the checkpoint WAL record from
+ * being written until we have marked the buffer dirty. We don't
+ * start the checkpoint flush until we have marked dirty, so our
+ * checkpoint must flush the change to disk successfully or the
+ * checkpoint never gets written, so crash recovery will set us right.
+ *
+ * XXX rename PGPROC variable later; keep it same now for clarity
+ */
+ MyPgXact->inCommit = true;
+ smgr_buffer_hint(buffer);
+
LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);
if (!(bufHdr->flags & BM_DIRTY))
@@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
}
bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
UnlockBufHdr(bufHdr);
+ MyPgXact->inCommit = false;
}
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 096d36a..a220310 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
/* Find smgr relation for buffer */
oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ /* XXX do we want to write checksums for local buffers? An option? */
+
/* And write... */
smgrwrite(oreln,
bufHdr->tag.forkNum,
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 90a731c..c8d15bd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -16,6 +16,12 @@
#include "access/htup.h"
+bool page_checksums = false;
+
+static char pageCopy[BLCKSZ]; /* temporary buffer to allow checksum calculation */
+
+static bool PageVerificationInfoOK(Page page);
+static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
@@ -25,6 +31,10 @@
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we write the block to disk.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
@@ -67,20 +77,20 @@ PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
-PageHeaderIsValid(PageHeader page)
+PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
/* Check normal case */
- if (PageGetPageSize(page) == BLCKSZ &&
- PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
- (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
- page->pd_lower >= SizeOfPageHeaderData &&
- page->pd_lower <= page->pd_upper &&
- page->pd_upper <= page->pd_special &&
- page->pd_special <= BLCKSZ &&
- page->pd_special == MAXALIGN(page->pd_special))
+ if (PageVerificationInfoOK(page) &&
+ (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
+ p->pd_lower >= SizeOfPageHeaderData &&
+ p->pd_lower <= p->pd_upper &&
+ p->pd_upper <= p->pd_special &&
+ p->pd_special <= BLCKSZ &&
+ p->pd_special == MAXALIGN(p->pd_special))
return true;
/* Check all-zeroes case */
@@ -93,7 +103,6 @@ PageHeaderIsValid(PageHeader page)
return true;
}
-
/*
* PageAddItem
*
@@ -827,3 +836,266 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
pfree(itemidbase);
}
+
+/*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ *
+ * Any write of a data block can cause a torn page if the write is unsuccessful.
+ * Full page writes protect us from that, which are stored in WAL. Setting
+ * hint bits when a page is already dirty is OK because a full page write
+ * must already have been written for that since the last checkpoint.
+ * Setting hint bits on an otherwise clean page can allow torn pages; this
+ * doesn't normally matter since they are just hints. When the page has
+ * checksums, losing a few bits would cause the checksum to be invalid.
+ * So if we have full_page_writes = on and page_checksums = on then we must
+ * write a WAL record specifically so that we record a full page image in WAL.
+ * New WAL records cannot be written during recovery, so hint bits set
+ * during recovery must not dirty the page if the buffer is not already dirty,
+ * when page_checksums = on. Enforced by checking HintsMustNotDirtyPage()
+ *
+ * So we cannot enable/disable page_checksums except at a checkpoint if
+ * full_page_writes is enabled. We choose to only allow changes at server start.
+ *
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page writes does not check verification
+ * info via this function. So we zero the checksum when restoring backup blocks.
+ * In recovery, since we only dirty a block when we have a full page image
+ * available if we crash, we are fully OK to use page verification.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering and leaving
+ * the buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data leaving and
+ * re-entering PostgreSQL controlled memory.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ *
+ * Note also that this returns a boolean, not a full damage assessment.
+ */
+static bool
+PageVerificationInfoOK(Page page)
+{
+ PageHeader p = (PageHeader) page;
+
+ /*
+ * We set two non-adjacent bits to signify that the page has a checksum and
+ * should be verified against that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags and destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ */
+ if (PageHasChecksumFlag1(p) && PageHasChecksumFlag2(p))
+ {
+ uint16 checksum = PageCalcChecksum16(page);
+
+ if (checksum == p->pd_verify.pd_checksum16)
+ {
+#ifdef CHECK_HOLE
+ /* Also check page hole is all-zeroes */
+ char *pagebytes;
+ bool empty = true;
+ int i;
+
+ pagebytes = (char *) page;
+ for (i = p->pd_lower; i < p->pd_upper; i++)
+ {
+ if (pagebytes[i] != 0)
+ {
+ empty = false;
+ break;
+ }
+ }
+
+ if (!empty)
+ elog(LOG, "hole was not empty at byte %d pd_lower %d pd_upper %d",
+ i, p->pd_lower, p->pd_upper);
+#endif
+ return true;
+ }
+
+ elog(LOG, "page verification failed - checksum was %u page checksum field is %u",
+ checksum, p->pd_verify.pd_checksum16);
+ }
+ else if (!PageHasChecksumFlag1(p) && !PageHasChecksumFlag2(p))
+ {
+ if (PageGetPageLayoutVersion(p) == PG_PAGE_LAYOUT_VERSION &&
+ PageGetPageSize(p) == BLCKSZ)
+ return true;
+ }
+ else
+ elog(LOG, "page verification failed - page has one checksum flag set");
+
+ return false;
+}
+
+/*
+ * Set verification info for page.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an invalid checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ *
+ * Returns a pointer to the block-sized data that needs to be written. That
+ * allows us to either copy, or not, depending upon whether we checksum.
+ */
+char *
+PageSetVerificationInfo(Page page)
+{
+ PageHeader p;
+
+ if (PageIsNew(page))
+ return (char *) page;
+
+ if (page_checksums)
+ {
+ /*
+ * We make a copy iff we need to calculate a checksum because other
+ * backends may set hint bits on this page while we write, which
+ * would mean the checksum differs from the page contents. It doesn't
+ * matter if we include or exclude hints during the copy, as long
+ * as we write a valid page and associated checksum.
+ */
+ memcpy(&pageCopy, page, BLCKSZ);
+
+ p = (PageHeader) &pageCopy;
+ p->pd_flags |= PD_CHECKSUM;
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16((Page) &pageCopy);
+
+ return (char *) &pageCopy;
+ }
+
+ p = (PageHeader) page;
+
+ if (PageHasChecksumFlag1(p) || PageHasChecksumFlag2(p))
+ {
+ /* ensure any older checksum info is overwritten with watermark */
+ p->pd_flags &= ~PD_CHECKSUM;
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+ }
+
+ return (char *) page;
+}
+
+/*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+static uint16
+PageCalcChecksum16(Page page)
+{
+#define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+#ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ */
+ PageHeader p = (PageHeader) page;
+ uint page_header_stop = (uint)(offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+ uint page_lower_start = (uint)(offsetof(PageHeaderData, pd_prune_xid));
+ uint page_lower_stop;
+ uint sum1 = 0;
+ uint64 sum2 = 0;
+ int i;
+
+ /*
+ * Avoid calculating checksum if page is new, just return a value that
+ * will cause the check to fail. We may still pass the all-zeroes check.
+ */
+ if (PageIsNew(page))
+ return 1;
+
+ /*
+ * Just add in the pd_prune_xid if there are no line pointers yet.
+ */
+ page_lower_stop = p->pd_lower;
+ if (page_lower_stop == 0)
+ page_lower_stop = page_lower_start + sizeof(TransactionId);
+
+ Assert(p->pd_upper != 0);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "calculating checksum for %u-%u %u-%u %u-%u",
+ 0, /* page_header_start */
+ page_header_stop,
+ page_lower_start,
+ page_lower_stop,
+ p->pd_upper,
+ BLCKSZ
+ );
+#endif
+
+#define COMP_F16(from, to) \
+do { \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = sum1 + page[i]; \
+ sum2 = sum1 + sum2; \
+ } \
+ sum1 %= 255; \
+ sum2 %= 255; \
+} while (0); \
+
+#ifdef IGNORE_PAGE_HOLE
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ page_lower_stop);
+
+ /* ignore the hole in the middle of the block */
+
+ COMP_F16(p->pd_upper,
+ BLCKSZ - 1);
+#else
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ BLCKSZ - 1);
+#endif
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "checksum %u", ((sum2 << 8) | sum1));
+#endif
+
+ return ((sum2 << 8) | sum1);
+#endif
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bfc9f06..8897a9b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -689,6 +689,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
off_t seekpos;
int nbytes;
MdfdVec *v;
+ char *bufCopy;
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
@@ -701,6 +702,16 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
reln->smgr_rnode.node.relNode,
reln->smgr_rnode.backend);
+ /*
+ * Set page verification info immediately before we write the buffer to disk.
+ * Once we have flushed the buffer is marked clean again, meaning it can
+ * be replaced quickly and silently with another data block, so we must
+ * write verification info now. For efficiency, the process of cleaning
+ * and page replacement is asynchronous, so we can't do this *only* when
+ * we are about to replace the buffer, we need to do this for every flush.
+ */
+ bufCopy = PageSetVerificationInfo((Page) buffer);
+
v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_FAIL);
seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -713,7 +724,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
errmsg("could not seek to block %u in file \"%s\": %m",
blocknum, FilePathName(v->mdfd_vfd))));
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+ nbytes = FileWrite(v->mdfd_vfd, bufCopy, BLCKSZ);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5c910dd..9a76bc8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -830,6 +830,20 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksum before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..6f81023 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -150,15 +150,21 @@
#------------------------------------------------------------------------------
-# WRITE AHEAD LOG
+# WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
-# - Settings -
+# - Reliability -
-#wal_level = minimal # minimal, archive, or hot_standby
- # (change requires restart)
+#page_checksums = off # calculate checksum before database I/O
+#full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+# - Write Ahead Log -
+
+#wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
@@ -166,7 +172,6 @@
# fsync
# fsync_writethrough
# open_sync
-#full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index db6380f..eb32856 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -114,6 +114,8 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
#define XLogPageHeaderSize(hdr) \
(((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
+#define XLOG_SMGR_HINT 0x40
+
/*
* We break each logical log file (xlogid value) into segment files of the
* size indicated by XLOG_SEG_SIZE. One possible segment at the end of each
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index d5103a8..48a728c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -36,6 +36,7 @@ extern void PostPrepare_smgr(void);
extern void log_smgrcreate(RelFileNode *rnode, ForkNumber forkNum);
+extern void smgr_buffer_hint(Buffer buffer);
extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 1ab64e0..38708c0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -18,6 +18,8 @@
#include "storage/item.h"
#include "storage/off.h"
+extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
@@ -93,7 +95,7 @@ typedef uint16 LocationIndex;
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
- * pd_pagesize_version - size in bytes and page layout version number.
+ * pd_verify - page verification information of different kinds
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
@@ -106,7 +108,8 @@ typedef uint16 LocationIndex;
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
- * The page version number and page size are packed together into a single
+ * For verification we store either a 16 bit checksum or a watermark of
+ * the page version number and page size packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
@@ -130,7 +133,13 @@ typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
- uint16 pd_pagesize_version;
+
+ union
+ {
+ uint16 pd_pagesize_version;
+ uint16 pd_checksum16;
+ } pd_verify; /* page verification data */
+
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
@@ -155,7 +164,16 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */
+
+#define PD_CHECKSUM1 0x0008 /* First checksum bit */
+#define PD_CHECKSUM2 0x8000 /* Second checksum bit */
+#define PD_CHECKSUM 0x8008 /* OR of both checksum flags */
+
+#define PageHasChecksumFlag1(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM1) == PD_CHECKSUM1)
+#define PageHasChecksumFlag2(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM2) == PD_CHECKSUM2)
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -165,6 +183,8 @@ typedef PageHeaderData *PageHeader;
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses 4 as well, though with changed meaning of verification bits.
+ * We deliberately don't bump the page version for that, to allow upgrades.
*/
#define PG_PAGE_LAYOUT_VERSION 4
@@ -231,19 +251,22 @@ typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
- * this can only be called on a formatted page (unlike
- * BufferGetPageSize, which can be called on an unformatted page).
- * however, it can be called on a page that is not stored in a buffer.
+ * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
+ * This can be called on any page, initialised or not, in or out of buffers.
+ * You might think this can vary at runtime but you'd be wrong, since pages
+ * frequently need to occupy buffers and pages are copied from one to another
+ * so there are many hidden assumptions that this simple definition is true.
*/
-#define PageGetPageSize(page) \
- ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
+#define PageGetPageSize(page) (BLCKSZ)
/*
* PageGetPageLayoutVersion
* Returns the page layout version of a page.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageGetPageLayoutVersion(page) \
- (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
+ (((PageHeader) (page))->pd_verify.pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
@@ -251,14 +274,24 @@ typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
- ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
+/*
+ * HintsMustNotDirtyPage
+ * See discussion for PageVerificationInfoOK()
+ */
+#define HintsMustNotDirtyPage() \
+ (page_checksums && fullPageWrites && RecoveryInProgress())
+extern bool fullPageWrites;
+
/* ----------------
* page special data macros
* ----------------
@@ -368,7 +401,7 @@ do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
-extern bool PageHeaderIsValid(PageHeader page);
+extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
@@ -381,5 +414,6 @@ extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+extern char *PageSetVerificationInfo(Page page);
#endif /* BUFPAGE_H */
On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
v7
This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. One could not simply condition them on the
page_checksums setting, though. Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. If that write tears,
leaving the checksum intact, that block will now fail verification. A couple
of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the
old page had a checksum. (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.
Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer. Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards. Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too. The visibility map shortcuts are more nuanced.
When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum. If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data. They would need a dump/reload
to get the performance of a never-checksummed database. Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.
--- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml
+ <para> + Turning this parameter off speeds normal operation, but + might allow data corruption to go unnoticed. The checksum uses + 16-bit checksums, using the fast Fletcher 16 algorithm. With this + parameter enabled there is still a non-zero probability that an error + could go undetected, as well as a non-zero probability of false + positives. + </para>
What sources of false positives do you intend to retain?
--- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -20,6 +20,7 @@ #include "postgres.h"#include "access/visibilitymap.h"
+#include "access/transam.h"
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
@@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_HINT 0x40typedef struct xl_smgr_create
{
@@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
smgrDoPendingDeletes(false);
}+/* + * Write a backup block if needed when we are setting a hint. + * + * Deciding the "if needed" bit is delicate and requires us to either + * grab WALInsertLock or check the info_lck spinlock. If we check the + * spinlock and it says Yes then we will need to get WALInsertLock as well, + * so the design choice here is to just go straight for the WALInsertLock + * and trust that calls to this function are minimised elsewhere. + * + * Callable while holding share lock on the buffer content. + * + * Possible that multiple concurrent backends could attempt to write + * WAL records. In that case, more than one backup block may be recorded + * though that isn't important to the outcome and the backup blocks are + * likely to be identical anyway. + */ +#define SMGR_HINT_WATERMARK 13579 +void +smgr_buffer_hint(Buffer buffer) +{ + /* + * Make an XLOG entry reporting the hint + */ + XLogRecPtr lsn; + XLogRecData rdata[2]; + int watermark = SMGR_HINT_WATERMARK; + + /* + * Not allowed to have zero-length records, so use a small watermark + */ + rdata[0].data = (char *) (&watermark); + rdata[0].len = sizeof(int); + rdata[0].buffer = InvalidBuffer; + rdata[0].buffer_std = false; + rdata[0].next = &(rdata[1]); + + rdata[1].data = NULL; + rdata[1].len = 0; + rdata[1].buffer = buffer; + rdata[1].buffer_std = true; + rdata[1].next = NULL; + + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata); + + /* + * Set the page LSN if we wrote a backup block. + */ + if (!XLByteEQ(InvalidXLogRecPtr, lsn)) + { + Page page = BufferGetPage(buffer); + PageSetLSN(page, lsn);
PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.
Need a PageSetTLI() here, too.
+ elog(LOG, "inserted backup block for hint bit"); + } +}
--- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c
@@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) != (BM_DIRTY | BM_JUST_DIRTIED)) { + /* + * If we're writing checksums and we care about torn pages then we + * cannot dirty a page during recovery as a result of a hint. + * We can set the hint, just not dirty the page as a result. + * + * See long discussion in bufpage.c + */ + if (HintsMustNotDirtyPage()) + return;
This expands to
if (page_checksums && fullPageWrites && RecoveryInProgress())
If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
WAL record during recovery.
+ + /* + * Write a full page into WAL iff this is the first change on the + * block since the last checkpoint. That will never be the case + * if the block is already dirty because we either made a change + * or set a hint already. Note that aggressive cleaning of blocks + * dirtied by hint bit setting would increase the call rate. + * Bulk setting of hint bits would reduce the call rate... + * + * We must issue the WAL record before we mark the buffer dirty. + * Otherwise we might write the page before we write the WAL. + * That causes a race condition, since a checkpoint might + * occur between writing the WAL record and marking the buffer dirty. + * We solve that with a kluge, but one that is already in use + * during transaction commit to prevent race conditions. + * Basically, we simply prevent the checkpoint WAL record from + * being written until we have marked the buffer dirty. We don't + * start the checkpoint flush until we have marked dirty, so our + * checkpoint must flush the change to disk successfully or the + * checkpoint never gets written, so crash recovery will set us right. + * + * XXX rename PGPROC variable later; keep it same now for clarity + */ + MyPgXact->inCommit = true; + smgr_buffer_hint(buffer); + LockBufHdr(bufHdr); Assert(bufHdr->refcount > 0); if (!(bufHdr->flags & BM_DIRTY)) @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) } bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED); UnlockBufHdr(bufHdr); + MyPgXact->inCommit = false; } }diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 096d36a..a220310 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, /* Find smgr relation for buffer */ oreln = smgropen(bufHdr->tag.rnode, MyBackendId);+ /* XXX do we want to write checksums for local buffers? An option? */ +
Don't we have that, now that it happens in mdwrite()?
I can see value in an option to exclude local buffers, since corruption there
may be less exciting. It doesn't seem important for an initial patch, though.
--- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c
+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
I get a 404 for that link.
--- a/src/include/storage/bufpage.h +++ b/src/include/storage/bufpage.h
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ +#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */
The comment is not consistent with the character of the value.
Thanks,
nm
Some other comments on the checksum patch:
I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite(). If there were every another storage backend, it would have to duplicate the checksum check, right? Is there a disadvantage to putting it in smgrwrite()?
You may have already noticed this, but a bunch of the comments are incorrect, now that you have moved the checksum calculation to mdwrite().
config.sgml - says writes via temp_buffers (which I think means local buffers) are not checksummed -- that's no longer true, right?
bufmgr.c, line 1914 - bufToWrite no longer exists. You could revert changes from 1912-1920
localbuf.c, line 203 - as mentioned below, this comment is no longer relevant, you are checksumming local buffers now
Dan
----- Original Message -----
From: "Noah Misch" <noah@leadboat.com>
To: "Simon Riggs" <simon@2ndQuadrant.com>
Cc: "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>, "Robert Haas" <robertmhaas@gmail.com>, "Andres Freund" <andres@anarazel.de>, "Kevin Grittner" <Kevin.Grittner@wicourts.gov>, david@fetter.org, aidan@highrise.ca, stark@mit.edu, pgsql-hackers@postgresql.org
Sent: Thursday, January 26, 2012 12:20:39 PM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2
On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
v7
This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. One could not simply condition them on the
page_checksums setting, though. Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. If that write tears,
leaving the checksum intact, that block will now fail verification. A couple
of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the
old page had a checksum. (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.
Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer. Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards. Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too. The visibility map shortcuts are more nuanced.
When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum. If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data. They would need a dump/reload
to get the performance of a never-checksummed database. Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.
--- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml
+ <para> + Turning this parameter off speeds normal operation, but + might allow data corruption to go unnoticed. The checksum uses + 16-bit checksums, using the fast Fletcher 16 algorithm. With this + parameter enabled there is still a non-zero probability that an error + could go undetected, as well as a non-zero probability of false + positives. + </para>
What sources of false positives do you intend to retain?
--- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -20,6 +20,7 @@ #include "postgres.h"#include "access/visibilitymap.h"
+#include "access/transam.h"
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
@@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_HINT 0x40typedef struct xl_smgr_create
{
@@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
smgrDoPendingDeletes(false);
}+/* + * Write a backup block if needed when we are setting a hint. + * + * Deciding the "if needed" bit is delicate and requires us to either + * grab WALInsertLock or check the info_lck spinlock. If we check the + * spinlock and it says Yes then we will need to get WALInsertLock as well, + * so the design choice here is to just go straight for the WALInsertLock + * and trust that calls to this function are minimised elsewhere. + * + * Callable while holding share lock on the buffer content. + * + * Possible that multiple concurrent backends could attempt to write + * WAL records. In that case, more than one backup block may be recorded + * though that isn't important to the outcome and the backup blocks are + * likely to be identical anyway. + */ +#define SMGR_HINT_WATERMARK 13579 +void +smgr_buffer_hint(Buffer buffer) +{ + /* + * Make an XLOG entry reporting the hint + */ + XLogRecPtr lsn; + XLogRecData rdata[2]; + int watermark = SMGR_HINT_WATERMARK; + + /* + * Not allowed to have zero-length records, so use a small watermark + */ + rdata[0].data = (char *) (&watermark); + rdata[0].len = sizeof(int); + rdata[0].buffer = InvalidBuffer; + rdata[0].buffer_std = false; + rdata[0].next = &(rdata[1]); + + rdata[1].data = NULL; + rdata[1].len = 0; + rdata[1].buffer = buffer; + rdata[1].buffer_std = true; + rdata[1].next = NULL; + + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata); + + /* + * Set the page LSN if we wrote a backup block. + */ + if (!XLByteEQ(InvalidXLogRecPtr, lsn)) + { + Page page = BufferGetPage(buffer); + PageSetLSN(page, lsn);
PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.
Need a PageSetTLI() here, too.
+ elog(LOG, "inserted backup block for hint bit"); + } +}
--- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c
@@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) != (BM_DIRTY | BM_JUST_DIRTIED)) { + /* + * If we're writing checksums and we care about torn pages then we + * cannot dirty a page during recovery as a result of a hint. + * We can set the hint, just not dirty the page as a result. + * + * See long discussion in bufpage.c + */ + if (HintsMustNotDirtyPage()) + return;
This expands to
if (page_checksums && fullPageWrites && RecoveryInProgress())
If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
WAL record during recovery.
+ + /* + * Write a full page into WAL iff this is the first change on the + * block since the last checkpoint. That will never be the case + * if the block is already dirty because we either made a change + * or set a hint already. Note that aggressive cleaning of blocks + * dirtied by hint bit setting would increase the call rate. + * Bulk setting of hint bits would reduce the call rate... + * + * We must issue the WAL record before we mark the buffer dirty. + * Otherwise we might write the page before we write the WAL. + * That causes a race condition, since a checkpoint might + * occur between writing the WAL record and marking the buffer dirty. + * We solve that with a kluge, but one that is already in use + * during transaction commit to prevent race conditions. + * Basically, we simply prevent the checkpoint WAL record from + * being written until we have marked the buffer dirty. We don't + * start the checkpoint flush until we have marked dirty, so our + * checkpoint must flush the change to disk successfully or the + * checkpoint never gets written, so crash recovery will set us right. + * + * XXX rename PGPROC variable later; keep it same now for clarity + */ + MyPgXact->inCommit = true; + smgr_buffer_hint(buffer); + LockBufHdr(bufHdr); Assert(bufHdr->refcount > 0); if (!(bufHdr->flags & BM_DIRTY)) @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) } bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED); UnlockBufHdr(bufHdr); + MyPgXact->inCommit = false; } }diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 096d36a..a220310 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, /* Find smgr relation for buffer */ oreln = smgropen(bufHdr->tag.rnode, MyBackendId);+ /* XXX do we want to write checksums for local buffers? An option? */ +
Don't we have that, now that it happens in mdwrite()?
I can see value in an option to exclude local buffers, since corruption there
may be less exciting. It doesn't seem important for an initial patch, though.
--- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c
+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
I get a 404 for that link.
--- a/src/include/storage/bufpage.h +++ b/src/include/storage/bufpage.h
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ +#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */
The comment is not consistent with the character of the value.
Thanks,
nm
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales <scales@vmware.com> wrote:
I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite(). If there were every another storage backend, it would have to duplicate the checksum check, right? Is there a disadvantage to putting it in smgrwrite()?
The smgr and md layers don't currently know anything about the page
format, and I imagine we want to keep it that way. It seems like the
right place for this is in some higher layer, like bufmgr.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writes that don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)
Also, I missed this before: don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well? Otherwise, you won't be checksumming a bunch of the new pages.
Dan
----- Original Message -----
From: "Robert Haas" <robertmhaas@gmail.com>
To: "Dan Scales" <scales@vmware.com>
Cc: "Noah Misch" <noah@leadboat.com>, "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>, "Andres Freund" <andres@anarazel.de>, "Kevin Grittner" <Kevin.Grittner@wicourts.gov>, david@fetter.org, aidan@highrise.ca, stark@mit.edu, pgsql-hackers@postgresql.org, "Simon Riggs" <simon@2ndquadrant.com>
Sent: Friday, January 27, 2012 5:19:32 AM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2
On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales <scales@vmware.com> wrote:
I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite(). If there were every another storage backend, it would have to duplicate the checksum check, right? Is there a disadvantage to putting it in smgrwrite()?
The smgr and md layers don't currently know anything about the page
format, and I imagine we want to keep it that way. It seems like the
right place for this is in some higher layer, like bufmgr.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jan 27, 2012 at 9:07 PM, Dan Scales <scales@vmware.com> wrote:
The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writes that don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)
I'll have another look at that. Seems like we can make it various
ways, we just need to decide the code placement.
Also, I missed this before: don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well? Otherwise, you won't be checksumming a bunch of the new pages.
You don't need to checksum the extend because no data is written at
that point. It create a new block that will become dirty at some point
and then be written out, which will trigger the checksum.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 28.01.2012 15:49, Simon Riggs wrote:
On Fri, Jan 27, 2012 at 9:07 PM, Dan Scales<scales@vmware.com> wrote:
Also, I missed this before: don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well? Otherwise, you won't be checksumming a bunch of the new pages.
You don't need to checksum the extend because no data is written at
that point.
That's not correct. smgrextend writes a block of data just like
smgrwrite does. When a relation is extended by the buffer manager, it
calls smgrextend with an all-zeros block, but not all callers do that.
See rewriteheap.c and nbtsort.c for counter-examples.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, Jan 27, 2012 at 4:07 PM, Dan Scales <scales@vmware.com> wrote:
The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writes that don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)
Maybe we should have some sort of wrapper function, then. I really
dislike the idea that the smgr layer knows anything about the page
format, and if md has to know that's even worse.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Dec 25, 2011 at 04:25:19PM +0100, Martijn van Oosterhout wrote:
On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote:
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
Why don't you use the same tricks as the former patch and copy the buffer,
compute the checksum on that, and then write out that copy (you can even do
both at the same time). I have a hard time believing that the additional copy
is more expensive than the locking.ISTM we can't write and copy at the same time because the cheksum is
not a trailer field.Ofcourse you can. If the checksum is in the trailer field you get the
nice property that the whole block has a constant checksum. However, if
you store the checksum elsewhere you just need to change the checking
algorithm to copy the checksum out, zero those bytes and run the
checksum and compare with the extracted checksum.Not pretty, but I don't think it makes a difference in performence.
Sorry to be late replying to this, but an interesting idea would be to
zero the _hint_ bits before doing the CRC checksum. That would avoid
the problem of WAL-logging the hint bits.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted. �Do you really think this is
the last page layout change we'll ever make?No, I don't. I hope and expect the next page layout change to
reintroduce such a field.But since we're agreed now that upgrading is important, changing page
format isn't likely to be happening until we get an online upgrade
process. So future changes are much less likely. If they do happen, we
have some flag bits spare that can be used to indicate later versions.
It's not the prettiest thing in the world, but it's a small ugliness
in return for an important feature. If there was a way without that, I
would have chosen it.
Have you considered the CRC might match a valuid page version number?
Is that safe?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted. Do you really think this is
the last page layout change we'll ever make?No, I don't. I hope and expect the next page layout change to
reintroduce such a field.But since we're agreed now that upgrading is important, changing page
format isn't likely to be happening until we get an online upgrade
process. So future changes are much less likely. If they do happen, we
have some flag bits spare that can be used to indicate later versions.
It's not the prettiest thing in the world, but it's a small ugliness
in return for an important feature. If there was a way without that, I
would have chosen it.Have you considered the CRC might match a valuid page version number?
Is that safe?
In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted. �Do you really think this is
the last page layout change we'll ever make?No, I don't. I hope and expect the next page layout change to
reintroduce such a field.But since we're agreed now that upgrading is important, changing page
format isn't likely to be happening until we get an online upgrade
process. So future changes are much less likely. If they do happen, we
have some flag bits spare that can be used to indicate later versions.
It's not the prettiest thing in the world, but it's a small ugliness
in return for an important feature. If there was a way without that, I
would have chosen it.Have you considered the CRC might match a valuid page version number?
Is that safe?In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.
Thanks. Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.
I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero. I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.
I don't want to get to a point where we need a bit per page number
version.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On 06.02.2012 05:48, Bruce Momjian wrote:
On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.Thanks. Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.
There's this idea that I'd like to see implemented:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php
In any case, I think it's a very bad idea to remove the page version
field. How are we supposed to ever be able to change the page format
again if we don't even have a version number on the page? I strongly
oppose removing it.
I'm also not very comfortable with the idea of having flags on the page
indicating whether it has a checksum or not. It's not hard to imagine a
software of firmware bug or hardware failure that would cause pd_flags
field to be zeroed out altogether. It would be more robust if the
expected bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to
deal with that at upgrade somehow. And it still feels a bit whacky anyway.
I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero. I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.
Three bits seems short-sighted, but yeah, something like 6-8 bits should
be enough. On the whole, though. I think we should bite the bullet and
invent a way to extend the page header at upgrade.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Mon, Feb 6, 2012 at 4:48 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted. Do you really think this is
the last page layout change we'll ever make?No, I don't. I hope and expect the next page layout change to
reintroduce such a field.But since we're agreed now that upgrading is important, changing page
format isn't likely to be happening until we get an online upgrade
process. So future changes are much less likely. If they do happen, we
have some flag bits spare that can be used to indicate later versions.
It's not the prettiest thing in the world, but it's a small ugliness
in return for an important feature. If there was a way without that, I
would have chosen it.Have you considered the CRC might match a valuid page version number?
Is that safe?In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.Thanks. Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero. I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.I don't want to get to a point where we need a bit per page number
version.
Agreed, though I think we need at least 2 bits set if we are using the
checksum, so we should start at version 2 to ensure that.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 06.02.2012 05:48, Bruce Momjian wrote:
On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.Thanks. Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.There's this idea that I'd like to see implemented:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php
As you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.
ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.
That leaves the same problem that if someone writes to the page you
need to freeze the rows first.
In any case, I think it's a very bad idea to remove the page version field.
How are we supposed to ever be able to change the page format again if we
don't even have a version number on the page? I strongly oppose removing it.
Nobody is removing the version field, nor is anybody suggesting not
being able to tell which page version we are looking at.
I'm also not very comfortable with the idea of having flags on the page
indicating whether it has a checksum or not. It's not hard to imagine a
software of firmware bug or hardware failure that would cause pd_flags field
to be zeroed out altogether. It would be more robust if the expected
bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
at upgrade somehow. And it still feels a bit whacky anyway.
Good idea. Lets use
0-0-0 to represent upgraded from previous version, needs a bit set
0-0-1 to represent new version number of page, no checksum
1-1-1 to represent new version number of page, with checksum
So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator
I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero. I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.Three bits seems short-sighted, but yeah, something like 6-8 bits should be
enough. On the whole, though. I think we should bite the bullet and invent a
way to extend the page header at upgrade.
There are currently many spare bits. I don't see any need to allocate
them to this specific use ahead of time - especially since that is the
exact decision we took last time when we reserved 16 bits for the
version.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 06.02.2012 10:05, Simon Riggs wrote:
On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 06.02.2012 05:48, Bruce Momjian wrote:
Thanks. Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.There's this idea that I'd like to see implemented:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.phpAs you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.
Good idea. However, the followup idea to that discussion was to not only
avoid the I/O needed to mark tuples as frozen, but to avoid xid
wraparound altogether, by allowing clog to grow indefinitely. You do
want to freeze at some point of course, to truncate the clog, but it
would be nice to not have a hard limit. The way to do that is to store
an xid "epoch" in the page header, so that Xids are effectively 64-bits
wide, even though the xid fields on the tuple header are only 32-bits
wide. That does require a new field in the page header.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 06.02.2012 10:05, Simon Riggs wrote:
On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 06.02.2012 05:48, Bruce Momjian wrote:
Thanks. Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.There's this idea that I'd like to see implemented:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.phpAs you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.Good idea. However, the followup idea to that discussion was to not only
avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
altogether, by allowing clog to grow indefinitely. You do want to freeze at
some point of course, to truncate the clog, but it would be nice to not have
a hard limit. The way to do that is to store an xid "epoch" in the page
header, so that Xids are effectively 64-bits wide, even though the xid
fields on the tuple header are only 32-bits wide. That does require a new
field in the page header.
We wouldn't need to do that would we?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 06.02.2012 11:25, Simon Riggs wrote:
On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Good idea. However, the followup idea to that discussion was to not only
avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
altogether, by allowing clog to grow indefinitely. You do want to freeze at
some point of course, to truncate the clog, but it would be nice to not have
a hard limit. The way to do that is to store an xid "epoch" in the page
header, so that Xids are effectively 64-bits wide, even though the xid
fields on the tuple header are only 32-bits wide. That does require a new
field in the page header.We wouldn't need to do that would we?
Huh? Do you mean that we wouldn't need to implement that feature?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Mon, Feb 06, 2012 at 09:05:19AM +0000, Simon Riggs wrote:
In any case, I think it's a very bad idea to remove the page version field.
How are we supposed to ever be able to change the page format again if we
don't even have a version number on the page? I strongly oppose removing it.Nobody is removing the version field, nor is anybody suggesting not
being able to tell which page version we are looking at.
Agreed. I thought the idea was that we have a 16-bit page _version_
number and 8+ free page _flag_ bits, which are all currently zero. The
idea was to move the version number from 16-bit field into the unused
flag bits, and use the 16-bit field for the checksum. I would like to
have some logic that would allow tools inspecting the page to tell if
they should look for the version number in the bits at the beginning of
the page or at the end.
Specifically, this becomes the checksum:
uint16 pd_pagesize_version;
and this holds the page version, if we have updated the page to the new
format:
uint16 pd_flags; /* flag bits, see below */
Of the 16 bits of pd_flags, these are the only ones used:
#define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */
#define PD_PAGE_FULL 0x0002 /* not enough free space for new
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
I'm also not very comfortable with the idea of having flags on the page
indicating whether it has a checksum or not. It's not hard to imagine a
software of firmware bug or hardware failure that would cause pd_flags field
to be zeroed out altogether. It would be more robust if the expected
bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
at upgrade somehow. And it still feels a bit whacky anyway.
Good idea. Lets use
0-0-0 to represent upgraded from previous version, needs a bit set
0-0-1 to represent new version number of page, no checksum
1-1-1 to represent new version number of page, with checksumSo we have 1 bit dedicated to the page version, 2 bits to the checksum indicator
Interesting point that we would not be guarding against a bit flip from
1 to 0 for the checksum bit; I agree using two bits is the way to go. I
don't see how upgrade figures into this.
However, I am unclear how Simon's idea above actually works. We need
two bits for redundancy, both 1, to mark a page as having a checksum. I
don't think mixing the idea of a new page version and checksum enabled
really makes sense, especially since we have to plan for future page
version changes.
I think we dedicate 2 bits to say we have computed a checksum, and 3
bits to mark up to 8 possible page versions, so the logic is, in
pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
on the page, and we use 0x32 and later for the page version number. We
can assume all the remaining bits are for the page version number until
we need to define new bits, and we can start storing them at the end
first, and work forward. If all the version bits are zero, it means the
page version number is still stored in pd_pagesize_version.
I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero. �I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.Three bits seems short-sighted, but yeah, something like 6-8 bits should be
enough. On the whole, though. I think we should bite the bullet and invent a
way to extend the page header at upgrade.There are currently many spare bits. I don't see any need to allocate
them to this specific use ahead of time - especially since that is the
exact decision we took last time when we reserved 16 bits for the
version.
Right, but I am thinking we should set things up so we can grow the page
version number into the unused bit, rather than box it between bits we
are already using.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Mon, Feb 06, 2012 at 08:51:34AM +0100, Heikki Linnakangas wrote:
I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero. I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.Three bits seems short-sighted, but yeah, something like 6-8 bits
should be enough. On the whole, though. I think we should bite the
bullet and invent a way to extend the page header at upgrade.
I just emailed a possible layout that allows for future page version
number expansion.
I don't think there is any magic bullet that will allow for page header
extension by pg_upgrade. If it is going to be done, it would have to
happen in the backend while the system is running. Anything that
requires pg_upgrade to do lots of reads or writes basically makes
pg_upgrade useless.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Mon, Feb 06, 2012 at 12:59:59PM -0500, Bruce Momjian wrote:
I'm also not very comfortable with the idea of having flags on the page
indicating whether it has a checksum or not. It's not hard to imagine a
software of firmware bug or hardware failure that would cause pd_flags field
to be zeroed out altogether. It would be more robust if the expected
bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
at upgrade somehow. And it still feels a bit whacky anyway.Good idea. Lets use
0-0-0 to represent upgraded from previous version, needs a bit set
0-0-1 to represent new version number of page, no checksum
1-1-1 to represent new version number of page, with checksumSo we have 1 bit dedicated to the page version, 2 bits to the checksum indicator
Interesting point that we would not be guarding against a bit flip from
1 to 0 for the checksum bit; I agree using two bits is the way to go. I
don't see how upgrade figures into this.However, I am unclear how Simon's idea above actually works. We need
two bits for redundancy, both 1, to mark a page as having a checksum. I
don't think mixing the idea of a new page version and checksum enabled
really makes sense, especially since we have to plan for future page
version changes.I think we dedicate 2 bits to say we have computed a checksum, and 3
bits to mark up to 8 possible page versions, so the logic is, in
pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
on the page, and we use 0x32 and later for the page version number. We
can assume all the remaining bits are for the page version number until
we need to define new bits, and we can start storing them at the end
first, and work forward. If all the version bits are zero, it means the
page version number is still stored in pd_pagesize_version.
A simpler solution would be to place two bits for checksum after the
existing page bit usage, and place the page version on the last four
bits of the 16-bit field --- that still leaves us with 7 unused bits.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
v7
Thanks very much for the review. Just realised I hadn't actually replied...
This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. One could not simply condition them on the
page_checksums setting, though. Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. If that write tears,
leaving the checksum intact, that block will now fail verification. A couple
of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the
old page had a checksum. (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.
We can only change page_checksums at restart (now) so the above seems moot.
If we crash with FPWs enabled we repair any torn pages.
Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer. Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards. Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too. The visibility map shortcuts are more nuanced.
Still checking all, but not as bad as the list looks.
I'm removing chksum code from smgr and putting back into bufmgr and
other locations.
Patch footprint now looks like this.
doc/src/sgml/config.sgml | 40 +++
src/backend/access/hash/hashpage.c | 1
src/backend/access/heap/rewriteheap.c | 4
src/backend/access/heap/visibilitymap.c | 1
src/backend/access/nbtree/nbtree.c | 1
src/backend/access/nbtree/nbtsort.c | 3
src/backend/access/spgist/spginsert.c | 2
src/backend/access/transam/twophase.c | 16 -
src/backend/access/transam/xact.c | 8
src/backend/access/transam/xlog.c | 114 ++++++++
src/backend/commands/tablecmds.c | 2
src/backend/storage/buffer/bufmgr.c | 75 +++++
src/backend/storage/buffer/localbuf.c | 2
src/backend/storage/ipc/procarray.c | 63 +++-
src/backend/storage/lmgr/proc.c | 4
src/backend/storage/page/bufpage.c | 331 +++++++++++++++++++++++++-
src/backend/utils/misc/guc.c | 14 +
src/backend/utils/misc/postgresql.conf.sample | 15 -
src/include/access/xlog.h | 1
src/include/catalog/pg_control.h | 3
src/include/catalog/storage.h | 1
src/include/storage/bufpage.h | 107 ++++++--
src/include/storage/proc.h | 3
src/include/storage/procarray.h | 4
24 files changed, 743 insertions(+), 72 deletions(-)
Patch is *not* attached here, yet. Still working on it.
When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum. If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data. They would need a dump/reload
to get the performance of a never-checksummed database. Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.
Agreed, changed.
--- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml+ <para> + Turning this parameter off speeds normal operation, but + might allow data corruption to go unnoticed. The checksum uses + 16-bit checksums, using the fast Fletcher 16 algorithm. With this + parameter enabled there is still a non-zero probability that an error + could go undetected, as well as a non-zero probability of false + positives. + </para>What sources of false positives do you intend to retain?
I see none. Will reword.
--- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -20,6 +20,7 @@ #include "postgres.h"#include "access/visibilitymap.h" +#include "access/transam.h" #include "access/xact.h" #include "access/xlogutils.h" #include "catalog/catalog.h" @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_HINT 0x40typedef struct xl_smgr_create
{
@@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
smgrDoPendingDeletes(false);
}+/* + * Write a backup block if needed when we are setting a hint. + * + * Deciding the "if needed" bit is delicate and requires us to either + * grab WALInsertLock or check the info_lck spinlock. If we check the + * spinlock and it says Yes then we will need to get WALInsertLock as well, + * so the design choice here is to just go straight for the WALInsertLock + * and trust that calls to this function are minimised elsewhere. + * + * Callable while holding share lock on the buffer content. + * + * Possible that multiple concurrent backends could attempt to write + * WAL records. In that case, more than one backup block may be recorded + * though that isn't important to the outcome and the backup blocks are + * likely to be identical anyway. + */ +#define SMGR_HINT_WATERMARK 13579 +void +smgr_buffer_hint(Buffer buffer) +{ + /* + * Make an XLOG entry reporting the hint + */ + XLogRecPtr lsn; + XLogRecData rdata[2]; + int watermark = SMGR_HINT_WATERMARK; + + /* + * Not allowed to have zero-length records, so use a small watermark + */ + rdata[0].data = (char *) (&watermark); + rdata[0].len = sizeof(int); + rdata[0].buffer = InvalidBuffer; + rdata[0].buffer_std = false; + rdata[0].next = &(rdata[1]); + + rdata[1].data = NULL; + rdata[1].len = 0; + rdata[1].buffer = buffer; + rdata[1].buffer_std = true; + rdata[1].next = NULL; + + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata); + + /* + * Set the page LSN if we wrote a backup block. + */ + if (!XLByteEQ(InvalidXLogRecPtr, lsn)) + { + Page page = BufferGetPage(buffer); + PageSetLSN(page, lsn);PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.
Am serialising this by only writing PageLSN while holding buf hdr lock.
Need a PageSetTLI() here, too.
Agreed
+ elog(LOG, "inserted backup block for hint bit"); + } +}--- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c@@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) != (BM_DIRTY | BM_JUST_DIRTIED)) { + /* + * If we're writing checksums and we care about torn pages then we + * cannot dirty a page during recovery as a result of a hint. + * We can set the hint, just not dirty the page as a result. + * + * See long discussion in bufpage.c + */ + if (HintsMustNotDirtyPage()) + return;This expands to
if (page_checksums && fullPageWrites && RecoveryInProgress())
If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
WAL record during recovery.
Yes, logic corrected.
+ + /* + * Write a full page into WAL iff this is the first change on the + * block since the last checkpoint. That will never be the case + * if the block is already dirty because we either made a change + * or set a hint already. Note that aggressive cleaning of blocks + * dirtied by hint bit setting would increase the call rate. + * Bulk setting of hint bits would reduce the call rate... + * + * We must issue the WAL record before we mark the buffer dirty. + * Otherwise we might write the page before we write the WAL. + * That causes a race condition, since a checkpoint might + * occur between writing the WAL record and marking the buffer dirty. + * We solve that with a kluge, but one that is already in use + * during transaction commit to prevent race conditions. + * Basically, we simply prevent the checkpoint WAL record from + * being written until we have marked the buffer dirty. We don't + * start the checkpoint flush until we have marked dirty, so our + * checkpoint must flush the change to disk successfully or the + * checkpoint never gets written, so crash recovery will set us right. + * + * XXX rename PGPROC variable later; keep it same now for clarity + */ + MyPgXact->inCommit = true; + smgr_buffer_hint(buffer); + LockBufHdr(bufHdr); Assert(bufHdr->refcount > 0); if (!(bufHdr->flags & BM_DIRTY)) @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) } bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED); UnlockBufHdr(bufHdr); + MyPgXact->inCommit = false; } }
Found another bug in that section also, relating to inCommit handling.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 096d36a..a220310 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, /* Find smgr relation for buffer */ oreln = smgropen(bufHdr->tag.rnode, MyBackendId);+ /* XXX do we want to write checksums for local buffers? An option? */ +Don't we have that, now that it happens in mdwrite()?
I can see value in an option to exclude local buffers, since corruption there
may be less exciting. It doesn't seem important for an initial patch, though.
I'm continuing to exclude local buffers. Let me know if that should change.
--- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
I get a 404 for that link.
Changed
--- a/src/include/storage/bufpage.h +++ b/src/include/storage/bufpage.h-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ +#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */The comment is not consistent with the character of the value.
OK
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. ?One could not simply condition them on the
page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. ?If that write tears,
leaving the checksum intact, that block will now fail verification. ?A couple
of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.We can only change page_checksums at restart (now) so the above seems moot.
If we crash with FPWs enabled we repair any torn pages.
There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.
PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.Am serialising this by only writing PageLSN while holding buf hdr lock.
That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock. Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?
I can see value in an option to exclude local buffers, since corruption there
may be less exciting. ?It doesn't seem important for an initial patch, though.I'm continuing to exclude local buffers. Let me know if that should change.
Seems reasonable.
Thanks,
nm
On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote:
On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. ?One could not simply condition them on the
page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. ?If that write tears,
leaving the checksum intact, that block will now fail verification. ?A couple
of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.We can only change page_checksums at restart (now) so the above seems moot.
If we crash with FPWs enabled we repair any torn pages.
There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.
Sorry, I don't understand what you mean. I don't see any failure cases
that require that.
page_checksums can only change at a shutdown checkpoint,
The statement "If that write tears,
leaving the checksum intact, that block will now fail verification."
cannot happen, ISTM.
If we write out a block we update the checksum if page_checksums is
set, or we clear it. If we experience a torn page at crash, the FPI
corrects that, so the checksum never does fail verification. We only
need to write a FPI when we write with checksums.
If that's wrong, please explain a failure case in detail.
PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.Am serialising this by only writing PageLSN while holding buf hdr lock.
That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock. Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?
Yes, I think it will pay off. This is the only code location where we
do that, and we are already taking the buffer header lock, so there is
effectively no overhead.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote:
On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. ?One could not simply condition them on the
page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. ?If that write tears,
leaving the checksum intact, that block will now fail verification. ?A couple
of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.We can only change page_checksums at restart (now) so the above seems moot.
If we crash with FPWs enabled we repair any torn pages.
There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.Sorry, I don't understand what you mean. I don't see any failure cases
that require that.page_checksums can only change at a shutdown checkpoint,
The statement "If that write tears,
leaving the checksum intact, that block will now fail verification."
cannot happen, ISTM.
If we write out a block we update the checksum if page_checksums is
set, or we clear it. If we experience a torn page at crash, the FPI
corrects that, so the checksum never does fail verification. We only
need to write a FPI when we write with checksums.If that's wrong, please explain a failure case in detail.
In the following description, I will treat a heap page as having two facts.
The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
A torn page write lacking the protection of a full-page image can update one
or both facts despite the transaction having logically updated both. (This is
just the normal torn-page scenario.) Given that, consider the following
sequence of events:
1. startup with page_checksums = on
2. update page P with facts CHKSUM, !HINT
3. clean write of P
4. clean shutdown
5. startup with page_checksums = off
6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off
7. begin write of P
8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT
9. startup with page_checksums = off
10. crash recovery. does not touch P, because step 6 generated no WAL
11. clean shutdown
12. startup with page_checksums = on
13. read P. checksum failure
Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
because step 6 _does_ write WAL. That ought to change for the sake of
page_checksums=off performance, at which point we must protect the above
scenario by other means.
PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.Am serialising this by only writing PageLSN while holding buf hdr lock.
That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock. ?Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?Yes, I think it will pay off. This is the only code location where we
do that, and we are already taking the buffer header lock, so there is
effectively no overhead.
The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
(via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header
spinlock at either of those locations. If they remain safe under the new
rules, we'll need comments explaining why. I think they may indeed be safe,
but it's far from obvious.
Thanks,
nm
On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch <noah@leadboat.com> wrote:
On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote:
On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. ?One could not simply condition them on the
page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. ?If that write tears,
leaving the checksum intact, that block will now fail verification. ?A couple
of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.We can only change page_checksums at restart (now) so the above seems moot.
If we crash with FPWs enabled we repair any torn pages.
There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.Sorry, I don't understand what you mean. I don't see any failure cases
that require that.page_checksums can only change at a shutdown checkpoint,
The statement "If that write tears,
leaving the checksum intact, that block will now fail verification."
cannot happen, ISTM.
If we write out a block we update the checksum if page_checksums is
set, or we clear it. If we experience a torn page at crash, the FPI
corrects that, so the checksum never does fail verification. We only
need to write a FPI when we write with checksums.If that's wrong, please explain a failure case in detail.
In the following description, I will treat a heap page as having two facts.
The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
A torn page write lacking the protection of a full-page image can update one
or both facts despite the transaction having logically updated both. (This is
just the normal torn-page scenario.) Given that, consider the following
sequence of events:1. startup with page_checksums = on
2. update page P with facts CHKSUM, !HINT
3. clean write of P
4. clean shutdown5. startup with page_checksums = off
6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off
7. begin write of P
8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT9. startup with page_checksums = off
10. crash recovery. does not touch P, because step 6 generated no WAL
11. clean shutdown12. startup with page_checksums = on
13. read P. checksum failureAgain, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
because step 6 _does_ write WAL. That ought to change for the sake of
page_checksums=off performance, at which point we must protect the above
scenario by other means.
Thanks for explaining.
Logic fixed.
PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.Am serialising this by only writing PageLSN while holding buf hdr lock.
That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock. ?Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?Yes, I think it will pay off. This is the only code location where we
do that, and we are already taking the buffer header lock, so there is
effectively no overhead.The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
(via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header
spinlock at either of those locations. If they remain safe under the new
rules, we'll need comments explaining why. I think they may indeed be safe,
but it's far from obvious.
OK, some slight rework required to do that, no problems.
I checked all other call sites and have updated README to explain.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 9, 2012 at 3:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch <noah@leadboat.com> wrote:
On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote:
On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. ?One could not simply condition them on the
page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. ?If that write tears,
leaving the checksum intact, that block will now fail verification. ?A couple
of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.We can only change page_checksums at restart (now) so the above seems moot.
If we crash with FPWs enabled we repair any torn pages.
There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.Sorry, I don't understand what you mean. I don't see any failure cases
that require that.page_checksums can only change at a shutdown checkpoint,
The statement "If that write tears,
leaving the checksum intact, that block will now fail verification."
cannot happen, ISTM.
If we write out a block we update the checksum if page_checksums is
set, or we clear it. If we experience a torn page at crash, the FPI
corrects that, so the checksum never does fail verification. We only
need to write a FPI when we write with checksums.If that's wrong, please explain a failure case in detail.
In the following description, I will treat a heap page as having two facts.
The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
A torn page write lacking the protection of a full-page image can update one
or both facts despite the transaction having logically updated both. (This is
just the normal torn-page scenario.) Given that, consider the following
sequence of events:1. startup with page_checksums = on
2. update page P with facts CHKSUM, !HINT
3. clean write of P
4. clean shutdown5. startup with page_checksums = off
6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off
7. begin write of P
8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT9. startup with page_checksums = off
10. crash recovery. does not touch P, because step 6 generated no WAL
11. clean shutdown12. startup with page_checksums = on
13. read P. checksum failureAgain, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
because step 6 _does_ write WAL. That ought to change for the sake of
page_checksums=off performance, at which point we must protect the above
scenario by other means.Thanks for explaining.
Logic fixed.
PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.Am serialising this by only writing PageLSN while holding buf hdr lock.
That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock. ?Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?Yes, I think it will pay off. This is the only code location where we
do that, and we are already taking the buffer header lock, so there is
effectively no overhead.The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
(via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header
spinlock at either of those locations. If they remain safe under the new
rules, we'll need comments explaining why. I think they may indeed be safe,
but it's far from obvious.OK, some slight rework required to do that, no problems.
I checked all other call sites and have updated README to explain.
v8 attached
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16_with_wallogged_hint_bits.v8c.patchtext/x-diff; charset=US-ASCII; name=checksum16_with_wallogged_hint_bits.v8c.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1712,1717 **** SET ENABLE_SEQSCAN TO OFF;
--- 1712,1773 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-page-checksums" xreflabel="page_checksums">
+ <indexterm>
+ <primary><varname>page_checksums</> configuration parameter</primary>
+ </indexterm>
+ <term><varname>page_checksums</varname> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ When this parameter is on, the <productname>PostgreSQL</> server
+ calculates checksums when it writes main database pages to disk,
+ flagging the page as checksum protected.
+ When pages are read into shared buffers any page flagged with a
+ checksum has the checksum re-calculated and compared against the
+ stored value to provide greatly improved validation of page contents.
+ The checksum uses 16-bit checksums, using the fast Fletcher 16 algorithm.
+ </para>
+
+ <para>
+ When this parameter is off we write only the page version number and
+ blocksize as a standard watermark. If we read in a block that has a
+ checksum yet <varname>page_checksums</varname> is disabled we do not
+ verify the checksum value. Such pages will be reset to the standard
+ watermark if they are rewritten for any reason.
+ The database may thus contain a mix of pages with checksums and pages
+ without checksums.
+ </para>
+
+ <para>
+ Writes via temp_buffers are not checksummed at any time.
+ </para>
+
+ <para>
+ Turning this parameter off reduces the CPU overhead of reading/writing
+ data from disk though might allow data corruption to go unnoticed.
+ With this parameter enabled there is still a non-zero probability that
+ an error would go undetected.
+ </para>
+
+ <para>
+ The checksum uses 16-bit checksums, using the fast Fletcher 16 algorithm.
+ Later releases may increase the size of the checksum field and offer
+ alternative checksum or CRC algorithms.
+ </para>
+
+ <para>
+ The default is <literal>off</> for backwards compatibility and
+ to allow upgrade. The recommended setting is <literal>on</> though
+ this should not be enabled until upgrade is successfully complete
+ with full set of new backups.
+ </para>
+
+ <para>
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
<term><varname>wal_buffers</varname> (<type>integer</type>)</term>
<indexterm>
*** a/src/backend/access/hash/hashpage.c
--- b/src/backend/access/hash/hashpage.c
***************
*** 730,735 **** _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
--- 730,736 ----
MemSet(zerobuf, 0, sizeof(zerobuf));
RelationOpenSmgr(rel);
+ /* no need to set page verification info for all zero pages */
smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
return true;
*** a/src/backend/access/heap/rewriteheap.c
--- b/src/backend/access/heap/rewriteheap.c
***************
*** 266,271 **** end_heap_rewrite(RewriteState state)
--- 266,273 ----
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
+ PageSetVerificationInfoInplace(state->rs_buffer);
+
if (state->rs_use_wal)
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
***************
*** 611,616 **** raw_heap_insert(RewriteState state, HeapTuple tup)
--- 613,620 ----
{
/* Doesn't fit, so write out the existing page */
+ PageSetVerificationInfoInplace(page);
+
/* XLOG stuff */
if (state->rs_use_wal)
log_newpage(&state->rs_new_rel->rd_node,
*** a/src/backend/access/heap/visibilitymap.c
--- b/src/backend/access/heap/visibilitymap.c
***************
*** 568,573 **** vm_extend(Relation rel, BlockNumber vm_nblocks)
--- 568,574 ----
/* Now extend the file */
while (vm_nblocks_now < vm_nblocks)
{
+ /* no need to set page verification info for all zero pages */
smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, vm_nblocks_now,
(char *) pg, false);
vm_nblocks_now++;
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 216,221 **** btbuildempty(PG_FUNCTION_ARGS)
--- 216,222 ----
_bt_initmetapage(metapage, P_NONE, 0);
/* Write the page. If archiving/streaming, XLOG it. */
+ PageSetVerificationInfoInplace(metapage);
smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
(char *) metapage, true);
if (XLogIsNeeded())
*** a/src/backend/access/nbtree/nbtsort.c
--- b/src/backend/access/nbtree/nbtsort.c
***************
*** 288,299 **** _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
--- 288,302 ----
{
if (!wstate->btws_zeropage)
wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+ /* no need to set page verification info for all zero pages */
smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM,
wstate->btws_pages_written++,
(char *) wstate->btws_zeropage,
true);
}
+ PageSetVerificationInfoInplace(page);
+
/*
* Now write the page. There's no need for smgr to schedule an fsync for
* this write; we'll do it ourselves before ending the build.
*** a/src/backend/access/spgist/spginsert.c
--- b/src/backend/access/spgist/spginsert.c
***************
*** 150,155 **** spgbuildempty(PG_FUNCTION_ARGS)
--- 150,156 ----
SpGistInitMetapage(page);
/* Write the page. If archiving/streaming, XLOG it. */
+ PageSetVerificationInfoInplace(page);
smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
(char *) page, true);
if (XLogIsNeeded())
***************
*** 159,164 **** spgbuildempty(PG_FUNCTION_ARGS)
--- 160,166 ----
/* Likewise for the root page. */
SpGistInitPage(page, SPGIST_LEAF);
+ PageSetVerificationInfoInplace(page);
smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_HEAD_BLKNO,
(char *) page, true);
if (XLogIsNeeded())
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
***************
*** 527,532 **** associated with the n'th distinct buffer ID seen in the "rdata" array, and
--- 527,540 ----
per the above discussion, fully-rewritable buffers shouldn't be mentioned in
"rdata".)
+ Note that we must only use PageSetLSN/PageGetLSN() when we know the action
+ is serialised. Only Startup process may modify data blocks during recovery,
+ so Startup process may execute PageGetLSN() without fear of serialisation
+ problems. All other processes must only call PageSet/GetLSN when holding
+ either an exclusive buffer lock or a shared lock plus buffer header lock,
+ or be writing the data block directly rather than through shared buffers
+ while holding AccessExclusiveLock on the relation.
+
Due to all these constraints, complex changes (such as a multilevel index
insertion) normally need to be described by a series of atomic-action WAL
records. What do you do if the intermediate states are not self-consistent?
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 320,326 **** MarkAsPreparing(TransactionId xid, const char *gid,
proc->lxid = (LocalTransactionId) xid;
pgxact->xid = xid;
pgxact->xmin = InvalidTransactionId;
! pgxact->inCommit = false;
pgxact->vacuumFlags = 0;
proc->pid = 0;
proc->backendId = InvalidBackendId;
--- 320,326 ----
proc->lxid = (LocalTransactionId) xid;
pgxact->xid = xid;
pgxact->xmin = InvalidTransactionId;
! pgxact->delayChkpt = false;
pgxact->vacuumFlags = 0;
proc->pid = 0;
proc->backendId = InvalidBackendId;
***************
*** 1028,1045 **** EndPrepare(GlobalTransaction gxact)
* odds of a PANIC actually occurring should be very tiny given that we
* were able to write the bogus CRC above.
*
! * We have to set inCommit here, too; otherwise a checkpoint starting
* immediately after the WAL record is inserted could complete without
* fsync'ing our state file. (This is essentially the same kind of race
* condition as the COMMIT-to-clog-write case that RecordTransactionCommit
! * uses inCommit for; see notes there.)
*
* We save the PREPARE record's location in the gxact for later use by
* CheckPointTwoPhase.
*/
START_CRIT_SECTION();
! MyPgXact->inCommit = true;
gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
records.head);
--- 1028,1045 ----
* odds of a PANIC actually occurring should be very tiny given that we
* were able to write the bogus CRC above.
*
! * We have to set delayChkpt here, too; otherwise a checkpoint starting
* immediately after the WAL record is inserted could complete without
* fsync'ing our state file. (This is essentially the same kind of race
* condition as the COMMIT-to-clog-write case that RecordTransactionCommit
! * uses delayChkpt for; see notes there.)
*
* We save the PREPARE record's location in the gxact for later use by
* CheckPointTwoPhase.
*/
START_CRIT_SECTION();
! MyPgXact->delayChkpt = true;
gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
records.head);
***************
*** 1087,1093 **** EndPrepare(GlobalTransaction gxact)
* checkpoint starting after this will certainly see the gxact as a
* candidate for fsyncing.
*/
! MyPgXact->inCommit = false;
END_CRIT_SECTION();
--- 1087,1093 ----
* checkpoint starting after this will certainly see the gxact as a
* candidate for fsyncing.
*/
! MyPgXact->delayChkpt = false;
END_CRIT_SECTION();
***************
*** 1971,1977 **** RecoverPreparedTransactions(void)
* RecordTransactionCommitPrepared
*
* This is basically the same as RecordTransactionCommit: in particular,
! * we must set the inCommit flag to avoid a race condition.
*
* We know the transaction made at least one XLOG entry (its PREPARE),
* so it is never possible to optimize out the commit record.
--- 1971,1977 ----
* RecordTransactionCommitPrepared
*
* This is basically the same as RecordTransactionCommit: in particular,
! * we must set the delayChkpt flag to avoid a race condition.
*
* We know the transaction made at least one XLOG entry (its PREPARE),
* so it is never possible to optimize out the commit record.
***************
*** 1994,2000 **** RecordTransactionCommitPrepared(TransactionId xid,
START_CRIT_SECTION();
/* See notes in RecordTransactionCommit */
! MyPgXact->inCommit = true;
/* Emit the XLOG commit record */
xlrec.xid = xid;
--- 1994,2000 ----
START_CRIT_SECTION();
/* See notes in RecordTransactionCommit */
! MyPgXact->delayChkpt = true;
/* Emit the XLOG commit record */
xlrec.xid = xid;
***************
*** 2059,2065 **** RecordTransactionCommitPrepared(TransactionId xid,
TransactionIdCommitTree(xid, nchildren, children);
/* Checkpoint can proceed now */
! MyPgXact->inCommit = false;
END_CRIT_SECTION();
--- 2059,2065 ----
TransactionIdCommitTree(xid, nchildren, children);
/* Checkpoint can proceed now */
! MyPgXact->delayChkpt = false;
END_CRIT_SECTION();
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 975,987 **** RecordTransactionCommit(void)
* RecordTransactionAbort. That's because loss of a transaction abort
* is noncritical; the presumption would be that it aborted, anyway.
*
! * It's safe to change the inCommit flag of our own backend without
* holding the ProcArrayLock, since we're the only one modifying it.
! * This makes checkpoint's determination of which xacts are inCommit a
* bit fuzzy, but it doesn't matter.
*/
START_CRIT_SECTION();
! MyPgXact->inCommit = true;
SetCurrentTransactionStopTimestamp();
--- 975,987 ----
* RecordTransactionAbort. That's because loss of a transaction abort
* is noncritical; the presumption would be that it aborted, anyway.
*
! * It's safe to change the delayChkpt flag of our own backend without
* holding the ProcArrayLock, since we're the only one modifying it.
! * This makes checkpoint's determination of which xacts are delayChkpt a
* bit fuzzy, but it doesn't matter.
*/
START_CRIT_SECTION();
! MyPgXact->delayChkpt = true;
SetCurrentTransactionStopTimestamp();
***************
*** 1155,1161 **** RecordTransactionCommit(void)
*/
if (markXidCommitted)
{
! MyPgXact->inCommit = false;
END_CRIT_SECTION();
}
--- 1155,1161 ----
*/
if (markXidCommitted)
{
! MyPgXact->delayChkpt = false;
END_CRIT_SECTION();
}
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 733,738 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 733,739 ----
bool doPageWrites;
bool isLogSwitch = false;
bool fpwChange = false;
+ bool isHint = false;
uint8 info_orig = info;
/* cross-check on whether we should be here or not */
***************
*** 760,765 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 761,770 ----
fpwChange = true;
break;
+ case XLOG_HINT:
+ isHint = true;
+ break;
+
default:
break;
}
***************
*** 998,1003 **** begin:;
--- 1003,1020 ----
}
/*
+ * If this is a hint record and we don't need a backup block then
+ * we have no more work to do and can exit quickly without inserting
+ * a WAL record at all. In that case return InvalidXLogRecPtr.
+ */
+ if (isHint && !(info & XLR_BKP_BLOCK_MASK))
+ {
+ LWLockRelease(WALInsertLock);
+ END_CRIT_SECTION();
+ return InvalidXLogRecPtr;
+ }
+
+ /*
* If there isn't enough space on the current XLOG page for a record
* header, advance to the next page (leaving the unused space as zeroes).
*/
***************
*** 1280,1286 **** XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
/*
* XXX We assume page LSN is first data on *every* page that can be passed
* to XLogInsert, whether it otherwise has the standard page layout or
! * not.
*/
*lsn = PageGetLSN(page);
--- 1297,1304 ----
/*
* XXX We assume page LSN is first data on *every* page that can be passed
* to XLogInsert, whether it otherwise has the standard page layout or
! * not. We don't need the buffer header lock for PageGetLSN because we
! * have exclusive lock on the page and/or the relation.
*/
*lsn = PageGetLSN(page);
***************
*** 3724,3729 **** RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
--- 3742,3753 ----
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
}
+ /*
+ * Any checksum set on this page will be invalid. We don't need
+ * to reset it here since it will be reset before being written
+ * and changing data when we don't have a cleanup lock is bad juju.
+ */
+
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
***************
*** 7653,7660 **** CreateCheckPoint(int flags)
uint32 redo_logSeg;
uint32 insert_logId;
uint32 insert_logSeg;
! TransactionId *inCommitXids;
! int nInCommit;
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
--- 7677,7684 ----
uint32 redo_logSeg;
uint32 insert_logId;
uint32 insert_logSeg;
! VirtualTransactionId *vxids;
! int nvxids;
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7828,7834 **** CreateCheckPoint(int flags)
/*
* Before flushing data, we must wait for any transactions that are
! * currently in their commit critical sections. If an xact inserted its
* commit record into XLOG just before the REDO point, then a crash
* restart from the REDO point would not replay that record, which means
* that our flushing had better include the xact's update of pg_clog. So
--- 7852,7858 ----
/*
* Before flushing data, we must wait for any transactions that are
! * currently in commit or hint critical sections. If an xact inserted its
* commit record into XLOG just before the REDO point, then a crash
* restart from the REDO point would not replay that record, which means
* that our flushing had better include the xact's update of pg_clog. So
***************
*** 7844,7864 **** CreateCheckPoint(int flags)
* protected by different locks, but again that seems best on grounds of
* minimizing lock contention.)
*
! * A transaction that has not yet set inCommit when we look cannot be at
* risk, since he's not inserted his commit record yet; and one that's
* already cleared it is not at risk either, since he's done fixing clog
* and we will correctly flush the update below. So we cannot miss any
* xacts we need to wait for.
*/
! nInCommit = GetTransactionsInCommit(&inCommitXids);
! if (nInCommit > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
! } while (HaveTransactionsInCommit(inCommitXids, nInCommit));
}
! pfree(inCommitXids);
/*
* Get the other info we need for the checkpoint record.
--- 7868,7888 ----
* protected by different locks, but again that seems best on grounds of
* minimizing lock contention.)
*
! * A transaction that has not yet set delayChkpt when we look cannot be at
* risk, since he's not inserted his commit record yet; and one that's
* already cleared it is not at risk either, since he's done fixing clog
* and we will correctly flush the update below. So we cannot miss any
* xacts we need to wait for.
*/
! vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
! if (nvxids > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
! } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}
! pfree(vxids);
/*
* Get the other info we need for the checkpoint record.
***************
*** 8443,8448 **** XLogRestorePoint(const char *rpName)
--- 8467,8517 ----
}
/*
+ * Write a backup block if needed when we are setting a hint. Note that
+ * this may be called for a variety of page types, not just heaps.
+ *
+ * Deciding the "if needed" part is delicate and requires us to either
+ * grab WALInsertLock or check the info_lck spinlock. If we check the
+ * spinlock and it says Yes then we will need to get WALInsertLock as well,
+ * so the design choice here is to just go straight for the WALInsertLock
+ * and trust that calls to this function are minimised elsewhere.
+ *
+ * Callable while holding just share lock on the buffer content.
+ *
+ * Possible that multiple concurrent backends could attempt to write
+ * WAL records. In that case, more than one backup block may be recorded
+ * though that isn't important to the outcome and the backup blocks are
+ * likely to be identical anyway.
+ */
+ #define XLOG_HINT_WATERMARK 13579
+ XLogRecPtr
+ XLogRecordHint(Buffer buffer)
+ {
+ /*
+ * Make an XLOG entry reporting the hint
+ */
+ XLogRecData rdata[2];
+ int watermark = XLOG_HINT_WATERMARK;
+
+ /*
+ * Not allowed to have zero-length records, so use a small watermark
+ */
+ rdata[0].data = (char *) (&watermark);
+ rdata[0].len = sizeof(int);
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
+ }
+
+ /*
* Check if any of the GUC parameters that are critical for hot standby
* have changed, and update the value in pg_control file if necessary.
*/
***************
*** 8540,8547 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
{
uint8 info = record->xl_info & ~XLR_INFO_MASK;
! /* Backup blocks are not used in xlog records */
! Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
if (info == XLOG_NEXTOID)
{
--- 8609,8616 ----
{
uint8 info = record->xl_info & ~XLR_INFO_MASK;
! /* Backup blocks are not used in most xlog records */
! Assert(info == XLOG_HINT || !(record->xl_info & XLR_BKP_BLOCK_MASK));
if (info == XLOG_NEXTOID)
{
***************
*** 8681,8686 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 8750,8781 ----
{
/* nothing to do here */
}
+ else if (info == XLOG_HINT)
+ {
+ int *watermark = (int *) XLogRecGetData(record);
+
+ /* Check the watermark is correct for the hint record */
+ Assert(*watermark == XLOG_HINT_WATERMARK);
+
+ /* Backup blocks must be present for smgr hint records */
+ Assert(record->xl_info & XLR_BKP_BLOCK_MASK);
+
+ /*
+ * Hint records have no information that needs to be replayed.
+ * The sole purpose of them is to ensure that a hint bit does
+ * not cause a checksum invalidation if a hint bit write should
+ * cause a torn page. So the body of the record is empty but
+ * there must be one backup block.
+ *
+ * Since the only change in the backup block is a hint bit,
+ * there is no confict with Hot Standby.
+ *
+ * This also means there is no corresponding API call for this,
+ * so an smgr implementation has no need to implement anything.
+ * Which means nothing is needed in md.c etc
+ */
+ RestoreBkpBlocks(lsn, record, false);
+ }
else if (info == XLOG_BACKUP_END)
{
XLogRecPtr startpoint;
***************
*** 8816,8821 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 8911,8920 ----
appendStringInfo(buf, "restore point: %s", xlrec->rp_name);
}
+ else if (info == XLOG_HINT)
+ {
+ appendStringInfo(buf, "page hint");
+ }
else if (info == XLOG_BACKUP_END)
{
XLogRecPtr startpoint;
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 8537,8542 **** copy_relation_data(SMgrRelation src, SMgrRelation dst,
--- 8537,8544 ----
smgrread(src, forkNum, blkno, buf);
+ PageSetVerificationInfoInplace(page);
+
/* XLOG stuff */
if (use_wal)
log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 34,39 ****
--- 34,40 ----
#include <unistd.h>
#include "catalog/catalog.h"
+ #include "catalog/storage.h"
#include "executor/instrument.h"
#include "miscadmin.h"
#include "pg_trace.h"
***************
*** 440,446 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
! if (!PageHeaderIsValid((PageHeader) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
--- 441,447 ----
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
! if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
***************
*** 465,471 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
{
/* Only need to adjust flags */
bufHdr->flags |= BM_VALID;
! }
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
--- 466,474 ----
{
/* Only need to adjust flags */
bufHdr->flags |= BM_VALID;
! [
!
! }
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
***************
*** 635,648 **** BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* victim. We need lock to inspect the page LSN, so this
* can't be done inside StrategyGetBuffer.
*/
! if (strategy != NULL &&
! XLogNeedsFlush(BufferGetLSN(buf)) &&
! StrategyRejectBuffer(strategy, buf))
{
! /* Drop lock/pin and loop around for another buffer */
! LWLockRelease(buf->content_lock);
! UnpinBuffer(buf, true);
! continue;
}
/* OK, do the I/O */
--- 638,660 ----
* victim. We need lock to inspect the page LSN, so this
* can't be done inside StrategyGetBuffer.
*/
! if (strategy != NULL)
{
! XLogRecPtr lsn;
!
! /* Read the LSN while holding buffer header lock */
! LockBufHdr(buf);
! lsn = BufferGetLSN(buf);
! UnlockBufHdr(buf);
!
! if (XLogNeedsFlush(lsn) &&
! StrategyRejectBuffer(strategy, buf))
! {
! /* Drop lock/pin and loop around for another buffer */
! LWLockRelease(buf->content_lock);
! UnpinBuffer(buf, true);
! continue;
! }
}
/* OK, do the I/O */
***************
*** 1873,1878 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1885,1892 ----
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
+ char *bufToWrite;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
***************
*** 1901,1912 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
reln->smgr_rnode.node.dbNode,
reln->smgr_rnode.node.relNode);
/*
* Force XLOG flush up to buffer's LSN. This implements the basic WAL
* rule that log updates must hit disk before any of the data-file changes
* they describe do.
*/
- recptr = BufferGetLSN(buf);
XLogFlush(recptr);
/*
--- 1915,1936 ----
reln->smgr_rnode.node.dbNode,
reln->smgr_rnode.node.relNode);
+ LockBufHdr(buf);
+
+ /*
+ * Run PageGetLSN while holding header lock.
+ */
+ recptr = BufferGetLSN(buf);
+
+ /* To check if block content changes while flushing. - vadim 01/17/97 */
+ buf->flags &= ~BM_JUST_DIRTIED;
+ UnlockBufHdr(buf);
+
/*
* Force XLOG flush up to buffer's LSN. This implements the basic WAL
* rule that log updates must hit disk before any of the data-file changes
* they describe do.
*/
XLogFlush(recptr);
/*
***************
*** 1915,1929 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
* we have the io_in_progress lock.
*/
! /* To check if block content changes while flushing. - vadim 01/17/97 */
! LockBufHdr(buf);
! buf->flags &= ~BM_JUST_DIRTIED;
! UnlockBufHdr(buf);
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
! (char *) BufHdrGetBlock(buf),
false);
pgBufferUsage.shared_blks_written++;
--- 1939,1955 ----
* we have the io_in_progress lock.
*/
! bufBlock = BufHdrGetBlock(buf);
+ bufToWrite = PageSetVerificationInfoOnCopy((Page) bufBlock);
+
+ /*
+ * bufToWrite is either the shared buffer or a copy, as appropriate.
+ */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
! bufToWrite,
false);
pgBufferUsage.shared_blks_written++;
***************
*** 1934,1939 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1960,1967 ----
*/
TerminateBufferIO(buf, true, 0);
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
***************
*** 2326,2331 **** void
--- 2354,2360 ----
SetBufferCommitInfoNeedsSave(Buffer buffer)
{
volatile BufferDesc *bufHdr;
+ Page page = BufferGetPage(buffer);
if (!BufferIsValid(buffer))
elog(ERROR, "bad buffer ID: %d", buffer);
***************
*** 2354,2368 **** SetBufferCommitInfoNeedsSave(Buffer buffer)
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
bool dirtied = false;
LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);
if (!(bufHdr->flags & BM_DIRTY))
! dirtied = true;
bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
UnlockBufHdr(bufHdr);
if (dirtied)
{
VacuumPageDirty++;
--- 2383,2487 ----
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
+ XLogRecPtr lsn = {0, 0};
bool dirtied = false;
+ bool delayChkpt = false;
+
+ /*
+ * If we are currently writing checksums, or if the page has a
+ * checksum set then we must check to see if a full page image is
+ * required to protect against torn pages.
+ *
+ * If page_checksums is set then we must write a new checksum
+ * for any change, even a hint. If we write a checksum and we crash
+ * it could leave torn pages on disk. So to protect against that
+ * a hint must be written if this is the first change on the block
+ * since the last checkpoint.
+ *
+ * Even if page_checksums is off, we may also need to write a full
+ * page image. This protects against events that can occur if
+ * page_checksums is turned on and off and then on again. In that case,
+ * if a page has a checksum set and then we write a page
+ * while page_checksums = off a torn page could result in a page
+ * that has a checksums set and yet not match the actual page.
+ * So even if page_checksums = off we write a full page image if the
+ * page is marked as having a checksum, even though we are going
+ * to reset the checksum.
+ *
+ * We don't check full_page_writes here because that can be set
+ * on dynamically during a backup, so even if the usual value is
+ * off we may still need to use a full page image.
+ */
+ if (page_checksums || !PageHasNoChecksum(page))
+ {
+ /*
+ * If we're in recovery we cannot dirty a page because of a hint.
+ * We can set the hint, just not dirty the page as a result so
+ * the hint is lost when we evict the page or shutdown.
+ *
+ * See long discussion in bufpage.c
+ */
+ if (RecoveryInProgress())
+ return;
+
+ /*
+ * If the block is already dirty because we either made a change
+ * or set a hint already, then we don't need to write a full page
+ * image. Note that aggressive cleaning of blocks
+ * dirtied by hint bit setting would increase the call rate.
+ * Bulk setting of hint bits would reduce the call rate...
+ *
+ * We must issue the WAL record before we mark the buffer dirty.
+ * Otherwise we might write the page before we write the WAL.
+ * That causes a race condition, since a checkpoint might occur
+ * between writing the WAL record and marking the buffer dirty.
+ * We solve that with a kluge, but one that is already in use
+ * during transaction commit to prevent race conditions.
+ * Basically, we simply prevent the checkpoint WAL record from
+ * being written until we have marked the buffer dirty. We don't
+ * start the checkpoint flush until we have marked dirty, so our
+ * checkpoint must flush the change to disk successfully or the
+ * checkpoint never gets written, so crash recovery will fix.
+ *
+ * It's possible we may enter here without an xid, so it is
+ * essential that CreateCheckpoint waits for virtual transactions
+ * rather than full transactionids.
+ */
+ MyPgXact->delayChkpt = delayChkpt = true;
+ lsn = XLogRecordHint(buffer);
+ }
LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);
if (!(bufHdr->flags & BM_DIRTY))
! {
! dirtied = true; /* Means "will be dirtied by this action" */
!
! /*
! * Set the page LSN if we wrote a backup block. We aren't
! * supposed to set this when only holding a share lock but
! * as long as we serialise it somehow we're OK. We choose to
! * set LSN while holding the buffer header lock, which causes
! * any reader of an LSN who holds only a share lock to also
! * obtain a buffer header lock before using PageGetLSN().
! * Fortunately, thats not too many places.
! *
! * If !page_checksums and PageHasChecksum() you might think
! * we should reset the checksum here. That will happen when
! * the page is written sometime later in this checkpoint cycle.
! */
! if (!XLogRecPtrIsInvalid(lsn))
! {
! PageSetLSN(page, lsn);
! PageSetTLI(page, ThisTimeLineID);
! }
! }
bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
UnlockBufHdr(bufHdr);
+ if (delayChkpt)
+ MyPgXact->delayChkpt = false;
+
if (dirtied)
{
VacuumPageDirty++;
*** a/src/backend/storage/buffer/localbuf.c
--- b/src/backend/storage/buffer/localbuf.c
***************
*** 200,205 **** LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
--- 200,207 ----
/* Find smgr relation for buffer */
oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ /* XXX do we want to write checksums for local buffers? An option? */
+
/* And write... */
smgrwrite(oreln,
bufHdr->tag.forkNum,
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 398,404 **** ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
pgxact->xmin = InvalidTransactionId;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! pgxact->inCommit = false; /* be sure this is cleared in abort */
proc->recoveryConflictPending = false;
/* Clear the subtransaction-XID cache too while holding the lock */
--- 398,404 ----
pgxact->xmin = InvalidTransactionId;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! pgxact->delayChkpt = false; /* be sure this is cleared in abort */
proc->recoveryConflictPending = false;
/* Clear the subtransaction-XID cache too while holding the lock */
***************
*** 425,431 **** ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
pgxact->xmin = InvalidTransactionId;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! pgxact->inCommit = false; /* be sure this is cleared in abort */
proc->recoveryConflictPending = false;
Assert(pgxact->nxids == 0);
--- 425,431 ----
pgxact->xmin = InvalidTransactionId;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! pgxact->delayChkpt = false; /* be sure this is cleared in abort */
proc->recoveryConflictPending = false;
Assert(pgxact->nxids == 0);
***************
*** 460,466 **** ProcArrayClearTransaction(PGPROC *proc)
/* redundant, but just in case */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! pgxact->inCommit = false;
/* Clear the subtransaction-XID cache too */
pgxact->nxids = 0;
--- 460,466 ----
/* redundant, but just in case */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! pgxact->delayChkpt = false;
/* Clear the subtransaction-XID cache too */
pgxact->nxids = 0;
***************
*** 1723,1787 **** GetOldestActiveTransactionId(void)
}
/*
! * GetTransactionsInCommit -- Get the XIDs of transactions that are committing
*
! * Constructs an array of XIDs of transactions that are currently in commit
! * critical sections, as shown by having inCommit set in their PGPROC entries.
*
! * *xids_p is set to a palloc'd array that should be freed by the caller.
! * The return value is the number of valid entries.
*
! * Note that because backends set or clear inCommit without holding any lock,
* the result is somewhat indeterminate, but we don't really care. Even in
* a multiprocessor with delayed writes to shared memory, it should be certain
! * that setting of inCommit will propagate to shared memory when the backend
! * takes the WALInsertLock, so we cannot fail to see an xact as inCommit if
* it's already inserted its commit record. Whether it takes a little while
! * for clearing of inCommit to propagate is unimportant for correctness.
*/
! int
! GetTransactionsInCommit(TransactionId **xids_p)
{
ProcArrayStruct *arrayP = procArray;
! TransactionId *xids;
! int nxids;
int index;
! xids = (TransactionId *) palloc(arrayP->maxProcs * sizeof(TransactionId));
! nxids = 0;
LWLockAcquire(ProcArrayLock, LW_SHARED);
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
! volatile PGXACT *pgxact = &allPgXact[pgprocno];
! TransactionId pxid;
! /* Fetch xid just once - see GetNewTransactionId */
! pxid = pgxact->xid;
! if (pgxact->inCommit && TransactionIdIsValid(pxid))
! xids[nxids++] = pxid;
}
LWLockRelease(ProcArrayLock);
! *xids_p = xids;
! return nxids;
}
/*
! * HaveTransactionsInCommit -- Are any of the specified XIDs in commit?
*
! * This is used with the results of GetTransactionsInCommit to see if any
! * of the specified XIDs are still in their commit critical sections.
*
! * Note: this is O(N^2) in the number of xacts that are/were in commit, but
* those numbers should be small enough for it not to be a problem.
*/
bool
! HaveTransactionsInCommit(TransactionId *xids, int nxids)
{
bool result = false;
ProcArrayStruct *arrayP = procArray;
--- 1723,1792 ----
}
/*
! * GetVirtualXIDsDelayingChkpt -- Get the VXIDs of transactions that are
! * delaying checkpoint because they have critical actions in progress.
*
! * Constructs an array of VXIDs of transactions that are currently in commit
! * critical sections, as shown by having delayChkpt set in their PGXACT.
*
! * Returns a palloc'd array that should be freed by the caller.
! * *nvxids is the number of valid entries.
*
! * Note that because backends set or clear delayChkpt without holding any lock,
* the result is somewhat indeterminate, but we don't really care. Even in
* a multiprocessor with delayed writes to shared memory, it should be certain
! * that setting of delayChkpt will propagate to shared memory when the backend
! * takes a lock, so we cannot fail to see an virtual xact as delayChkpt if
* it's already inserted its commit record. Whether it takes a little while
! * for clearing of delayChkpt to propagate is unimportant for correctness.
*/
! VirtualTransactionId *
! GetVirtualXIDsDelayingChkpt(int *nvxids)
{
+ VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
! int count = 0;
int index;
! /* allocate what's certainly enough result space */
! vxids = (VirtualTransactionId *)
! palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
LWLockAcquire(ProcArrayLock, LW_SHARED);
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
! volatile PGPROC *proc = &allProcs[pgprocno];
! volatile PGXACT *pgxact = &allPgXact[pgprocno];
! if (pgxact->delayChkpt)
! {
! VirtualTransactionId vxid;
! GET_VXID_FROM_PGPROC(vxid, *proc);
! if (VirtualTransactionIdIsValid(vxid))
! vxids[count++] = vxid;
! }
}
LWLockRelease(ProcArrayLock);
! *nvxids = count;
! return vxids;
}
/*
! * HaveVirtualXIDsDelayingChkpt -- Are any of the specified VXIDs delaying?
*
! * This is used with the results of GetVirtualXIDsDelayingChkpt to see if any
! * of the specified VXIDs are still in critical sections of code.
*
! * Note: this is O(N^2) in the number of vxacts that are/were delaying, but
* those numbers should be small enough for it not to be a problem.
*/
bool
! HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
{
bool result = false;
ProcArrayStruct *arrayP = procArray;
***************
*** 1789,1818 **** HaveTransactionsInCommit(TransactionId *xids, int nxids)
LWLockAcquire(ProcArrayLock, LW_SHARED);
! for (index = 0; index < arrayP->numProcs; index++)
{
! int pgprocno = arrayP->pgprocnos[index];
! volatile PGXACT *pgxact = &allPgXact[pgprocno];
! TransactionId pxid;
!
! /* Fetch xid just once - see GetNewTransactionId */
! pxid = pgxact->xid;
!
! if (pgxact->inCommit && TransactionIdIsValid(pxid))
{
! int i;
! for (i = 0; i < nxids; i++)
{
! if (xids[i] == pxid)
{
result = true;
break;
}
}
- if (result)
- break;
}
}
LWLockRelease(ProcArrayLock);
--- 1794,1825 ----
LWLockAcquire(ProcArrayLock, LW_SHARED);
! while (VirtualTransactionIdIsValid(*vxids))
{
! for (index = 0; index < arrayP->numProcs; index++)
{
! int pgprocno = arrayP->pgprocnos[index];
! volatile PGPROC *proc = &allProcs[pgprocno];
! volatile PGXACT *pgxact = &allPgXact[pgprocno];
! VirtualTransactionId vxid;
! GET_VXID_FROM_PGPROC(vxid, *proc);
! if (VirtualTransactionIdIsValid(vxid))
{
! if (VirtualTransactionIdEquals(vxid, *vxids) &&
! pgxact->delayChkpt)
{
result = true;
break;
}
}
}
+
+ if (result)
+ break;
+
+ /* The virtual transaction is gone now, wait for the next one */
+ vxids++;
}
LWLockRelease(ProcArrayLock);
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 356,362 **** InitProcess(void)
MyProc->backendId = InvalidBackendId;
MyProc->databaseId = InvalidOid;
MyProc->roleId = InvalidOid;
! MyPgXact->inCommit = false;
MyPgXact->vacuumFlags = 0;
/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
if (IsAutoVacuumWorkerProcess())
--- 356,362 ----
MyProc->backendId = InvalidBackendId;
MyProc->databaseId = InvalidOid;
MyProc->roleId = InvalidOid;
! MyPgXact->delayChkpt = false;
MyPgXact->vacuumFlags = 0;
/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
if (IsAutoVacuumWorkerProcess())
***************
*** 514,520 **** InitAuxiliaryProcess(void)
MyProc->backendId = InvalidBackendId;
MyProc->databaseId = InvalidOid;
MyProc->roleId = InvalidOid;
! MyPgXact->inCommit = false;
MyPgXact->vacuumFlags = 0;
MyProc->lwWaiting = false;
MyProc->lwWaitMode = 0;
--- 514,520 ----
MyProc->backendId = InvalidBackendId;
MyProc->databaseId = InvalidOid;
MyProc->roleId = InvalidOid;
! MyPgXact->delayChkpt = false;
MyPgXact->vacuumFlags = 0;
MyProc->lwWaiting = false;
MyProc->lwWaitMode = 0;
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 16,21 ****
--- 16,27 ----
#include "access/htup.h"
+ bool page_checksums = false;
+
+ static Page pageCopy; /* temporary buffer to allow checksum calculation */
+
+ static bool PageVerificationInfoOK(Page page);
+ static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
***************
*** 25,30 ****
--- 31,40 ----
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we write the block to disk.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
***************
*** 67,87 **** PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
! PageHeaderIsValid(PageHeader page)
{
char *pagebytes;
int i;
! /* Check normal case */
! if (PageGetPageSize(page) == BLCKSZ &&
! PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
! (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
! page->pd_lower >= SizeOfPageHeaderData &&
! page->pd_lower <= page->pd_upper &&
! page->pd_upper <= page->pd_special &&
! page->pd_special <= BLCKSZ &&
! page->pd_special == MAXALIGN(page->pd_special))
! return true;
/* Check all-zeroes case */
pagebytes = (char *) page;
--- 77,102 ----
* will clean up such a page and make it usable.
*/
bool
! PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
! /*
! * Don't verify page data unless the page passes basic non-zero test
! */
! if (p->pd_lower >= SizeOfPageHeaderData)
! {
! /* Check normal case */
! if (PageVerificationInfoOK(page) &&
! (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
! p->pd_lower <= p->pd_upper &&
! p->pd_upper <= p->pd_special &&
! p->pd_special <= BLCKSZ &&
! p->pd_special == MAXALIGN(p->pd_special))
! return true;
! }
/* Check all-zeroes case */
pagebytes = (char *) page;
***************
*** 827,829 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 842,1152 ----
pfree(itemidbase);
}
+
+ /*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ *
+ * Any write of a data block can cause a torn page if the write is unsuccessful.
+ * Full page writes protect us from that, which are stored in WAL. Setting
+ * hint bits when a page is already dirty is OK because a full page write
+ * must already have been written for that since the last checkpoint.
+ * Setting hint bits on an otherwise clean page can allow torn pages; this
+ * doesn't normally matter since they are just hints. When the page has
+ * checksums, losing a few bits would cause the checksum to be invalid.
+ * So if we have full_page_writes = on and page_checksums = on then we must
+ * write a WAL record specifically so that we record a full page image in WAL.
+ * New WAL records cannot be written during recovery, so hint bits set
+ * during recovery must not dirty the page if the buffer is not already dirty,
+ * when page_checksums = on.
+ *
+ * So we cannot enable/disable page_checksums except at a checkpoint if
+ * full_page_writes is enabled. We choose to only allow changes at server start.
+ *
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page images does not check verification
+ * info via this function. In recovery, since we only dirty a block when we
+ * have a full page image available if we crash, we are fully OK to use page
+ * verification normally.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering and leaving
+ * the buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data leaving and
+ * re-entering PostgreSQL controlled memory. Note also that the WAL checksum
+ * is a CRC, whereas the page checksum is a Fletcher16 checksum, not a CRC.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data. We also require that a third bit is
+ * zeroed, to protect against a run of 1s being written to the page header.
+ *
+ * This function returns a boolean, not a full damage assessment.
+ */
+ static bool
+ PageVerificationInfoOK(Page page)
+ {
+ PageHeader p = (PageHeader) page;
+
+ /* Quick exit if nobody cares about checksumming */
+ if (!page_checksums)
+ return true;
+
+ if (PageHasChecksum(p))
+ {
+ uint16 checksum = PageCalcChecksum16(page);
+
+ if (checksum == p->pd_verify.pd_checksum16)
+ {
+ #ifdef CHECK_HOLE
+ /* Also check page hole is all-zeroes */
+ char *pagebytes;
+ bool empty = true;
+ int i;
+
+ pagebytes = (char *) page;
+ for (i = p->pd_lower; i < p->pd_upper; i++)
+ {
+ if (pagebytes[i] != 0)
+ {
+ empty = false;
+ break;
+ }
+ }
+
+ if (!empty)
+ elog(LOG, "hole was not empty at byte %d pd_lower %d pd_upper %d",
+ i, p->pd_lower, p->pd_upper);
+ #endif
+ return true;
+ }
+
+ elog(LOG, "page verification failed - checksum was %u page checksum field is %u",
+ checksum, p->pd_verify.pd_checksum16);
+ }
+ else if (PageHasNoChecksum(p))
+ {
+ uint16 pd_pagesize_version = BLCKSZ | PG_PAGE_LAYOUT_VERSION;
+
+ if (p->pd_verify.pd_pagesize_version == pd_pagesize_version)
+ return true;
+
+ elog(LOG, "page verification failed - page size and version set incorrectly %u",
+ p->pd_verify.pd_checksum16);
+ }
+ else
+ elog(LOG, "page verification failed - page flags are incorrectly set");
+
+ return false;
+ }
+
+ #define PageSetChecksum(page) \
+ do \
+ { \
+ PageHeader p = (PageHeader) page; \
+ p->pd_flags |= PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags |= PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
+ } while (0);
+
+ /* ensure any older checksum info is overwritten with watermark */
+ #define PageResetVersion(page) \
+ do \
+ { \
+ PageHeader p = (PageHeader) page; \
+ if (!PageHasNoChecksum(p)) \
+ { \
+ p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags &= ~PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
+ } \
+ } while (0);
+
+ /*
+ * Set verification info for page in shared buffers.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an invalid checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ *
+ * Returns a pointer to the block-sized data that needs to be written. That
+ * allows us to either copy, or not, depending upon whether we checksum.
+ */
+ char *
+ PageSetVerificationInfoOnCopy(Page page)
+ {
+ if (PageIsNew(page))
+ return (char *) page;
+
+ if (page_checksums)
+ {
+ if (pageCopy == NULL)
+ {
+ pageCopy = (Page) malloc(BLCKSZ);
+ if (pageCopy == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+ }
+
+ /*
+ * We make a copy iff we need to calculate a checksum because other
+ * backends may set hint bits on this page while we write, which
+ * would mean the checksum differs from the page contents. It doesn't
+ * matter if we include or exclude hints during the copy, as long
+ * as we write a valid page and associated checksum.
+ */
+ memcpy((char *) pageCopy, (char *) page, BLCKSZ);
+ PageSetChecksum(pageCopy);
+ return (char *) pageCopy;
+ }
+
+ PageResetVersion(page);
+ return (char *) page;
+ }
+
+ /*
+ * Set verification info for page in private memory.
+ *
+ * This is a simpler version of PageSetVerificationInfoOnCopy(). The more
+ * explicit API allows us to more easily see if we're making the correct call
+ * and reduces the amount of additional code specific to page verification.
+ */
+ void
+ PageSetVerificationInfoInplace(Page page)
+ {
+ if (PageIsNew(page))
+ return;
+
+ if (page_checksums)
+ {
+ PageSetChecksum(page);
+ return;
+ }
+
+ PageResetVersion(page);
+ return;
+ }
+
+ /*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+ static uint16
+ PageCalcChecksum16(Page page)
+ {
+ #define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+ #ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ */
+ PageHeader p = (PageHeader) page;
+ uint page_header_stop = (uint)(offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+ uint page_lower_start = (uint)(offsetof(PageHeaderData, pd_prune_xid));
+ uint page_lower_stop;
+ uint sum1 = 0;
+ uint64 sum2 = 0;
+ int i;
+
+ /*
+ * Avoid calculating checksum if page is new, just return a value that
+ * will cause the check to fail. We may still pass the all-zeroes check.
+ */
+ if (PageIsNew(page))
+ return 1;
+
+ /*
+ * Just add in the pd_prune_xid if there are no line pointers yet.
+ */
+ page_lower_stop = p->pd_lower;
+ if (page_lower_stop == 0)
+ page_lower_stop = page_lower_start + sizeof(TransactionId);
+
+ Assert(p->pd_upper != 0);
+
+ #ifdef DEBUG_CHECKSUM
+ elog(LOG, "calculating checksum for %u-%u %u-%u %u-%u",
+ 0, /* page_header_start */
+ page_header_stop,
+ page_lower_start,
+ page_lower_stop,
+ p->pd_upper,
+ BLCKSZ
+ );
+ #endif
+
+ #define COMP_F16(from, to) \
+ do { \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = sum1 + page[i]; \
+ sum2 = sum1 + sum2; \
+ } \
+ sum1 %= 255; \
+ sum2 %= 255; \
+ } while (0); \
+
+ #ifdef IGNORE_PAGE_HOLE
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ page_lower_stop);
+
+ /* ignore the hole in the middle of the block */
+
+ COMP_F16(p->pd_upper,
+ BLCKSZ - 1);
+ #else
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ BLCKSZ - 1);
+ #endif
+
+ #ifdef DEBUG_CHECKSUM
+ elog(LOG, "checksum %u", ((sum2 << 8) | sum1));
+ #endif
+
+ return ((sum2 << 8) | sum1);
+ #endif
+ }
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 830,835 **** static struct config_bool ConfigureNamesBool[] =
--- 830,849 ----
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksum before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 150,164 ****
#------------------------------------------------------------------------------
! # WRITE AHEAD LOG
#------------------------------------------------------------------------------
! # - Settings -
! #wal_level = minimal # minimal, archive, or hot_standby
! # (change requires restart)
#fsync = on # turns forced synchronization on or off
#synchronous_commit = on # synchronization level; on, off, or local
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
--- 150,170 ----
#------------------------------------------------------------------------------
! # WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
! # - Reliability -
! #page_checksums = off # calculate checksum before database I/O
! #full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+ # - Write Ahead Log -
+
+ #wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
***************
*** 166,172 ****
# fsync
# fsync_writethrough
# open_sync
- #full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
--- 172,177 ----
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 272,277 **** extern int XLogFileInit(uint32 log, uint32 seg,
--- 272,278 ----
bool *use_existent, bool use_lock);
extern int XLogFileOpen(uint32 log, uint32 seg);
+ extern XLogRecPtr XLogRecordHint(Buffer buffer);
extern void XLogGetLastRemoved(uint32 *log, uint32 *seg);
extern void XLogSetAsyncXactLSN(XLogRecPtr record);
*** a/src/include/catalog/pg_control.h
--- b/src/include/catalog/pg_control.h
***************
*** 61,67 **** typedef struct CheckPoint
#define XLOG_BACKUP_END 0x50
#define XLOG_PARAMETER_CHANGE 0x60
#define XLOG_RESTORE_POINT 0x70
! #define XLOG_FPW_CHANGE 0x80
/*
--- 61,68 ----
#define XLOG_BACKUP_END 0x50
#define XLOG_PARAMETER_CHANGE 0x60
#define XLOG_RESTORE_POINT 0x70
! #define XLOG_FPW_CHANGE 0x80
! #define XLOG_HINT 0x90
/*
*** a/src/include/catalog/storage.h
--- b/src/include/catalog/storage.h
***************
*** 36,41 **** extern void PostPrepare_smgr(void);
--- 36,42 ----
extern void log_smgrcreate(RelFileNode *rnode, ForkNumber forkNum);
+ extern void smgr_hint(Buffer buffer);
extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 18,23 ****
--- 18,25 ----
#include "storage/item.h"
#include "storage/off.h"
+ extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
***************
*** 93,99 **** typedef uint16 LocationIndex;
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
! * pd_pagesize_version - size in bytes and page layout version number.
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
--- 95,101 ----
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
! * pd_verify - page verification information of different kinds
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
***************
*** 106,112 **** typedef uint16 LocationIndex;
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
! * The page version number and page size are packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
--- 108,115 ----
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
! * For verification we store either a 16 bit checksum or a watermark of
! * the page version number and page size packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
***************
*** 130,136 **** typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
! uint16 pd_pagesize_version;
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
--- 133,145 ----
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
!
! union
! {
! uint16 pd_pagesize_version;
! uint16 pd_checksum16;
! } pd_verify; /* page verification data */
!
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
***************
*** 155,161 **** typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
! #define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
--- 164,205 ----
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
! #define PD_VALID_FLAG_BITS 0x400F /* OR of all non-checksum pd_flags bits */
!
! #define PD_CHECKSUM1 0x0008 /* First checksum bit - should be 1 */
! #define PD_CHECKSUM2 0x8000 /* Second checksum bit - should be 0 */
!
! /*
! * Page layout version number increment bits (still part of flags).
! * Next flag would be called PD_PAGE_VERSION_PLUS2, PD_PAGE_VERSION_PLUS4 etc
! */
! #define PD_PAGE_VERSION_PLUS1 0x4000
!
! /*
! * PD_SET_CHECKSUM is a bitmask of multiple flags
! *
! * When we wish to set a checksum we require exact settings for 3 bits.
! *
! * (1) First, we must set the version correctly, so we set all increment bits.
! *
! * (2) Next we set PD_CHECKSUM2 to be zero, the opposite setting of the increment
! * bits. So any error that causes a run of multiple zeroes or ones to be written
! * to the block will be detected.
! *
! * (3) Next we set PD_CHECKSUM1 to be one, to indicate the block has a checksum.
! * A setting of just the increment bits but no checksum is version-legal but
! * we call that an error also to reduce number of chances for error.
! */
! #define PD_SET_CHECKSUM 0x4008 /* All 3 bits set correctly for checksum */
!
! #define PageHasChecksum(page) \
! ((((PageHeader) (page))->pd_flags & PD_SET_CHECKSUM) == PD_SET_CHECKSUM)
!
! #define PageHasNoChecksum(page) \
! ((((PageHeader) (page))->pd_flags & PD_SET_CHECKSUM) == 0)
!
! #define PageHasChecksumFlagError(page) \
! (!PageHasChecksum(page) && !PageHasNoChecksum(page))
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
***************
*** 165,173 **** typedef PageHeaderData *PageHeader;
--- 209,238 ----
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses versions 4 and/or 5. Each page may only have one version,
+ * though different pages may have different versions. This is arranged
+ * deliberately to allow us to upgrade from one major version to another.
*/
#define PG_PAGE_LAYOUT_VERSION 4
+ /*
+ * PageGetPageLayoutVersion
+ * Returns the page layout version of a page.
+ *
+ * To calculate the page layout version we take the basic page layout (4)
+ * and increment that according to the incremental bits above. If further
+ * versions of the page are required AND those page versions must be
+ * compatible with v4 and v5 then we would do this by setting new
+ * incremental bits and adding them into the definition below.
+ * Note that this works for all page versions because we don't look at the
+ * actual data field if we have the version increment bits set.
+ * Not that any of the code actually uses this, but someone might.
+ */
+ #define PageGetPageLayoutVersion(page) \
+ ((((PageHeader) (page))->pd_flags & PD_PAGE_VERSION_PLUS1) == PD_PAGE_VERSION_PLUS1 ? \
+ PG_PAGE_LAYOUT_VERSION + 1 : \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version)
+
/* ----------------------------------------------------------------
* page support macros
***************
*** 231,249 **** typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
! * this can only be called on a formatted page (unlike
! * BufferGetPageSize, which can be called on an unformatted page).
! * however, it can be called on a page that is not stored in a buffer.
*/
! #define PageGetPageSize(page) \
! ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
!
! /*
! * PageGetPageLayoutVersion
! * Returns the page layout version of a page.
! */
! #define PageGetPageLayoutVersion(page) \
! (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
--- 296,308 ----
* PageGetPageSize
* Returns the page size of a page.
*
! * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
! * This can be called on any page, initialised or not, in or out of buffers.
! * You might think this can vary at runtime but you'd be wrong, since pages
! * frequently need to occupy buffers and pages are copied from one to another
! * so there are many hidden assumptions that this simple definition is true.
*/
! #define PageGetPageSize(page) (BLCKSZ)
/*
* PageSetPageSizeAndVersion
***************
*** 251,262 **** typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
! ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
)
/* ----------------
--- 310,323 ----
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
! ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
/* ----------------
***************
*** 368,374 **** do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
! extern bool PageHeaderIsValid(PageHeader page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
--- 429,435 ----
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
! extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
***************
*** 381,385 **** extern Size PageGetExactFreeSpace(Page page);
--- 442,448 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern char *PageSetVerificationInfoOnCopy(Page page);
+ extern void PageSetVerificationInfoInplace(Page page);
#endif /* BUFPAGE_H */
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 168,174 **** typedef struct PGXACT
uint8 vacuumFlags; /* vacuum-related flags, see above */
bool overflowed;
! bool inCommit; /* true if within commit critical section */
uint8 nxids;
} PGXACT;
--- 168,175 ----
uint8 vacuumFlags; /* vacuum-related flags, see above */
bool overflowed;
! bool delayChkpt; /* true if this proc delays checkpoint start */
! /* previously called InCommit */
uint8 nxids;
} PGXACT;
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
***************
*** 52,59 **** extern bool TransactionIdIsActive(TransactionId xid);
extern TransactionId GetOldestXmin(bool allDbs, bool ignoreVacuum);
extern TransactionId GetOldestActiveTransactionId(void);
! extern int GetTransactionsInCommit(TransactionId **xids_p);
! extern bool HaveTransactionsInCommit(TransactionId *xids, int nxids);
extern PGPROC *BackendPidGetProc(int pid);
extern int BackendXidGetPid(TransactionId xid);
--- 52,59 ----
extern TransactionId GetOldestXmin(bool allDbs, bool ignoreVacuum);
extern TransactionId GetOldestActiveTransactionId(void);
! extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
! extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
extern PGPROC *BackendPidGetProc(int pid);
extern int BackendXidGetPid(TransactionId xid);
A Dijous, 16 de febrer de 2012 12:16:31, Simon Riggs va escriure:
v8 attached
Maybe the docs should include what will happen if the checksum is not correct?
--
Albert Cervera i Areny
http://www.NaN-tic.com
Tel: +34 93 553 18 03
On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
v8 attached
It's hard to believe that this version has been tested terribly
thoroughly, because it doesn't compile.
+ LockBufHdr(buf);
+
+ /*
+ * Run PageGetLSN while holding header lock.
+ */
+ recptr = BufferGetLSN(buf);
+
+ /* To check if block content changes while flushing. - vadim 01/17/97 */
+ buf->flags &= ~BM_JUST_DIRTIED;
+ UnlockBufHdr(buf);
+
This doesn't seem very helpful. It's obvious even without the comment
that we're running PageGetLSN while holding the header lock. What's
not obvious, and what the comment should be explaining, is why we're
doing that.
+ /*
+ * If we're in recovery we cannot dirty a page because of a hint.
+ * We can set the hint, just not dirty the page as a result so
+ * the hint is lost when we evict the page or shutdown.
+ *
+ * See long discussion in bufpage.c
+ */
+ if (RecoveryInProgress())
+ return;
Doesn't this seem awfully bad for performance on Hot Standby servers?
I agree that it fixes the problem with un-WAL-logged pages there, but
I seem to recall some recent complaining about performance features
that work on the master but not the standby. Durable hint bits are
one such feature.
+ * Basically, we simply prevent the checkpoint WAL record from
+ * being written until we have marked the buffer dirty. We don't
+ * start the checkpoint flush until we have marked dirty, so our
+ * checkpoint must flush the change to disk successfully or the
+ * checkpoint never gets written, so crash recovery will fix.
+ *
+ * It's possible we may enter here without an xid, so it is
+ * essential that CreateCheckpoint waits for virtual transactions
+ * rather than full transactionids.
+ */
+ MyPgXact->delayChkpt = delayChkpt = true;
I am slightly worried that this expansion in the use of this mechanism
(formerly called inCommit, for those following along at home) could
lead to checkpoint starvation. Suppose we've got one or two large
table scans wandering along, setting hint bits, and now suddenly it's
time to checkpoint. How long will it take the checkpoint process to
find a time when nobody's got delayChkpt set?
+ #define PageSetChecksum(page) \
+ do \
+ { \
+ PageHeader p = (PageHeader) page; \
+ p->pd_flags |= PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags |= PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
+ } while (0);
+
+ /* ensure any older checksum info is overwritten with watermark */
+ #define PageResetVersion(page) \
+ do \
+ { \
+ PageHeader p = (PageHeader) page; \
+ if (!PageHasNoChecksum(p)) \
+ { \
+ p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags &= ~PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
+ } \
+ } while (0);
So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
doesn't have a checksum, PD_CHECKSUM2 is not set? What good does that
do?
* PageGetPageSize
* Returns the page size of a page.
*
! * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
! * This can be called on any page, initialised or not, in or out of buffers.
! * You might think this can vary at runtime but you'd be wrong, since pages
! * frequently need to occupy buffers and pages are copied from one to another
! * so there are many hidden assumptions that this simple definition is true.
*/
! #define PageGetPageSize(page) (BLCKSZ)
I think I agree that the current definition of PageGetPageSize seems
unlikely to come anywhere close to allowing us to cope with multiple
page sizes, but I think this method of disabling it is a hack. The
callers that want to know how big the page really is should just use
BLCKSZ instead of this macro, and those that want to know how big the
page THINKS it is (notably contrib/pageinspect) need a way to get that
information.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
v8 attached
It's hard to believe that this version has been tested terribly
thoroughly, because it doesn't compile.
I'm just back home from a few days off grid.
It's possible it doesn't compile against current HEAD, though it
certainly does compile and work against my last git pull.
I will look into your comments in detail tomorrow morning. Thank you
for looking at the patch.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 17, 2012 at 10:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
v8 attached
It's hard to believe that this version has been tested terribly
thoroughly, because it doesn't compile.I'm just back home from a few days off grid.
It's possible it doesn't compile against current HEAD, though it
certainly does compile and work against my last git pull.I will look into your comments in detail tomorrow morning. Thank you
for looking at the patch.
It does appear there is a chunk of code introduced somewhere between
my testing and sending the patch, so you're right. Looking at the
chunk below it looks like something hit some keys, adding a "]" and
CR, which are adjacent on my keyboard. Trying to develop with 3 kids
and a dog around isn't easy, I guess.
With this chunk removed we're back to the version I've tested - I
think - will carry on checking.
@@ -465,7 +466,9 @@ ReadBuffer_common(SMgrRelation smgr, char
relpersistence, ForkNumber forkNum,
{
/* Only need to adjust flags */
bufHdr->flags |= BM_VALID;
- }
+ [
+
+}
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
For testing, I've been running server with shared_buffers = 16, to
force blocks in and out of cache frequently to stress test the
checksumming.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16_with_wallogged_hint_bits.v8x.patchtext/x-diff; charset=US-ASCII; name=checksum16_with_wallogged_hint_bits.v8x.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3a84321..7208087 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1712,6 +1712,62 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-page-checksums" xreflabel="page_checksums">
+ <indexterm>
+ <primary><varname>page_checksums</> configuration parameter</primary>
+ </indexterm>
+ <term><varname>page_checksums</varname> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ When this parameter is on, the <productname>PostgreSQL</> server
+ calculates checksums when it writes main database pages to disk,
+ flagging the page as checksum protected.
+ When pages are read into shared buffers any page flagged with a
+ checksum has the checksum re-calculated and compared against the
+ stored value to provide greatly improved validation of page contents.
+ The checksum uses 16-bit checksums, using the fast Fletcher 16 algorithm.
+ </para>
+
+ <para>
+ When this parameter is off we write only the page version number and
+ blocksize as a standard watermark. If we read in a block that has a
+ checksum yet <varname>page_checksums</varname> is disabled we do not
+ verify the checksum value. Such pages will be reset to the standard
+ watermark if they are rewritten for any reason.
+ The database may thus contain a mix of pages with checksums and pages
+ without checksums.
+ </para>
+
+ <para>
+ Writes via temp_buffers are not checksummed at any time.
+ </para>
+
+ <para>
+ Turning this parameter off reduces the CPU overhead of reading/writing
+ data from disk though might allow data corruption to go unnoticed.
+ With this parameter enabled there is still a non-zero probability that
+ an error would go undetected.
+ </para>
+
+ <para>
+ The checksum uses 16-bit checksums, using the fast Fletcher 16 algorithm.
+ Later releases may increase the size of the checksum field and offer
+ alternative checksum or CRC algorithms.
+ </para>
+
+ <para>
+ The default is <literal>off</> for backwards compatibility and
+ to allow upgrade. The recommended setting is <literal>on</> though
+ this should not be enabled until upgrade is successfully complete
+ with full set of new backups.
+ </para>
+
+ <para>
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
<term><varname>wal_buffers</varname> (<type>integer</type>)</term>
<indexterm>
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 6b647a8..0bec9a9 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -730,6 +730,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
MemSet(zerobuf, 0, sizeof(zerobuf));
RelationOpenSmgr(rel);
+ /* no need to set page verification info for all zero pages */
smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
return true;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 31b2b67..0668f8f 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -266,6 +266,8 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
+ PageSetVerificationInfoInplace(state->rs_buffer);
+
if (state->rs_use_wal)
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
@@ -611,6 +613,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
/* Doesn't fit, so write out the existing page */
+ PageSetVerificationInfoInplace(page);
+
/* XLOG stuff */
if (state->rs_use_wal)
log_newpage(&state->rs_new_rel->rd_node,
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 6505e76..7da69d5 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -568,6 +568,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
/* Now extend the file */
while (vm_nblocks_now < vm_nblocks)
{
+ /* no need to set page verification info for all zero pages */
smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, vm_nblocks_now,
(char *) pg, false);
vm_nblocks_now++;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 7627738..1cb4cda 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -216,6 +216,7 @@ btbuildempty(PG_FUNCTION_ARGS)
_bt_initmetapage(metapage, P_NONE, 0);
/* Write the page. If archiving/streaming, XLOG it. */
+ PageSetVerificationInfoInplace(metapage);
smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
(char *) metapage, true);
if (XLogIsNeeded())
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 9aa3a13..59f086d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,12 +288,15 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
{
if (!wstate->btws_zeropage)
wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+ /* no need to set page verification info for all zero pages */
smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM,
wstate->btws_pages_written++,
(char *) wstate->btws_zeropage,
true);
}
+ PageSetVerificationInfoInplace(page);
+
/*
* Now write the page. There's no need for smgr to schedule an fsync for
* this write; we'll do it ourselves before ending the build.
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index cbcf655..bdd6692 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -150,6 +150,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
SpGistInitMetapage(page);
/* Write the page. If archiving/streaming, XLOG it. */
+ PageSetVerificationInfoInplace(page);
smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
(char *) page, true);
if (XLogIsNeeded())
@@ -159,6 +160,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
/* Likewise for the root page. */
SpGistInitPage(page, SPGIST_LEAF);
+ PageSetVerificationInfoInplace(page);
smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_HEAD_BLKNO,
(char *) page, true);
if (XLogIsNeeded())
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index de72f71..b415d3c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -527,6 +527,14 @@ associated with the n'th distinct buffer ID seen in the "rdata" array, and
per the above discussion, fully-rewritable buffers shouldn't be mentioned in
"rdata".)
+Note that we must only use PageSetLSN/PageGetLSN() when we know the action
+is serialised. Only Startup process may modify data blocks during recovery,
+so Startup process may execute PageGetLSN() without fear of serialisation
+problems. All other processes must only call PageSet/GetLSN when holding
+either an exclusive buffer lock or a shared lock plus buffer header lock,
+or be writing the data block directly rather than through shared buffers
+while holding AccessExclusiveLock on the relation.
+
Due to all these constraints, complex changes (such as a multilevel index
insertion) normally need to be described by a series of atomic-action WAL
records. What do you do if the intermediate states are not self-consistent?
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6e84cd0..404227d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -320,7 +320,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
proc->lxid = (LocalTransactionId) xid;
pgxact->xid = xid;
pgxact->xmin = InvalidTransactionId;
- pgxact->inCommit = false;
+ pgxact->delayChkpt = false;
pgxact->vacuumFlags = 0;
proc->pid = 0;
proc->backendId = InvalidBackendId;
@@ -1028,18 +1028,18 @@ EndPrepare(GlobalTransaction gxact)
* odds of a PANIC actually occurring should be very tiny given that we
* were able to write the bogus CRC above.
*
- * We have to set inCommit here, too; otherwise a checkpoint starting
+ * We have to set delayChkpt here, too; otherwise a checkpoint starting
* immediately after the WAL record is inserted could complete without
* fsync'ing our state file. (This is essentially the same kind of race
* condition as the COMMIT-to-clog-write case that RecordTransactionCommit
- * uses inCommit for; see notes there.)
+ * uses delayChkpt for; see notes there.)
*
* We save the PREPARE record's location in the gxact for later use by
* CheckPointTwoPhase.
*/
START_CRIT_SECTION();
- MyPgXact->inCommit = true;
+ MyPgXact->delayChkpt = true;
gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
records.head);
@@ -1087,7 +1087,7 @@ EndPrepare(GlobalTransaction gxact)
* checkpoint starting after this will certainly see the gxact as a
* candidate for fsyncing.
*/
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
END_CRIT_SECTION();
@@ -1971,7 +1971,7 @@ RecoverPreparedTransactions(void)
* RecordTransactionCommitPrepared
*
* This is basically the same as RecordTransactionCommit: in particular,
- * we must set the inCommit flag to avoid a race condition.
+ * we must set the delayChkpt flag to avoid a race condition.
*
* We know the transaction made at least one XLOG entry (its PREPARE),
* so it is never possible to optimize out the commit record.
@@ -1994,7 +1994,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
START_CRIT_SECTION();
/* See notes in RecordTransactionCommit */
- MyPgXact->inCommit = true;
+ MyPgXact->delayChkpt = true;
/* Emit the XLOG commit record */
xlrec.xid = xid;
@@ -2059,7 +2059,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
TransactionIdCommitTree(xid, nchildren, children);
/* Checkpoint can proceed now */
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
END_CRIT_SECTION();
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e22bdac..8d7bf3a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -975,13 +975,13 @@ RecordTransactionCommit(void)
* RecordTransactionAbort. That's because loss of a transaction abort
* is noncritical; the presumption would be that it aborted, anyway.
*
- * It's safe to change the inCommit flag of our own backend without
+ * It's safe to change the delayChkpt flag of our own backend without
* holding the ProcArrayLock, since we're the only one modifying it.
- * This makes checkpoint's determination of which xacts are inCommit a
+ * This makes checkpoint's determination of which xacts are delayChkpt a
* bit fuzzy, but it doesn't matter.
*/
START_CRIT_SECTION();
- MyPgXact->inCommit = true;
+ MyPgXact->delayChkpt = true;
SetCurrentTransactionStopTimestamp();
@@ -1155,7 +1155,7 @@ RecordTransactionCommit(void)
*/
if (markXidCommitted)
{
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
END_CRIT_SECTION();
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b128bfd..bfac487 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -733,6 +733,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
bool doPageWrites;
bool isLogSwitch = false;
bool fpwChange = false;
+ bool isHint = false;
uint8 info_orig = info;
/* cross-check on whether we should be here or not */
@@ -760,6 +761,10 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
fpwChange = true;
break;
+ case XLOG_HINT:
+ isHint = true;
+ break;
+
default:
break;
}
@@ -998,6 +1003,18 @@ begin:;
}
/*
+ * If this is a hint record and we don't need a backup block then
+ * we have no more work to do and can exit quickly without inserting
+ * a WAL record at all. In that case return InvalidXLogRecPtr.
+ */
+ if (isHint && !(info & XLR_BKP_BLOCK_MASK))
+ {
+ LWLockRelease(WALInsertLock);
+ END_CRIT_SECTION();
+ return InvalidXLogRecPtr;
+ }
+
+ /*
* If there isn't enough space on the current XLOG page for a record
* header, advance to the next page (leaving the unused space as zeroes).
*/
@@ -1280,7 +1297,8 @@ XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
/*
* XXX We assume page LSN is first data on *every* page that can be passed
* to XLogInsert, whether it otherwise has the standard page layout or
- * not.
+ * not. We don't need the buffer header lock for PageGetLSN because we
+ * have exclusive lock on the page and/or the relation.
*/
*lsn = PageGetLSN(page);
@@ -3724,6 +3742,12 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
}
+ /*
+ * Any checksum set on this page will be invalid. We don't need
+ * to reset it here since it will be reset before being written
+ * and changing data when we don't have a cleanup lock is bad juju.
+ */
+
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
@@ -7653,8 +7677,8 @@ CreateCheckPoint(int flags)
uint32 redo_logSeg;
uint32 insert_logId;
uint32 insert_logSeg;
- TransactionId *inCommitXids;
- int nInCommit;
+ VirtualTransactionId *vxids;
+ int nvxids;
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -7828,7 +7852,7 @@ CreateCheckPoint(int flags)
/*
* Before flushing data, we must wait for any transactions that are
- * currently in their commit critical sections. If an xact inserted its
+ * currently in commit or hint critical sections. If an xact inserted its
* commit record into XLOG just before the REDO point, then a crash
* restart from the REDO point would not replay that record, which means
* that our flushing had better include the xact's update of pg_clog. So
@@ -7844,21 +7868,21 @@ CreateCheckPoint(int flags)
* protected by different locks, but again that seems best on grounds of
* minimizing lock contention.)
*
- * A transaction that has not yet set inCommit when we look cannot be at
+ * A transaction that has not yet set delayChkpt when we look cannot be at
* risk, since he's not inserted his commit record yet; and one that's
* already cleared it is not at risk either, since he's done fixing clog
* and we will correctly flush the update below. So we cannot miss any
* xacts we need to wait for.
*/
- nInCommit = GetTransactionsInCommit(&inCommitXids);
- if (nInCommit > 0)
+ vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+ if (nvxids > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
- } while (HaveTransactionsInCommit(inCommitXids, nInCommit));
+ } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}
- pfree(inCommitXids);
+ pfree(vxids);
/*
* Get the other info we need for the checkpoint record.
@@ -8443,6 +8467,51 @@ XLogRestorePoint(const char *rpName)
}
/*
+ * Write a backup block if needed when we are setting a hint. Note that
+ * this may be called for a variety of page types, not just heaps.
+ *
+ * Deciding the "if needed" part is delicate and requires us to either
+ * grab WALInsertLock or check the info_lck spinlock. If we check the
+ * spinlock and it says Yes then we will need to get WALInsertLock as well,
+ * so the design choice here is to just go straight for the WALInsertLock
+ * and trust that calls to this function are minimised elsewhere.
+ *
+ * Callable while holding just share lock on the buffer content.
+ *
+ * Possible that multiple concurrent backends could attempt to write
+ * WAL records. In that case, more than one backup block may be recorded
+ * though that isn't important to the outcome and the backup blocks are
+ * likely to be identical anyway.
+ */
+#define XLOG_HINT_WATERMARK 13579
+XLogRecPtr
+XLogRecordHint(Buffer buffer)
+{
+ /*
+ * Make an XLOG entry reporting the hint
+ */
+ XLogRecData rdata[2];
+ int watermark = XLOG_HINT_WATERMARK;
+
+ /*
+ * Not allowed to have zero-length records, so use a small watermark
+ */
+ rdata[0].data = (char *) (&watermark);
+ rdata[0].len = sizeof(int);
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
+}
+
+/*
* Check if any of the GUC parameters that are critical for hot standby
* have changed, and update the value in pg_control file if necessary.
*/
@@ -8540,8 +8609,8 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
{
uint8 info = record->xl_info & ~XLR_INFO_MASK;
- /* Backup blocks are not used in xlog records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+ /* Backup blocks are not used in most xlog records */
+ Assert(info == XLOG_HINT || !(record->xl_info & XLR_BKP_BLOCK_MASK));
if (info == XLOG_NEXTOID)
{
@@ -8681,6 +8750,32 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
{
/* nothing to do here */
}
+ else if (info == XLOG_HINT)
+ {
+ int *watermark = (int *) XLogRecGetData(record);
+
+ /* Check the watermark is correct for the hint record */
+ Assert(*watermark == XLOG_HINT_WATERMARK);
+
+ /* Backup blocks must be present for smgr hint records */
+ Assert(record->xl_info & XLR_BKP_BLOCK_MASK);
+
+ /*
+ * Hint records have no information that needs to be replayed.
+ * The sole purpose of them is to ensure that a hint bit does
+ * not cause a checksum invalidation if a hint bit write should
+ * cause a torn page. So the body of the record is empty but
+ * there must be one backup block.
+ *
+ * Since the only change in the backup block is a hint bit,
+ * there is no confict with Hot Standby.
+ *
+ * This also means there is no corresponding API call for this,
+ * so an smgr implementation has no need to implement anything.
+ * Which means nothing is needed in md.c etc
+ */
+ RestoreBkpBlocks(lsn, record, false);
+ }
else if (info == XLOG_BACKUP_END)
{
XLogRecPtr startpoint;
@@ -8816,6 +8911,10 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
appendStringInfo(buf, "restore point: %s", xlrec->rp_name);
}
+ else if (info == XLOG_HINT)
+ {
+ appendStringInfo(buf, "page hint");
+ }
else if (info == XLOG_BACKUP_END)
{
XLogRecPtr startpoint;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 07dc326..2c957b8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -8537,6 +8537,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
smgrread(src, forkNum, blkno, buf);
+ PageSetVerificationInfoInplace(page);
+
/* XLOG stuff */
if (use_wal)
log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1adb6d3..c3e7f43 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
#include <unistd.h>
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "executor/instrument.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -440,7 +441,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
- if (!PageHeaderIsValid((PageHeader) bufBlock))
+ if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
@@ -635,14 +636,23 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* victim. We need lock to inspect the page LSN, so this
* can't be done inside StrategyGetBuffer.
*/
- if (strategy != NULL &&
- XLogNeedsFlush(BufferGetLSN(buf)) &&
- StrategyRejectBuffer(strategy, buf))
+ if (strategy != NULL)
{
- /* Drop lock/pin and loop around for another buffer */
- LWLockRelease(buf->content_lock);
- UnpinBuffer(buf, true);
- continue;
+ XLogRecPtr lsn;
+
+ /* Read the LSN while holding buffer header lock */
+ LockBufHdr(buf);
+ lsn = BufferGetLSN(buf);
+ UnlockBufHdr(buf);
+
+ if (XLogNeedsFlush(lsn) &&
+ StrategyRejectBuffer(strategy, buf))
+ {
+ /* Drop lock/pin and loop around for another buffer */
+ LWLockRelease(buf->content_lock);
+ UnpinBuffer(buf, true);
+ continue;
+ }
}
/* OK, do the I/O */
@@ -1873,6 +1883,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
+ char *bufToWrite;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
@@ -1901,12 +1913,22 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
reln->smgr_rnode.node.dbNode,
reln->smgr_rnode.node.relNode);
+ LockBufHdr(buf);
+
+ /*
+ * Run PageGetLSN while holding header lock.
+ */
+ recptr = BufferGetLSN(buf);
+
+ /* To check if block content changes while flushing. - vadim 01/17/97 */
+ buf->flags &= ~BM_JUST_DIRTIED;
+ UnlockBufHdr(buf);
+
/*
* Force XLOG flush up to buffer's LSN. This implements the basic WAL
* rule that log updates must hit disk before any of the data-file changes
* they describe do.
*/
- recptr = BufferGetLSN(buf);
XLogFlush(recptr);
/*
@@ -1915,15 +1937,17 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
* we have the io_in_progress lock.
*/
- /* To check if block content changes while flushing. - vadim 01/17/97 */
- LockBufHdr(buf);
- buf->flags &= ~BM_JUST_DIRTIED;
- UnlockBufHdr(buf);
+ bufBlock = BufHdrGetBlock(buf);
+
+ bufToWrite = PageSetVerificationInfoOnCopy((Page) bufBlock);
+ /*
+ * bufToWrite is either the shared buffer or a copy, as appropriate.
+ */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
- (char *) BufHdrGetBlock(buf),
+ bufToWrite,
false);
pgBufferUsage.shared_blks_written++;
@@ -1934,6 +1958,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
*/
TerminateBufferIO(buf, true, 0);
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
@@ -2326,6 +2352,7 @@ void
SetBufferCommitInfoNeedsSave(Buffer buffer)
{
volatile BufferDesc *bufHdr;
+ Page page = BufferGetPage(buffer);
if (!BufferIsValid(buffer))
elog(ERROR, "bad buffer ID: %d", buffer);
@@ -2354,15 +2381,105 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
+ XLogRecPtr lsn = {0, 0};
bool dirtied = false;
+ bool delayChkpt = false;
+
+ /*
+ * If we are currently writing checksums, or if the page has a
+ * checksum set then we must check to see if a full page image is
+ * required to protect against torn pages.
+ *
+ * If page_checksums is set then we must write a new checksum
+ * for any change, even a hint. If we write a checksum and we crash
+ * it could leave torn pages on disk. So to protect against that
+ * a hint must be written if this is the first change on the block
+ * since the last checkpoint.
+ *
+ * Even if page_checksums is off, we may also need to write a full
+ * page image. This protects against events that can occur if
+ * page_checksums is turned on and off and then on again. In that case,
+ * if a page has a checksum set and then we write a page
+ * while page_checksums = off a torn page could result in a page
+ * that has a checksums set and yet not match the actual page.
+ * So even if page_checksums = off we write a full page image if the
+ * page is marked as having a checksum, even though we are going
+ * to reset the checksum.
+ *
+ * We don't check full_page_writes here because that can be set
+ * on dynamically during a backup, so even if the usual value is
+ * off we may still need to use a full page image.
+ */
+ if (page_checksums || !PageHasNoChecksum(page))
+ {
+ /*
+ * If we're in recovery we cannot dirty a page because of a hint.
+ * We can set the hint, just not dirty the page as a result so
+ * the hint is lost when we evict the page or shutdown.
+ *
+ * See long discussion in bufpage.c
+ */
+ if (RecoveryInProgress())
+ return;
+
+ /*
+ * If the block is already dirty because we either made a change
+ * or set a hint already, then we don't need to write a full page
+ * image. Note that aggressive cleaning of blocks
+ * dirtied by hint bit setting would increase the call rate.
+ * Bulk setting of hint bits would reduce the call rate...
+ *
+ * We must issue the WAL record before we mark the buffer dirty.
+ * Otherwise we might write the page before we write the WAL.
+ * That causes a race condition, since a checkpoint might occur
+ * between writing the WAL record and marking the buffer dirty.
+ * We solve that with a kluge, but one that is already in use
+ * during transaction commit to prevent race conditions.
+ * Basically, we simply prevent the checkpoint WAL record from
+ * being written until we have marked the buffer dirty. We don't
+ * start the checkpoint flush until we have marked dirty, so our
+ * checkpoint must flush the change to disk successfully or the
+ * checkpoint never gets written, so crash recovery will fix.
+ *
+ * It's possible we may enter here without an xid, so it is
+ * essential that CreateCheckpoint waits for virtual transactions
+ * rather than full transactionids.
+ */
+ MyPgXact->delayChkpt = delayChkpt = true;
+ lsn = XLogRecordHint(buffer);
+ }
LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);
if (!(bufHdr->flags & BM_DIRTY))
- dirtied = true;
+ {
+ dirtied = true; /* Means "will be dirtied by this action" */
+
+ /*
+ * Set the page LSN if we wrote a backup block. We aren't
+ * supposed to set this when only holding a share lock but
+ * as long as we serialise it somehow we're OK. We choose to
+ * set LSN while holding the buffer header lock, which causes
+ * any reader of an LSN who holds only a share lock to also
+ * obtain a buffer header lock before using PageGetLSN().
+ * Fortunately, thats not too many places.
+ *
+ * If !page_checksums and PageHasChecksum() you might think
+ * we should reset the checksum here. That will happen when
+ * the page is written sometime later in this checkpoint cycle.
+ */
+ if (!XLogRecPtrIsInvalid(lsn))
+ {
+ PageSetLSN(page, lsn);
+ PageSetTLI(page, ThisTimeLineID);
+ }
+ }
bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
UnlockBufHdr(bufHdr);
+ if (delayChkpt)
+ MyPgXact->delayChkpt = false;
+
if (dirtied)
{
VacuumPageDirty++;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 096d36a..a220310 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
/* Find smgr relation for buffer */
oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ /* XXX do we want to write checksums for local buffers? An option? */
+
/* And write... */
smgrwrite(oreln,
bufHdr->tag.forkNum,
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 8bda474..67e5e33 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -398,7 +398,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
pgxact->xmin = InvalidTransactionId;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
- pgxact->inCommit = false; /* be sure this is cleared in abort */
+ pgxact->delayChkpt = false; /* be sure this is cleared in abort */
proc->recoveryConflictPending = false;
/* Clear the subtransaction-XID cache too while holding the lock */
@@ -425,7 +425,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
pgxact->xmin = InvalidTransactionId;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
- pgxact->inCommit = false; /* be sure this is cleared in abort */
+ pgxact->delayChkpt = false; /* be sure this is cleared in abort */
proc->recoveryConflictPending = false;
Assert(pgxact->nxids == 0);
@@ -460,7 +460,7 @@ ProcArrayClearTransaction(PGPROC *proc)
/* redundant, but just in case */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
- pgxact->inCommit = false;
+ pgxact->delayChkpt = false;
/* Clear the subtransaction-XID cache too */
pgxact->nxids = 0;
@@ -1723,65 +1723,70 @@ GetOldestActiveTransactionId(void)
}
/*
- * GetTransactionsInCommit -- Get the XIDs of transactions that are committing
+ * GetVirtualXIDsDelayingChkpt -- Get the VXIDs of transactions that are
+ * delaying checkpoint because they have critical actions in progress.
*
- * Constructs an array of XIDs of transactions that are currently in commit
- * critical sections, as shown by having inCommit set in their PGPROC entries.
+ * Constructs an array of VXIDs of transactions that are currently in commit
+ * critical sections, as shown by having delayChkpt set in their PGXACT.
*
- * *xids_p is set to a palloc'd array that should be freed by the caller.
- * The return value is the number of valid entries.
+ * Returns a palloc'd array that should be freed by the caller.
+ * *nvxids is the number of valid entries.
*
- * Note that because backends set or clear inCommit without holding any lock,
+ * Note that because backends set or clear delayChkpt without holding any lock,
* the result is somewhat indeterminate, but we don't really care. Even in
* a multiprocessor with delayed writes to shared memory, it should be certain
- * that setting of inCommit will propagate to shared memory when the backend
- * takes the WALInsertLock, so we cannot fail to see an xact as inCommit if
+ * that setting of delayChkpt will propagate to shared memory when the backend
+ * takes a lock, so we cannot fail to see an virtual xact as delayChkpt if
* it's already inserted its commit record. Whether it takes a little while
- * for clearing of inCommit to propagate is unimportant for correctness.
+ * for clearing of delayChkpt to propagate is unimportant for correctness.
*/
-int
-GetTransactionsInCommit(TransactionId **xids_p)
+VirtualTransactionId *
+GetVirtualXIDsDelayingChkpt(int *nvxids)
{
+ VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
- TransactionId *xids;
- int nxids;
+ int count = 0;
int index;
- xids = (TransactionId *) palloc(arrayP->maxProcs * sizeof(TransactionId));
- nxids = 0;
+ /* allocate what's certainly enough result space */
+ vxids = (VirtualTransactionId *)
+ palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
LWLockAcquire(ProcArrayLock, LW_SHARED);
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- volatile PGXACT *pgxact = &allPgXact[pgprocno];
- TransactionId pxid;
+ volatile PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
- /* Fetch xid just once - see GetNewTransactionId */
- pxid = pgxact->xid;
+ if (pgxact->delayChkpt)
+ {
+ VirtualTransactionId vxid;
- if (pgxact->inCommit && TransactionIdIsValid(pxid))
- xids[nxids++] = pxid;
+ GET_VXID_FROM_PGPROC(vxid, *proc);
+ if (VirtualTransactionIdIsValid(vxid))
+ vxids[count++] = vxid;
+ }
}
LWLockRelease(ProcArrayLock);
- *xids_p = xids;
- return nxids;
+ *nvxids = count;
+ return vxids;
}
/*
- * HaveTransactionsInCommit -- Are any of the specified XIDs in commit?
+ * HaveVirtualXIDsDelayingChkpt -- Are any of the specified VXIDs delaying?
*
- * This is used with the results of GetTransactionsInCommit to see if any
- * of the specified XIDs are still in their commit critical sections.
+ * This is used with the results of GetVirtualXIDsDelayingChkpt to see if any
+ * of the specified VXIDs are still in critical sections of code.
*
- * Note: this is O(N^2) in the number of xacts that are/were in commit, but
+ * Note: this is O(N^2) in the number of vxacts that are/were delaying, but
* those numbers should be small enough for it not to be a problem.
*/
bool
-HaveTransactionsInCommit(TransactionId *xids, int nxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
{
bool result = false;
ProcArrayStruct *arrayP = procArray;
@@ -1789,30 +1794,32 @@ HaveTransactionsInCommit(TransactionId *xids, int nxids)
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ while (VirtualTransactionIdIsValid(*vxids))
{
- int pgprocno = arrayP->pgprocnos[index];
- volatile PGXACT *pgxact = &allPgXact[pgprocno];
- TransactionId pxid;
-
- /* Fetch xid just once - see GetNewTransactionId */
- pxid = pgxact->xid;
-
- if (pgxact->inCommit && TransactionIdIsValid(pxid))
+ for (index = 0; index < arrayP->numProcs; index++)
{
- int i;
+ int pgprocno = arrayP->pgprocnos[index];
+ volatile PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+ VirtualTransactionId vxid;
- for (i = 0; i < nxids; i++)
+ GET_VXID_FROM_PGPROC(vxid, *proc);
+ if (VirtualTransactionIdIsValid(vxid))
{
- if (xids[i] == pxid)
+ if (VirtualTransactionIdEquals(vxid, *vxids) &&
+ pgxact->delayChkpt)
{
result = true;
break;
}
}
- if (result)
- break;
}
+
+ if (result)
+ break;
+
+ /* The virtual transaction is gone now, wait for the next one */
+ vxids++;
}
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 2196f51..845fcfb 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -356,7 +356,7 @@ InitProcess(void)
MyProc->backendId = InvalidBackendId;
MyProc->databaseId = InvalidOid;
MyProc->roleId = InvalidOid;
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
MyPgXact->vacuumFlags = 0;
/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
if (IsAutoVacuumWorkerProcess())
@@ -514,7 +514,7 @@ InitAuxiliaryProcess(void)
MyProc->backendId = InvalidBackendId;
MyProc->databaseId = InvalidOid;
MyProc->roleId = InvalidOid;
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
MyPgXact->vacuumFlags = 0;
MyProc->lwWaiting = false;
MyProc->lwWaitMode = 0;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 90a731c..475e55b 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -16,6 +16,12 @@
#include "access/htup.h"
+bool page_checksums = false;
+
+static Page pageCopy; /* temporary buffer to allow checksum calculation */
+
+static bool PageVerificationInfoOK(Page page);
+static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
@@ -25,6 +31,10 @@
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we write the block to disk.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
@@ -67,21 +77,26 @@ PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
-PageHeaderIsValid(PageHeader page)
+PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
- /* Check normal case */
- if (PageGetPageSize(page) == BLCKSZ &&
- PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
- (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
- page->pd_lower >= SizeOfPageHeaderData &&
- page->pd_lower <= page->pd_upper &&
- page->pd_upper <= page->pd_special &&
- page->pd_special <= BLCKSZ &&
- page->pd_special == MAXALIGN(page->pd_special))
- return true;
+ /*
+ * Don't verify page data unless the page passes basic non-zero test
+ */
+ if (p->pd_lower >= SizeOfPageHeaderData)
+ {
+ /* Check normal case */
+ if (PageVerificationInfoOK(page) &&
+ (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
+ p->pd_lower <= p->pd_upper &&
+ p->pd_upper <= p->pd_special &&
+ p->pd_special <= BLCKSZ &&
+ p->pd_special == MAXALIGN(p->pd_special))
+ return true;
+ }
/* Check all-zeroes case */
pagebytes = (char *) page;
@@ -827,3 +842,311 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
pfree(itemidbase);
}
+
+/*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ *
+ * Any write of a data block can cause a torn page if the write is unsuccessful.
+ * Full page writes protect us from that, which are stored in WAL. Setting
+ * hint bits when a page is already dirty is OK because a full page write
+ * must already have been written for that since the last checkpoint.
+ * Setting hint bits on an otherwise clean page can allow torn pages; this
+ * doesn't normally matter since they are just hints. When the page has
+ * checksums, losing a few bits would cause the checksum to be invalid.
+ * So if we have full_page_writes = on and page_checksums = on then we must
+ * write a WAL record specifically so that we record a full page image in WAL.
+ * New WAL records cannot be written during recovery, so hint bits set
+ * during recovery must not dirty the page if the buffer is not already dirty,
+ * when page_checksums = on.
+ *
+ * So we cannot enable/disable page_checksums except at a checkpoint if
+ * full_page_writes is enabled. We choose to only allow changes at server start.
+ *
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page images does not check verification
+ * info via this function. In recovery, since we only dirty a block when we
+ * have a full page image available if we crash, we are fully OK to use page
+ * verification normally.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering and leaving
+ * the buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data leaving and
+ * re-entering PostgreSQL controlled memory. Note also that the WAL checksum
+ * is a CRC, whereas the page checksum is a Fletcher16 checksum, not a CRC.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data. We also require that a third bit is
+ * zeroed, to protect against a run of 1s being written to the page header.
+ *
+ * This function returns a boolean, not a full damage assessment.
+ */
+static bool
+PageVerificationInfoOK(Page page)
+{
+ PageHeader p = (PageHeader) page;
+
+ /* Quick exit if nobody cares about checksumming */
+ if (!page_checksums)
+ return true;
+
+ if (PageHasChecksum(p))
+ {
+ uint16 checksum = PageCalcChecksum16(page);
+
+ if (checksum == p->pd_verify.pd_checksum16)
+ {
+#ifdef CHECK_HOLE
+ /* Also check page hole is all-zeroes */
+ char *pagebytes;
+ bool empty = true;
+ int i;
+
+ pagebytes = (char *) page;
+ for (i = p->pd_lower; i < p->pd_upper; i++)
+ {
+ if (pagebytes[i] != 0)
+ {
+ empty = false;
+ break;
+ }
+ }
+
+ if (!empty)
+ elog(LOG, "hole was not empty at byte %d pd_lower %d pd_upper %d",
+ i, p->pd_lower, p->pd_upper);
+#endif
+ return true;
+ }
+
+ elog(LOG, "page verification failed - checksum was %u page checksum field is %u",
+ checksum, p->pd_verify.pd_checksum16);
+ }
+ else if (PageHasNoChecksum(p))
+ {
+ uint16 pd_pagesize_version = BLCKSZ | PG_PAGE_LAYOUT_VERSION;
+
+ if (p->pd_verify.pd_pagesize_version == pd_pagesize_version)
+ return true;
+
+ elog(LOG, "page verification failed - page size and version set incorrectly %u",
+ p->pd_verify.pd_checksum16);
+ }
+ else
+ elog(LOG, "page verification failed - page flags are incorrectly set");
+
+ return false;
+}
+
+#define PageSetChecksum(page) \
+do \
+{ \
+ PageHeader p = (PageHeader) page; \
+ p->pd_flags |= PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags |= PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
+} while (0);
+
+/* ensure any older checksum info is overwritten with watermark */
+#define PageResetVersion(page) \
+do \
+{ \
+ PageHeader p = (PageHeader) page; \
+ if (!PageHasNoChecksum(p)) \
+ { \
+ p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags &= ~PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
+ } \
+} while (0);
+
+/*
+ * Set verification info for page in shared buffers.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an invalid checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ *
+ * Returns a pointer to the block-sized data that needs to be written. That
+ * allows us to either copy, or not, depending upon whether we checksum.
+ */
+char *
+PageSetVerificationInfoOnCopy(Page page)
+{
+ if (PageIsNew(page))
+ return (char *) page;
+
+ if (page_checksums)
+ {
+ if (pageCopy == NULL)
+ {
+ pageCopy = (Page) malloc(BLCKSZ);
+ if (pageCopy == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+ }
+
+ /*
+ * We make a copy iff we need to calculate a checksum because other
+ * backends may set hint bits on this page while we write, which
+ * would mean the checksum differs from the page contents. It doesn't
+ * matter if we include or exclude hints during the copy, as long
+ * as we write a valid page and associated checksum.
+ */
+ memcpy((char *) pageCopy, (char *) page, BLCKSZ);
+ PageSetChecksum(pageCopy);
+ return (char *) pageCopy;
+ }
+
+ PageResetVersion(page);
+ return (char *) page;
+}
+
+/*
+ * Set verification info for page in private memory.
+ *
+ * This is a simpler version of PageSetVerificationInfoOnCopy(). The more
+ * explicit API allows us to more easily see if we're making the correct call
+ * and reduces the amount of additional code specific to page verification.
+ */
+void
+PageSetVerificationInfoInplace(Page page)
+{
+ if (PageIsNew(page))
+ return;
+
+ if (page_checksums)
+ {
+ PageSetChecksum(page);
+ return;
+ }
+
+ PageResetVersion(page);
+ return;
+}
+
+/*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+static uint16
+PageCalcChecksum16(Page page)
+{
+#define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+#ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ */
+ PageHeader p = (PageHeader) page;
+ uint page_header_stop = (uint)(offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+ uint page_lower_start = (uint)(offsetof(PageHeaderData, pd_prune_xid));
+ uint page_lower_stop;
+ uint sum1 = 0;
+ uint64 sum2 = 0;
+ int i;
+
+ /*
+ * Avoid calculating checksum if page is new, just return a value that
+ * will cause the check to fail. We may still pass the all-zeroes check.
+ */
+ if (PageIsNew(page))
+ return 1;
+
+ /*
+ * Just add in the pd_prune_xid if there are no line pointers yet.
+ */
+ page_lower_stop = p->pd_lower;
+ if (page_lower_stop == 0)
+ page_lower_stop = page_lower_start + sizeof(TransactionId);
+
+ Assert(p->pd_upper != 0);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "calculating checksum for %u-%u %u-%u %u-%u",
+ 0, /* page_header_start */
+ page_header_stop,
+ page_lower_start,
+ page_lower_stop,
+ p->pd_upper,
+ BLCKSZ
+ );
+#endif
+
+#define COMP_F16(from, to) \
+do { \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = sum1 + page[i]; \
+ sum2 = sum1 + sum2; \
+ } \
+ sum1 %= 255; \
+ sum2 %= 255; \
+} while (0); \
+
+#ifdef IGNORE_PAGE_HOLE
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ page_lower_stop);
+
+ /* ignore the hole in the middle of the block */
+
+ COMP_F16(p->pd_upper,
+ BLCKSZ - 1);
+#else
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ BLCKSZ - 1);
+#endif
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "checksum %u", ((sum2 << 8) | sum1));
+#endif
+
+ return ((sum2 << 8) | sum1);
+#endif
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7df5292..6c7421e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -830,6 +830,20 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksum before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 400c52b..de02b5f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -150,15 +150,21 @@
#------------------------------------------------------------------------------
-# WRITE AHEAD LOG
+# WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
-# - Settings -
+# - Reliability -
-#wal_level = minimal # minimal, archive, or hot_standby
- # (change requires restart)
+#page_checksums = off # calculate checksum before database I/O
+#full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+# - Write Ahead Log -
+
+#wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
@@ -166,7 +172,6 @@
# fsync
# fsync_writethrough
# open_sync
-#full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8aecef..4f726f6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -272,6 +272,7 @@ extern int XLogFileInit(uint32 log, uint32 seg,
bool *use_existent, bool use_lock);
extern int XLogFileOpen(uint32 log, uint32 seg);
+extern XLogRecPtr XLogRecordHint(Buffer buffer);
extern void XLogGetLastRemoved(uint32 *log, uint32 *seg);
extern void XLogSetAsyncXactLSN(XLogRecPtr record);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 1031e56..445a3e0 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -61,7 +61,8 @@ typedef struct CheckPoint
#define XLOG_BACKUP_END 0x50
#define XLOG_PARAMETER_CHANGE 0x60
#define XLOG_RESTORE_POINT 0x70
-#define XLOG_FPW_CHANGE 0x80
+#define XLOG_FPW_CHANGE 0x80
+#define XLOG_HINT 0x90
/*
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index d5103a8..7592611 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -36,6 +36,7 @@ extern void PostPrepare_smgr(void);
extern void log_smgrcreate(RelFileNode *rnode, ForkNumber forkNum);
+extern void smgr_hint(Buffer buffer);
extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 1ab64e0..0bc76d2 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -18,6 +18,8 @@
#include "storage/item.h"
#include "storage/off.h"
+extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
@@ -93,7 +95,7 @@ typedef uint16 LocationIndex;
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
- * pd_pagesize_version - size in bytes and page layout version number.
+ * pd_verify - page verification information of different kinds
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
@@ -106,7 +108,8 @@ typedef uint16 LocationIndex;
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
- * The page version number and page size are packed together into a single
+ * For verification we store either a 16 bit checksum or a watermark of
+ * the page version number and page size packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
@@ -130,7 +133,13 @@ typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
- uint16 pd_pagesize_version;
+
+ union
+ {
+ uint16 pd_pagesize_version;
+ uint16 pd_checksum16;
+ } pd_verify; /* page verification data */
+
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
@@ -155,7 +164,42 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x400F /* OR of all non-checksum pd_flags bits */
+
+#define PD_CHECKSUM1 0x0008 /* First checksum bit - should be 1 */
+#define PD_CHECKSUM2 0x8000 /* Second checksum bit - should be 0 */
+
+/*
+ * Page layout version number increment bits (still part of flags).
+ * Next flag would be called PD_PAGE_VERSION_PLUS2, PD_PAGE_VERSION_PLUS4 etc
+ */
+#define PD_PAGE_VERSION_PLUS1 0x4000
+
+/*
+ * PD_SET_CHECKSUM is a bitmask of multiple flags
+ *
+ * When we wish to set a checksum we require exact settings for 3 bits.
+ *
+ * (1) First, we must set the version correctly, so we set all increment bits.
+ *
+ * (2) Next we set PD_CHECKSUM2 to be zero, the opposite setting of the increment
+ * bits. So any error that causes a run of multiple zeroes or ones to be written
+ * to the block will be detected.
+ *
+ * (3) Next we set PD_CHECKSUM1 to be one, to indicate the block has a checksum.
+ * A setting of just the increment bits but no checksum is version-legal but
+ * we call that an error also to reduce number of chances for error.
+ */
+#define PD_SET_CHECKSUM 0x4008 /* All 3 bits set correctly for checksum */
+
+#define PageHasChecksum(page) \
+ ((((PageHeader) (page))->pd_flags & PD_SET_CHECKSUM) == PD_SET_CHECKSUM)
+
+#define PageHasNoChecksum(page) \
+ ((((PageHeader) (page))->pd_flags & PD_SET_CHECKSUM) == 0)
+
+#define PageHasChecksumFlagError(page) \
+ (!PageHasChecksum(page) && !PageHasNoChecksum(page))
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -165,9 +209,30 @@ typedef PageHeaderData *PageHeader;
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses versions 4 and/or 5. Each page may only have one version,
+ * though different pages may have different versions. This is arranged
+ * deliberately to allow us to upgrade from one major version to another.
*/
#define PG_PAGE_LAYOUT_VERSION 4
+/*
+ * PageGetPageLayoutVersion
+ * Returns the page layout version of a page.
+ *
+ * To calculate the page layout version we take the basic page layout (4)
+ * and increment that according to the incremental bits above. If further
+ * versions of the page are required AND those page versions must be
+ * compatible with v4 and v5 then we would do this by setting new
+ * incremental bits and adding them into the definition below.
+ * Note that this works for all page versions because we don't look at the
+ * actual data field if we have the version increment bits set.
+ * Not that any of the code actually uses this, but someone might.
+ */
+#define PageGetPageLayoutVersion(page) \
+ ((((PageHeader) (page))->pd_flags & PD_PAGE_VERSION_PLUS1) == PD_PAGE_VERSION_PLUS1 ? \
+ PG_PAGE_LAYOUT_VERSION + 1 : \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version)
+
/* ----------------------------------------------------------------
* page support macros
@@ -231,19 +296,13 @@ typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
- * this can only be called on a formatted page (unlike
- * BufferGetPageSize, which can be called on an unformatted page).
- * however, it can be called on a page that is not stored in a buffer.
+ * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
+ * This can be called on any page, initialised or not, in or out of buffers.
+ * You might think this can vary at runtime but you'd be wrong, since pages
+ * frequently need to occupy buffers and pages are copied from one to another
+ * so there are many hidden assumptions that this simple definition is true.
*/
-#define PageGetPageSize(page) \
- ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
-
-/*
- * PageGetPageLayoutVersion
- * Returns the page layout version of a page.
- */
-#define PageGetPageLayoutVersion(page) \
- (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
+#define PageGetPageSize(page) (BLCKSZ)
/*
* PageSetPageSizeAndVersion
@@ -251,12 +310,14 @@ typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
- ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
/* ----------------
@@ -368,7 +429,7 @@ do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
-extern bool PageHeaderIsValid(PageHeader page);
+extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
@@ -381,5 +442,7 @@ extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+extern char *PageSetVerificationInfoOnCopy(Page page);
+extern void PageSetVerificationInfoInplace(Page page);
#endif /* BUFPAGE_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 415c093..ea4a13f 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -168,7 +168,8 @@ typedef struct PGXACT
uint8 vacuumFlags; /* vacuum-related flags, see above */
bool overflowed;
- bool inCommit; /* true if within commit critical section */
+ bool delayChkpt; /* true if this proc delays checkpoint start */
+ /* previously called InCommit */
uint8 nxids;
} PGXACT;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 0b0aa35..8648bdb 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -52,8 +52,8 @@ extern bool TransactionIdIsActive(TransactionId xid);
extern TransactionId GetOldestXmin(bool allDbs, bool ignoreVacuum);
extern TransactionId GetOldestActiveTransactionId(void);
-extern int GetTransactionsInCommit(TransactionId **xids_p);
-extern bool HaveTransactionsInCommit(TransactionId *xids, int nxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
extern PGPROC *BackendPidGetProc(int pid);
extern int BackendXidGetPid(TransactionId xid);
On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
+ /* + * If we're in recovery we cannot dirty a page because of a hint. + * We can set the hint, just not dirty the page as a result so + * the hint is lost when we evict the page or shutdown. + * + * See long discussion in bufpage.c + */ + if (RecoveryInProgress()) + return;Doesn't this seem awfully bad for performance on Hot Standby servers?
I agree that it fixes the problem with un-WAL-logged pages there, but
I seem to recall some recent complaining about performance features
that work on the master but not the standby. Durable hint bits are
one such feature.
It's impossible for it to work, in this case, since we cannot write
new WAL to prevent torn pages.
Note that hint bit setting on a dirty block is allowed, so many hints
will still be set in Hot Standby.
+ * Basically, we simply prevent the checkpoint WAL record from + * being written until we have marked the buffer dirty. We don't + * start the checkpoint flush until we have marked dirty, so our + * checkpoint must flush the change to disk successfully or the + * checkpoint never gets written, so crash recovery will fix. + * + * It's possible we may enter here without an xid, so it is + * essential that CreateCheckpoint waits for virtual transactions + * rather than full transactionids. + */ + MyPgXact->delayChkpt = delayChkpt = true;I am slightly worried that this expansion in the use of this mechanism
(formerly called inCommit, for those following along at home) could
lead to checkpoint starvation. Suppose we've got one or two large
table scans wandering along, setting hint bits, and now suddenly it's
time to checkpoint. How long will it take the checkpoint process to
find a time when nobody's got delayChkpt set?
We don't need to wait until nobody has it set, we just need to wait
for the people that had it set when we first checked to be out of that
state momentarily.
+ #define PageSetChecksum(page) \ + do \ + { \ + PageHeader p = (PageHeader) page; \ + p->pd_flags |= PD_PAGE_VERSION_PLUS1; \ + p->pd_flags |= PD_CHECKSUM1; \ + p->pd_flags &= ~PD_CHECKSUM2; \ + p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \ + } while (0); + + /* ensure any older checksum info is overwritten with watermark */ + #define PageResetVersion(page) \ + do \ + { \ + PageHeader p = (PageHeader) page; \ + if (!PageHasNoChecksum(p)) \ + { \ + p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \ + p->pd_flags &= ~PD_CHECKSUM1; \ + p->pd_flags &= ~PD_CHECKSUM2; \ + PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \ + } \ + } while (0);So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
doesn't have a checksum, PD_CHECKSUM2 is not set? What good does that
do?
As explained in detailed comments, the purpose of this is to implement
Heikki's suggestion that we have a bit set to zero so we can detect
failures that cause a run of 1s.
* PageGetPageSize
* Returns the page size of a page.
*
! * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
! * This can be called on any page, initialised or not, in or out of buffers.
! * You might think this can vary at runtime but you'd be wrong, since pages
! * frequently need to occupy buffers and pages are copied from one to another
! * so there are many hidden assumptions that this simple definition is true.
*/
! #define PageGetPageSize(page) (BLCKSZ)I think I agree that the current definition of PageGetPageSize seems
unlikely to come anywhere close to allowing us to cope with multiple
page sizes, but I think this method of disabling it is a hack. The
callers that want to know how big the page really is should just use
BLCKSZ instead of this macro, and those that want to know how big the
page THINKS it is (notably contrib/pageinspect) need a way to get that
information.
Fair comment.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 4:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
We don't need to wait until nobody has it set, we just need to wait
for the people that had it set when we first checked to be out of that
state momentarily.
I've just finished doing some performance analysis on various aspects
of hint bit setting for this patch.
I've seen as high as 14% of transactions writing full pages during a
pgbench run. That sounds quite bad, but on pgbench at least all of
those are associated with UPDATEs, which would dirty the page anyway,
so there aren't any more full page writes overall.
Checkpoints would be delayed only until a virtual transaction ends or
a virtual transaction comes out of DelayCkpt state. If a virtual
transaction was long running it wouldn't spend much time in the
delaying state, especially if we take into account I/O requirements.
So although I'm not exactly happy with the overheads, they don't seem
to be as big a problem as they sound. Plus this is all optional and
avoidable.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
v8 attached
v10 attached.
This patch covers all the valid concerns discussed and has been
extensively tested.
I expect to commit this in about 24 hours, so last call for comments
please or say if you need more time to review.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16_with_wallogged_hint_bits.v10.patchtext/x-diff; charset=US-ASCII; name=checksum16_with_wallogged_hint_bits.v10.patchDownload
diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
index f51a4e3..a857a13 100644
--- a/contrib/pageinspect/rawpage.c
+++ b/contrib/pageinspect/rawpage.c
@@ -214,7 +214,7 @@ page_header(PG_FUNCTION_ARGS)
values[3] = UInt16GetDatum(page->pd_lower);
values[4] = UInt16GetDatum(page->pd_upper);
values[5] = UInt16GetDatum(page->pd_special);
- values[6] = UInt16GetDatum(PageGetPageSize(page));
+ values[6] = UInt16GetDatum(PageGetPageSizeField(page));
values[7] = UInt16GetDatum(PageGetPageLayoutVersion(page));
values[8] = TransactionIdGetDatum(page->pd_prune_xid);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0ea9aeb..85795ff 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1712,6 +1712,77 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-page-checksums" xreflabel="page_checksums">
+ <indexterm>
+ <primary><varname>page_checksums</> configuration parameter</primary>
+ </indexterm>
+ <term><varname>page_checksums</varname> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ When this parameter is on, the <productname>PostgreSQL</> server
+ calculates checksums when it writes main database pages to disk,
+ flagging the page as checksum protected.
+ When pages are read into shared buffers any page flagged with a
+ checksum has the checksum re-calculated and compared against the
+ stored value to provide greatly improved validation of page contents.
+ The checksum uses 16-bit checksums, using the fast Fletcher 16 algorithm.
+ </para>
+
+ <para>
+ When this parameter is off we write only the page version number and
+ blocksize as a standard watermark. If we read in a block that has a
+ checksum yet <varname>page_checksums</varname> is disabled we do not
+ verify the checksum value. Such pages will be reset to the standard
+ watermark if they are rewritten for any reason.
+ The database may thus contain a mix of pages with checksums and pages
+ without checksums.
+ </para>
+
+ <para>
+ The checksum field on each database page is only set on write, which
+ allows smooth and easy upgrade and transition when enabling
+ <varname>page_checksums</varname>. This means that the database will
+ contain a mix of pages with checksums and pages without checksums.
+ Use <command>VACUUM FREEZE</> with <varname>vacuum_freeze_table_age</>
+ set to zero if you wish to force checksums onto many datablocks,
+ or use <command>VACUUM FULL</> for a more aggressive rewriting of
+ datablocks. This should not be necessary however. A more natural and
+ gradual process of rewriting blocks with checksums will occur naturally
+ as writes occur on the database.
+ </para>
+
+ <para>
+ Writes via temp_buffers are not checksummed at any time.
+ </para>
+
+ <para>
+ Turning this parameter off reduces the CPU overhead of reading/writing
+ data from disk though might allow data corruption to go unnoticed.
+ With this parameter enabled there is still a non-zero probability that
+ an error would go undetected. Turning this parameter on will have a
+ CPU cost, as well as a slightly higher WAL overhead in some cases,
+ depending upon the workload.
+ </para>
+
+ <para>
+ The checksum uses 16-bit checksums, using the fast Fletcher 16 algorithm.
+ Later releases may increase the size of the checksum field and offer
+ alternative checksum or CRC algorithms.
+ </para>
+
+ <para>
+ The default is <literal>off</> for backwards compatibility and
+ to allow upgrade. The recommended setting is <literal>on</> though
+ this should not be enabled until upgrade is successfully complete
+ with full set of new backups.
+ </para>
+
+ <para>
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
<term><varname>wal_buffers</varname> (<type>integer</type>)</term>
<indexterm>
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 6b647a8..0bec9a9 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -730,6 +730,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
MemSet(zerobuf, 0, sizeof(zerobuf));
RelationOpenSmgr(rel);
+ /* no need to set page verification info for all zero pages */
smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
return true;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 31b2b67..0668f8f 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -266,6 +266,8 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
+ PageSetVerificationInfoInplace(state->rs_buffer);
+
if (state->rs_use_wal)
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
@@ -611,6 +613,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
/* Doesn't fit, so write out the existing page */
+ PageSetVerificationInfoInplace(page);
+
/* XLOG stuff */
if (state->rs_use_wal)
log_newpage(&state->rs_new_rel->rd_node,
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 6505e76..7da69d5 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -568,6 +568,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
/* Now extend the file */
while (vm_nblocks_now < vm_nblocks)
{
+ /* no need to set page verification info for all zero pages */
smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, vm_nblocks_now,
(char *) pg, false);
vm_nblocks_now++;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 7627738..1cb4cda 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -216,6 +216,7 @@ btbuildempty(PG_FUNCTION_ARGS)
_bt_initmetapage(metapage, P_NONE, 0);
/* Write the page. If archiving/streaming, XLOG it. */
+ PageSetVerificationInfoInplace(metapage);
smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
(char *) metapage, true);
if (XLogIsNeeded())
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 9aa3a13..59f086d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,12 +288,15 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
{
if (!wstate->btws_zeropage)
wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+ /* no need to set page verification info for all zero pages */
smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM,
wstate->btws_pages_written++,
(char *) wstate->btws_zeropage,
true);
}
+ PageSetVerificationInfoInplace(page);
+
/*
* Now write the page. There's no need for smgr to schedule an fsync for
* this write; we'll do it ourselves before ending the build.
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index cbcf655..bdd6692 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -150,6 +150,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
SpGistInitMetapage(page);
/* Write the page. If archiving/streaming, XLOG it. */
+ PageSetVerificationInfoInplace(page);
smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
(char *) page, true);
if (XLogIsNeeded())
@@ -159,6 +160,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
/* Likewise for the root page. */
SpGistInitPage(page, SPGIST_LEAF);
+ PageSetVerificationInfoInplace(page);
smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_HEAD_BLKNO,
(char *) page, true);
if (XLogIsNeeded())
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index de72f71..b415d3c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -527,6 +527,14 @@ associated with the n'th distinct buffer ID seen in the "rdata" array, and
per the above discussion, fully-rewritable buffers shouldn't be mentioned in
"rdata".)
+Note that we must only use PageSetLSN/PageGetLSN() when we know the action
+is serialised. Only Startup process may modify data blocks during recovery,
+so Startup process may execute PageGetLSN() without fear of serialisation
+problems. All other processes must only call PageSet/GetLSN when holding
+either an exclusive buffer lock or a shared lock plus buffer header lock,
+or be writing the data block directly rather than through shared buffers
+while holding AccessExclusiveLock on the relation.
+
Due to all these constraints, complex changes (such as a multilevel index
insertion) normally need to be described by a series of atomic-action WAL
records. What do you do if the intermediate states are not self-consistent?
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3d08e92..fca48bf 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -320,7 +320,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
proc->lxid = (LocalTransactionId) xid;
pgxact->xid = xid;
pgxact->xmin = InvalidTransactionId;
- pgxact->inCommit = false;
+ pgxact->delayChkpt = false;
pgxact->vacuumFlags = 0;
proc->pid = 0;
proc->backendId = InvalidBackendId;
@@ -1028,18 +1028,18 @@ EndPrepare(GlobalTransaction gxact)
* odds of a PANIC actually occurring should be very tiny given that we
* were able to write the bogus CRC above.
*
- * We have to set inCommit here, too; otherwise a checkpoint starting
+ * We have to set delayChkpt here, too; otherwise a checkpoint starting
* immediately after the WAL record is inserted could complete without
* fsync'ing our state file. (This is essentially the same kind of race
* condition as the COMMIT-to-clog-write case that RecordTransactionCommit
- * uses inCommit for; see notes there.)
+ * uses delayChkpt for; see notes there.)
*
* We save the PREPARE record's location in the gxact for later use by
* CheckPointTwoPhase.
*/
START_CRIT_SECTION();
- MyPgXact->inCommit = true;
+ MyPgXact->delayChkpt = true;
gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
records.head);
@@ -1087,7 +1087,7 @@ EndPrepare(GlobalTransaction gxact)
* checkpoint starting after this will certainly see the gxact as a
* candidate for fsyncing.
*/
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
END_CRIT_SECTION();
@@ -1977,7 +1977,7 @@ RecoverPreparedTransactions(void)
* RecordTransactionCommitPrepared
*
* This is basically the same as RecordTransactionCommit: in particular,
- * we must set the inCommit flag to avoid a race condition.
+ * we must set the delayChkpt flag to avoid a race condition.
*
* We know the transaction made at least one XLOG entry (its PREPARE),
* so it is never possible to optimize out the commit record.
@@ -2000,7 +2000,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
START_CRIT_SECTION();
/* See notes in RecordTransactionCommit */
- MyPgXact->inCommit = true;
+ MyPgXact->delayChkpt = true;
/* Emit the XLOG commit record */
xlrec.xid = xid;
@@ -2065,7 +2065,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
TransactionIdCommitTree(xid, nchildren, children);
/* Checkpoint can proceed now */
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
END_CRIT_SECTION();
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e22bdac..8d7bf3a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -975,13 +975,13 @@ RecordTransactionCommit(void)
* RecordTransactionAbort. That's because loss of a transaction abort
* is noncritical; the presumption would be that it aborted, anyway.
*
- * It's safe to change the inCommit flag of our own backend without
+ * It's safe to change the delayChkpt flag of our own backend without
* holding the ProcArrayLock, since we're the only one modifying it.
- * This makes checkpoint's determination of which xacts are inCommit a
+ * This makes checkpoint's determination of which xacts are delayChkpt a
* bit fuzzy, but it doesn't matter.
*/
START_CRIT_SECTION();
- MyPgXact->inCommit = true;
+ MyPgXact->delayChkpt = true;
SetCurrentTransactionStopTimestamp();
@@ -1155,7 +1155,7 @@ RecordTransactionCommit(void)
*/
if (markXidCommitted)
{
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
END_CRIT_SECTION();
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 266c0de..8f1c97d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -733,6 +733,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
bool doPageWrites;
bool isLogSwitch = false;
bool fpwChange = false;
+ bool isHint = false;
uint8 info_orig = info;
/* cross-check on whether we should be here or not */
@@ -760,6 +761,10 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
fpwChange = true;
break;
+ case XLOG_HINT:
+ isHint = true;
+ break;
+
default:
break;
}
@@ -998,6 +1003,18 @@ begin:;
}
/*
+ * If this is a hint record and we don't need a backup block then
+ * we have no more work to do and can exit quickly without inserting
+ * a WAL record at all. In that case return InvalidXLogRecPtr.
+ */
+ if (isHint && !(info & XLR_BKP_BLOCK_MASK))
+ {
+ LWLockRelease(WALInsertLock);
+ END_CRIT_SECTION();
+ return InvalidXLogRecPtr;
+ }
+
+ /*
* If there isn't enough space on the current XLOG page for a record
* header, advance to the next page (leaving the unused space as zeroes).
*/
@@ -1280,7 +1297,8 @@ XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
/*
* XXX We assume page LSN is first data on *every* page that can be passed
* to XLogInsert, whether it otherwise has the standard page layout or
- * not.
+ * not. We don't need the buffer header lock for PageGetLSN because we
+ * have exclusive lock on the page and/or the relation.
*/
*lsn = PageGetLSN(page);
@@ -3724,6 +3742,12 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
}
+ /*
+ * Any checksum set on this page will be invalid. We don't need
+ * to reset it here since it will be reset before being written
+ * and changing data when we don't have a cleanup lock is bad juju.
+ */
+
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
@@ -7664,8 +7688,8 @@ CreateCheckPoint(int flags)
uint32 redo_logSeg;
uint32 insert_logId;
uint32 insert_logSeg;
- TransactionId *inCommitXids;
- int nInCommit;
+ VirtualTransactionId *vxids;
+ int nvxids;
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -7839,7 +7863,7 @@ CreateCheckPoint(int flags)
/*
* Before flushing data, we must wait for any transactions that are
- * currently in their commit critical sections. If an xact inserted its
+ * currently in commit or hint critical sections. If an xact inserted its
* commit record into XLOG just before the REDO point, then a crash
* restart from the REDO point would not replay that record, which means
* that our flushing had better include the xact's update of pg_clog. So
@@ -7855,21 +7879,32 @@ CreateCheckPoint(int flags)
* protected by different locks, but again that seems best on grounds of
* minimizing lock contention.)
*
- * A transaction that has not yet set inCommit when we look cannot be at
+ * A transaction that has not yet set delayChkpt when we look cannot be at
* risk, since he's not inserted his commit record yet; and one that's
* already cleared it is not at risk either, since he's done fixing clog
* and we will correctly flush the update below. So we cannot miss any
* xacts we need to wait for.
*/
- nInCommit = GetTransactionsInCommit(&inCommitXids);
- if (nInCommit > 0)
+ vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+ if (nvxids > 0)
{
+ uint nwaits = 0;
+
do
{
pg_usleep(10000L); /* wait for 10 msec */
- } while (HaveTransactionsInCommit(inCommitXids, nInCommit));
+ nwaits++;
+ } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+
+ /*
+ * We can afford to wait a while before we start getting bothered
+ * about the total waiting time. Anything over 1 sec seems worth
+ * reporting but we aren't expecting to see even this.
+ */
+ if (nwaits > 100)
+ elog(LOG, "checkpoint start delayed by %u ms", (10 * nwaits));
}
- pfree(inCommitXids);
+ pfree(vxids);
/*
* Get the other info we need for the checkpoint record.
@@ -8454,6 +8489,51 @@ XLogRestorePoint(const char *rpName)
}
/*
+ * Write a backup block if needed when we are setting a hint. Note that
+ * this may be called for a variety of page types, not just heaps.
+ *
+ * Deciding the "if needed" part is delicate and requires us to either
+ * grab WALInsertLock or check the info_lck spinlock. If we check the
+ * spinlock and it says Yes then we will need to get WALInsertLock as well,
+ * so the design choice here is to just go straight for the WALInsertLock
+ * and trust that calls to this function are minimised elsewhere.
+ *
+ * Callable while holding just share lock on the buffer content.
+ *
+ * Possible that multiple concurrent backends could attempt to write
+ * WAL records. In that case, more than one backup block may be recorded
+ * though that isn't important to the outcome and the backup blocks are
+ * likely to be identical anyway.
+ */
+#define XLOG_HINT_WATERMARK 13579
+XLogRecPtr
+XLogSaveBufferForHint(Buffer buffer)
+{
+ /*
+ * Make an XLOG entry reporting the hint
+ */
+ XLogRecData rdata[2];
+ int watermark = XLOG_HINT_WATERMARK;
+
+ /*
+ * Not allowed to have zero-length records, so use a small watermark
+ */
+ rdata[0].data = (char *) (&watermark);
+ rdata[0].len = sizeof(int);
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
+}
+
+/*
* Check if any of the GUC parameters that are critical for hot standby
* have changed, and update the value in pg_control file if necessary.
*/
@@ -8551,8 +8631,8 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
{
uint8 info = record->xl_info & ~XLR_INFO_MASK;
- /* Backup blocks are not used in xlog records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+ /* Backup blocks are not used in most xlog records */
+ Assert(info == XLOG_HINT || !(record->xl_info & XLR_BKP_BLOCK_MASK));
if (info == XLOG_NEXTOID)
{
@@ -8704,6 +8784,32 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
{
/* nothing to do here */
}
+ else if (info == XLOG_HINT)
+ {
+ int *watermark = (int *) XLogRecGetData(record);
+
+ /* Check the watermark is correct for the hint record */
+ Assert(*watermark == XLOG_HINT_WATERMARK);
+
+ /* Backup blocks must be present for smgr hint records */
+ Assert(record->xl_info & XLR_BKP_BLOCK_MASK);
+
+ /*
+ * Hint records have no information that needs to be replayed.
+ * The sole purpose of them is to ensure that a hint bit does
+ * not cause a checksum invalidation if a hint bit write should
+ * cause a torn page. So the body of the record is empty but
+ * there must be one backup block.
+ *
+ * Since the only change in the backup block is a hint bit,
+ * there is no confict with Hot Standby.
+ *
+ * This also means there is no corresponding API call for this,
+ * so an smgr implementation has no need to implement anything.
+ * Which means nothing is needed in md.c etc
+ */
+ RestoreBkpBlocks(lsn, record, false);
+ }
else if (info == XLOG_BACKUP_END)
{
XLogRecPtr startpoint;
@@ -8839,6 +8945,10 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
appendStringInfo(buf, "restore point: %s", xlrec->rp_name);
}
+ else if (info == XLOG_HINT)
+ {
+ appendStringInfo(buf, "page hint");
+ }
else if (info == XLOG_BACKUP_END)
{
XLogRecPtr startpoint;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 07dc326..2c957b8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -8537,6 +8537,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
smgrread(src, forkNum, blkno, buf);
+ PageSetVerificationInfoInplace(page);
+
/* XLOG stuff */
if (use_wal)
log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1adb6d3..fe104d8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
#include <unistd.h>
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "executor/instrument.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -440,7 +441,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
- if (!PageHeaderIsValid((PageHeader) bufBlock))
+ if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
@@ -635,14 +636,23 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* victim. We need lock to inspect the page LSN, so this
* can't be done inside StrategyGetBuffer.
*/
- if (strategy != NULL &&
- XLogNeedsFlush(BufferGetLSN(buf)) &&
- StrategyRejectBuffer(strategy, buf))
+ if (strategy != NULL)
{
- /* Drop lock/pin and loop around for another buffer */
- LWLockRelease(buf->content_lock);
- UnpinBuffer(buf, true);
- continue;
+ XLogRecPtr lsn;
+
+ /* Read the LSN while holding buffer header lock */
+ LockBufHdr(buf);
+ lsn = BufferGetLSN(buf);
+ UnlockBufHdr(buf);
+
+ if (XLogNeedsFlush(lsn) &&
+ StrategyRejectBuffer(strategy, buf))
+ {
+ /* Drop lock/pin and loop around for another buffer */
+ LWLockRelease(buf->content_lock);
+ UnpinBuffer(buf, true);
+ continue;
+ }
}
/* OK, do the I/O */
@@ -1873,6 +1883,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
+ char *bufToWrite;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
@@ -1901,12 +1913,23 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
reln->smgr_rnode.node.dbNode,
reln->smgr_rnode.node.relNode);
+ LockBufHdr(buf);
+
+ /*
+ * Run PageGetLSN while holding header lock, since we don't have the
+ * buffer locked exclusively in all cases.
+ */
+ recptr = BufferGetLSN(buf);
+
+ /* To check if block content changes while flushing. - vadim 01/17/97 */
+ buf->flags &= ~BM_JUST_DIRTIED;
+ UnlockBufHdr(buf);
+
/*
* Force XLOG flush up to buffer's LSN. This implements the basic WAL
* rule that log updates must hit disk before any of the data-file changes
* they describe do.
*/
- recptr = BufferGetLSN(buf);
XLogFlush(recptr);
/*
@@ -1915,15 +1938,17 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
* we have the io_in_progress lock.
*/
- /* To check if block content changes while flushing. - vadim 01/17/97 */
- LockBufHdr(buf);
- buf->flags &= ~BM_JUST_DIRTIED;
- UnlockBufHdr(buf);
+ bufBlock = BufHdrGetBlock(buf);
+
+ bufToWrite = PageSetVerificationInfoOnCopy((Page) bufBlock);
+ /*
+ * bufToWrite is either the shared buffer or a copy, as appropriate.
+ */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
- (char *) BufHdrGetBlock(buf),
+ bufToWrite,
false);
pgBufferUsage.shared_blks_written++;
@@ -1934,6 +1959,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
*/
TerminateBufferIO(buf, true, 0);
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
@@ -2326,6 +2353,7 @@ void
SetBufferCommitInfoNeedsSave(Buffer buffer)
{
volatile BufferDesc *bufHdr;
+ Page page = BufferGetPage(buffer);
if (!BufferIsValid(buffer))
elog(ERROR, "bad buffer ID: %d", buffer);
@@ -2354,15 +2382,105 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
+ XLogRecPtr lsn = {0, 0};
bool dirtied = false;
+ bool delayChkpt = false;
+
+ /*
+ * If we are currently writing checksums, or if the page has a
+ * checksum set then we must check to see if a full page image is
+ * required to protect against torn pages.
+ *
+ * If page_checksums is set then we must write a new checksum
+ * for any change, even a hint. If we write a checksum and we crash
+ * it could leave torn pages on disk. So to protect against that
+ * a hint must be written if this is the first change on the block
+ * since the last checkpoint.
+ *
+ * Even if page_checksums is off, we may also need to write a full
+ * page image. This protects against events that can occur if
+ * page_checksums is turned on and off and then on again. In that case,
+ * if a page has a checksum set and then we write a page
+ * while page_checksums = off a torn page could result in a page
+ * that has a checksums set and yet not match the actual page.
+ * So even if page_checksums = off we write a full page image if the
+ * page is marked as having a checksum because we are going
+ * to reset the checksum.
+ *
+ * We don't check full_page_writes here because that can be set
+ * on dynamically during a backup, so even if the usual value is
+ * off we may still need to use a full page image.
+ */
+ if (page_checksums || !PageHasNoChecksum(page))
+ {
+ /*
+ * If we're in recovery we cannot dirty a page because of a hint.
+ * We can set the hint, just not dirty the page as a result so
+ * the hint is lost when we evict the page or shutdown.
+ *
+ * See long discussion in bufpage.c
+ */
+ if (RecoveryInProgress())
+ return;
+
+ /*
+ * If the block is already dirty because we either made a change
+ * or set a hint already, then we don't need to write a full page
+ * image. Note that aggressive cleaning of blocks
+ * dirtied by hint bit setting would increase the call rate.
+ * Bulk setting of hint bits would reduce the call rate...
+ *
+ * We must issue the WAL record before we mark the buffer dirty.
+ * Otherwise we might write the page before we write the WAL.
+ * That causes a race condition, since a checkpoint might occur
+ * between writing the WAL record and marking the buffer dirty.
+ * We solve that with a kluge, but one that is already in use
+ * during transaction commit to prevent race conditions.
+ * Basically, we simply prevent the checkpoint WAL record from
+ * being written until we have marked the buffer dirty. We don't
+ * start the checkpoint flush until we have marked dirty, so our
+ * checkpoint must flush the change to disk successfully or the
+ * checkpoint never gets written, so crash recovery will fix.
+ *
+ * It's possible we may enter here without an xid, so it is
+ * essential that CreateCheckpoint waits for virtual transactions
+ * rather than full transactionids.
+ */
+ MyPgXact->delayChkpt = delayChkpt = true;
+ lsn = XLogSaveBufferForHint(buffer);
+ }
LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);
if (!(bufHdr->flags & BM_DIRTY))
- dirtied = true;
+ {
+ dirtied = true; /* Means "will be dirtied by this action" */
+
+ /*
+ * Set the page LSN if we wrote a backup block. We aren't
+ * supposed to set this when only holding a share lock but
+ * as long as we serialise it somehow we're OK. We choose to
+ * set LSN while holding the buffer header lock, which causes
+ * any reader of an LSN who holds only a share lock to also
+ * obtain a buffer header lock before using PageGetLSN().
+ * Fortunately, thats not too many places.
+ *
+ * If !page_checksums and PageHasChecksum() you might think
+ * we should reset the checksum here. That will happen when
+ * the page is written sometime later in this checkpoint cycle.
+ */
+ if (!XLogRecPtrIsInvalid(lsn))
+ {
+ PageSetLSN(page, lsn);
+ PageSetTLI(page, ThisTimeLineID);
+ }
+ }
bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
UnlockBufHdr(bufHdr);
+ if (delayChkpt)
+ MyPgXact->delayChkpt = false;
+
if (dirtied)
{
VacuumPageDirty++;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 096d36a..a220310 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
/* Find smgr relation for buffer */
oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ /* XXX do we want to write checksums for local buffers? An option? */
+
/* And write... */
smgrwrite(oreln,
bufHdr->tag.forkNum,
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 09b7311..be70240 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -398,7 +398,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
pgxact->xmin = InvalidTransactionId;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
- pgxact->inCommit = false; /* be sure this is cleared in abort */
+ pgxact->delayChkpt = false; /* be sure this is cleared in abort */
proc->recoveryConflictPending = false;
/* Clear the subtransaction-XID cache too while holding the lock */
@@ -425,7 +425,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
pgxact->xmin = InvalidTransactionId;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
- pgxact->inCommit = false; /* be sure this is cleared in abort */
+ pgxact->delayChkpt = false; /* be sure this is cleared in abort */
proc->recoveryConflictPending = false;
Assert(pgxact->nxids == 0);
@@ -460,7 +460,7 @@ ProcArrayClearTransaction(PGPROC *proc)
/* redundant, but just in case */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
- pgxact->inCommit = false;
+ pgxact->delayChkpt = false;
/* Clear the subtransaction-XID cache too */
pgxact->nxids = 0;
@@ -1741,65 +1741,70 @@ GetOldestActiveTransactionId(void)
}
/*
- * GetTransactionsInCommit -- Get the XIDs of transactions that are committing
+ * GetVirtualXIDsDelayingChkpt -- Get the VXIDs of transactions that are
+ * delaying checkpoint because they have critical actions in progress.
*
- * Constructs an array of XIDs of transactions that are currently in commit
- * critical sections, as shown by having inCommit set in their PGPROC entries.
+ * Constructs an array of VXIDs of transactions that are currently in commit
+ * critical sections, as shown by having delayChkpt set in their PGXACT.
*
- * *xids_p is set to a palloc'd array that should be freed by the caller.
- * The return value is the number of valid entries.
+ * Returns a palloc'd array that should be freed by the caller.
+ * *nvxids is the number of valid entries.
*
- * Note that because backends set or clear inCommit without holding any lock,
+ * Note that because backends set or clear delayChkpt without holding any lock,
* the result is somewhat indeterminate, but we don't really care. Even in
* a multiprocessor with delayed writes to shared memory, it should be certain
- * that setting of inCommit will propagate to shared memory when the backend
- * takes the WALInsertLock, so we cannot fail to see an xact as inCommit if
+ * that setting of delayChkpt will propagate to shared memory when the backend
+ * takes a lock, so we cannot fail to see an virtual xact as delayChkpt if
* it's already inserted its commit record. Whether it takes a little while
- * for clearing of inCommit to propagate is unimportant for correctness.
+ * for clearing of delayChkpt to propagate is unimportant for correctness.
*/
-int
-GetTransactionsInCommit(TransactionId **xids_p)
+VirtualTransactionId *
+GetVirtualXIDsDelayingChkpt(int *nvxids)
{
+ VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
- TransactionId *xids;
- int nxids;
+ int count = 0;
int index;
- xids = (TransactionId *) palloc(arrayP->maxProcs * sizeof(TransactionId));
- nxids = 0;
+ /* allocate what's certainly enough result space */
+ vxids = (VirtualTransactionId *)
+ palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
LWLockAcquire(ProcArrayLock, LW_SHARED);
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- volatile PGXACT *pgxact = &allPgXact[pgprocno];
- TransactionId pxid;
+ volatile PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
- /* Fetch xid just once - see GetNewTransactionId */
- pxid = pgxact->xid;
+ if (pgxact->delayChkpt)
+ {
+ VirtualTransactionId vxid;
- if (pgxact->inCommit && TransactionIdIsValid(pxid))
- xids[nxids++] = pxid;
+ GET_VXID_FROM_PGPROC(vxid, *proc);
+ if (VirtualTransactionIdIsValid(vxid))
+ vxids[count++] = vxid;
+ }
}
LWLockRelease(ProcArrayLock);
- *xids_p = xids;
- return nxids;
+ *nvxids = count;
+ return vxids;
}
/*
- * HaveTransactionsInCommit -- Are any of the specified XIDs in commit?
+ * HaveVirtualXIDsDelayingChkpt -- Are any of the specified VXIDs delaying?
*
- * This is used with the results of GetTransactionsInCommit to see if any
- * of the specified XIDs are still in their commit critical sections.
+ * This is used with the results of GetVirtualXIDsDelayingChkpt to see if any
+ * of the specified VXIDs are still in critical sections of code.
*
- * Note: this is O(N^2) in the number of xacts that are/were in commit, but
+ * Note: this is O(N^2) in the number of vxacts that are/were delaying, but
* those numbers should be small enough for it not to be a problem.
*/
bool
-HaveTransactionsInCommit(TransactionId *xids, int nxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
{
bool result = false;
ProcArrayStruct *arrayP = procArray;
@@ -1807,30 +1812,32 @@ HaveTransactionsInCommit(TransactionId *xids, int nxids)
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ while (VirtualTransactionIdIsValid(*vxids))
{
- int pgprocno = arrayP->pgprocnos[index];
- volatile PGXACT *pgxact = &allPgXact[pgprocno];
- TransactionId pxid;
-
- /* Fetch xid just once - see GetNewTransactionId */
- pxid = pgxact->xid;
-
- if (pgxact->inCommit && TransactionIdIsValid(pxid))
+ for (index = 0; index < arrayP->numProcs; index++)
{
- int i;
+ int pgprocno = arrayP->pgprocnos[index];
+ volatile PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+ VirtualTransactionId vxid;
- for (i = 0; i < nxids; i++)
+ GET_VXID_FROM_PGPROC(vxid, *proc);
+ if (VirtualTransactionIdIsValid(vxid))
{
- if (xids[i] == pxid)
+ if (VirtualTransactionIdEquals(vxid, *vxids) &&
+ pgxact->delayChkpt)
{
result = true;
break;
}
}
- if (result)
- break;
}
+
+ if (result)
+ break;
+
+ /* The virtual transaction is gone now, wait for the next one */
+ vxids++;
}
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 2196f51..845fcfb 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -356,7 +356,7 @@ InitProcess(void)
MyProc->backendId = InvalidBackendId;
MyProc->databaseId = InvalidOid;
MyProc->roleId = InvalidOid;
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
MyPgXact->vacuumFlags = 0;
/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
if (IsAutoVacuumWorkerProcess())
@@ -514,7 +514,7 @@ InitAuxiliaryProcess(void)
MyProc->backendId = InvalidBackendId;
MyProc->databaseId = InvalidOid;
MyProc->roleId = InvalidOid;
- MyPgXact->inCommit = false;
+ MyPgXact->delayChkpt = false;
MyPgXact->vacuumFlags = 0;
MyProc->lwWaiting = false;
MyProc->lwWaitMode = 0;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 90a731c..1c17ecc 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -16,6 +16,12 @@
#include "access/htup.h"
+bool page_checksums = false;
+
+static Page pageCopy; /* temporary buffer to allow checksum calculation */
+
+static bool PageVerificationInfoOK(Page page);
+static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
@@ -25,6 +31,10 @@
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we write the block to disk.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
@@ -67,21 +77,26 @@ PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
-PageHeaderIsValid(PageHeader page)
+PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
- /* Check normal case */
- if (PageGetPageSize(page) == BLCKSZ &&
- PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
- (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
- page->pd_lower >= SizeOfPageHeaderData &&
- page->pd_lower <= page->pd_upper &&
- page->pd_upper <= page->pd_special &&
- page->pd_special <= BLCKSZ &&
- page->pd_special == MAXALIGN(page->pd_special))
- return true;
+ /*
+ * Don't verify page data unless the page passes basic non-zero test
+ */
+ if (p->pd_lower >= SizeOfPageHeaderData)
+ {
+ /* Check normal case */
+ if (PageVerificationInfoOK(page) &&
+ (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
+ p->pd_lower <= p->pd_upper &&
+ p->pd_upper <= p->pd_special &&
+ p->pd_special <= BLCKSZ &&
+ p->pd_special == MAXALIGN(p->pd_special))
+ return true;
+ }
/* Check all-zeroes case */
pagebytes = (char *) page;
@@ -827,3 +842,311 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
pfree(itemidbase);
}
+
+/*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ *
+ * Any write of a data block can cause a torn page if the write is unsuccessful.
+ * Full page writes protect us from that, which are stored in WAL. Setting
+ * hint bits when a page is already dirty is OK because a full page write
+ * must already have been written for that since the last checkpoint.
+ * Setting hint bits on an otherwise clean page can allow torn pages; this
+ * doesn't normally matter since they are just hints. When the page has
+ * checksums, losing a few bits would cause the checksum to be invalid.
+ * So if we have full_page_writes = on and page_checksums = on then we must
+ * write a WAL record specifically so that we record a full page image in WAL.
+ * New WAL records cannot be written during recovery, so hint bits set
+ * during recovery must not dirty the page if the buffer is not already dirty,
+ * when page_checksums = on.
+ *
+ * So we cannot enable/disable page_checksums except at a checkpoint if
+ * full_page_writes is enabled. We choose to only allow changes at server start.
+ *
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page images does not check verification
+ * info via this function. In recovery, since we only dirty a block when we
+ * have a full page image available if we crash, we are fully OK to use page
+ * verification normally.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering the shared
+ * buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data re-entering
+ * PostgreSQL controlled memory. Note also that the WAL checksum is a
+ * 32-bit CRC, whereas the page checksum is a Fletcher16 checksum, not a CRC.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data. We also require that a third bit is
+ * zeroed, to protect against a run of 1s being written to the page header.
+ *
+ * This function returns a boolean, not a full damage assessment.
+ */
+static bool
+PageVerificationInfoOK(Page page)
+{
+ PageHeader p = (PageHeader) page;
+
+ /* Quick exit if nobody cares about checksumming */
+ if (!page_checksums)
+ return true;
+
+ if (PageHasChecksum(p))
+ {
+ uint16 checksum = PageCalcChecksum16(page);
+
+ if (checksum == p->pd_verify.pd_checksum16)
+ {
+#ifdef CHECK_HOLE
+ /* Also check page hole is all-zeroes */
+ char *pagebytes;
+ bool empty = true;
+ int i;
+
+ pagebytes = (char *) page;
+ for (i = p->pd_lower; i < p->pd_upper; i++)
+ {
+ if (pagebytes[i] != 0)
+ {
+ empty = false;
+ break;
+ }
+ }
+
+ if (!empty)
+ elog(LOG, "hole was not empty at byte %d pd_lower %d pd_upper %d",
+ i, p->pd_lower, p->pd_upper);
+#endif
+ return true;
+ }
+
+ elog(LOG, "page verification failed - checksum was %u page checksum field is %u",
+ checksum, p->pd_verify.pd_checksum16);
+ }
+ else if (PageHasNoChecksum(p))
+ {
+ uint16 pd_pagesize_version = BLCKSZ | PG_PAGE_LAYOUT_VERSION;
+
+ if (p->pd_verify.pd_pagesize_version == pd_pagesize_version)
+ return true;
+
+ elog(LOG, "page verification failed - page size and version set incorrectly %u",
+ p->pd_verify.pd_checksum16);
+ }
+ else
+ elog(LOG, "page verification failed - page flags are incorrectly set");
+
+ return false;
+}
+
+#define PageSetChecksum(page) \
+do \
+{ \
+ PageHeader p = (PageHeader) page; \
+ p->pd_flags |= PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags |= PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
+} while (0);
+
+/* ensure any older checksum info is overwritten with watermark */
+#define PageResetVersion(page) \
+do \
+{ \
+ PageHeader p = (PageHeader) page; \
+ if (!PageHasNoChecksum(p)) \
+ { \
+ p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags &= ~PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
+ } \
+} while (0);
+
+/*
+ * Set verification info for page in shared buffers.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an invalid checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ *
+ * Returns a pointer to the block-sized data that needs to be written. That
+ * allows us to either copy, or not, depending upon whether we checksum.
+ */
+char *
+PageSetVerificationInfoOnCopy(Page page)
+{
+ if (PageIsNew(page))
+ return (char *) page;
+
+ if (page_checksums)
+ {
+ if (pageCopy == NULL)
+ {
+ pageCopy = (Page) malloc(BLCKSZ);
+ if (pageCopy == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+ }
+
+ /*
+ * We make a copy iff we need to calculate a checksum because other
+ * backends may set hint bits on this page while we write, which
+ * would mean the checksum differs from the page contents. It doesn't
+ * matter if we include or exclude hints during the copy, as long
+ * as we write a valid page and associated checksum.
+ */
+ memcpy((char *) pageCopy, (char *) page, BLCKSZ);
+ PageSetChecksum(pageCopy);
+ return (char *) pageCopy;
+ }
+
+ PageResetVersion(page);
+ return (char *) page;
+}
+
+/*
+ * Set verification info for page in private memory.
+ *
+ * This is a simpler version of PageSetVerificationInfoOnCopy(). The more
+ * explicit API allows us to more easily see if we're making the correct call
+ * and reduces the amount of additional code specific to page verification.
+ */
+void
+PageSetVerificationInfoInplace(Page page)
+{
+ if (PageIsNew(page))
+ return;
+
+ if (page_checksums)
+ {
+ PageSetChecksum(page);
+ return;
+ }
+
+ PageResetVersion(page);
+ return;
+}
+
+/*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+static uint16
+PageCalcChecksum16(Page page)
+{
+#define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+#ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ */
+ PageHeader p = (PageHeader) page;
+ uint page_header_stop = (uint)(offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+ uint page_lower_start = (uint)(offsetof(PageHeaderData, pd_prune_xid));
+ uint page_lower_stop;
+ uint sum1 = 0;
+ uint64 sum2 = 0;
+ int i;
+
+ /*
+ * Avoid calculating checksum if page is new, just return a value that
+ * will cause the check to fail. We may still pass the all-zeroes check.
+ */
+ if (PageIsNew(page))
+ return 1;
+
+ /*
+ * Just add in the pd_prune_xid if there are no line pointers yet.
+ */
+ page_lower_stop = p->pd_lower;
+ if (page_lower_stop == 0)
+ page_lower_stop = page_lower_start + sizeof(TransactionId);
+
+ Assert(p->pd_upper != 0);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "calculating checksum for %u-%u %u-%u %u-%u",
+ 0, /* page_header_start */
+ page_header_stop,
+ page_lower_start,
+ page_lower_stop,
+ p->pd_upper,
+ BLCKSZ
+ );
+#endif
+
+#define COMP_F16(from, to) \
+do { \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = sum1 + page[i]; \
+ sum2 = sum1 + sum2; \
+ } \
+ sum1 %= 255; \
+ sum2 %= 255; \
+} while (0); \
+
+#ifdef IGNORE_PAGE_HOLE
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ page_lower_stop);
+
+ /* ignore the hole in the middle of the block */
+
+ COMP_F16(p->pd_upper,
+ BLCKSZ - 1);
+#else
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ BLCKSZ - 1);
+#endif
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "checksum %u", ((sum2 << 8) | sum1));
+#endif
+
+ return ((sum2 << 8) | sum1);
+#endif
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7df5292..6c7421e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -830,6 +830,20 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksum before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 400c52b..de02b5f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -150,15 +150,21 @@
#------------------------------------------------------------------------------
-# WRITE AHEAD LOG
+# WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
-# - Settings -
+# - Reliability -
-#wal_level = minimal # minimal, archive, or hot_standby
- # (change requires restart)
+#page_checksums = off # calculate checksum before database I/O
+#full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+# - Write Ahead Log -
+
+#wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
@@ -166,7 +172,6 @@
# fsync
# fsync_writethrough
# open_sync
-#full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8aecef..5f18d4d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -272,6 +272,7 @@ extern int XLogFileInit(uint32 log, uint32 seg,
bool *use_existent, bool use_lock);
extern int XLogFileOpen(uint32 log, uint32 seg);
+extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer);
extern void XLogGetLastRemoved(uint32 *log, uint32 *seg);
extern void XLogSetAsyncXactLSN(XLogRecPtr record);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 1031e56..445a3e0 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -61,7 +61,8 @@ typedef struct CheckPoint
#define XLOG_BACKUP_END 0x50
#define XLOG_PARAMETER_CHANGE 0x60
#define XLOG_RESTORE_POINT 0x70
-#define XLOG_FPW_CHANGE 0x80
+#define XLOG_FPW_CHANGE 0x80
+#define XLOG_HINT 0x90
/*
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 1ab64e0..722cdb5 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -18,6 +18,8 @@
#include "storage/item.h"
#include "storage/off.h"
+extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
@@ -93,7 +95,7 @@ typedef uint16 LocationIndex;
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
- * pd_pagesize_version - size in bytes and page layout version number.
+ * pd_verify - page verification information of different kinds
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
@@ -106,7 +108,8 @@ typedef uint16 LocationIndex;
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
- * The page version number and page size are packed together into a single
+ * For verification we store either a 16 bit checksum or a watermark of
+ * the page version number and page size packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
@@ -130,7 +133,13 @@ typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
- uint16 pd_pagesize_version;
+
+ union
+ {
+ uint16 pd_pagesize_version;
+ uint16 pd_checksum16;
+ } pd_verify; /* page verification data */
+
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
@@ -155,7 +164,42 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x400F /* OR of all non-checksum pd_flags bits */
+
+#define PD_CHECKSUM1 0x0008 /* First checksum bit - should be 1 */
+#define PD_CHECKSUM2 0x8000 /* Second checksum bit - should be 0 */
+
+/*
+ * Page layout version number increment bits (still part of flags).
+ * Next flag would be called PD_PAGE_VERSION_PLUS2, PD_PAGE_VERSION_PLUS4 etc
+ */
+#define PD_PAGE_VERSION_PLUS1 0x4000
+
+/*
+ * PD_SET_CHECKSUM is a bitmask of multiple flags
+ *
+ * When we wish to set a checksum we require exact settings for 3 bits.
+ *
+ * (1) First, we must set the version correctly, so we set all increment bits.
+ *
+ * (2) Next we set PD_CHECKSUM2 to be zero, the opposite setting of the increment
+ * bits. So any error that causes a run of multiple zeroes or ones to be written
+ * to the block will be detected.
+ *
+ * (3) Next we set PD_CHECKSUM1 to be one, to indicate the block has a checksum.
+ * A setting of just the increment bits but no checksum is version-legal but
+ * we call that an error also to reduce number of chances for error.
+ */
+#define PD_SET_CHECKSUM 0x4008 /* All 3 bits set correctly for checksum */
+
+#define PageHasChecksum(page) \
+ ((((PageHeader) (page))->pd_flags & PD_SET_CHECKSUM) == PD_SET_CHECKSUM)
+
+#define PageHasNoChecksum(page) \
+ ((((PageHeader) (page))->pd_flags & PD_SET_CHECKSUM) == 0)
+
+#define PageHasChecksumFlagError(page) \
+ (!PageHasChecksum(page) && !PageHasNoChecksum(page))
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -165,9 +209,29 @@ typedef PageHeaderData *PageHeader;
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses versions 4 and/or 5. Each page may only have one version,
+ * though different pages may have different versions. This is arranged
+ * deliberately to allow us to upgrade from one major version to another.
*/
#define PG_PAGE_LAYOUT_VERSION 4
+/*
+ * PageGetPageLayoutVersion
+ * Returns the page layout version of a page.
+ *
+ * To calculate the page layout version we take the basic page layout (4)
+ * and increment that according to the incremental bits above. If further
+ * versions of the page are required AND those page versions must be
+ * compatible with v4 and v5 then we would do this by setting new
+ * incremental bits and adding them into the definition below.
+ * Note that this works for all page versions because we don't look at the
+ * actual data field if we have the version increment bits set.
+ */
+#define PageGetPageLayoutVersion(page) \
+ ((((PageHeader) (page))->pd_flags & PD_PAGE_VERSION_PLUS1) == PD_PAGE_VERSION_PLUS1 ? \
+ PG_PAGE_LAYOUT_VERSION + 1 : \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version - BLCKSZ)
+
/* ----------------------------------------------------------------
* page support macros
@@ -231,19 +295,19 @@ typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
- * this can only be called on a formatted page (unlike
- * BufferGetPageSize, which can be called on an unformatted page).
- * however, it can be called on a page that is not stored in a buffer.
+ * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
+ * This can be called on any page, initialised or not, in or out of buffers.
+ * You might think this can vary at runtime but you'd be wrong, since pages
+ * frequently need to occupy buffers and pages are copied from one to another
+ * so there are many hidden assumptions that this simple definition is true.
*/
-#define PageGetPageSize(page) \
- ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
+#define PageGetPageSize(page) (BLCKSZ)
+
+#define PageGetPageSizeField(page) \
+ ((((PageHeader) (page))->pd_flags & PD_PAGE_VERSION_PLUS1) == PD_PAGE_VERSION_PLUS1 ? \
+ (BLCKSZ) : \
+ (((PageHeader) (page))->pd_verify.pd_pagesize_version & 0xFF00))
-/*
- * PageGetPageLayoutVersion
- * Returns the page layout version of a page.
- */
-#define PageGetPageLayoutVersion(page) \
- (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
@@ -251,12 +315,14 @@ typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
- ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
/* ----------------
@@ -368,7 +434,7 @@ do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
-extern bool PageHeaderIsValid(PageHeader page);
+extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
@@ -381,5 +447,7 @@ extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+extern char *PageSetVerificationInfoOnCopy(Page page);
+extern void PageSetVerificationInfoInplace(Page page);
#endif /* BUFPAGE_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 415c093..ea4a13f 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -168,7 +168,8 @@ typedef struct PGXACT
uint8 vacuumFlags; /* vacuum-related flags, see above */
bool overflowed;
- bool inCommit; /* true if within commit critical section */
+ bool delayChkpt; /* true if this proc delays checkpoint start */
+ /* previously called InCommit */
uint8 nxids;
} PGXACT;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 0b0aa35..8648bdb 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -52,8 +52,8 @@ extern bool TransactionIdIsActive(TransactionId xid);
extern TransactionId GetOldestXmin(bool allDbs, bool ignoreVacuum);
extern TransactionId GetOldestActiveTransactionId(void);
-extern int GetTransactionsInCommit(TransactionId **xids_p);
-extern bool HaveTransactionsInCommit(TransactionId *xids, int nxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
extern PGPROC *BackendPidGetProc(int pid);
extern int BackendXidGetPid(TransactionId xid);
On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Doesn't this seem awfully bad for performance on Hot Standby servers?
I agree that it fixes the problem with un-WAL-logged pages there, but
I seem to recall some recent complaining about performance features
that work on the master but not the standby. Durable hint bits are
one such feature.It's impossible for it to work, in this case, since we cannot write
new WAL to prevent torn pages.Note that hint bit setting on a dirty block is allowed, so many hints
will still be set in Hot Standby.
To me, it seems that you are applying a double standard. You have
twice attempted to insist that I do extra work to make major features
that I worked on - unlogged tables and index-only scans - work in Hot
Standby mode, despite the existence of significant technological
obstacles. But when it comes to your own feature, you simply state
that it cannot be done, and therefore we need not do it. Of course,
this feature, like those, CAN be made to work. It just involves
solving difficult problems that have little to do with the primary
purpose of the patch. To be honest, I don't use Hot Standby enough to
care very much about this particular issue, and I'm not saying we
should reject it on these grounds. But I do think it's a mistake to
dismiss it entirely, since every limitation of Hot Standby narrows the
set of cases in which it can be deployed. And at any rate, I want the
same standard applied to my work as to yours.
I am slightly worried that this expansion in the use of this mechanism
(formerly called inCommit, for those following along at home) could
lead to checkpoint starvation. Suppose we've got one or two large
table scans wandering along, setting hint bits, and now suddenly it's
time to checkpoint. How long will it take the checkpoint process to
find a time when nobody's got delayChkpt set?We don't need to wait until nobody has it set, we just need to wait
for the people that had it set when we first checked to be out of that
state momentarily.
Ah... good point.
So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
doesn't have a checksum, PD_CHECKSUM2 is not set? What good does that
do?As explained in detailed comments, the purpose of this is to implement
Heikki's suggestion that we have a bit set to zero so we can detect
failures that cause a run of 1s.
I think it's nonsensical to pretend that there's anything special
about that particular bit. If we want to validate the page header
before trusting the lack of a checksum bit, we can do that far more
thoroughly than just checking that one bit. There are a whole bunch
of bits that ought to always be zero, and there are other things we
can validate as well (e.g. LSN not in future). If we're concerned
about the checksum-enabled bit getting flipped (and I agree that we
should be), we can check some combination of that stuff in the hope of
catching it, and that'll be a far better guard than just checking one
arbitrarily selected bit.
That having been said, I don't feel very good about the idea of
relying on the contents of the page to tell us whether or not the page
has a checksum. There's no guarantee that an error that flips the
has-checksum bit will flip any other bit on the page, or that it won't
flip everything else we're relying on as a backstop in exactly the
manner that foils whatever algorithm we put in place. Random
corruption is, perhaps, unlikely to do that, but somehow I feel like
it'll happen more often than random chance suggests. Things never
fail the way you want them to.
Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed. If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.
I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table. That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are. This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums. It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity. I'm not sure whether that's
practical.
I also think that the question of where exactly the checksum ought to
be put might bear some more thought. Tom expressed some concern about
stealing the page version field, and it seems to me we could work
around that by instead stealing something less valuable. The top
eight bits of the pd_pagesize_version field probably meet that
criteria, since in practice basically everybody uses 8K blocks, and
the number of errors that will be caught by checksums is probably much
larger than the number of errors that will be caught by the page-size
cross-check. But maybe the other 8 bits should come from somewhere
else, maybe pd_tli or pd_flags. For almost all practical purposes,
storing only the low-order 8 bits of the TLI ought to provide just as
much of a safety check as storing the low-order 16 bits, so I think
that might be the way to go. We could set it up so that we always
check only the low 8 bits against the TLI, regardless of whether
checksums are enabled; then we basically free up that bit space at no
cost in code complexity.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
To me, it seems that you are applying a double standard. You have
twice attempted to insist that I do extra work to make major features
that I worked on - unlogged tables and index-only scans - work in Hot
Standby mode, despite the existence of significant technological
obstacles. But when it comes to your own feature, you simply state
that it cannot be done, and therefore we need not do it. Of course,
this feature, like those, CAN be made to work.
Vitriol aside, If you would be so kind as to explain how it is
possible, as you claim, I'll look into making it work.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
As explained in detailed comments, the purpose of this is to implement
Heikki's suggestion that we have a bit set to zero so we can detect
failures that cause a run of 1s.I think it's nonsensical to pretend that there's anything special
about that particular bit. If we want to validate the page header
before trusting the lack of a checksum bit, we can do that far more
thoroughly than just checking that one bit. There are a whole bunch
of bits that ought to always be zero, and there are other things we
can validate as well (e.g. LSN not in future). If we're concerned
about the checksum-enabled bit getting flipped (and I agree that we
should be), we can check some combination of that stuff in the hope of
catching it, and that'll be a far better guard than just checking one
arbitrarily selected bit.
I thought it was a reasonable and practical idea from Heikki. The bit
is not selected arbitrarily, it is by design adjacent to one of the
other bits. So overall, 3 bits need to be set to a precise value and a
run of 1s or 0s will throw and error.
That having been said, I don't feel very good about the idea of
relying on the contents of the page to tell us whether or not the page
has a checksum. There's no guarantee that an error that flips the
has-checksum bit will flip any other bit on the page, or that it won't
flip everything else we're relying on as a backstop in exactly the
manner that foils whatever algorithm we put in place. Random
corruption is, perhaps, unlikely to do that, but somehow I feel like
it'll happen more often than random chance suggests. Things never
fail the way you want them to.
You're right. This patch is not the best possible world, given a clean
slate. But we don't have a clean slate.
The fact is this patch will detect corruptions pretty well and that's
what Postgres users want.
While developing this, many obstacles could have been blockers. I
think we're fairly lucky that I managed to find a way through the
minefield of obstacles.
Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed.
You can read every block and check.
If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.
We can't have it both ways. Either we have an easy upgrade, or we
don't. I'm told that an easy upgrade is essential.
I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table. That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are. This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums. It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity. I'm not sure whether that's
practical.
It's not practical as the only mechanism, given the requirement for upgrade.
As I mention in the docs, if you want that, use VACUUM FULL. So there
is a mechanism if you want it.
I also think that the question of where exactly the checksum ought to
be put might bear some more thought. Tom expressed some concern about
stealing the page version field, and it seems to me we could work
around that by instead stealing something less valuable. The top
eight bits of the pd_pagesize_version field probably meet that
criteria, since in practice basically everybody uses 8K blocks, and
the number of errors that will be caught by checksums is probably much
larger than the number of errors that will be caught by the page-size
cross-check. But maybe the other 8 bits should come from somewhere
else, maybe pd_tli or pd_flags. For almost all practical purposes,
storing only the low-order 8 bits of the TLI ought to provide just as
much of a safety check as storing the low-order 16 bits, so I think
that might be the way to go. We could set it up so that we always
check only the low 8 bits against the TLI, regardless of whether
checksums are enabled; then we basically free up that bit space at no
cost in code complexity.
The version stuff is catered for by the current patch.
TLI isn't something I want to mess with. It does actually mean
something, unlike the page version field which carries no appreciable
information.
People want checksums. This is the best way I've come up with of
giving it to them, given all of the constraints and requests imposed
upon me. I have no doubt that a later release will redesign the page
format and do a better job, while at the same time allowing an online
upgrade mechanism to allow block changes. But that is at least 1 year
away and experience shows that is often much longer than that. I'll
buy you a $DRINK if that happens in the next release.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 6:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
To me, it seems that you are applying a double standard. You have
twice attempted to insist that I do extra work to make major features
that I worked on - unlogged tables and index-only scans - work in Hot
Standby mode, despite the existence of significant technological
obstacles. But when it comes to your own feature, you simply state
that it cannot be done, and therefore we need not do it. Of course,
this feature, like those, CAN be made to work.Vitriol aside, If you would be so kind as to explain how it is
possible, as you claim, I'll look into making it work.
It would require a double-write buffer of some kind.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Feb 19, 2012 at 6:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
I thought it was a reasonable and practical idea from Heikki. The bit
is not selected arbitrarily, it is by design adjacent to one of the
other bits. So overall, 3 bits need to be set to a precise value and a
run of 1s or 0s will throw and error.
Sure, but who is to say that a typical error will look anything like
that? Anyway, you could check even more bits just as easily; we know
exactly which ones have a plausible reason for being non-zero.
That having been said, I don't feel very good about the idea of
relying on the contents of the page to tell us whether or not the page
has a checksum. There's no guarantee that an error that flips the
has-checksum bit will flip any other bit on the page, or that it won't
flip everything else we're relying on as a backstop in exactly the
manner that foils whatever algorithm we put in place. Random
corruption is, perhaps, unlikely to do that, but somehow I feel like
it'll happen more often than random chance suggests. Things never
fail the way you want them to.You're right. This patch is not the best possible world, given a clean
slate. But we don't have a clean slate.The fact is this patch will detect corruptions pretty well and that's
what Postgres users want.While developing this, many obstacles could have been blockers. I
think we're fairly lucky that I managed to find a way through the
minefield of obstacles.
I think we could do worse than this patch, but I don't really believe
it's ready for commit. We don't have a single performance number
showing how much of a performance regression this causes, either on
the master or on the standby, on any workload, much less those where a
problem is reasonably forseeable from the design you've chosen. Many
controversial aspects of the patch, such as the way you're using the
buffer header spinlocks as a surrogate for x-locking the buffer, or
the right way of obtaining the bit-space the patch requires, haven't
really been discussed, and to the extent that they have been
discussed, they have not been agreed.
On the former point, you haven't updated
src/backend/storage/buffer/README at all; but updating is not by
itself sufficient. Before we change the buffer-locking rules, we
ought to have some discussion of whether it's OK to do that, and why
it's necessary. "Why it's necessary" would presumably include
demonstrating that the performance of the straightforward
implementation stinks, and that with this change is better. You
haven't made any effort to do that whatsoever, or if you have, you
haven't posted the results here. I'm pretty un-excited by that
change, first because I think it's a modularity violation and possibly
unsafe, and second because I believe we've already got a problem with
buffer header spinlock contention which this might exacerbate.
Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed.You can read every block and check.
Using what tool?
If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.We can't have it both ways. Either we have an easy upgrade, or we
don't. I'm told that an easy upgrade is essential.
Easy upgrade and removal of checksums are unrelated issues AFAICS.
I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table. That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are. This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums. It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity. I'm not sure whether that's
practical.It's not practical as the only mechanism, given the requirement for upgrade.
Why not?
The version stuff is catered for by the current patch.
Your patch reuses the version number field for an unrelated purpose
and includes some vague hints of how we might do versioning using some
of the page-level flag bits. Since bit-space is fungible, I think it
makes more sense to keep the version number where it is and carve the
checksum out of whatever bit space you would have moved the version
number into. Whether that's pd_tli or pd_flags is somewhat secondary;
the point is that you can certainly arrange to leave the 8 bits that
currently contain the version number as they are.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Feb 20, 2012 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I think we could do worse than this patch, but I don't really believe
it's ready for commit. We don't have a single performance number
showing how much of a performance regression this causes, either on
the master or on the standby, on any workload, much less those where a
problem is reasonably forseeable from the design you've chosen. Many
controversial aspects of the patch, such as the way you're using the
buffer header spinlocks as a surrogate for x-locking the buffer, or
the right way of obtaining the bit-space the patch requires, haven't
really been discussed, and to the extent that they have been
discussed, they have not been agreed.
These points are being discussed here. If you have rational
objections, say what they are.
On the former point, you haven't updated
src/backend/storage/buffer/README at all;
I've updated the appropriate README file which relates to page LSNs to explain.
but updating is not by
itself sufficient. Before we change the buffer-locking rules, we
ought to have some discussion of whether it's OK to do that, and why
it's necessary. "Why it's necessary" would presumably include
demonstrating that the performance of the straightforward
implementation stinks, and that with this change is better.
What straightforward implementation is that?? This *is* the straightforward one.
God knows what else we'd break if we drop the lock, reacquire as an
exclusive, then drop it again and reacquire in shared mode. Code tends
to assume that when you take a lock you hold it until you release;
doing otherwise would allow all manner of race conditions to emerge.
And do you really think that would be faster?
You
haven't made any effort to do that whatsoever, or if you have, you
haven't posted the results here. I'm pretty un-excited by that
change, first because I think it's a modularity violation and possibly
unsafe, and second because I believe we've already got a problem with
buffer header spinlock contention which this might exacerbate.Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed.You can read every block and check.
Using what tool?
Pageinspect?
If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.We can't have it both ways. Either we have an easy upgrade, or we
don't. I'm told that an easy upgrade is essential.Easy upgrade and removal of checksums are unrelated issues AFAICS.
I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table. That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are. This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums. It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity. I'm not sure whether that's
practical.It's not practical as the only mechanism, given the requirement for upgrade.
Why not?
The version stuff is catered for by the current patch.
Your patch reuses the version number field for an unrelated purpose
and includes some vague hints of how we might do versioning using some
of the page-level flag bits. Since bit-space is fungible, I think it
makes more sense to keep the version number where it is and carve the
checksum out of whatever bit space you would have moved the version
number into. Whether that's pd_tli or pd_flags is somewhat secondary;
the point is that you can certainly arrange to leave the 8 bits that
currently contain the version number as they are.
If you've got some rational objections, please raise them.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
What straightforward implementation is that?? This *is* the straightforward one.
God knows what else we'd break if we drop the lock, reacquire as an
exclusive, then drop it again and reacquire in shared mode. Code tends
to assume that when you take a lock you hold it until you release;
doing otherwise would allow all manner of race conditions to emerge.And do you really think that would be faster?
I don't know, but neither do you, because you apparently haven't tried
it. Games where we drop the shared lock and get an exclusive lock
are used in numerous places in the source code; see, for example,
heap_update(), so I don't believe that there is any reason to reject
that a priori. Indeed, I can think of at least one pretty good reason
to do it exactly that way: it's the way we've handled all previous
problems of this type, and in general it's better to make new code
look like existing code than to invent something new, especially when
you haven't made any effort to quantify the problem or the extent to
which the proposed solution mitigates it.
If you've got some rational objections, please raise them.
Look, I think that the objections I have been raising are rational.
YMMV, and obviously does. The bottom line here is that I can't stop
you from committing this patch, and we both know that. And, really,
at the end of the day, I have no desire to keep you from committing
this patch, even if I had the power to do so, because I agree that the
feature has merit. But I can object to it and, as the patch stands
now, I do object to it, for the following specific reasons:
1. You haven't posted any meaningful performance test results, of
either normal cases or worst cases.
2. You're fiddling with the buffer locking in a way that I'm very
doubtful about without having tried any of the alternatives.
3. You're rearranging the page header in a way that I find ugly and baroque.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2/20/12 5:57 AM, Robert Haas wrote:
3. You're rearranging the page header in a way that I find ugly and baroque.
Guys, are we talking about an on-disk format change? If so, this may
need to be kicked out of 9.2 ...
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus <josh@agliodbs.com> wrote:
On 2/20/12 5:57 AM, Robert Haas wrote:
3. You're rearranging the page header in a way that I find ugly and baroque.
Guys, are we talking about an on-disk format change? If so, this may
need to be kicked out of 9.2 ...
Yes, we are. Simon's gone to some pains to make it backward
compatible, but IMHO it's a lot messier and less future-proof than it
could be with some more work, and if we commit this patch than we WILL
be stuck with this for a very long time.
The fact is that, thus far, so real effort has been made by anyone to
evict anything from the current CommitFest. I did that for the last
two cycles, but as the minutes for last year's development meeting
succinctly observed "Haas ... doesn't like being the schedule jerk".
My resolve to be less of a schedule jerk is being sorely tested,
though: aside from this patch, which has its share of issues, there's
also Alvaro's foreign key stuff, which probably needs a lot more
review than it has so far gotten, and several large patches from
Dimitri, which do cool stuff but need a lot more work, and Peter
Geoghegan's pg_stat_statements patch, which is awesome functionality
but sounds like it's still a little green, and parallel pg_dump, which
is waiting on a rebase after some refactoring I did to simplify
things, and pgsql_fdw, and Heikki's xlog insert scaling patch, which
fortunately seems to be in the capable hands of Fujii Masao and Jeff
Janes, but certainly isn't ready to go today. Any of these
individually could probably eat up a solid week of my time, and then
there are the other 40 as-yet-undealt-with patches.
I said five weeks ago that it probably wouldn't hurt anything to let
the CommitFest run six weeks, and Tom opined that it would probably
require two months. But the sad reality is that, after five weeks,
we're not even half done with this CommitFest. That's partly because
most people did not review as much code as they submitted, partly
because a bunch of people played fast and loose with the submission
deadline, and partly because we didn't return any patches that still
needed major rehashing after the first round of review, leading to a
situation where supposedly code-complete patches are showing up for
the first time four or five weeks after the submission deadline. Now
none of that is the fault of this patch specifically: honestly, if I
had to pick just two more patches to get committed before beta, this
would probably be one of them. But that doesn't make me happy with
where it's at today, not to mention the fact that there are people who
submitted their code a lot earlier who still haven't gotten the
attention this patch has, which is still less than the attention a
patch of this magnitude probably needs.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Guys, are we talking about an on-disk format change? If so, this may
need to be kicked out of 9.2 ...Yes, we are. Simon's gone to some pains to make it backward
compatible, but IMHO it's a lot messier and less future-proof than it
could be with some more work, and if we commit this patch than we WILL
be stuck with this for a very long time.
Yeah. I'd personally prefer to boot this to 9.3, then. It's not like
there's not enough features for 9.2, and I really don't want this
feature to cause 5 others to be delayed.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
On Mon, Feb 20, 2012 at 6:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus <josh@agliodbs.com> wrote:
On 2/20/12 5:57 AM, Robert Haas wrote:
3. You're rearranging the page header in a way that I find ugly and baroque.
Guys, are we talking about an on-disk format change? If so, this may
need to be kicked out of 9.2 ...Yes, we are. Simon's gone to some pains to make it backward
compatible, but IMHO it's a lot messier and less future-proof than it
could be with some more work, and if we commit this patch than we WILL
be stuck with this for a very long time.
No, we aren't. The format is not changed, though there are 3 new bits added.
The format is designed to be extensible and we have room for 10 more
bits, which allows 1024 new page formats, which at current usage rate
should last us for approximately 5000 years of further development.
Previously, we were limited to 251 more page formats, so this new
mechanism is more future proof than the last. Alternatively you might
say that we have dropped from 1271 page formats to only 1024, which is
hardly limiting.
This has already been discussed and includes points made by Bruce and
Heikki in the final design. The "more work" required is not described
clearly, nor has anyone but RH requested it. In terms of messier, RH's
only alternate suggestion is to split the checksum into multiple
pieces, which I don't think yer average reader would call less ugly,
less messy or more future proof.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed. If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.
Yes, pg_upgrade makes this hard. If you are using pg_dump to restore,
and set the checksum GUC before you do the restore, and never turn it
off, then you will have a fully checksum'ed database.
If you use pg_upgrade, and enable the checksum GUC, your database will
become progressively checksummed, and as Simon pointed out, the only
clean way is VACUUM FULL. It is quite hard to estimate the checksum
coverage of a database with mixed checksumming --- one cool idea would
be for ANALYZE to report how many of the pages it saw were checksummed.
Yeah, crazy, but it might be enough.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Mon, Feb 20, 2012 at 11:09 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed. If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.Yes, pg_upgrade makes this hard. If you are using pg_dump to restore,
and set the checksum GUC before you do the restore, and never turn it
off, then you will have a fully checksum'ed database.If you use pg_upgrade, and enable the checksum GUC, your database will
become progressively checksummed, and as Simon pointed out, the only
clean way is VACUUM FULL. It is quite hard to estimate the checksum
coverage of a database with mixed checksumming --- one cool idea would
be for ANALYZE to report how many of the pages it saw were checksummed.
Yeah, crazy, but it might be enough.
Well, I didn't say VACUUM FULL was the only clean way of knowing
whether every block is checksummed, its a very intrusive way.
If you want a fast upgrade with pg_upgrade, rewriting every block is
not really a grand plan, but if you want it...
If we did that, I think I would prefer to do it with these commands
VACUUM ENABLE CHECKSUM; //whole database only
VACUUM DISABLE CHECKSUM;
rather than use a GUC. We can then add an option to pg_upgrade.
That way, we scan whole database, adding checksums and then record it
in pg_database
When we remove it, we do same thing in reverse then record it.
So there's no worries about turning on/off GUCs and we know for
certain where our towel is.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 20, 2012 at 11:23:42PM +0000, Simon Riggs wrote:
If you use pg_upgrade, and enable the checksum GUC, your database will
become progressively checksummed, and as Simon pointed out, the only
clean way is VACUUM FULL. �It is quite hard to estimate the checksum
coverage of a database with mixed checksumming --- one cool idea would
be for ANALYZE to report how many of the pages it saw were checksummed.
Yeah, crazy, but it might be enough.Well, I didn't say VACUUM FULL was the only clean way of knowing
whether every block is checksummed, its a very intrusive way.If you want a fast upgrade with pg_upgrade, rewriting every block is
not really a grand plan, but if you want it...If we did that, I think I would prefer to do it with these commands
VACUUM ENABLE CHECKSUM; //whole database only
VACUUM DISABLE CHECKSUM;rather than use a GUC. We can then add an option to pg_upgrade.
That way, we scan whole database, adding checksums and then record it
in pg_databaseWhen we remove it, we do same thing in reverse then record it.
So there's no worries about turning on/off GUCs and we know for
certain where our towel is.
Yes, that would work, but I suspect most users are going to want to
enable checksums when they are seeing some odd behavior and want to test
the hardware. Requiring them to rewrite the database to do that is
making the feature less useful, and once they have the problem fixed,
they might want to turn it off again.
Now, one argument is that they could enable checksums, see no checksum
errors, but still be getting checksum failures from pages that were
written before checksum was enabled. I think we just need to document
that checksum only works for writes performed _after_ the feature is
enabled.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Mon, Feb 20, 2012 at 08:57:08AM -0500, Robert Haas wrote:
On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
What straightforward implementation is that?? This *is* the straightforward one.
God knows what else we'd break if we drop the lock, reacquire as an
exclusive, then drop it again and reacquire in shared mode. Code tends
to assume that when you take a lock you hold it until you release;
doing otherwise would allow all manner of race conditions to emerge.And do you really think that would be faster?
I don't know, but neither do you, because you apparently haven't tried
it. Games where we drop the shared lock and get an exclusive lock
are used in numerous places in the source code; see, for example,
heap_update(), so I don't believe that there is any reason to reject
that a priori. Indeed, I can think of at least one pretty good reason
to do it exactly that way: it's the way we've handled all previous
problems of this type, and in general it's better to make new code
look like existing code than to invent something new, especially when
you haven't made any effort to quantify the problem or the extent to
which the proposed solution mitigates it.
We do, in numerous places, drop a shared buffer content lock and reacquire it
in exclusive mode. However, the existing users of that pattern tend to trade
the lock, complete subsequent work, and unlock the buffer all within the same
function. So they must, because several of them recheck facts that can change
during the period of holding no lock. SetBufferCommitInfoNeedsSave() callers
do not adhere to that pattern; they can be quite distant from the original
lock acquisition and eventual lock release. Take the prototypical case of
SetHintBits(). Our shared lock originated in a place like heapgettup(), which
expects to hold it continually until finished. We would need to somehow push
up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
lock, then have heapgettup() reorient itself as needed. Every caller of code
that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
Perhaps there's a different way to apply the lock-trade technique that avoids
this mess, but I'm not seeing it. Consequently, I find the idea of requiring
a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
to be a sensible one. Within that umbrella, some details need attention:
- Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c. I note
gistScanPage() and XLogCheckBuffer()[1]Patch v10 adds a comment to XLogCheckBuffer() saying otherwise. The comment is true today, but the same patch makes it false by having XLogSaveBufferForHint() call XLogInsert() under a share lock.. (Perhaps we'll only require the
spinlock for heap buffers, letting gistScanPage() off the hook.) We need a
public API, perhaps LockBufferForLSN().
- The use of some spinlock need not imply using the buffer header spinlock.
We could add a dedicated pd_lsn_tli_lock to BufferDesc. That has the usual
trade-off of splitting a lock: less contention at the cost of more
acquisitions. I have no intuition on which approach would perform better.
I agree that src/backend/storage/buffer/README is the place to document the
new locking rules.
I do share your general unease about adding new twists to the buffer access
rules. Some of our hairiest code is also the code manipulating buffer locks
most extensively, and I would not wish to see that code get even more
difficult to understand. However, I'm not seeing a credible alternative that
retains the same high-level behaviors.
A possible compromise is to leave the page clean after setting a hint bit,
much like the patch already has us do under hot standby. Then there's no new
WAL and no new rules around pd_lsn. Wasn't that one of the things Merlin
benchmarked when he was looking at hint bits? Does anyone recall the result?
Thanks,
nm
[1]: Patch v10 adds a comment to XLogCheckBuffer() saying otherwise. The comment is true today, but the same patch makes it false by having XLogSaveBufferForHint() call XLogInsert() under a share lock.
comment is true today, but the same patch makes it false by having
XLogSaveBufferForHint() call XLogInsert() under a share lock.
On Tue, Feb 21, 2012 at 10:07 AM, Noah Misch <noah@leadboat.com> wrote:
Consequently, I find the idea of requiring
a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
to be a sensible one.
OK
Within that umbrella, some details need attention:
- Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c. I note
gistScanPage() and XLogCheckBuffer()[1]. (Perhaps we'll only require the
spinlock for heap buffers, letting gistScanPage() off the hook.) We need a
public API, perhaps LockBufferForLSN().
Yep, I checked all the call points previously. gistScanPage() and also
gistbulkdelete() would be need changing, but since GIST doesn't use
hints, there is no need for locking. I'll document that better.
- The use of some spinlock need not imply using the buffer header spinlock.
We could add a dedicated pd_lsn_tli_lock to BufferDesc. That has the usual
trade-off of splitting a lock: less contention at the cost of more
acquisitions. I have no intuition on which approach would perform better.
Will think about that and try a few ideas.
I agree that src/backend/storage/buffer/README is the place to document the
new locking rules.
OK, I'll move the comments.
I do share your general unease about adding new twists to the buffer access
rules. Some of our hairiest code is also the code manipulating buffer locks
most extensively, and I would not wish to see that code get even more
difficult to understand. However, I'm not seeing a credible alternative that
retains the same high-level behaviors.
Yes, those changes not made lightly and with much checking.
A possible compromise is to leave the page clean after setting a hint bit,
much like the patch already has us do under hot standby. Then there's no new
WAL and no new rules around pd_lsn. Wasn't that one of the things Merlin
benchmarked when he was looking at hint bits? Does anyone recall the result?
I was thinking exactly that myself. I'll add a GUC to test.
[1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise. The
comment is true today, but the same patch makes it false by having
XLogSaveBufferForHint() call XLogInsert() under a share lock.
OK, thats an issue, well spotted. Will fix.
Thanks for thorough review.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that
do?As explained in detailed comments, the purpose of this is to implement
Heikki's suggestion that we have a bit set to zero so we can detect
failures that cause a run of 1s.I think it's nonsensical to pretend that there's anything special
about that particular bit. If we want to validate the page header
before trusting the lack of a checksum bit, we can do that far more
thoroughly than just checking that one bit. There are a whole bunch
of bits that ought to always be zero, and there are other things we
can validate as well (e.g. LSN not in future). If we're concerned
about the checksum-enabled bit getting flipped (and I agree that we
should be), we can check some combination of that stuff in the hope of
catching it, and that'll be a far better guard than just checking one
arbitrarily selected bit.
PageHeaderIsValid() (being renamed to PageIsVerified()) already checks
"(page->pd_flags & ~PD_VALID_FLAG_BITS) == 0". Explicitly naming another bit
and keeping it unset is redundant with that existing check. It would cease to
be redundant if we ever allocate all the flag bits, but then we also wouldn't
have a bit to spare as PD_CHECKSUM2. I agree with you on this point.
That having been said, I don't feel very good about the idea of
relying on the contents of the page to tell us whether or not the page
has a checksum. There's no guarantee that an error that flips the
has-checksum bit will flip any other bit on the page, or that it won't
flip everything else we're relying on as a backstop in exactly the
manner that foils whatever algorithm we put in place. Random
corruption is, perhaps, unlikely to do that, but somehow I feel like
it'll happen more often than random chance suggests. Things never
fail the way you want them to.Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed. If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.
I'm not seeing value in rewriting pages to remove checksums, as opposed to
just ignoring those checksums going forward. Did you have a particular
scenario in mind?
I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table. That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are. This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums. It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity. I'm not sure whether that's
practical.
This patch implies future opportunities to flesh out its UI, and I don't see
it locking us out of implementing the above. We'll want a weak-lock command
that adds checksums to pages lacking them. We'll want to expose whether a
given relation has full checksum coverage. With that, we could produce an
error when an apparently-non-checksummed page appears in a relation previously
known to have full checksum coverage.
Even supposing an "ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS" as the only
tool for enabling or disabling them, it's helpful to have each page declare
whether it has a checksum. That way, the rewrite can take as little as an
AccessShareLock, and any failure need not lose work already done. If you rely
on anticipating the presence of a checksum based on a pg_class column, that
ALTER needs to perform an atomic rewrite.
On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch <noah@leadboat.com> wrote:
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that
do?As explained in detailed comments, the purpose of this is to implement
Heikki's suggestion that we have a bit set to zero so we can detect
failures that cause a run of 1s.I think it's nonsensical to pretend that there's anything special
about that particular bit. If we want to validate the page header
before trusting the lack of a checksum bit, we can do that far more
thoroughly than just checking that one bit. There are a whole bunch
of bits that ought to always be zero, and there are other things we
can validate as well (e.g. LSN not in future). If we're concerned
about the checksum-enabled bit getting flipped (and I agree that we
should be), we can check some combination of that stuff in the hope of
catching it, and that'll be a far better guard than just checking one
arbitrarily selected bit.PageHeaderIsValid() (being renamed to PageIsVerified()) already checks
"(page->pd_flags & ~PD_VALID_FLAG_BITS) == 0". Explicitly naming another bit
and keeping it unset is redundant with that existing check. It would cease to
be redundant if we ever allocate all the flag bits, but then we also wouldn't
have a bit to spare as PD_CHECKSUM2. I agree with you on this point.
No problem to remove that. It was there at Heikki's request.
That having been said, I don't feel very good about the idea of
relying on the contents of the page to tell us whether or not the page
has a checksum. There's no guarantee that an error that flips the
has-checksum bit will flip any other bit on the page, or that it won't
flip everything else we're relying on as a backstop in exactly the
manner that foils whatever algorithm we put in place. Random
corruption is, perhaps, unlikely to do that, but somehow I feel like
it'll happen more often than random chance suggests. Things never
fail the way you want them to.Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed. If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.I'm not seeing value in rewriting pages to remove checksums, as opposed to
just ignoring those checksums going forward. Did you have a particular
scenario in mind?
Agreed. No reason to change a checksum unless we rewrite the block, no
matter whether page_checksums is on or off.
I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table. That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are. This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums. It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity. I'm not sure whether that's
practical.This patch implies future opportunities to flesh out its UI, and I don't see
it locking us out of implementing the above.
Agreed
We'll want a weak-lock command
that adds checksums to pages lacking them.
VACUUM FREEZE will do this.
We'll want to expose whether a
given relation has full checksum coverage.
Not sure I understand why we'd want that. If you really want that, its
a simple function to do it.
With that, we could produce an
error when an apparently-non-checksummed page appears in a relation previously
known to have full checksum coverage.
Additional checking at relation level is just going to slow things
down. Don't see the value of having relation level checksum option.
It would be an easy thing to skip checksums on UNLOGGED relations, and
possibly desirable, but I don't see the utility of a per relation
checksum setting in other cases. There's just no need for fine tuning
like that.
Even supposing an "ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS" as the only
tool for enabling or disabling them, it's helpful to have each page declare
whether it has a checksum. That way, the rewrite can take as little as an
AccessShareLock, and any failure need not lose work already done. If you rely
on anticipating the presence of a checksum based on a pg_class column, that
ALTER needs to perform an atomic rewrite.
That will pretty much kill the usability of this feature. We must be
able to add checksums without taking a full lock on the table and/or
database. Making it an ALTER TABLE would require huge maintenance
effort to add checksums to a database.
-1 to the idea of having ALTER TABLE variant to enable/disable checksums
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 21, 2012 at 5:07 AM, Noah Misch <noah@leadboat.com> wrote:
We do, in numerous places, drop a shared buffer content lock and reacquire it
in exclusive mode. However, the existing users of that pattern tend to trade
the lock, complete subsequent work, and unlock the buffer all within the same
function. So they must, because several of them recheck facts that can change
during the period of holding no lock. SetBufferCommitInfoNeedsSave() callers
do not adhere to that pattern; they can be quite distant from the original
lock acquisition and eventual lock release. Take the prototypical case of
SetHintBits(). Our shared lock originated in a place like heapgettup(), which
expects to hold it continually until finished. We would need to somehow push
up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
lock, then have heapgettup() reorient itself as needed. Every caller of code
that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
Perhaps there's a different way to apply the lock-trade technique that avoids
this mess, but I'm not seeing it. Consequently, I find the idea of requiring
a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
to be a sensible one. Within that umbrella, some details need attention:
Fair analysis.
- Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c. I note
gistScanPage() and XLogCheckBuffer()[1]. (Perhaps we'll only require the
spinlock for heap buffers, letting gistScanPage() off the hook.) We need a
public API, perhaps LockBufferForLSN().
That seems like a good idea. I am a little worried about Simon's plan
to do the additional locking only for AMs that have no unlogged-type
updates. It's a reasonable performance optimization, and skipping the
locking when checksums are shut off also seems reasonable, but it
seems a bit fragile: suppose that, in the future, someone fixes GiST
to do something intelligent with kill_prior_tuple. Now suddenly it
needs LSN locking that it didn't need before, but this fact might not
be very obvious. It might be a good idea to design LockBufferForLSN
to take an AM OID as an argument, and use the AM OID to decide whether
the locking is required. That way, we can call it from everywhere
that reads an LSN, and it can simply skip the locking in places where
it isn't presently needed.
Or maybe there's a better design, but I agree with you that some kind
of public API is essential.
- The use of some spinlock need not imply using the buffer header spinlock.
We could add a dedicated pd_lsn_tli_lock to BufferDesc. That has the usual
trade-off of splitting a lock: less contention at the cost of more
acquisitions. I have no intuition on which approach would perform better.
I think I like the idea of a separate lock, but I agree it could stand
to be tested both ways. Another thought is that we might add
SpinLockConditionalAcquire(), that just tries to TAS the lock and
returns false if the lock is already held. Then, we could grab the
spinlock while writing the page, in lieu of copying it, and anyone who
wants to set a hint bit can conditionally acquire the spinlock long
enough to set the bits. If the spinlock isn't immediately free then
instead of spinning they just don't set the hint bits after all. That
way you can be sure that no hint bits are set during the write, but
any attempt to set a hint bit only costs you a single TAS - no loop,
as with a regular spinlock acquisition - and of course the hint
itself.
A possible compromise is to leave the page clean after setting a hint bit,
much like the patch already has us do under hot standby. Then there's no new
WAL and no new rules around pd_lsn. Wasn't that one of the things Merlin
benchmarked when he was looking at hint bits? Does anyone recall the result?
It slows things down substantially in the case of repeated scans of
the same unchaning table. In fact, I tried a much more nuanced
approach: set BM_UNTIDY on the page when
SetBufferCommitInfoNeedsSave() is called. BM_UNTIDY pages are written
by the background writer and during checkpoints, and 5% of the time by
backends. All of that has the salutary effect of making the first
sequential scan less ungodly slow, but it also has the less-desirable
effect of making the next 19 of them still kinda-sorta slow. It was
unclear to me (and perhaps to others) whether it really worked out to
a win.
However, we might need/want to revisit some of that logic in
connection with this patch, because it seems to me that a large
sequential scan of an unhinted table could be ferociously slow, with
the backend writing out its own pages and WAL-logging them as it goes.
It would be good to test that to figure out whether some mitigation
measures are needed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
I decided that it would be worth benchmarking this patch.
Specifically, I tested:
master, as a basis of comparison
checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'
checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'
This test was performed using pgbench-tools. At different client
counts and scaling factors "1,10,100", performance of an update.sql
workload was tested.
This benchmark used Greg Smith's "toy" server. As he put it recently:
"The main change to the 8 hyperthreaded core test server (Intel
i7-870) for this year is bumping it from 8GB to 16GB of RAM, which
effectively doubles the scale I can reach before things slow
dramatically." However, while Greg used scientific Linux for his
recent batch of performance numbers, the box was booted into Debian
for this, which used Kernel 2.6.32, FWIW. Didn't bother with a
separate disk for WAL.
I put shared_buffers at 1GB, and checkpoint_segments at 50. I took the
additional precaution of initdb'ing for each set, lest there be some
kind of contamination between sets, which necessitated doing some
additional work since I couldn't very well expect the "results"
database to persist. Different sets of figures from different runs
where dumped and restored, before finally being combined so that
pgbench-tools could produce it's regular report.
I have attached a config for pgbench-tools, so that others may
recreate my work here. I also attach the most relevant image,
comparing each test set across scaling levels. I'll make a pdf of the
full report available if that would be useful.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
scaling-sets-page-checksums.pngimage/png; name=scaling-sets-page-checksums.pngDownload
�PNG
IHDR �Mq PLTE��� ���� � ��� � ���@ �� Ai��� �@���0`�� @� ������**�� @�� 333MMMfff�������������������22�������U��������������� � d �"�".�W � �p � ���� ��� � �����P����E ��r��z�����k������� �� ���� �����P@Uk/� ��@�@��`��`��� ��@��@��`��p������������������������|�@�� �����K IDATx����b���a ���r�fH���Su����$���{�cu7&� ��������Q�
��"