crash-safe visibility map, take four

Started by Robert Haasalmost 15 years ago17 messages
#1Robert Haas
robertmhaas@gmail.com
1 attachment(s)

On Wed, Dec 1, 2010 at 11:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

As far as I can tell, there are basically two viable solutions on the
table here.

1. Every time we observe a page as all-visible, (a) set the
PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
bit in the visibility map page, bumping the LSN as usual, and (c) emit
a WAL record indicating the relation and block number.  On redo of
this record, set both the page-level bit and the visibility map bit.
The heap page may hit the disk before the WAL record, but that's OK;
it just might result in a little extra work until some subsequent
operation gets the visibility map bit set.  The visibility map page
page may hit the disk before the heap page, but that's OK too, because
the WAL record will already be on disk due to the LSN interlock.  If a
crash occurs before the heap page is flushed, redo will fix the heap
page.  (The heap page will get flushed as part of the next checkpoint,
if not sooner, so by the time the redo pointer advances past the WAL
record, there's no longer a risk.)

2. Every time we observe a page as all-visible, (a) set the
PD_ALL_VISIBLE bit on the page, without bumping the LSN, (b) set the
bit in the visibility map page, bumping the LSN if a WAL record is
issued (which only happens sometimes, read on), and (c) emit a WAL
record indicating the "chunk" of 128 visibility map bits which
contains the bit we just set - but only if we're now dealing with a
new group of 128 visibility map bits or if a checkpoint has intervened
since the last such record we emitted.  On redo of this record, clear
the visibility map bits in each chunk.  The heap page may hit the disk
before the WAL record, but that's OK for the same reasons as in plan
#1.  The visibility map page may hit the disk before the heap page,
but that's OK too, because the WAL record will already be on disk to
due the LSN interlock.  If a crash occurs before the heap page makes
it to disk, then redo will clear the visibility map bits, leaving them
to be reset by a subsequent VACUUM.

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
Patch attached. It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something. Feel free to point and snicker in the latter case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

visibility-map-v1.patchapplication/octet-stream; name=visibility-map-v1.patchDownload
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 4044,4049 **** log_heap_freeze(Relation reln, Buffer buffer,
--- 4044,4081 ----
  }
  
  /*
+  * Perform XLogInsert for a heap-visible operation.	 'block' is the block
+  * being marked all-visible, and vm_buffer is the buffer containing the
+  * corresponding visibility map block.  Both should have already been modified
+  * and dirtied.
+  */
+ XLogRecPtr
+ log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer)
+ {
+ 	xl_heap_visible xlrec;
+ 	XLogRecPtr	recptr;
+ 	XLogRecData rdata[2];
+ 
+ 	xlrec.node = rnode;
+ 	xlrec.block = block;
+ 
+ 	rdata[0].data = (char *) &xlrec;
+ 	rdata[0].len = SizeOfHeapVisible;
+ 	rdata[0].buffer = InvalidBuffer;
+ 	rdata[0].next = &(rdata[1]);
+ 
+ 	rdata[1].data = NULL;
+ 	rdata[1].len = 0;
+ 	rdata[1].buffer = vm_buffer;
+ 	rdata[1].buffer_std = false;
+ 	rdata[1].next = NULL;
+ 
+ 	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VISIBLE, rdata);
+ 
+ 	return recptr;
+ }
+ 
+ /*
   * Perform XLogInsert for a heap-update operation.	Caller must already
   * have modified the buffer(s) and marked them dirty.
   */
***************
*** 4331,4336 **** heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
--- 4363,4440 ----
  	UnlockReleaseBuffer(buffer);
  }
  
+ /*
+  * Replay XLOG_HEAP2_VISIBLE record.
+  *
+  * The critical integrity requirement here is that we must never end up with
+  * a situation where the visibility map bit is set, and the page-level
+  * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
+  * page modification would fail to clear the visibility map bit.
+  */
+ static void
+ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
+ {
+ 	xl_heap_visible *xlrec = (xl_heap_visible *) XLogRecGetData(record);
+ 	Buffer		buffer;
+ 	Page		page;
+ 
+ 	/*
+ 	 * Read the heap page, if it still exists.  If the heap file has been
+ 	 * dropped or truncated later in recovery, this might fail.  In that case,
+ 	 * there's no point in doing anything further, since the visibility map
+ 	 * will have to be cleared out at the same time.
+ 	 */
+ 	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block,
+ 									RBM_NORMAL);
+ 	if (!BufferIsValid(buffer))
+ 		return;
+ 	page = (Page) BufferGetPage(buffer);
+ 
+ 	/*
+ 	 * We don't bump the LSN of the heap page when setting the visibility
+ 	 * map bit, because that would generate an unworkable volume of
+ 	 * full-page writes.  This exposes us to torn page hazards, but since
+ 	 * we're not inspecting the existing page contents in any way, we
+ 	 * don't care.
+ 	 *
+ 	 * However, all operations that clear the visibility map bit *do* bump
+ 	 * the LSN, and those operations will only be replayed if the XLOG LSN
+ 	 * precedes the page LSN.  Thus, if the page LSN has advanced past our
+ 	 * XLOG record's LSN, we mustn't mark the page all-visible, because
+ 	 * the subsequent update won't be replayed to clear the flag.
+ 	 */
+ 	if (XLByteLE(lsn, PageGetLSN(page)))
+ 	{
+ 		PageSetAllVisible(page);
+ 		MarkBufferDirty(buffer);
+ 	}
+ 
+ 	/* Done with heap page. */
+ 	UnlockReleaseBuffer(buffer);
+ 
+ 	/*
+ 	 * Even we skipped the heap page update due to the LSN interlock, it's
+ 	 * still safe to update the visibility map.  Any WAL record that clears
+ 	 * the visibility map bit does so before checking the page LSN, so any
+ 	 * bits that need to be cleared will still be cleared.
+ 	 */
+ 	if (record->xl_info & XLR_BKP_BLOCK_1)
+ 		RestoreBkpBlocks(lsn, record, false);
+ 	else
+ 	{
+ 		Relation	reln;
+ 		Buffer		vmbuffer = InvalidBuffer;
+ 
+ 		reln = CreateFakeRelcacheEntry(xlrec->node);
+ 		visibilitymap_pin(reln, xlrec->block, &vmbuffer);
+ 		/* Don't set the bit if replay has already passed this point. */
+ 		if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer))))
+ 			visibilitymap_set(reln, xlrec->block, lsn, &vmbuffer);
+ 		ReleaseBuffer(vmbuffer);
+ 		FreeFakeRelcacheEntry(reln);
+ 	}
+ }
+ 
  static void
  heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
  {
***************
*** 4923,4928 **** heap2_redo(XLogRecPtr lsn, XLogRecord *record)
--- 5027,5035 ----
  		case XLOG_HEAP2_CLEANUP_INFO:
  			heap_xlog_cleanup_info(lsn, record);
  			break;
+ 		case XLOG_HEAP2_VISIBLE:
+ 			heap_xlog_visible(lsn, record);
+ 			break;
  		default:
  			elog(PANIC, "heap2_redo: unknown op code %u", info);
  	}
***************
*** 5052,5057 **** heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 5159,5172 ----
  		appendStringInfo(buf, "cleanup info: remxid %u",
  						 xlrec->latestRemovedXid);
  	}
+ 	else if (info == XLOG_HEAP2_VISIBLE)
+ 	{
+ 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
+ 
+ 		appendStringInfo(buf, "visible: rel %u/%u/%u; blk %u",
+ 						 xlrec->node.spcNode, xlrec->node.dbNode,
+ 						 xlrec->node.relNode, xlrec->block);
+ 	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
  }
*** a/src/backend/access/heap/visibilitymap.c
--- b/src/backend/access/heap/visibilitymap.c
***************
*** 76,94 ****
   * index would still be visible to all other backends, and deletions wouldn't
   * be visible to other backends yet.  (But HOT breaks that argument, no?)
   *
-  * There's another hole in the way the PD_ALL_VISIBLE flag is set. When
-  * vacuum observes that all tuples are visible to all, it sets the flag on
-  * the heap page, and also sets the bit in the visibility map. If we then
-  * crash, and only the visibility map page was flushed to disk, we'll have
-  * a bit set in the visibility map, but the corresponding flag on the heap
-  * page is not set. If the heap page is then updated, the updater won't
-  * know to clear the bit in the visibility map.  (Isn't that prevented by
-  * the LSN interlock?)
-  *
   *-------------------------------------------------------------------------
   */
  #include "postgres.h"
  
  #include "access/visibilitymap.h"
  #include "storage/bufmgr.h"
  #include "storage/bufpage.h"
--- 76,86 ----
   * index would still be visible to all other backends, and deletions wouldn't
   * be visible to other backends yet.  (But HOT breaks that argument, no?)
   *
   *-------------------------------------------------------------------------
   */
  #include "postgres.h"
  
+ #include "access/heapam.h"
  #include "access/visibilitymap.h"
  #include "storage/bufmgr.h"
  #include "storage/bufpage.h"
***************
*** 196,208 **** visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
  /*
   *	visibilitymap_set - set a bit on a previously pinned page
   *
!  * recptr is the LSN of the heap page. The LSN of the visibility map page is
!  * advanced to that, to make sure that the visibility map doesn't get flushed
!  * to disk before the update to the heap page that made all tuples visible.
   *
!  * This is an opportunistic function. It does nothing, unless *buf
!  * contains the bit for heapBlk. Call visibilitymap_pin first to pin
!  * the right map page. This function doesn't do any I/O.
   */
  void
  visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
--- 188,201 ----
  /*
   *	visibilitymap_set - set a bit on a previously pinned page
   *
!  * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
!  * or InvalidXLogRecPtr in normal running.  The page LSN is advanced to the
!  * one provided; in normal running, we generate a new XLOG record and set the
!  * page LSN to that value.
   *
!  * You must pass a buffer containing the correct map page to this function.
!  * Call visibilitymap_pin first to pin the right one. This function doesn't do
!  * any I/O.
   */
  void
  visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
***************
*** 220,226 **** visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
  
  	/* Check that we have the right page pinned */
  	if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock)
! 		return;
  
  	page = BufferGetPage(*buf);
  	map = PageGetContents(page);
--- 213,219 ----
  
  	/* Check that we have the right page pinned */
  	if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock)
! 		elog(ERROR, "wrong buffer passed to visibilitymap_set");
  
  	page = BufferGetPage(*buf);
  	map = PageGetContents(page);
***************
*** 230,236 **** visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
  	{
  		map[mapByte] |= (1 << mapBit);
  
! 		if (XLByteLT(PageGetLSN(page), recptr))
  			PageSetLSN(page, recptr);
  		PageSetTLI(page, ThisTimeLineID);
  		MarkBufferDirty(*buf);
--- 223,232 ----
  	{
  		map[mapByte] |= (1 << mapBit);
  
! 		if (XLogRecPtrIsInvalid(recptr) && RelationNeedsWAL(rel))
! 			recptr = log_heap_visible(rel->rd_node, heapBlk, *buf);
! 
! 		if (!XLogRecPtrIsInvalid(recptr))
  			PageSetLSN(page, recptr);
  		PageSetTLI(page, ThisTimeLineID);
  		MarkBufferDirty(*buf);
*** a/src/backend/access/transam/transam.c
--- b/src/backend/access/transam/transam.c
***************
*** 24,29 ****
--- 24,31 ----
  #include "access/transam.h"
  #include "utils/snapmgr.h"
  
+ /* Handy constant for an invalid xlog recptr */
+ const XLogRecPtr InvalidXLogRecPtr = {0, 0};
  
  /*
   * Single-item cache for results of TransactionLogFetch.  It's worth having
***************
*** 35,43 **** static TransactionId cachedFetchXid = InvalidTransactionId;
  static XidStatus cachedFetchXidStatus;
  static XLogRecPtr cachedCommitLSN;
  
- /* Handy constant for an invalid xlog recptr */
- static const XLogRecPtr InvalidXLogRecPtr = {0, 0};
- 
  /* Local functions */
  static XidStatus TransactionLogFetch(TransactionId transactionId);
  
--- 37,42 ----
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 5410,5416 **** exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
  {
  	char		recoveryPath[MAXPGPATH];
  	char		xlogpath[MAXPGPATH];
- 	XLogRecPtr	InvalidXLogRecPtr = {0, 0};
  
  	/*
  	 * We are no longer in archive recovery state.
--- 5410,5415 ----
***************
*** 7997,8004 **** CreateRestartPoint(int flags)
  	if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
  		XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
  	{
- 		XLogRecPtr	InvalidXLogRecPtr = {0, 0};
- 
  		ereport(DEBUG2,
  				(errmsg("skipping restartpoint, already performed at %X/%X",
  				  lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
--- 7996,8001 ----
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 479,485 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  				visibilitymap_pin(onerel, blkno, &vmbuffer);
  				LockBuffer(buf, BUFFER_LOCK_SHARE);
  				if (PageIsAllVisible(page))
! 					visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer);
  				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
  			}
  
--- 479,486 ----
  				visibilitymap_pin(onerel, blkno, &vmbuffer);
  				LockBuffer(buf, BUFFER_LOCK_SHARE);
  				if (PageIsAllVisible(page))
! 					visibilitymap_set(onerel, blkno, InvalidXLogRecPtr,
! 									  &vmbuffer);
  				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
  			}
  
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 136,141 **** extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
--- 136,143 ----
  extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
  				TransactionId cutoff_xid,
  				OffsetNumber *offsets, int offcnt);
+ extern XLogRecPtr log_heap_visible(RelFileNode rnode, BlockNumber block,
+ 				 Buffer vm_buffer);
  extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
  			BlockNumber blk, Page page);
  
*** a/src/include/access/htup.h
--- b/src/include/access/htup.h
***************
*** 606,611 **** typedef HeapTupleData *HeapTuple;
--- 606,612 ----
  #define XLOG_HEAP2_CLEAN		0x10
  /* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
  #define XLOG_HEAP2_CLEANUP_INFO 0x30
+ #define XLOG_HEAP2_VISIBLE		0x40
  
  /*
   * All what we need to find changed tuple
***************
*** 750,755 **** typedef struct xl_heap_freeze
--- 751,765 ----
  
  #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_xid) + sizeof(TransactionId))
  
+ /* This is what we need to know about setting a visibility map bit */
+ typedef struct xl_heap_visible
+ {
+ 	RelFileNode node;
+ 	BlockNumber block;
+ } xl_heap_visible;
+ 
+ #define SizeOfHeapVisible (offsetof(xl_heap_visible, block) + sizeof(BlockNumber))
+ 
  extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
  									   TransactionId *latestRemovedXid);
  
*** a/src/include/access/transam.h
--- b/src/include/access/transam.h
***************
*** 135,140 **** extern bool TransactionStartedDuringRecovery(void);
--- 135,143 ----
  /* in transam/varsup.c */
  extern PGDLLIMPORT VariableCache ShmemVariableCache;
  
+ /* in transam/transam.c */
+ extern const XLogRecPtr InvalidXLogRecPtr;
+ 
  
  /*
   * prototypes for functions in transam/transam.c
#2Gokulakannan Somasundaram
gokul007@gmail.com
In reply to: Robert Haas (#1)
Re: crash-safe visibility map, take four

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
Patch attached. It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something. Feel free to point and snicker in the latter case.

Hi,

I suppose the problem is not with setting the bit, but resetting the
bit. Has that been completed already?

Thanks.

#3Robert Haas
robertmhaas@gmail.com
In reply to: Gokulakannan Somasundaram (#2)
Re: crash-safe visibility map, take four

On Tue, Mar 22, 2011 at 11:59 PM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
 Patch attached.  It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something.  Feel free to point and snicker in the latter case.

Hi,
    I suppose the problem is not with setting the bit, but resetting the
bit. Has that been completed already?
Thanks.

All operations that clear the bit area are already WAL-logged.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Jesper Krogh
jesper@krogh.cc
In reply to: Robert Haas (#1)
Re: crash-safe visibility map, take four

On 2011-03-22 21:43, Robert Haas wrote:

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
Patch attached. It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something. Feel free to point and snicker in the latter case.

Looks simple, but there is now benefit on the usage side in the patch,
so it isn't really "testable" yet? I would love to spend some time testing
when its doable (even with rough corners.)

I'm still a bit puzzled with how it would end up working with a page-level
visibillity map bit for index-scans. There is a clear "drop off" in
usabillity
when the change rates of the table goes up, which may or may not be
relevant, but I cannot really judge, since I haven't even got a ballpark
figure about how much table churn would disable say 50% of the usage.

= Really naive suggestion approaching =
Another layout might be to simply drag out t_xmin, t_xmax pr row (8 bytes)
into a table by itself. This table will be way bigger than the one bit
per page
map, but could be "wal-logged" as any other change in the system?

It would, by definition make the visibility testing work (way faster
than today),
no matter how fast the underlying table changes.

State of today (PG 8.4) is that a query like this:
testdb=# select count(id) from testdb.seq where fts @@ to_tsquery('tag');
count
-------
69753
(1 row)

Time: 5863.600 ms
testdb=# select count(id) from testdb.seq where fts @@ to_tsquery('tag');
count
-------
69753
(1 row)

Time: 659.832 ms
testdb=# select count(id) from testdb.seq where fts @@ to_tsquery('tag');
count
-------
69753
(1 row)

Time: 1005.765 ms

Somewhere around 15ns / tuple (not bad at all).
(the first was probably "half warm")

The "average" rows per tuple is somewhere between 4 and 8 for this
table, assuming
8 and that the 69K are randomly distributed among the 16M other tuples
(fair assumption
in this case). The 600-1000ms for the fresh cache run are the timing to
drag:
69753*8192 (page size) = 571MB into memory for visibillity testing
alone, on warm cache
all pages being in main memory. Packing 16M tuples with 8 bytes / tuple
in a map would be
around 128MB.

given 8 bytes/row and random distribution of data, that would require us
to read all 128MB,
so a speedup of x4 on this example, but it would rougly let us count the
entire table in
the same time.

With regard to disk vs. memory hotness.. those 128MB compares to a table
size of 32GB
(with a toast table next to it of 64GB) but that shouldn't be touched by
above query.

The ns/tuple number (today) on a "thin" table in my system is
approaching 1ns / tuple.

If the page-level bitmap would be set "quite fast" on a fairly busy
system anyway, then
the above is just noise in the air, but I have currently no feeling, and
there is
some math in there I have trouble setting reliable ballpark numbers on.

There is, by all approaches room for significant improvements for the
visibillity
testing for a huge range of installations.

Can I drag out numbers of "frozenness of tuples" from my current systems
to fill in the
discussion? (how?)

Jesper
--
Jesper

#5Gokulakannan Somasundaram
gokul007@gmail.com
In reply to: Robert Haas (#3)
Re: crash-safe visibility map, take four

All operations that clear the bit area are already WAL-logged.

Is it the case with visibility map also?

Thanks.

#6Robert Haas
robertmhaas@gmail.com
In reply to: Gokulakannan Somasundaram (#5)
Re: crash-safe visibility map, take four

On Wed, Mar 23, 2011 at 2:29 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:

All operations that clear the bit area are already WAL-logged.

Is it the case with visibility map also?
Thanks.

Yes. Look at the comment that the patch removes. That describes the
problem being fixed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Robert Haas
robertmhaas@gmail.com
In reply to: Jesper Krogh (#4)
Re: crash-safe visibility map, take four

On Wed, Mar 23, 2011 at 2:16 AM, Jesper Krogh <jesper@krogh.cc> wrote:

On 2011-03-22 21:43, Robert Haas wrote:

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
 Patch attached.  It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something.  Feel free to point and snicker in the latter case.

Looks simple, but there is now benefit on the usage side in the patch,
so it isn't really "testable" yet? I would love to spend some time testing
when its doable (even with rough corners.)

What it probably needs right now is some crash testing - insert a
database panic at various points in the code and then check whether
the state after recovery is still OK. Also some code review from
people who understand recovery better than me. *waves to Heikki*

There's a lot more work that will have to be done before this starts
to produce user-visible performance benefits, and then a lot more work
after that before we've exhausted all the possibilities. I can't cope
with all that right now. This is basic infrastructure, that will
eventually enable a variety of cool stuff, but isn't particularly sexy
by itself.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Merlin Moncure
mmoncure@gmail.com
In reply to: Jesper Krogh (#4)
Re: crash-safe visibility map, take four

On Wed, Mar 23, 2011 at 1:16 AM, Jesper Krogh <jesper@krogh.cc> wrote:

On 2011-03-22 21:43, Robert Haas wrote:

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
 Patch attached.  It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something.  Feel free to point and snicker in the latter case.

Looks simple, but there is now benefit on the usage side in the patch,
so it isn't really "testable" yet? I would love to spend some time testing
when its doable (even with rough corners.)

I'm still a bit puzzled with how it would end up working with a page-level
visibillity map bit for index-scans. There is a clear "drop off" in
usabillity
when the change rates of the table goes up, which may or may not be
relevant, but I cannot really judge, since I haven't even got a ballpark
figure about how much table churn would disable say 50% of the usage.

How much benefit you are going to get is going to be really workload
dependent. In a lot of cases distribution of writes are going to be
really non uniform so that a small percentage of records get the
majority of the writes across the database generally. Reliable
PD_ALL_VISIBLE opens the door to optimizing around this pattern, which
i'd estimate the vast majority of databases follow in various degrees.

It's really hard to overemphasize how important in performance terms
are the features that mitigate the relative downsides of our mvcc
implementation. The HOT feature in 8.3 was an absolute breakthrough
in terms of postgres performance and I expect this will open similar
doors.

merlin

#9Gokulakannan Somasundaram
gokul007@gmail.com
In reply to: Robert Haas (#6)
Re: crash-safe visibility map, take four

Yeah. i looked at it. I don't think it addresses the problem raised here.

http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php

<http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php&gt;Or may be
i am missing something.

Thanks.

On Wed, Mar 23, 2011 at 7:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Wed, Mar 23, 2011 at 2:29 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:

All operations that clear the bit area are already WAL-logged.

Is it the case with visibility map also?
Thanks.

Yes. Look at the comment that the patch removes. That describes the
problem being fixed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Robert Haas
robertmhaas@gmail.com
In reply to: Gokulakannan Somasundaram (#9)
Re: crash-safe visibility map, take four

On Thu, Mar 24, 2011 at 12:56 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:

Yeah. i looked at it. I don't think it addresses the problem raised here.
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
Or may be i am missing something.

Yeah, I think you're right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Jesper Krogh (#4)
Re: crash-safe visibility map, take four

On Wed, Mar 23, 2011 at 6:16 AM, Jesper Krogh <jesper@krogh.cc> wrote:

On 2011-03-22 21:43, Robert Haas wrote:

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
 Patch attached.  It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something.  Feel free to point and snicker in the latter case.

Looks simple, but there is now benefit...

Your tests and discussion remind me that I haven't yet seen any tests
that show that index-only scans would be useful for performance.

Everyone just seems to be assuming that they make a huge difference,
and that the difference is practically realisable in a common
workload.

Perhaps that's already been done and I just didn't notice?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#12高增琦
pgf00a@gmail.com
In reply to: Simon Riggs (#11)
Re: crash-safe visibility map, take four

Hi,

Should we do full-page write for visibilitymap all the time?
Now, when clear visiblitymap, there is no full-page write for vm
since we don't save buffer info in insert/update/delete's log.

The full-page write is used to protect pages from disk failure. Without it,
1) set vm: the vm bits that should be set to 1 may still be 0
2) clear vm: the vm bits that should be set to 0 may still be 1
Are these true? Or the page is totally unpredictable?

Another question:
To address the problem in
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
, should we just clear the vm before the log of insert/update/delete?
This may reduce the performance, is there another solution?

Thanks.

--
GaoZengqi
pgf00a@gmail.com
zengqigao@gmail.com

On Fri, Mar 25, 2011 at 6:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Show quoted text

On Wed, Mar 23, 2011 at 6:16 AM, Jesper Krogh <jesper@krogh.cc> wrote:

On 2011-03-22 21:43, Robert Haas wrote:

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
Patch attached. It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something. Feel free to point and snicker in the latter case.

Looks simple, but there is now benefit...

Your tests and discussion remind me that I haven't yet seen any tests
that show that index-only scans would be useful for performance.

Everyone just seems to be assuming that they make a huge difference,
and that the difference is practically realisable in a common
workload.

Perhaps that's already been done and I just didn't notice?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: 高增琦 (#12)
Re: crash-safe visibility map, take four

On 30.03.2011 06:24, 高增琦 wrote:

Should we do full-page write for visibilitymap all the time?
Now, when clear visiblitymap, there is no full-page write for vm
since we don't save buffer info in insert/update/delete's log.

The full-page write is used to protect pages from disk failure. Without it,
1) set vm: the vm bits that should be set to 1 may still be 0
2) clear vm: the vm bits that should be set to 0 may still be 1
Are these true? Or the page is totally unpredictable?

Not quite. The WAL replay will set or clear vm bits, regardless of full
page writes. Full page writes protect from torn pages, ie. the problem
where some operations on a page have made it to disk while others have
not. That's not a problem for VM pages, as each bit on the page can be
set or cleared individually. But for something like a heap page where
you have an offset in the beginning of the page that points to the tuple
elsewhere on the page, you have to ensure that they stay in sync, even
if you don't otherwise care if the update makes it to disk or not.

Another question:
To address the problem in
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
, should we just clear the vm before the log of insert/update/delete?
This may reduce the performance, is there another solution?

Yeah, that's a straightforward way to fix it. I don't think the
performance hit will be too bad. But we need to be careful not to hold
locks while doing I/O, which might require some rearrangement of the
code. We might want to do a similar dance that we do in vacuum, and call
visibilitymap_pin first, then lock and update the heap page, and then
set the VM bit while holding the lock on the heap page.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#13)
Re: crash-safe visibility map, take four

On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Yeah, that's a straightforward way to fix it. I don't think the performance
hit will be too bad. But we need to be careful not to hold locks while doing
I/O, which might require some rearrangement of the code. We might want to do
a similar dance that we do in vacuum, and call visibilitymap_pin first, then
lock and update the heap page, and then set the VM bit while holding the
lock on the heap page.

Looking at heap_insert(), for example, we get the target buffer by
calling RelationGetBufferForTuple(), which kicks out an already
x-locked buffer. Until we have an x-lock on the buffer, we don't
actually know that there's enough free space for the new tuple. But
if we release the x-lock to pin the visibility map page, then, one,
we're adding an additional lock acquisition and release cycle, and,
two, by the time we reacquire the x-lock the amount of free space may
no longer be sufficient. Maybe we could check PD_ALL_VISIBLE before
taking the buffer lock - if it appears to be set, then we pin the
visibility map page before taking the buffer lock. Otherwise, we take
the buffer lock at once. Either way, once we have the lock, we
recheck the bit. Only if it's set and we haven't got a pin do we need
to do the drop-lock-pin-reacquire-lock dance. Is that at all
sensible?

I also wonder if we should be thinking about holding on to the
visibility map pin longer, rather than potentially reacquiring it for
every tuple inserted. For bulk loading it doesn't matter; we'll
usually be extending the relation anyway, so the PD_ALL_VISIBLE bits
won't be set and we needn't ever hit the map. But for a table where
there's a large amount of internal free space, it might matter. Then
again maybe we don't clear all-visible bits often enough for it to be
an issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15高增琦
pgf00a@gmail.com
In reply to: Heikki Linnakangas (#13)
Re: crash-safe visibility map, take four

On Wed, Mar 30, 2011 at 8:52 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 30.03.2011 06:24, 高增琦 wrote:

Should we do full-page write for visibilitymap all the time?
Now, when clear visiblitymap, there is no full-page write for vm
since we don't save buffer info in insert/update/delete's log.

The full-page write is used to protect pages from disk failure. Without
it,
1) set vm: the vm bits that should be set to 1 may still be 0
2) clear vm: the vm bits that should be set to 0 may still be 1
Are these true? Or the page is totally unpredictable?

Not quite. The WAL replay will set or clear vm bits, regardless of full
page writes. Full page writes protect from torn pages, ie. the problem where
some operations on a page have made it to disk while others have not. That's
not a problem for VM pages, as each bit on the page can be set or cleared
individually. But for something like a heap page where you have an offset in
the beginning of the page that points to the tuple elsewhere on the page,
you have to ensure that they stay in sync, even if you don't otherwise care
if the update makes it to disk or not.

Consider a example:
1. delete on two pages, emits two log (1, page1, vm_clear_1), (2, page2,
vm_clear_2)
2. "vm_clear_1" and "vm_clear_2" on same vm page
3. checkpoint, and vm page get torned, vm_clear_2 was lost
4. delete another page, emits one log (3, page1, vm_clear_3), vm_clear_3
still on that vm page
5. power down
6. startup, redo will replay all change after checkpoint, but vm_clear_2
will never be cleared
Am I right?

Another question:

To address the problem in
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
, should we just clear the vm before the log of insert/update/delete?
This may reduce the performance, is there another solution?

Yeah, that's a straightforward way to fix it. I don't think the performance
hit will be too bad. But we need to be careful not to hold locks while doing
I/O, which might require some rearrangement of the code. We might want to do
a similar dance that we do in vacuum, and call visibilitymap_pin first, then
lock and update the heap page, and then set the VM bit while holding the
lock on the heap page.

Do you mean we should lock the heap page first, then get the blocknumber,
then release heap page,
then pin the vm's page, then lock both heap page and vm page?
As Robert Haas said, when lock the heap page again, may there isnot enough
free space on it.
Is there a way just stop the checkpoint for a while?

Thanks.
GaoZengqi

#16高增琦
pgf00a@gmail.com
In reply to: Robert Haas (#14)
Re: crash-safe visibility map, take four

2011/3/30 Robert Haas <robertmhaas@gmail.com>

Maybe we could check PD_ALL_VISIBLE before
taking the buffer lock - if it appears to be set, then we pin the
visibility map page before taking the buffer lock. Otherwise, we take
the buffer lock at once. Either way, once we have the lock, we
recheck the bit. Only if it's set and we haven't got a pin do we need
to do the drop-lock-pin-reacquire-lock dance. Is that at all
sensible?

But only lock can make sure the page has enough free space.
If we try the drop-lock-...-lock dance, we may fall into a dead loop.

--
GaoZengqi
pgf00a@gmail.com
zengqigao@gmail.com

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: 高增琦 (#15)
Re: crash-safe visibility map, take four

On 31.03.2011 11:33, 高增琦 wrote:

Consider a example:
1. delete on two pages, emits two log (1, page1, vm_clear_1), (2, page2,
vm_clear_2)
2. "vm_clear_1" and "vm_clear_2" on same vm page
3. checkpoint, and vm page get torned, vm_clear_2 was lost
4. delete another page, emits one log (3, page1, vm_clear_3), vm_clear_3
still on that vm page
5. power down
6. startup, redo will replay all change after checkpoint, but vm_clear_2
will never be cleared
Am I right?

No. A page can only be torn at a hard crash, ie. at step 5. A checkpoint
flushes all changes to disk, once the checkpoint finishes all the
changes before it are safe on disk.

If you crashed between step 2 and 3, the VM page might be torn so that
only one of the vm_clears has made it to disk but the other has not. But
the WAL records for both are on disk anyway, so that will be corrected
at replay.

Another question:

To address the problem in
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
, should we just clear the vm before the log of insert/update/delete?
This may reduce the performance, is there another solution?

Yeah, that's a straightforward way to fix it. I don't think the performance
hit will be too bad. But we need to be careful not to hold locks while doing
I/O, which might require some rearrangement of the code. We might want to do
a similar dance that we do in vacuum, and call visibilitymap_pin first, then
lock and update the heap page, and then set the VM bit while holding the
lock on the heap page.

Do you mean we should lock the heap page first, then get the blocknumber,
then release heap page,
then pin the vm's page, then lock both heap page and vm page?
As Robert Haas said, when lock the heap page again, may there isnot enough
free space on it.

I think the sequence would have to be:

1. Pin the heap page.
2. Check if the all-visible flag is set on the heap page (without lock).
If it is, pin the vm page
3. Lock heap page, check that it has enough free space
4. Check again if the all-visible flag is set. If it is but we didn't
pin the vm page yet, release lock and loop back to step 2
5. Update heap page
6. Update vm page

Is there a way just stop the checkpoint for a while?

Not at the moment. It wouldn't be hard to add, though. I was about to
add a mechnism for that last autumn to fix a similar issue with b-tree
parent pointer updates
(http://archives.postgresql.org/message-id/4CCFEE61.2090702@enterprisedb.com),
but in the end it was solved differently.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com