Moving more work outside WALInsertLock

Started by Heikki Linnakangasabout 14 years ago83 messages
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
1 attachment(s)

I've been looking at various ways to make WALInsertLock less of a
bottleneck on multi-CPU servers. The key is going to be to separate the
two things that are done while holding the WALInsertLock: a) allocating
the required space in the WAL, and b) calculating the CRC of the record
header and copying the data to the WAL page. a) needs to be serialized,
but b) could be done in parallel.

I've been experimenting with different approaches to do that, but one
thing is common among all of them: you need to know the total amount of
WAL space needed for the record, including backup blocks, before you
take the lock. So, here's a patch to move things around in XLogInsert()
a bit, to accomplish that.

This patch doesn't seem to have any performance or scalability impact. I
must admit I expected it to give a tiny gain in scalability by
shortening the time WALInsertLock is held by a few instructions, but I
can't measure any. But IMO it makes the code more readable, so this is
worthwhile for that reason alone.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-bkp-refactor-1.patchtext/x-diff; name=xloginsert-bkp-refactor-1.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 690,695 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 690,698 ----
  	uint32		freespace;
  	int			curridx;
  	XLogRecData *rdt;
+ 	XLogRecData *rdt_cur;
+ 	XLogRecData	*rdt_bkp_first;
+ 	XLogRecData *rdt_bkp_last;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
***************
*** 704,709 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 707,713 ----
  	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	uint8		info_final;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 727,738 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	}
  
  	/*
! 	 * Here we scan the rdata chain, determine which buffers must be backed
! 	 * up, and compute the CRC values for the data.  Note that the record
! 	 * header isn't added into the CRC initially since we don't know the final
! 	 * length or info bits quite yet.  Thus, the CRC will represent the CRC of
! 	 * the whole record in the order "rdata, then backup blocks, then record
! 	 * header".
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
  	 * We could prevent the race by doing all this work while holding the
--- 731,738 ----
  	}
  
  	/*
! 	 * Here we scan the rdata chain to determine which buffers must be backed
! 	 * up.
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
  	 * We could prevent the race by doing all this work while holding the
***************
*** 746,751 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 746,752 ----
  	 * over the chain later.
  	 */
  begin:;
+ 	info_final = info;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
  		dtbuf[i] = InvalidBuffer;
***************
*** 760,766 **** begin:;
  	 */
  	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
- 	INIT_CRC32(rdata_crc);
  	len = 0;
  	for (rdt = rdata;;)
  	{
--- 761,766 ----
***************
*** 768,774 **** begin:;
  		{
  			/* Simple data, just include it */
  			len += rdt->len;
- 			COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  		}
  		else
  		{
--- 768,773 ----
***************
*** 779,790 **** begin:;
  				{
  					/* Buffer already referenced by earlier chain item */
  					if (dtbuf_bkp[i])
  						rdt->data = NULL;
  					else if (rdt->data)
- 					{
  						len += rdt->len;
- 						COMP_CRC32(rdata_crc, rdt->data, rdt->len);
- 					}
  					break;
  				}
  				if (dtbuf[i] == InvalidBuffer)
--- 778,789 ----
  				{
  					/* Buffer already referenced by earlier chain item */
  					if (dtbuf_bkp[i])
+ 					{
  						rdt->data = NULL;
+ 						rdt->len = 0;
+ 					}
  					else if (rdt->data)
  						len += rdt->len;
  					break;
  				}
  				if (dtbuf[i] == InvalidBuffer)
***************
*** 796,807 **** begin:;
  					{
  						dtbuf_bkp[i] = true;
  						rdt->data = NULL;
  					}
  					else if (rdt->data)
- 					{
  						len += rdt->len;
- 						COMP_CRC32(rdata_crc, rdt->data, rdt->len);
- 					}
  					break;
  				}
  			}
--- 795,804 ----
  					{
  						dtbuf_bkp[i] = true;
  						rdt->data = NULL;
+ 						rdt->len = 0;
  					}
  					else if (rdt->data)
  						len += rdt->len;
  					break;
  				}
  			}
***************
*** 816,854 **** begin:;
  	}
  
  	/*
! 	 * Now add the backup block headers and data into the CRC
  	 */
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		if (dtbuf_bkp[i])
  		{
! 			BkpBlock   *bkpb = &(dtbuf_xlg[i]);
! 			char	   *page;
! 
! 			COMP_CRC32(rdata_crc,
! 					   (char *) bkpb,
! 					   sizeof(BkpBlock));
! 			page = (char *) BufferGetBlock(dtbuf[i]);
! 			if (bkpb->hole_length == 0)
! 			{
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   BLCKSZ);
! 			}
! 			else
! 			{
! 				/* must skip the hole */
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   bkpb->hole_offset);
! 				COMP_CRC32(rdata_crc,
! 						   page + (bkpb->hole_offset + bkpb->hole_length),
! 						   BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
! 			}
  		}
  	}
  
  	/*
  	 * NOTE: We disallow len == 0 because it provides a useful bit of extra
  	 * error checking in ReadRecord.  This means that all callers of
  	 * XLogInsert must supply at least some not-in-a-buffer data.  However, we
--- 813,898 ----
  	}
  
  	/*
! 	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  They are kept in
! 	 * a separate rdata chain until we have the lock, and know that we won't
! 	 * need to restart from scratch anymore. The real rdata chain is kept
! 	 * unmodified until then.
! 	 *
! 	 * At the exit of this loop, write_len includes the backup block data.
! 	 *
! 	 * Also set the appropriate info bits to show which buffers were backed
! 	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
! 	 * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
  	 */
+ 	write_len = len;
+ 	rdt_bkp_first = rdt_bkp_last = NULL;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		BkpBlock   *bkpb;
! 		char	   *page;
! 
! 		if (!dtbuf_bkp[i])
! 			continue;
! 
! 		info_final |= XLR_SET_BKP_BLOCK(i);
! 
! 		bkpb = &(dtbuf_xlg[i]);
! 		page = (char *) BufferGetBlock(dtbuf[i]);
! 
! 		if (rdt_bkp_first == NULL)
! 			rdt_bkp_first = &(dtbuf_rdt1[i]);
! 		else
! 			rdt_bkp_last->next = &(dtbuf_rdt1[i]);
! 
! 		rdt_bkp_last = &(dtbuf_rdt1[i]);
! 
! 		rdt_bkp_last->data = (char *) bkpb;
! 		rdt_bkp_last->len = sizeof(BkpBlock);
! 		write_len += sizeof(BkpBlock);
! 
! 		rdt_bkp_last->next = &(dtbuf_rdt2[i]);
! 		rdt_bkp_last = rdt_bkp_last->next;
! 
! 		if (bkpb->hole_length == 0)
  		{
! 			rdt_bkp_last->data = page;
! 			rdt_bkp_last->len = BLCKSZ;
! 			write_len += BLCKSZ;
! 			rdt_bkp_last->next = NULL;
! 		}
! 		else
! 		{
! 			/* must skip the hole */
! 			rdt_bkp_last->data = page;
! 			rdt_bkp_last->len = bkpb->hole_offset;
! 			write_len += bkpb->hole_offset;
! 
! 			rdt_bkp_last->next = &(dtbuf_rdt3[i]);
! 			rdt_bkp_last = rdt_bkp_last->next;
! 
! 			rdt_bkp_last->data = page + (bkpb->hole_offset + bkpb->hole_length);
! 			rdt_bkp_last->len = BLCKSZ - (bkpb->hole_offset + bkpb->hole_length);
! 			write_len += rdt_bkp_last->len;
! 			rdt_bkp_last->next = NULL;
  		}
  	}
  
  	/*
+ 	 * Calculate CRC of the data, including all the backup blocks
+ 	 *
+ 	 * Note that the record header isn't added into the CRC initially since
+ 	 * we don't know the prev-link yet.  Thus, the CRC will represent the CRC
+ 	 * of the whole record in the order "rdata, then backup blocks, then
+ 	 * record header".
+ 	 */
+ 	INIT_CRC32(rdata_crc);
+ 	for (rdt_cur = rdt; rdt_cur != NULL; rdt_cur = rdt_cur->next)
+ 		COMP_CRC32(rdata_crc, rdt_cur->data, rdt_cur->len);
+ 	for (rdt_cur = rdt_bkp_first; rdt_cur != NULL; rdt_cur = rdt_cur->next)
+ 		COMP_CRC32(rdata_crc, rdt_cur->data, rdt_cur->len);
+ 
+ 	/*
  	 * NOTE: We disallow len == 0 because it provides a useful bit of extra
  	 * error checking in ReadRecord.  This means that all callers of
  	 * XLogInsert must supply at least some not-in-a-buffer data.  However, we
***************
*** 913,974 **** begin:;
  	}
  
  	/*
! 	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  Note that we have
! 	 * now irrevocably changed the input rdata chain.  At the exit of this
! 	 * loop, write_len includes the backup block data.
! 	 *
! 	 * Also set the appropriate info bits to show which buffers were backed
! 	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
! 	 * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
  	 */
! 	write_len = len;
! 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 	{
! 		BkpBlock   *bkpb;
! 		char	   *page;
! 
! 		if (!dtbuf_bkp[i])
! 			continue;
! 
! 		info |= XLR_SET_BKP_BLOCK(i);
! 
! 		bkpb = &(dtbuf_xlg[i]);
! 		page = (char *) BufferGetBlock(dtbuf[i]);
! 
! 		rdt->next = &(dtbuf_rdt1[i]);
! 		rdt = rdt->next;
! 
! 		rdt->data = (char *) bkpb;
! 		rdt->len = sizeof(BkpBlock);
! 		write_len += sizeof(BkpBlock);
! 
! 		rdt->next = &(dtbuf_rdt2[i]);
! 		rdt = rdt->next;
! 
! 		if (bkpb->hole_length == 0)
! 		{
! 			rdt->data = page;
! 			rdt->len = BLCKSZ;
! 			write_len += BLCKSZ;
! 			rdt->next = NULL;
! 		}
! 		else
! 		{
! 			/* must skip the hole */
! 			rdt->data = page;
! 			rdt->len = bkpb->hole_offset;
! 			write_len += bkpb->hole_offset;
! 
! 			rdt->next = &(dtbuf_rdt3[i]);
! 			rdt = rdt->next;
! 
! 			rdt->data = page + (bkpb->hole_offset + bkpb->hole_length);
! 			rdt->len = BLCKSZ - (bkpb->hole_offset + bkpb->hole_length);
! 			write_len += rdt->len;
! 			rdt->next = NULL;
! 		}
! 	}
  
  	/*
  	 * If there isn't enough space on the current XLOG page for a record
--- 957,966 ----
  	}
  
  	/*
! 	 * We can now link the backup block rdata entries into the chain. This
! 	 * changes the rdata chain irrevocably.
  	 */
! 	rdt->next = rdt_bkp_first;
  
  	/*
  	 * If there isn't enough space on the current XLOG page for a record
***************
*** 1031,1037 **** begin:;
  	record->xl_xid = GetCurrentTransactionIdIfAny();
  	record->xl_tot_len = SizeOfXLogRecord + write_len;
  	record->xl_len = len;		/* doesn't include backup blocks */
! 	record->xl_info = info;
  	record->xl_rmid = rmid;
  
  	/* Now we can finish computing the record's CRC */
--- 1023,1029 ----
  	record->xl_xid = GetCurrentTransactionIdIfAny();
  	record->xl_tot_len = SizeOfXLogRecord + write_len;
  	record->xl_len = len;		/* doesn't include backup blocks */
! 	record->xl_info = info_final;
  	record->xl_rmid = rmid;
  
  	/* Now we can finish computing the record's CRC */
#2Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#1)
Re: Moving more work outside WALInsertLock

On Thursday, December 15, 2011 02:51:33 PM Heikki Linnakangas wrote:

I've been looking at various ways to make WALInsertLock less of a
bottleneck on multi-CPU servers. The key is going to be to separate the
two things that are done while holding the WALInsertLock: a) allocating
the required space in the WAL, and b) calculating the CRC of the record
header and copying the data to the WAL page. a) needs to be serialized,
but b) could be done in parallel.

I've been experimenting with different approaches to do that, but one
thing is common among all of them: you need to know the total amount of
WAL space needed for the record, including backup blocks, before you
take the lock. So, here's a patch to move things around in XLogInsert()
a bit, to accomplish that.

This patch doesn't seem to have any performance or scalability impact. I
must admit I expected it to give a tiny gain in scalability by
shortening the time WALInsertLock is held by a few instructions, but I
can't measure any. But IMO it makes the code more readable, so this is
worthwhile for that reason alone.

Thats great! I did (or at least tried) something similar when I was playing
around with another crc32 implementation (which I plan to finish sometime). My
changes where totally whacky but I got rather big improvements when changing
the crc computation from incremental to one big swoop.
I started to hack up an api which buffered xlog data in statically sized buffer
in each backend and only submitted that every now and then. Never got that to
actually work correctly in more than the simplest cases though ;). In many
cases were taking the wal insert lock way to often during a single insert...
(you obviously know that...)

Andres

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Moving more work outside WALInsertLock

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I've been experimenting with different approaches to do that, but one
thing is common among all of them: you need to know the total amount of
WAL space needed for the record, including backup blocks, before you
take the lock. So, here's a patch to move things around in XLogInsert()
a bit, to accomplish that.

This patch may or may not be useful, but this description of it is utter
nonsense, because we already do compute that before taking the lock.
Please try again to explain what you're doing?

regards, tom lane

#4Jeff Janes
jeff.janes@gmail.com
In reply to: Tom Lane (#3)
Re: Moving more work outside WALInsertLock

On Thu, Dec 15, 2011 at 7:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I've been experimenting with different approaches to do that, but one
thing is common among all of them: you need to know the total amount of
WAL space needed for the record, including backup blocks, before you
take the lock. So, here's a patch to move things around in XLogInsert()
a bit, to accomplish that.

This patch may or may not be useful, but this description of it is utter
nonsense, because we already do compute that before taking the lock.
Please try again to explain what you're doing?

Currently the CRC of all the data minus the header is computed outside the lock,
but then the header's computation is added and the CRC is finalized
inside the lock.

Cheers,

Jeff

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#4)
Re: Moving more work outside WALInsertLock

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Dec 15, 2011 at 7:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This patch may or may not be useful, but this description of it is utter
nonsense, because we already do compute that before taking the lock.
Please try again to explain what you're doing?

Currently the CRC of all the data minus the header is computed outside the lock,
but then the header's computation is added and the CRC is finalized
inside the lock.

Quite. AFAICS that is not optional, unless you are proposing to remove
the prev_link from the scope of the CRC, which is not exactly a
penalty-free change.

regards, tom lane

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#5)
Re: Moving more work outside WALInsertLock

On 15.12.2011 18:48, Tom Lane wrote:

Jeff Janes<jeff.janes@gmail.com> writes:

On Thu, Dec 15, 2011 at 7:34 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

This patch may or may not be useful, but this description of it is utter
nonsense, because we already do compute that before taking the lock.
Please try again to explain what you're doing?

Currently the CRC of all the data minus the header is computed outside the lock,
but then the header's computation is added and the CRC is finalized
inside the lock.

Quite. AFAICS that is not optional,

Right, my patch did not change that.

unless you are proposing to remove
the prev_link from the scope of the CRC, which is not exactly a
penalty-free change.

We could CRC the rest of the record header before getting the lock,
though, and only include the prev-link while holding the lock. I
micro-benchmarked that a little bit, but didn't see much benefit from
doing just that. Once you do more drastic changes so that the lock
doesn't need to be held while copying the data and calculating the CRC
of the record header, so that those things can be done in parallel, it
matters even less.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#3)
Re: Moving more work outside WALInsertLock

On 15.12.2011 17:34, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

I've been experimenting with different approaches to do that, but one
thing is common among all of them: you need to know the total amount of
WAL space needed for the record, including backup blocks, before you
take the lock. So, here's a patch to move things around in XLogInsert()
a bit, to accomplish that.

This patch may or may not be useful, but this description of it is utter
nonsense, because we already do compute that before taking the lock.

Nope. Without this patch, the total length including the backup blocks,
write_len, is added up in the loop that creates the rdata entries for
backup blocks. That is done while holding the lock (see code beginning
with comment "Make additional rdata chain entries for the backup blocks").

Admittedly you could sum up the total length quite easily in the earlier
loop, before we grab the lock, where we calculate the CRC of the backup
blocks ("Now add the backup block headers and data into the CRC"). That
would be a smaller patch.

Please try again to explain what you're doing?

Ok: I'm moving the creation of rdata entries for backup blocks outside
the critical section, so that it's done before grabbing the lock. I'm
also moving the CRC calculation so that it's done after all the rdata
entries have been created, including the ones for backup blocks. It's
more readable to do it that way, as a separate step, instead of
sprinkling the COMP_CRC macros in many places.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#7)
Re: Moving more work outside WALInsertLock

On Thu, Dec 15, 2011 at 7:06 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Please try again to explain what you're doing?

Ok: I'm moving the creation of rdata entries for backup blocks outside the
critical section, so that it's done before grabbing the lock. I'm also
moving the CRC calculation so that it's done after all the rdata entries
have been created, including the ones for backup blocks. It's more readable
to do it that way, as a separate step, instead of sprinkling the COMP_CRC
macros in many places.

There's a comment that says we can't undo the linking of the rdata
chains, but it looks like a reversible process to me.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#6)
Re: Moving more work outside WALInsertLock

On Thu, Dec 15, 2011 at 6:50 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

unless you are proposing to remove
the prev_link from the scope of the CRC, which is not exactly a
penalty-free change.

We could CRC the rest of the record header before getting the lock, though,
and only include the prev-link while holding the lock. I micro-benchmarked
that a little bit, but didn't see much benefit from doing just that. Once
you do more drastic changes so that the lock doesn't need to be held while
copying the data and calculating the CRC of the record header, so that those
things can be done in parallel, it matters even less.

You missed your cue to discuss leaving the prev link out of the CRC altogether.

On its own that sounds dangerous, but its not. When we need to confirm
the prev link we already know what we expect it to be, so CRC-ing it
is overkill. That isn't true of any other part of the WAL record, so
the prev link is the only thing we can relax, but thats OK because we
can CRC check everything else outside of the locked section.

That isn't my idea, but I'm happy to put it on the table since I'm not shy.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#9)
Re: Moving more work outside WALInsertLock

Simon Riggs <simon@2ndQuadrant.com> writes:

You missed your cue to discuss leaving the prev link out of the CRC altogether.

On its own that sounds dangerous, but its not. When we need to confirm
the prev link we already know what we expect it to be, so CRC-ing it
is overkill. That isn't true of any other part of the WAL record, so
the prev link is the only thing we can relax, but thats OK because we
can CRC check everything else outside of the locked section.

That isn't my idea, but I'm happy to put it on the table since I'm not shy.

I'm glad it's not your idea, because it's a bad one. A large part of
the point of CRC'ing WAL records is to guard against torn-page problems
in the WAL files, and doing things like that would give up a significant
part of that protection, because there would no longer be any assurance
that the body of a WAL record had anything to do with its prev_link.

Consider a scenario like this:

* We write a WAL record that starts 8 bytes before a sector boundary,
so that the prev_link is in one sector and the rest of the record in
the next one(s).

* Time passes, and we recycle that WAL file.

* We write another WAL record that starts 8 bytes before the same sector
boundary, so that the prev_link is in one sector and the rest of the
record in the next one(s).

* System crashes, after having written out the earlier sector but not
the later one(s).

On restart, the replay code will see a prev_link that matches what it
expects. If the CRC for the remainder of the record is not dependent
on the prev_link, then the remainder of the old record will look good
too, and we'll attempt to replay it, n*16MB too late.

Including the prev_link in the CRC adds a significant amount of
protection against such problems. We should not remove this protection
in the name of shaving a few cycles.

regards, tom lane

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#10)
Re: Moving more work outside WALInsertLock

On 16.12.2011 05:27, Tom Lane wrote:

* We write a WAL record that starts 8 bytes before a sector boundary,
so that the prev_link is in one sector and the rest of the record in
the next one(s).

prev-link is not the first field in the header. The CRC is.

* Time passes, and we recycle that WAL file.

* We write another WAL record that starts 8 bytes before the same sector
boundary, so that the prev_link is in one sector and the rest of the
record in the next one(s).

* System crashes, after having written out the earlier sector but not
the later one(s).

On restart, the replay code will see a prev_link that matches what it
expects. If the CRC for the remainder of the record is not dependent
on the prev_link, then the remainder of the old record will look good
too, and we'll attempt to replay it, n*16MB too late.

The CRC would be in the previous sector with the prev-link, so the CRC
of the old record would have to match the CRC of the new record. I guess
that's not totally impossible, though - there could be some WAL-logged
operations where the payload of the WAL record is often exactly the
same. Like a heap clean record, when the same page is repeatedly pruned.

Including the prev_link in the CRC adds a significant amount of
protection against such problems. We should not remove this protection
in the name of shaving a few cycles.

Yeah. I did some quick testing with a patch to leave prev-link out of
the calculation, and move the record CRC calculation outside the lock,
too. I don't remember the numbers, but while it did make some
difference, it didn't seem worthwhile.

Anyway, I'm looking at ways to make the memcpy() of the payload happen
without the lock, in parallel, and once you do that the record header
CRC calculation can be done in parallel, too. That makes it irrelevant
from a performance point of view whether the prev-link is included in
the CRC or not.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#11)
Re: Moving more work outside WALInsertLock

On Fri, Dec 16, 2011 at 12:07 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Anyway, I'm looking at ways to make the memcpy() of the payload happen
without the lock, in parallel, and once you do that the record header CRC
calculation can be done in parallel, too. That makes it irrelevant from a
performance point of view whether the prev-link is included in the CRC or
not.

Better plan. So we keep the prev link in the CRC.

I already proposed a design for that using page-level share locks any
reason not to go with that?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#12)
Re: Moving more work outside WALInsertLock

On 16.12.2011 14:37, Simon Riggs wrote:

On Fri, Dec 16, 2011 at 12:07 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Anyway, I'm looking at ways to make the memcpy() of the payload happen
without the lock, in parallel, and once you do that the record header CRC
calculation can be done in parallel, too. That makes it irrelevant from a
performance point of view whether the prev-link is included in the CRC or
not.

Better plan. So we keep the prev link in the CRC.

I already proposed a design for that using page-level share locks any
reason not to go with that?

Sorry, I must've missed that. Got a link?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#13)
Re: Moving more work outside WALInsertLock

On Fri, Dec 16, 2011 at 12:50 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 16.12.2011 14:37, Simon Riggs wrote:

On Fri, Dec 16, 2011 at 12:07 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

Anyway, I'm looking at ways to make the memcpy() of the payload happen
without the lock, in parallel, and once you do that the record header CRC
calculation can be done in parallel, too. That makes it irrelevant from a
performance point of view whether the prev-link is included in the CRC or
not.

Better plan. So we keep the prev link in the CRC.

I already proposed a design for that using page-level share locks any
reason not to go with that?

Sorry, I must've missed that. Got a link?

From nearly 4 years ago.

http://grokbase.com/t/postgresql.org/pgsql-hackers/2008/02/reworking-wal-locking/145qrhllcqeqlfzntvn7kjefijey

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#14)
Re: Moving more work outside WALInsertLock

On 16.12.2011 15:03, Simon Riggs wrote:

On Fri, Dec 16, 2011 at 12:50 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 16.12.2011 14:37, Simon Riggs wrote:

I already proposed a design for that using page-level share locks any
reason not to go with that?

Sorry, I must've missed that. Got a link?

From nearly 4 years ago.

http://grokbase.com/t/postgresql.org/pgsql-hackers/2008/02/reworking-wal-locking/145qrhllcqeqlfzntvn7kjefijey

Ah, thanks. That is similar to what I'm experimenting, but a second
lwlock is still fairly heavy-weight. I think with many backends, you
will be beaten badly by contention on the spinlocks alone.

I'll polish up and post what I've been experimenting with, so we can
discuss that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#15)
Re: Moving more work outside WALInsertLock

On 16.12.2011 15:42, Heikki Linnakangas wrote:

On 16.12.2011 15:03, Simon Riggs wrote:

On Fri, Dec 16, 2011 at 12:50 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 16.12.2011 14:37, Simon Riggs wrote:

I already proposed a design for that using page-level share locks any
reason not to go with that?

Sorry, I must've missed that. Got a link?

From nearly 4 years ago.

http://grokbase.com/t/postgresql.org/pgsql-hackers/2008/02/reworking-wal-locking/145qrhllcqeqlfzntvn7kjefijey

Ah, thanks. That is similar to what I'm experimenting, but a second
lwlock is still fairly heavy-weight. I think with many backends, you
will be beaten badly by contention on the spinlocks alone.

I'll polish up and post what I've been experimenting with, so we can
discuss that.

So, here's a WIP patch of what I've been working on. The WAL insertions
is split into two stages:

1. Reserve the space from the WAL stream. This is done while holding a
spinlock. The page holding the reserved space doesn't necessary need to
be in cache yet, the reservation can run ahead of the WAL buffer cache.
(quick testing suggests that a lwlock is too heavy-weight for this)

2. Ensure the page is in the WAL buffer cache. If not, initialize it,
evicting old pages if needed. Then finish the CRC calculation of the
header and memcpy the record in place. (if the record spans multiple
pages, it operates on one page at a time, to avoid problems with running
out of WAL buffers)

As long as wal_buffers is high enough, and the I/O can keep up, stage 2
can happen in parallel in many backends. The WAL writer process
pre-initializes new pages ahead of the insertions, so regular backends
rarely need to do that.

When a page is written out, with XLogWrite(), you need to wait for any
in-progress insertions to the pages you're about to write out to finish.
For that, every backend has slot with an XLogRecPtr in shared memory.
Iẗ́'s set to the position where that backend is currently inserting to.
If there's no insertion in-progress, it's invalid, but when it's valid
it acts like a barrier, so that no-one is allowed to XLogWrite() beyond
that position. That's very lightweight to the backends, but I'm using
busy-waiting to wait on an insertion to finish ATM. That should be
replaced with something smarter, that's the biggest missing part of the
patch.

One simple way to test the performance impact of this is:

psql -c "DROP TABLE IF EXISTS foo; CREATE TABLE foo (id int4);
CHECKPOINT" postgres
echo "BEGIN; INSERT INTO foo SELECT i FROM generate_series(1, 10000) i;
ROLLBACK" > parallel-insert-test.sql
pgbench -n -T 10 -c4 -f parallel-insert-test.sql postgres

On my dual-core laptop, this patch increases the tps on that from about
60 to 110.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#16)
1 attachment(s)
Re: Moving more work outside WALInsertLock

On 23.12.2011 10:13, Heikki Linnakangas wrote:

So, here's a WIP patch of what I've been working on.

And here's the patch I forgot to attach..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-1.patchtext/x-diff; name=xloginsert-scale-1.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 272,277 **** static XLogRecPtr RedoRecPtr;
--- 273,290 ----
   */
  static XLogRecPtr RedoStartLSN = {0, 0};
  
+ /*
+  * We have one of these for each backend, plus one that is shared by all
+  * auxiliary processes. WALAuxSlotLock is used to coordinate access to the
+  * shared slot.
+  */
+ typedef struct
+ {
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ } BackendXLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	(MaxBackends + 1)
+ 
  /*----------
   * Shared-memory data structures for XLOG control
   *
***************
*** 282,292 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually three shared-memory copies of
   * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
   *		XLogCtl->LogwrtResult is protected by info_lck
   *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
-  *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
   * One must hold the associated lock to read or write any of these, but
   * of course no lock is needed to read/write the unshared LogwrtResult.
   *
--- 295,304 ----
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually two shared-memory copies of
   * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
   *		XLogCtl->LogwrtResult is protected by info_lck
   *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
   * One must hold the associated lock to read or write any of these, but
   * of course no lock is needed to read/write the unshared LogwrtResult.
   *
***************
*** 296,307 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * is that it can be examined/modified by code that already holds WALWriteLock
   * without needing to grab info_lck as well.
   *
!  * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
!  * but is updated when convenient.	Again, it exists for the convenience of
!  * code that is already holding WALInsertLock but not the other locks.
!  *
!  * The unshared LogwrtResult may lag behind any or all of these, and again
!  * is updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
--- 308,315 ----
   * is that it can be examined/modified by code that already holds WALWriteLock
   * without needing to grab info_lck as well.
   *
!  * The unshared LogwrtResult may lag behind either or both of these, and is
!  * updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
***************
*** 311,320 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 319,333 ----
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 326,331 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 339,392 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    infopos_lck. Try to keep this section as short as possible, infopos_lck
+  *    can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each backend has its own "XLog insertion
+  * slot", which is used to indicate the position the backend is writing to.
+  * The slot is marked as in-use in step 1, while holding infopos_lck, by
+  * setting the position field in the slot. When the backend is finished with
+  * the insertion, it can clear its slot without a lock.
+  *
+  * Before 9.2, WAL insertion was serialized on one big lock, so that once
+  * you finished inserting your record you knew that all the previous records
+  * were inserted too. That is no longer true, there can be insertions to
+  * earlier positions still in-progress when your insertion finishes. To wait
+  * for them to finish, use WaitXLogInsertionsToFinish(). It polls (FIXME:
+  * busy-waiting is bad) the array of per-backend XLog insertion slots until
+  * it sees that all of the insertions to earlier locations have finished.
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for a backend to finish
+  * (or at least advance to next uninitialized page), while you're holding
+  * WALWriteLock. That would be bad, because the backend you're waiting for
+  * might need to acquire WALWriteLock, too, to evict an old buffer, so you'd
+  * get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * its called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition. It can't get stuck, because an insertion to a
+  * WAL page that's already initialized in cache can always proceed without
+  * waiting on a lock. However, if the page has *just* been initialized, the
+  * insertion might still briefly acquire WALBufMappingLock to observe that
+  * fact.
+  *
   *----------
   */
  
***************
*** 346,356 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 407,431 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct. */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment. The
! 	 * next record will be inserted there (or somewhere after it if there's
! 	 * not enough space on the current page). PrevRecord points to the
! 	 * beginning of the last record already reserved. It might not be fully
! 	 * copied into place yet, but we know its exact location.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
! 
! 	/*
! 	 * padding to push RedoRecPtr and forcePageWrites, which rarely change,
! 	 * to a different cache line than the rapidly-changing CurrPos and
! 	 * PrevRecord values.
! 	 */
! 	char		padding[128];
! 
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 381,389 **** typedef struct XLogCtlWrite
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
--- 456,466 ----
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	BackendXLogInsertSlot *BackendXLogInsertSlots;
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
***************
*** 397,405 **** typedef struct XLogCtlData
  	XLogCtlWrite Write;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 474,492 ----
  	XLogCtlWrite Write;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock. To change the identity of a buffer that's
+ 	 * still dirty, the old page needs to be written out first, and for that
+ 	 * you need WALWriteLock, and you need to ensure that there's no
+ 	 * in-progress insertions to the page by calling
+ 	 * WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* current (latest) block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 468,497 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
! 
! /* Free space remaining in the current xlog page buffer */
! #define INSERT_FREESPACE(Insert)  \
! 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
! 
! /* Construct XLogRecPtr value for current insertion point */
! #define INSERT_RECPTR(recptr,Insert,curridx)  \
! 	( \
! 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
! 	  (recptr).xrecoff = \
! 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
! 	)
! 
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
  #define NextBufIdx(idx)		\
  		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
   * See discussion above.
   */
--- 555,583 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
! #define INSERT_FREESPACE(endptr)	\
! 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
  
  #define NextBufIdx(idx)		\
  		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
+  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
+  * would hold if it was in cache, the page containing 'recptr'.
+  *
+  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
+  * page is taken to mean the previous page.
+  */
+ #define XLogRecPtrToBufIdx(recptr)	\
+ 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
+ 
+ #define XLogRecEndPtrToBufIdx(recptr)	\
+ 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
+ 
+ /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
   * See discussion above.
   */
***************
*** 614,620 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
--- 700,706 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(bool new_segment, XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
***************
*** 663,668 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 749,766 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static XLogRecPtr PerformXLogInsert(int write_len, XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  bool forcePageWrites);
+ static bool ReserveXLogInsertLocation(int reqsize, bool forcePageWrites,
+ 						  XLogRecPtr *PrevRecord, XLogRecPtr *StartPos,
+ 						  XLogRecPtr *EndPos,
+ 						  volatile BackendXLogInsertSlot *myslot);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto,
+ 						   XLogRecPtr CurrPos);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ static XLogRecPtr AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 683,695 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
  	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
--- 781,789 ----
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	RecPtr;
  	XLogRecData *rdt;
+ 	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
***************
*** 701,709 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	uint32		len,
  				write_len;
  	unsigned	i;
- 	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 795,805 ----
  	uint32		len,
  				write_len;
  	unsigned	i;
  	bool		doPageWrites;
+ 	bool		forcePageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	uint8		info_final;
+ 	XLogRecord	rechdr;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 726,751 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  		return RecPtr;
  	}
  
  	/*
  	 * Here we scan the rdata chain, determine which buffers must be backed
! 	 * up, and compute the CRC values for the data.  Note that the record
! 	 * header isn't added into the CRC initially since we don't know the final
! 	 * length or info bits quite yet.  Thus, the CRC will represent the CRC of
! 	 * the whole record in the order "rdata, then backup blocks, then record
! 	 * header".
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
  	 * We could prevent the race by doing all this work while holding the
  	 * insert lock, but it seems better to avoid doing CRC calculations while
! 	 * holding the lock.  This means we have to be careful about modifying the
! 	 * rdata chain until we know we aren't going to loop back again.  The only
! 	 * change we allow ourselves to make earlier is to set rdt->data = NULL in
! 	 * chain items we have decided we will have to back up the whole buffer
! 	 * for.  This is OK because we will certainly decide the same thing again
! 	 * for those items if we do it over; doing it here saves an extra pass
! 	 * over the chain later.
  	 */
  begin:;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
  		dtbuf[i] = InvalidBuffer;
--- 822,850 ----
  		return RecPtr;
  	}
  
+ 	/* TODO */
+ 	if (isLogSwitch)
+ 	{
+ 		elog(LOG, "log switch not implemented");
+ 		return InvalidXLogRecPtr;
+ 	}
+ 
  	/*
  	 * Here we scan the rdata chain, determine which buffers must be backed
! 	 * up.
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
  	 * We could prevent the race by doing all this work while holding the
  	 * insert lock, but it seems better to avoid doing CRC calculations while
! 	 * holding the lock.
! 	 *
! 	 * To avoid modifying the original rdata chain, we copy it into
! 	 * rdata_final. Later we will also add entries for the backup blocks into
! 	 * rdata_final, so that they don't need any special treatment in the
! 	 * critical section where the chunks are copied into the WAL buffers.
  	 */
  begin:;
+ 	info_final = info;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
  		dtbuf[i] = InvalidBuffer;
***************
*** 758,766 **** begin:;
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
- 	INIT_CRC32(rdata_crc);
  	len = 0;
  	for (rdt = rdata;;)
  	{
--- 857,865 ----
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	forcePageWrites = Insert->forcePageWrites;
! 	doPageWrites = fullPageWrites || forcePageWrites;
  
  	len = 0;
  	for (rdt = rdata;;)
  	{
***************
*** 768,774 **** begin:;
  		{
  			/* Simple data, just include it */
  			len += rdt->len;
- 			COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  		}
  		else
  		{
--- 867,872 ----
***************
*** 779,790 **** begin:;
  				{
  					/* Buffer already referenced by earlier chain item */
  					if (dtbuf_bkp[i])
  						rdt->data = NULL;
  					else if (rdt->data)
- 					{
  						len += rdt->len;
- 						COMP_CRC32(rdata_crc, rdt->data, rdt->len);
- 					}
  					break;
  				}
  				if (dtbuf[i] == InvalidBuffer)
--- 877,888 ----
  				{
  					/* Buffer already referenced by earlier chain item */
  					if (dtbuf_bkp[i])
+ 					{
  						rdt->data = NULL;
+ 						rdt->len = 0;
+ 					}
  					else if (rdt->data)
  						len += rdt->len;
  					break;
  				}
  				if (dtbuf[i] == InvalidBuffer)
***************
*** 796,807 **** begin:;
  					{
  						dtbuf_bkp[i] = true;
  						rdt->data = NULL;
  					}
  					else if (rdt->data)
- 					{
  						len += rdt->len;
- 						COMP_CRC32(rdata_crc, rdt->data, rdt->len);
- 					}
  					break;
  				}
  			}
--- 894,903 ----
  					{
  						dtbuf_bkp[i] = true;
  						rdt->data = NULL;
+ 						rdt->len = 0;
  					}
  					else if (rdt->data)
  						len += rdt->len;
  					break;
  				}
  			}
***************
*** 814,852 **** begin:;
  			break;
  		rdt = rdt->next;
  	}
! 
! 	/*
! 	 * Now add the backup block headers and data into the CRC
! 	 */
! 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 	{
! 		if (dtbuf_bkp[i])
! 		{
! 			BkpBlock   *bkpb = &(dtbuf_xlg[i]);
! 			char	   *page;
! 
! 			COMP_CRC32(rdata_crc,
! 					   (char *) bkpb,
! 					   sizeof(BkpBlock));
! 			page = (char *) BufferGetBlock(dtbuf[i]);
! 			if (bkpb->hole_length == 0)
! 			{
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   BLCKSZ);
! 			}
! 			else
! 			{
! 				/* must skip the hole */
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   bkpb->hole_offset);
! 				COMP_CRC32(rdata_crc,
! 						   page + (bkpb->hole_offset + bkpb->hole_length),
! 						   BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
! 			}
! 		}
! 	}
  
  	/*
  	 * NOTE: We disallow len == 0 because it provides a useful bit of extra
--- 910,917 ----
  			break;
  		rdt = rdt->next;
  	}
! 	/* Remember that this was the last regular rdata entry */
! 	rdt_lastnormal = rdt;
  
  	/*
  	 * NOTE: We disallow len == 0 because it provides a useful bit of extra
***************
*** 858,922 **** begin:;
  	if (len == 0 && !isLogSwitch)
  		elog(PANIC, "invalid xlog record length %u", len);
  
- 	START_CRIT_SECTION();
- 
- 	/* Now wait to get insert lock */
- 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
- 
- 	/*
- 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
- 	 * back and recompute everything.  This can only happen just after a
- 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
- 	 *
- 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
- 	 * affect the contents of the XLOG record, so we'll update our local copy
- 	 * but not force a recomputation.
- 	 */
- 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
- 	{
- 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
- 		RedoRecPtr = Insert->RedoRecPtr;
- 
- 		if (doPageWrites)
- 		{
- 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
- 			{
- 				if (dtbuf[i] == InvalidBuffer)
- 					continue;
- 				if (dtbuf_bkp[i] == false &&
- 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
- 				{
- 					/*
- 					 * Oops, this buffer now needs to be backed up, but we
- 					 * didn't think so above.  Start over.
- 					 */
- 					LWLockRelease(WALInsertLock);
- 					END_CRIT_SECTION();
- 					goto begin;
- 				}
- 			}
- 		}
- 	}
- 
- 	/*
- 	 * Also check to see if forcePageWrites was just turned on; if we weren't
- 	 * already doing full-page writes then go back and recompute. (If it was
- 	 * just turned off, we could recompute the record without full pages, but
- 	 * we choose not to bother.)
- 	 */
- 	if (Insert->forcePageWrites && !doPageWrites)
- 	{
- 		/* Oops, must redo it with full-page data */
- 		LWLockRelease(WALInsertLock);
- 		END_CRIT_SECTION();
- 		goto begin;
- 	}
- 
  	/*
  	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  Note that we have
! 	 * now irrevocably changed the input rdata chain.  At the exit of this
! 	 * loop, write_len includes the backup block data.
  	 *
  	 * Also set the appropriate info bits to show which buffers were backed
  	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
--- 923,936 ----
  	if (len == 0 && !isLogSwitch)
  		elog(PANIC, "invalid xlog record length %u", len);
  
  	/*
  	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  We have now
! 	 * modified the original rdata chain, but we remembered the last regular
! 	 * entry in rdt_lastnormal, so we can undo this if we have to loop back
! 	 * to the beginning.
! 	 *
! 	 * At the exit of this loop, write_len includes the backup block data.
  	 *
  	 * Also set the appropriate info bits to show which buffers were backed
  	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
***************
*** 931,943 **** begin:;
  		if (!dtbuf_bkp[i])
  			continue;
  
! 		info |= XLR_SET_BKP_BLOCK(i);
  
  		bkpb = &(dtbuf_xlg[i]);
  		page = (char *) BufferGetBlock(dtbuf[i]);
  
  		rdt->next = &(dtbuf_rdt1[i]);
! 		rdt = rdt->next;
  
  		rdt->data = (char *) bkpb;
  		rdt->len = sizeof(BkpBlock);
--- 945,957 ----
  		if (!dtbuf_bkp[i])
  			continue;
  
! 		info_final |= XLR_SET_BKP_BLOCK(i);
  
  		bkpb = &(dtbuf_xlg[i]);
  		page = (char *) BufferGetBlock(dtbuf[i]);
  
  		rdt->next = &(dtbuf_rdt1[i]);
! 		rdt = &(dtbuf_rdt1[i]);
  
  		rdt->data = (char *) bkpb;
  		rdt->len = sizeof(BkpBlock);
***************
*** 971,1044 **** begin:;
  	}
  
  	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
  	 */
! 	updrqst = false;
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 
! 	/* Compute record's XLOG location */
! 	curridx = Insert->curridx;
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 	 * segment, we need not insert it (and don't want to because we'd like
! 	 * consecutive switch requests to be no-ops).  Instead, make sure
! 	 * everything is written and flushed through the end of the prior segment,
! 	 * and return the prior segment's end address.
  	 */
! 	if (isLogSwitch &&
! 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
! 	{
! 		/* We can release insert lock immediately */
! 		LWLockRelease(WALInsertLock);
! 
! 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
! 		if (RecPtr.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			RecPtr.xlogid -= 1;
! 			RecPtr.xrecoff = XLogFileSize;
! 		}
! 
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
! 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
! 		{
! 			XLogwrtRqst FlushRqst;
! 
! 			FlushRqst.Write = RecPtr;
! 			FlushRqst.Flush = RecPtr;
! 			XLogWrite(FlushRqst, false, false);
! 		}
! 		LWLockRelease(WALWriteLock);
! 
! 		END_CRIT_SECTION();
! 
! 		return RecPtr;
! 	}
! 
! 	/* Insert record header */
! 
! 	record = (XLogRecord *) Insert->currpos;
! 	record->xl_prev = Insert->PrevRecord;
! 	record->xl_xid = GetCurrentTransactionIdIfAny();
! 	record->xl_tot_len = SizeOfXLogRecord + write_len;
! 	record->xl_len = len;		/* doesn't include backup blocks */
! 	record->xl_info = info;
! 	record->xl_rmid = rmid;
! 
! 	/* Now we can finish computing the record's CRC */
! 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
--- 985,1013 ----
  	}
  
  	/*
! 	 * Calculate CRC of the data, including all the backup blocks
! 	 *
! 	 * Note that the record header isn't added into the CRC initially since
! 	 * we don't know the prev-link yet.  Thus, the CRC will represent the CRC
! 	 * of the whole record in the order "rdata, then backup blocks, then
! 	 * record header".
  	 */
! 	INIT_CRC32(rdata_crc);
! 	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
! 		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
  	/*
! 	 * Construct record header. We can't CRC it yet, because the prev-link
! 	 * needs to be covered by the CRC and we don't know that yet. We will
! 	 * finish computing the CRC when we do.
  	 */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	rechdr.xl_prev = InvalidXLogRecPtr; /* TO BE DETERMINED */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
***************
*** 1059,1231 **** begin:;
  	}
  #endif
  
! 	/* Record begin of record in appropriate places */
! 	ProcLastRecPtr = RecPtr;
! 	Insert->PrevRecord = RecPtr;
! 
! 	Insert->currpos += SizeOfXLogRecord;
! 	freespace -= SizeOfXLogRecord;
  
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
! 		{
! 			if (rdata->len > freespace)
! 			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
! 				rdata->data += freespace;
! 				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
! 			}
! 		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
- 	/* Ensure next record will be properly aligned */
- 	Insert->currpos = (char *) Insert->currpage +
- 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
- 	freespace = INSERT_FREESPACE(Insert);
- 
  	/*
  	 * The recptr I return is the beginning of the *next* record. This will be
  	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
  	{
! 		XLogCtlWrite *Write = &XLogCtl->Write;
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
! 		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
! 		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == Write->curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
! 		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
! 		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
! 		}
  
! 		Write->LogwrtResult = LogwrtResult;
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
! 	}
! 	else
! 	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
! 		{
! 			/* curridx is filled and available for writing out */
! 			updrqst = true;
! 		}
! 		else
! 		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
! 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
! 	LWLockRelease(WALInsertLock);
  
! 	if (updrqst)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(&xlogctl->info_lck);
  		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
  		/* update local result copy while I have the chance */
  		LogwrtResult = xlogctl->LogwrtResult;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
- 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1028,1505 ----
  	}
  #endif
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to do the insertion.
  	 */
! 	RecPtr = PerformXLogInsert(write_len, &rechdr, rdata, rdata_crc,
! 							forcePageWrites);
! 	END_CRIT_SECTION();
  
! 	if (XLogRecPtrIsInvalid(RecPtr))
! 	{
! 		/*
! 		 * Oops, have to retry. Unlink the backup blocks from the chain,
! 		 * restoring it to its original state.
! 		 */
! 		rdt_lastnormal->next = NULL;
  
! 		goto begin;
  	}
  
  	/*
  	 * The recptr I return is the beginning of the *next* record. This will be
  	 * stored as LSN for changed data pages...
  	 */
! 	return RecPtr;
! }
! 
! /*
!  * Subroutine of XLogInsert. All the changes to shared state are done here,
!  * XLogInsert only prepares the record for insertion.
!  *
!  * On success, returns pointer to end of inserted record like XLogInsert().
!  * If RedoRecPtr or forcePageWrites had changed, returns InvalidRecPtr, and
!  * the caller must recalculate full-page-images and retry.
!  */
! static XLogRecPtr
! PerformXLogInsert(int write_len, XLogRecord *rechdr,
! 				  XLogRecData *rdata, pg_crc32 rdata_crc,
! 				  bool forcePageWrites)
! {
! 	volatile BackendXLogInsertSlot *myslot;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			tot_len;
! 	int			freespace;
! 	int			tot_left;
! 	XLogRecPtr	PrevRecord;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	EndPos;
! 	XLogRecPtr	CurrPos;
  
  	/*
! 	 * Our fast insertion method requires each process to have its own
! 	 * "slot", to tell others that it's still busy doing the insertion. Each
! 	 * regular backend has a dedicated slot, but auxiliary processes share
! 	 * one extra slot. Aux processes don't write a lot of WAL so they can
! 	 * well share. WALAuxSlotLock is used to coordinate access to the slot
! 	 * between aux processes.
  	 */
! 	if (MyBackendId == InvalidBackendId)
  	{
! 		LWLockAcquire(WALAuxSlotLock, LW_EXCLUSIVE);
! 		myslot = &XLogCtl->BackendXLogInsertSlots[MaxBackends];
! 	}
! 	else
! 		myslot = &XLogCtl->BackendXLogInsertSlots[MyBackendId];
  
! 	/* Get an insert location  */
! 	tot_len = SizeOfXLogRecord + write_len;
! 	if (!ReserveXLogInsertLocation(tot_len, forcePageWrites,
! 								   &PrevRecord, &StartPos, &EndPos, myslot))
! 	{
! 		if (MyBackendId == InvalidBackendId)
! 			LWLockRelease(WALAuxSlotLock);
! 		return InvalidXLogRecPtr;
! 	}
  
! 	/*
! 	 * Got an insertion location! Now that we know the prev-link, we can
! 	 * finish computing the record's CRC
! 	 */
! 	rechdr->xl_prev = PrevRecord;
! 	COMP_CRC32(rdata_crc, ((char *) rechdr) + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	rechdr->xl_crc = rdata_crc;
  
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
! 	freespace = XLOG_BLCKSZ - CurrPos.xrecoff % XLOG_BLCKSZ;
  
! 	/* Copy the record header and data */
! 	record = (XLogRecord *) currpos;
  
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	currpos += SizeOfXLogRecord;
! 	CurrPos.xrecoff += SizeOfXLogRecord;
! 	freespace -= SizeOfXLogRecord;
! 
! 	tot_left = write_len;
! 	while (rdata != NULL)
! 	{
! 		while (rdata->len > freespace)
  		{
! 			/*
! 			 * Write what fits on this page, then write the continuation
! 			 * record, and continue.
! 			 */
! 			XLogContRecord *contrecord;
  
! 			memcpy(currpos, rdata->data, freespace);
! 			rdata->data += freespace;
! 			rdata->len -= freespace;
! 			tot_left -= freespace;
  
! 			CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
  
! 			/*
! 			 * Make sure the memcpy is visible to others before we claim
! 			 * it to be done. It's important to update CurrPos before calling
! 			 * GetXLogBuffer(), because GetXLogBuffer() might need to wait
! 			 * for some insertions to finish so that it can write out a
! 			 * buffer to make room for the new page. Updating CurrPos before
! 			 * waiting for a new buffer ensures that we don't deadlock with
! 			 * ourselves if we run out of clean buffers.
! 			 */
! 			pg_write_barrier();
! 			myslot->CurrPos = CurrPos;
  
! 			currpos = GetXLogBuffer(CurrPos);
  
! 			contrecord = (XLogContRecord *) currpos;
! 			contrecord->xl_rem_len = tot_len - tot_left;
! 
! 			currpos += SizeOfXLogContRecord;
! 			CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 			freespace = XLOG_BLCKSZ - CurrPos.xrecoff % XLOG_BLCKSZ;
  		}
! 
! 		memcpy(currpos, rdata->data, rdata->len);
! 		currpos += rdata->len;
! 		CurrPos.xrecoff += rdata->len;
! 		freespace -= rdata->len;
! 		tot_left -= rdata->len;
! 
! 		rdata = rdata->next;
  	}
+ 	Assert(tot_left == 0);
+ 
+ 	CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
+ 	Assert(XLByteEQ(CurrPos, EndPos));
+ 
+ 	/*
+ 	 * Done! Clear CurrPos in our slot to let others know that we're finished,
+ 	 * but first make sure the changes we made to the WAL pages are visible
+ 	 * to everyone.
+ 	 */
+ 	pg_write_barrier();
+ 	myslot->CurrPos = InvalidXLogRecPtr;
+ 	if (MyBackendId == InvalidBackendId)
+ 		LWLockRelease(WALAuxSlotLock);
  
! 	/* update our global variables */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
  
! 	/* update shared LogwrtRqst.Write, if we crossed page boundary */
! 	if (StartPos.xrecoff / XLOG_BLCKSZ != EndPos.xrecoff / XLOG_BLCKSZ)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(&xlogctl->info_lck);
  		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
  		/* update local result copy while I have the chance */
  		LogwrtResult = xlogctl->LogwrtResult;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	return EndPos;
! }
  
  
! /*
!  * Reserves the right amount of space for a record of the given size from
!  * the WAL. *StartPos_p is set to the beginning of the reserved section,
!  * *EndPos_p to its end, and *Prev_record_p points to the beginning of the
!  * previous record to set to the prev-link of the record header.
!  *
!  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
!  * to let others know that we're busy inserting to the reserved area. The
!  * caller must clear it when the insertion is finished.
!  *
!  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
!  * changed. On failure, the shared state is not modified.
!  *
!  * This is the performance critical part of XLogInsert that must be
!  * serialized across backends. The rest can happen mostly in parallel.
!  *
!  * NB: The space calculation here must match the code in PerformXLogInsert,
!  * where we actually copy the record to the reserved space.
!  */
! static bool
! ReserveXLogInsertLocation(int size, bool forcePageWrites,
! 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
! 						  XLogRecPtr *EndPos_p,
! 						  volatile BackendXLogInsertSlot *myslot)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			freespace;
! 	XLogRecPtr	ptr;
! 	XLogRecPtr	StartPos;
! 	int			sizeleft;
! 
! 	sizeleft = size;
! 
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
! 		Insert->forcePageWrites != forcePageWrites)
! 	{
! 		/*
! 		 * Oops, forcePageWrites was just turned on, or a checkpoint
! 		 * just happened. Loop back to the beginning, because we might have
! 		 * to include more full-page images in the record
! 		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		return false;
! 	}
! 
! 	/*
! 	 * Now reserve the right amount of space from the WAL for our record.
! 	 */
! 	ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	freespace = INSERT_FREESPACE(ptr);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
! 		freespace = INSERT_FREESPACE(ptr);
! 	}
! 	StartPos = ptr;
! 
! 	/*
! 	 * Set our slot's CurrPos to the starting position, to let others know
! 	 * that we're busy inserting to this area.
! 	 */
! 	myslot->CurrPos = StartPos;
! 
! 	while (freespace < sizeleft)
! 	{
! 		/* fill this page, and continue on next page */
! 		sizeleft -= freespace;
! 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
! 
! 		/* account for continuation record header */
! 		ptr.xrecoff += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(ptr);
! 	}
! 	/* the rest fits on this page */
! 	ptr.xrecoff += sizeleft;
! 	sizeleft = 0;
! 
! 	/* Align the end position, so that the next record starts aligned */
! 	ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(DEBUG2, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 result->PrevRecord.xlogid, result->PrevRecord.xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 reqsize);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 
! 	return true;
! }
! 
! /*
!  * Get a pointer to the right location in the WAL buffer corresponding a
!  * given XLogRecPtr.
!  *
!  * If the page is not initialized yet, it is initialized. That might also
!  * require evicting an old diry buffer from the buffer cache, which means I/O.
!  *
!  * The caller must ensure that the page containing the requested location
!  * isn't evicted yet, and won't be evicted, by holding onto a
!  * BackendXLogInsertSlot with CurrPos set to 'ptr'. Setting it to some value
!  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
!  * if we have to evict a buffer, we might have to wait for someone else to
!  * finish a write. And that someone else might not be able to finish the write
!  * if our CurrPos points to a buffer that's still in the buffer cache.
!  */
! static char *
! GetXLogBuffer(XLogRecPtr ptr)
! {
! 	int			idx;
! 	XLogRecPtr	endptr;
! 
! 	/*
! 	 * The XLog buffer cache is organized so that we can easily calculate the
! 	 * buffer a given page must be loaded into from the XLogRecPtr alone.
! 	 * A page must always be loaded to a particular buffer.
! 	 */
! 	idx = XLogRecPtrToBufIdx(ptr);
! 
! 	/*
! 	 * See what page is loaded in the buffer at the moment. It could be the
! 	 * page we're looking for, or something older. It can't be anything
! 	 * newer - that would imply the page we're looking for has already
! 	 * been written out to disk, which shouldn't happen as long as the caller
! 	 * has set its slot's CurrPos correctly.
! 	 *
! 	 * However, we don't hold a lock while we read the value. If someone has
! 	 * just initialized the page, it's possible that we get a "torn read",
! 	 * and read a bogus value. That's ok, we'll grab the mapping lock (in
! 	 * AdvanceXLInsertBuffer) and retry if we see anything else than the page
! 	 * we're looking for. But it means that when we do this unlocked read, we
! 	 * might see a value that *is* ahead of the page we're looking for. So
! 	 * don't PANIC on that, until we've verified the value while holding the
! 	 * lock.
! 	 */
! 	endptr = XLogCtl->xlblocks[idx];
! 	if (ptr.xlogid != endptr.xlogid ||
! 		!(ptr.xrecoff < endptr.xrecoff &&
! 		  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 	{
! 		AdvanceXLInsertBuffer(false, ptr, false);
! 		endptr = XLogCtl->xlblocks[idx];
! 
! 		if (ptr.xlogid != endptr.xlogid ||
! 			!(ptr.xrecoff < endptr.xrecoff &&
! 			  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 		{
! 			elog(PANIC, "could not find WAL buffer for %X/%X", ptr.xlogid, ptr.xrecoff);
! 		}
! 	}
! 
! 	/*
! 	 * Found the buffer holding this page. Return a pointer to the right
! 	 * offset within the page.
! 	 */
! 	return (char *) XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
! 		ptr.xrecoff % XLOG_BLCKSZ;
! }
! 
! /*
!  * Advance an XLogRecPtr to the first valid insertion location on the next
!  * page, right after the page header. An XLogRecPtr pointing to a boundary,
!  * ie. the first byte of a page, is taken to belong to the previous page, 
!  */
! static XLogRecPtr
! AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr)
! {
! 	int			freespace;
! 
! 	freespace = INSERT_FREESPACE(ptr);
! 	XLByteAdvance(ptr, freespace);
! 	if (ptr.xrecoff % XLogSegSize == 0)
! 		ptr.xrecoff += SizeOfXLogLongPHD;
! 	else
! 		ptr.xrecoff += SizeOfXLogShortPHD;
! 
! 	return ptr;
! }
! 
! /*
!  * Wait for any insertions < upto to finish.
!  *
!  * Returns a value >= upto, which indicates the oldest in-progress insertion
!  * that we saw in the array. All insertions upto that point are guaranteed to
!  * be finished. Note that it is just a conservative guess, there are race
!  * conditions where we return a bogus, too-low, value. If you care about the
!  * return value, you must get the current Insert->CurrPos value *before*
!  * calling this function, and pass that as the CurrPos argument.
!  */
! static XLogRecPtr
! WaitXLogInsertionsToFinish(XLogRecPtr upto, XLogRecPtr CurrPos)
! {
! 	int			i;
! 	int			nbusyslots = 0;
! 	/* FIXME: it's safe to palloc here, would PANIC on OOM */
! 	int		   *busyslots = palloc0(sizeof(int) * NumXLogInsertSlots);
! 	int			cycles = 0;
! 
! 	/*
! 	 * Get a list of backend slots that are still inserting to a point earlier
! 	 * than 'upto'.
! 	 *
! 	 * This is a bit sloppy because we don't do any locking here. A slot's
! 	 * CurrPos that we read might get split if 8-byte loads are not atomic,
! 	 * but that's harmless. All slots with values <= upto, which we really do
! 	 * have to wait for, must be in the array before this function is called.
! 	 * That's because the 'upto' value must've been obtained by reading the
! 	 * current insert position, either directly or indirectly. It can never
! 	 * be > Insert->CurrPos. So we shouldn't miss anything that we genuinely
! 	 * need to wait for. OTOH, if someone is just storing an XLogRecPtr in a
! 	 * slot while we read it, we might incorrectly think that we have to wait
! 	 * for it. But that's OK, because in the loop that follows, we'll retry
! 	 * and see that it's actually > upto.
! 	 *
! 	 * XXX: that's bogus. You might see a too-new value if a slot's CurrPos is
! 	 * advanced at the same instant.
! 	 */
! 	for (i = 0; i < NumXLogInsertSlots; i++)
! 	{
! 		volatile BackendXLogInsertSlot *slot = &XLogCtl->BackendXLogInsertSlots[i];
! 		XLogRecPtr slotptr = slot->CurrPos;
! 
! 		if (XLogRecPtrIsInvalid(slotptr))
! 			continue;
! 
! 		if (XLByteLT(slotptr, upto))
! 			busyslots[nbusyslots++] = i;
! 		else if (XLByteLT(slotptr, CurrPos))
! 			CurrPos = slotptr;
! 	}
! 
! 	/*
! 	 * Busy-wait until the insertion is done.
! 	 *
! 	 * TODO: This needs to be replaced with something smarter. I don't think
! 	 * it's possible that we'd have to wait for I/O here, though. As the code
! 	 * stands, the caller never passes an 'upto' pointer that points to an
! 	 * uninitialized page. It always points to an already inserted record, in
! 	 * which case the page must already be initialized in the WAL buffer
! 	 * cache. Nevertheless, busy-waiting is not good.
! 	 */
! 	while (nbusyslots > 0)
! 	{
! 		pg_read_barrier();
! 		for (i = 0; i < nbusyslots; i++)
! 		{
! 			volatile BackendXLogInsertSlot *slot = &XLogCtl->BackendXLogInsertSlots[busyslots[i]];
! 			XLogRecPtr slotptr = slot->CurrPos;
! 
! 			if (XLogRecPtrIsInvalid(slot->CurrPos) || !XLByteLT(slotptr, upto))
! 			{
! 				if (nbusyslots > 1)
! 				{
! 					busyslots[i] = busyslots[nbusyslots - 1];
! 					i--;
! 				}
! 				nbusyslots--;
! 			}
! 		}
! 
! 		/* a debugging aid */
! 		if (++cycles == 1000000)
! 			elog(LOG, "stuck waiting upto %X/%X", upto.xlogid, upto.xrecoff);
! 	}
! 	pfree(busyslots);
! 
! 	return CurrPos;
  }
  
  /*
***************
*** 1458,1486 **** XLogArchiveCleanup(const char *xlog)
   * If new_segment is TRUE then we set up the next buffer page as the first
   * page of the next xlog segment file, possibly but not usually the next
   * consecutive file page.
-  *
-  * The global LogwrtRqst.Write pointer needs to be advanced to include the
-  * just-filled page.  If we can do this for free (without an extra lock),
-  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
-  * request update still needs to be done, FALSE if we did it internally.
-  *
-  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
  	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
! 	/* Use Insert->LogwrtResult copy if it's more fresh */
! 	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
! 		LogwrtResult = Insert->LogwrtResult;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 1732,1763 ----
   * If new_segment is TRUE then we set up the next buffer page as the first
   * page of the next xlog segment file, possibly but not usually the next
   * consecutive file page.
   */
! static void
! AdvanceXLInsertBuffer(bool new_segment, XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
  	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
+ 	bool		needflush;
+ 	int			npages = 0;
+ 	XLogRecPtr	EvictedPtr;
+ 
+ 	Assert(!new_segment); /* FIXME: not implemented */
+ 
+ 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
  
! 	/*
! 	 * Now that we have the lock, check if someone initialized the page
! 	 * already.
! 	 */
! while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
! {
! 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1488,1499 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
! 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  	{
  		/* nope, got work to do... */
  		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 1765,1779 ----
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
! 
! 	needflush = !XLByteLE(OldPageRqstPtr, LogwrtResult.Write);
! 
! 	if (needflush)
  	{
  		/* nope, got work to do... */
  		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1502,1534 **** AdvanceXLInsertBuffer(bool new_segment)
  
  			SpinLockAcquire(&xlogctl->info_lck);
  			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
  				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		update_needed = false;	/* Did the shared-request update */
! 
! 		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
! 		{
! 			/* OK, someone wrote it already */
! 			Insert->LogwrtResult = LogwrtResult;
! 		}
! 		else
  		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding insert lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
--- 1782,1824 ----
  
  			SpinLockAcquire(&xlogctl->info_lck);
  			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
+ 			{
+ 				Assert(XLByteLE(FinishedPageRqstPtr, XLogCtl->Insert.CurrPos));
  				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
+ 			}
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * If we just want to pre-initialize as much as we can without
! 			 * flushing, give up now.
! 			 */
! 			if (opportunistic)
! 				break;
! 
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr, InvalidXLogRecPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding mapping lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
***************
*** 1537,1560 **** AdvanceXLInsertBuffer(bool new_segment)
  				WriteRqst.Flush.xrecoff = 0;
  				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
  	/*
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
  
  	if (new_segment)
  	{
  		/* force it to a segment start point */
  		NewPageEndPtr.xrecoff += XLogSegSize - 1;
  		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
  	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 1827,1856 ----
  				WriteRqst.Flush.xrecoff = 0;
  				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
+ 	EvictedPtr = OldPageRqstPtr;
+ 
  	/*
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
+ #ifdef BROKEN
  	if (new_segment)
  	{
  		/* force it to a segment start point */
  		NewPageEndPtr.xrecoff += XLogSegSize - 1;
  		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
  	}
+ #endif
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1564,1577 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
  	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
  	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
- 	Insert->curridx = nextidx;
- 	Insert->currpage = NewPage;
- 
- 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
- 
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
  	 * will look like zeroes and not valid XLOG records...
--- 1860,1871 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
+ 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
+ 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
+ 
  	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
  	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
  	 * will look like zeroes and not valid XLOG records...
***************
*** 1614,1624 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 1908,1938 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * make sure the xlblocks update becomes visible to others before the
! 	 * curridx update.
! 	 */
! 	pg_write_barrier();
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 
! 	Assert(opportunistic || XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]));
! 	LWLockRelease(WALBufMappingLock);
! 
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(LOG, "initialized %d pages, upto %X/%X (evicted upto %X/%X) in slot %d (backend %d)",
! 			 npages,
! 			 NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff,
! 			 EvictedPtr.xlogid, EvictedPtr.xrecoff,
! 			 nextidx, MyBackendId);
! #endif
  }
  
  /*
***************
*** 1669,1675 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * only if caller specifies WriteRqst == page-end and flexible == false,
   * and there is some data to write.)
   *
!  * Must be called with WALWriteLock held.
   */
  static void
  XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
--- 1983,1991 ----
   * only if caller specifies WriteRqst == page-end and flexible == false,
   * and there is some data to write.)
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) to make sure the data is ready to
!  * write.
   */
  static void
  XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
***************
*** 1722,1731 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
! 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
  		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
--- 2038,2047 ----
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
! 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X (slot %d)",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff, curridx);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
  		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
***************
*** 2097,2129 **** XLogFlush(XLogRecPtr record)
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
- 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
- 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
- 				else
- 				{
- 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
- 					WriteRqstPtr.xrecoff -= freespace;
- 				}
- 				LWLockRelease(WALInsertLock);
- 				WriteRqst.Write = WriteRqstPtr;
- 				WriteRqst.Flush = WriteRqstPtr;
- 			}
- 			else
- 			{
- 				WriteRqst.Write = WriteRqstPtr;
- 				WriteRqst.Flush = record;
- 			}
  			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
--- 2413,2460 ----
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
+ 		/* try to write/flush later additions to XLOG as well */
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos;
+ 
+ 		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to use LWLockConditionalAcquire, and fall back
+ 		 * to writing just up to 'record' if we couldn't get the lock. I
+ 		 * wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand,
+ 		 * it would be good to not cause more contention on the lock if it's
+ 		 * busy, but on the other hand, this spinlock is much more lightweight
+ 		 * than the WALInsertLock was, so maybe it's better to just grab the
+ 		 * spinlock. Also note that if we stored the XLogRecPtr as one 64-bit
+ 		 * integer, we could just read it with no lock on platforms where
+ 		 * 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)		/* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, insertpos);
+ 
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
  			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
***************
*** 2234,2243 **** XLogBackgroundFlush(void)
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
- 
  	START_CRIT_SECTION();
  
  	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
--- 2565,2576 ----
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
  	START_CRIT_SECTION();
  
  	/* now wait for the write lock */
+ 
+ 	WaitXLogInsertionsToFinish(WriteRqstPtr, InvalidXLogRecPtr);
+ 
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
***************
*** 2248,2256 **** XLogBackgroundFlush(void)
  		WriteRqst.Flush = WriteRqstPtr;
  		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2581,2600 ----
  		WriteRqst.Flush = WriteRqstPtr;
  		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LogwrtResult = XLogCtl->Write.LogwrtResult;
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 *
+ 	 * Before we release the write lock, calculate the location of the last
+ 	 * fully written page.
+ 	 */
+ 	AdvanceXLInsertBuffer(false, InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5044,5049 **** XLOGShmemSize(void)
--- 5388,5396 ----
  	/* and the buffers themselves */
  	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
  
+ 	/* XLog insertion slots */
+ 	size = add_size(size, mul_size(sizeof(BackendXLogInsertSlot), NumXLogInsertSlots));
+ 
  	/*
  	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
  	 * added by CreateSharedMemoryAndSemaphores.  This lets us use this
***************
*** 5059,5064 **** XLOGShmemInit(void)
--- 5406,5412 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5084,5089 **** XLOGShmemInit(void)
--- 5432,5445 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize per-backend buffers */
+ 	XLogCtl->BackendXLogInsertSlots = (BackendXLogInsertSlot *) allocptr;
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogCtl->BackendXLogInsertSlots[i].CurrPos = InvalidXLogRecPtr;
+ 	}
+ 	allocptr += sizeof(BackendXLogInsertSlot) * NumXLogInsertSlots;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5098,5108 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
- 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
--- 5454,5465 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
+ 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
+ 
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
***************
*** 5942,5947 **** StartupXLOG(void)
--- 6299,6305 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6697,6704 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7055,7066 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6706,6731 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->Write.LogwrtResult = LogwrtResult;
- 	Insert->LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7068,7091 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->Write.LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndRecPtr.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6737,6743 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7097,7103 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7231,7239 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
  XLogRecPtr
  GetInsertRecPtr(void)
--- 7591,7603 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
+  *
+  * XXX: now that there can be several insertions "in-flight", what should
+  * this return? The position a new insertion would got to? Or the the oldest
+  * still in-progress insertion, perhaps?
   */
  XLogRecPtr
  GetInsertRecPtr(void)
***************
*** 7507,7512 **** CreateCheckPoint(int flags)
--- 7871,7877 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7574,7584 **** CreateCheckPoint(int flags)
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 7939,7950 ----
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
+ 
  	/*
! 	 * Determine the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7590,7596 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 7956,7962 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7599,7613 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 7965,7976 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7633,7646 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
--- 7996,8005 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = XLOG_BLCKSZ - curInsert.xrecoff % XLOG_BLCKSZ;
  	if (freespace < SizeOfXLogRecord)
! 		curInsert = AdvanceXLogRecPtrToNextPage(curInsert);
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
***************
*** 7666,7672 **** CreateCheckPoint(int flags)
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8025,8031 ----
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7686,7692 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8045,8051 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 7798,7804 **** CreateCheckPoint(int flags)
  	 */
  	if (shutdown && !XLByteEQ(checkPoint.redo, ProcLastRecPtr))
  		ereport(PANIC,
! 				(errmsg("concurrent transaction log activity while database system is shutting down")));
  
  	/*
  	 * Select point at which we can truncate the log, which we base on the
--- 8157,8165 ----
  	 */
  	if (shutdown && !XLByteEQ(checkPoint.redo, ProcLastRecPtr))
  		ereport(PANIC,
! 				(errmsg("concurrent transaction log activity while database system is shutting down (redo %X/%X, ProcLastRecPtr %X/%X",
! 						checkPoint.redo.xlogid, checkPoint.redo.xrecoff,
! 						ProcLastRecPtr.xlogid, ProcLastRecPtr.xrecoff)));
  
  	/*
  	 * Select point at which we can truncate the log, which we base on the
***************
*** 8053,8067 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
- 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
- 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8414,8425 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold info_lck to update it. There is no other processes
! 	 * updating Insert.RedoRecPtr, so we don't need a lock to protect that.
  	 */
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8816,8821 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9174,9180 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
***************
*** 8865,8890 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9224,9249 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 8946,8958 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9305,9317 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9034,9043 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
--- 9393,9403 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
***************
*** 9054,9060 **** pg_start_backup_callback(int code, Datum arg)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9414,9420 ----
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9067,9072 **** pg_start_backup_callback(int code, Datum arg)
--- 9427,9433 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
***************
*** 9108,9116 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 9469,9477 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9119,9134 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 9480,9495 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9330,9345 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9691,9708 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
! 
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9391,9406 **** GetStandbyFlushRecPtr(void)
   * Get latest WAL insert pointer
   */
  XLogRecPtr
! GetXLogInsertRecPtr(bool needlock)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	if (needlock)
! 		LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	if (needlock)
! 		LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 9754,9767 ----
   * Get latest WAL insert pointer
   */
  XLogRecPtr
! GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***************
*** 200,206 **** pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	current_recptr = GetXLogInsertRecPtr(true);
  
  	snprintf(location, sizeof(location), "%X/%X",
  			 current_recptr.xlogid, current_recptr.xrecoff);
--- 200,206 ----
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	current_recptr = GetXLogInsertRecPtr();
  
  	snprintf(location, sizeof(location), "%X/%X",
  			 current_recptr.xlogid, current_recptr.xrecoff);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 288,294 **** extern bool XLogInsertAllowed(void);
  extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
  extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
  extern XLogRecPtr GetStandbyFlushRecPtr(void);
! extern XLogRecPtr GetXLogInsertRecPtr(bool needlock);
  extern XLogRecPtr GetXLogWriteRecPtr(void);
  extern bool RecoveryIsPaused(void);
  extern void SetRecoveryPause(bool recoveryPause);
--- 288,294 ----
  extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
  extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
  extern XLogRecPtr GetStandbyFlushRecPtr(void);
! extern XLogRecPtr GetXLogInsertRecPtr(void);
  extern XLogRecPtr GetXLogWriteRecPtr(void);
  extern bool RecoveryIsPaused(void);
  extern void SetRecoveryPause(bool recoveryPause);
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALAuxSlotLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#18Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#17)
Re: Moving more work outside WALInsertLock

On Fri, Dec 23, 2011 at 3:15 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 23.12.2011 10:13, Heikki Linnakangas wrote:

So, here's a WIP patch of what I've been working on.

And here's the patch I forgot to attach..

Fails regression tests for me. I found this in postmaster.log:

PANIC: could not find WAL buffer for 0/2ECA438STATEMENT: ANALYZE onek2;
LOG: stuck waiting upto 0/3000000
LOG: server process (PID 34529) was terminated by signal 6: Aborted
DETAIL: Failed process was running: ANALYZE onek2;

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#18)
1 attachment(s)
Re: Moving more work outside WALInsertLock

On 23.12.2011 15:23, Robert Haas wrote:

On Fri, Dec 23, 2011 at 3:15 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 23.12.2011 10:13, Heikki Linnakangas wrote:

So, here's a WIP patch of what I've been working on.

And here's the patch I forgot to attach..

Fails regression tests for me. I found this in postmaster.log:

PANIC: could not find WAL buffer for 0/2ECA438STATEMENT: ANALYZE onek2;
LOG: stuck waiting upto 0/3000000
LOG: server process (PID 34529) was terminated by signal 6: Aborted
DETAIL: Failed process was running: ANALYZE onek2;

Sorry. Last minute changes, didn't retest properly.. Here's another attempt.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-2.patchtext/x-diff; name=xloginsert-scale-2.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 272,277 **** static XLogRecPtr RedoRecPtr;
--- 273,290 ----
   */
  static XLogRecPtr RedoStartLSN = {0, 0};
  
+ /*
+  * We have one of these for each backend, plus one that is shared by all
+  * auxiliary processes. WALAuxSlotLock is used to coordinate access to the
+  * shared slot.
+  */
+ typedef struct
+ {
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ } BackendXLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	(MaxBackends + 1)
+ 
  /*----------
   * Shared-memory data structures for XLOG control
   *
***************
*** 282,292 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually three shared-memory copies of
   * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
   *		XLogCtl->LogwrtResult is protected by info_lck
   *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
-  *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
   * One must hold the associated lock to read or write any of these, but
   * of course no lock is needed to read/write the unshared LogwrtResult.
   *
--- 295,304 ----
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually two shared-memory copies of
   * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
   *		XLogCtl->LogwrtResult is protected by info_lck
   *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
   * One must hold the associated lock to read or write any of these, but
   * of course no lock is needed to read/write the unshared LogwrtResult.
   *
***************
*** 296,307 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * is that it can be examined/modified by code that already holds WALWriteLock
   * without needing to grab info_lck as well.
   *
!  * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
!  * but is updated when convenient.	Again, it exists for the convenience of
!  * code that is already holding WALInsertLock but not the other locks.
!  *
!  * The unshared LogwrtResult may lag behind any or all of these, and again
!  * is updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
--- 308,315 ----
   * is that it can be examined/modified by code that already holds WALWriteLock
   * without needing to grab info_lck as well.
   *
!  * The unshared LogwrtResult may lag behind either or both of these, and is
!  * updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
***************
*** 311,320 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 319,333 ----
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 326,331 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 339,392 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    infopos_lck. Try to keep this section as short as possible, infopos_lck
+  *    can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each backend has its own "XLog insertion
+  * slot", which is used to indicate the position the backend is writing to.
+  * The slot is marked as in-use in step 1, while holding infopos_lck, by
+  * setting the position field in the slot. When the backend is finished with
+  * the insertion, it can clear its slot without a lock.
+  *
+  * Before 9.2, WAL insertion was serialized on one big lock, so that once
+  * you finished inserting your record you knew that all the previous records
+  * were inserted too. That is no longer true, there can be insertions to
+  * earlier positions still in-progress when your insertion finishes. To wait
+  * for them to finish, use WaitXLogInsertionsToFinish(). It polls (FIXME:
+  * busy-waiting is bad) the array of per-backend XLog insertion slots until
+  * it sees that all of the insertions to earlier locations have finished.
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for a backend to finish
+  * (or at least advance to next uninitialized page), while you're holding
+  * WALWriteLock. That would be bad, because the backend you're waiting for
+  * might need to acquire WALWriteLock, too, to evict an old buffer, so you'd
+  * get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * its called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition. It can't get stuck, because an insertion to a
+  * WAL page that's already initialized in cache can always proceed without
+  * waiting on a lock. However, if the page has *just* been initialized, the
+  * insertion might still briefly acquire WALBufMappingLock to observe that
+  * fact.
+  *
   *----------
   */
  
***************
*** 346,356 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 407,431 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct. */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment. The
! 	 * next record will be inserted there (or somewhere after it if there's
! 	 * not enough space on the current page). PrevRecord points to the
! 	 * beginning of the last record already reserved. It might not be fully
! 	 * copied into place yet, but we know its exact location.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
! 
! 	/*
! 	 * padding to push RedoRecPtr and forcePageWrites, which rarely change,
! 	 * to a different cache line than the rapidly-changing CurrPos and
! 	 * PrevRecord values.
! 	 */
! 	char		padding[128];
! 
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 381,389 **** typedef struct XLogCtlWrite
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
--- 456,466 ----
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	BackendXLogInsertSlot *BackendXLogInsertSlots;
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
***************
*** 397,405 **** typedef struct XLogCtlData
  	XLogCtlWrite Write;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 474,492 ----
  	XLogCtlWrite Write;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock. To change the identity of a buffer that's
+ 	 * still dirty, the old page needs to be written out first, and for that
+ 	 * you need WALWriteLock, and you need to ensure that there's no
+ 	 * in-progress insertions to the page by calling
+ 	 * WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* current (latest) block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 468,497 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
! 
! /* Free space remaining in the current xlog page buffer */
! #define INSERT_FREESPACE(Insert)  \
! 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
! 
! /* Construct XLogRecPtr value for current insertion point */
! #define INSERT_RECPTR(recptr,Insert,curridx)  \
! 	( \
! 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
! 	  (recptr).xrecoff = \
! 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
! 	)
! 
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
  #define NextBufIdx(idx)		\
  		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
   * See discussion above.
   */
--- 555,583 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
! #define INSERT_FREESPACE(endptr)	\
! 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
  
  #define NextBufIdx(idx)		\
  		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
+  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
+  * would hold if it was in cache, the page containing 'recptr'.
+  *
+  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
+  * page is taken to mean the previous page.
+  */
+ #define XLogRecPtrToBufIdx(recptr)	\
+ 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
+ 
+ #define XLogRecEndPtrToBufIdx(recptr)	\
+ 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
+ 
+ /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
   * See discussion above.
   */
***************
*** 614,620 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
--- 700,706 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(bool new_segment, XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
***************
*** 663,668 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 749,766 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static XLogRecPtr PerformXLogInsert(int write_len, XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  bool forcePageWrites);
+ static bool ReserveXLogInsertLocation(int reqsize, bool forcePageWrites,
+ 						  XLogRecPtr *PrevRecord, XLogRecPtr *StartPos,
+ 						  XLogRecPtr *EndPos,
+ 						  volatile BackendXLogInsertSlot *myslot);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto,
+ 						   XLogRecPtr CurrPos);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ static XLogRecPtr AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 683,695 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
  	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
--- 781,789 ----
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	RecPtr;
  	XLogRecData *rdt;
+ 	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
***************
*** 701,709 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	uint32		len,
  				write_len;
  	unsigned	i;
- 	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 795,805 ----
  	uint32		len,
  				write_len;
  	unsigned	i;
  	bool		doPageWrites;
+ 	bool		forcePageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	uint8		info_final;
+ 	XLogRecord	rechdr;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 726,751 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  		return RecPtr;
  	}
  
  	/*
  	 * Here we scan the rdata chain, determine which buffers must be backed
! 	 * up, and compute the CRC values for the data.  Note that the record
! 	 * header isn't added into the CRC initially since we don't know the final
! 	 * length or info bits quite yet.  Thus, the CRC will represent the CRC of
! 	 * the whole record in the order "rdata, then backup blocks, then record
! 	 * header".
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
  	 * We could prevent the race by doing all this work while holding the
  	 * insert lock, but it seems better to avoid doing CRC calculations while
! 	 * holding the lock.  This means we have to be careful about modifying the
! 	 * rdata chain until we know we aren't going to loop back again.  The only
! 	 * change we allow ourselves to make earlier is to set rdt->data = NULL in
! 	 * chain items we have decided we will have to back up the whole buffer
! 	 * for.  This is OK because we will certainly decide the same thing again
! 	 * for those items if we do it over; doing it here saves an extra pass
! 	 * over the chain later.
  	 */
  begin:;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
  		dtbuf[i] = InvalidBuffer;
--- 822,850 ----
  		return RecPtr;
  	}
  
+ 	/* TODO */
+ 	if (isLogSwitch)
+ 	{
+ 		elog(LOG, "log switch not implemented");
+ 		return InvalidXLogRecPtr;
+ 	}
+ 
  	/*
  	 * Here we scan the rdata chain, determine which buffers must be backed
! 	 * up.
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
  	 * We could prevent the race by doing all this work while holding the
  	 * insert lock, but it seems better to avoid doing CRC calculations while
! 	 * holding the lock.
! 	 *
! 	 * To avoid modifying the original rdata chain, we copy it into
! 	 * rdata_final. Later we will also add entries for the backup blocks into
! 	 * rdata_final, so that they don't need any special treatment in the
! 	 * critical section where the chunks are copied into the WAL buffers.
  	 */
  begin:;
+ 	info_final = info;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
  		dtbuf[i] = InvalidBuffer;
***************
*** 758,766 **** begin:;
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
- 	INIT_CRC32(rdata_crc);
  	len = 0;
  	for (rdt = rdata;;)
  	{
--- 857,865 ----
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	forcePageWrites = Insert->forcePageWrites;
! 	doPageWrites = fullPageWrites || forcePageWrites;
  
  	len = 0;
  	for (rdt = rdata;;)
  	{
***************
*** 768,774 **** begin:;
  		{
  			/* Simple data, just include it */
  			len += rdt->len;
- 			COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  		}
  		else
  		{
--- 867,872 ----
***************
*** 779,790 **** begin:;
  				{
  					/* Buffer already referenced by earlier chain item */
  					if (dtbuf_bkp[i])
  						rdt->data = NULL;
  					else if (rdt->data)
- 					{
  						len += rdt->len;
- 						COMP_CRC32(rdata_crc, rdt->data, rdt->len);
- 					}
  					break;
  				}
  				if (dtbuf[i] == InvalidBuffer)
--- 877,888 ----
  				{
  					/* Buffer already referenced by earlier chain item */
  					if (dtbuf_bkp[i])
+ 					{
  						rdt->data = NULL;
+ 						rdt->len = 0;
+ 					}
  					else if (rdt->data)
  						len += rdt->len;
  					break;
  				}
  				if (dtbuf[i] == InvalidBuffer)
***************
*** 796,807 **** begin:;
  					{
  						dtbuf_bkp[i] = true;
  						rdt->data = NULL;
  					}
  					else if (rdt->data)
- 					{
  						len += rdt->len;
- 						COMP_CRC32(rdata_crc, rdt->data, rdt->len);
- 					}
  					break;
  				}
  			}
--- 894,903 ----
  					{
  						dtbuf_bkp[i] = true;
  						rdt->data = NULL;
+ 						rdt->len = 0;
  					}
  					else if (rdt->data)
  						len += rdt->len;
  					break;
  				}
  			}
***************
*** 814,852 **** begin:;
  			break;
  		rdt = rdt->next;
  	}
! 
! 	/*
! 	 * Now add the backup block headers and data into the CRC
! 	 */
! 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 	{
! 		if (dtbuf_bkp[i])
! 		{
! 			BkpBlock   *bkpb = &(dtbuf_xlg[i]);
! 			char	   *page;
! 
! 			COMP_CRC32(rdata_crc,
! 					   (char *) bkpb,
! 					   sizeof(BkpBlock));
! 			page = (char *) BufferGetBlock(dtbuf[i]);
! 			if (bkpb->hole_length == 0)
! 			{
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   BLCKSZ);
! 			}
! 			else
! 			{
! 				/* must skip the hole */
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   bkpb->hole_offset);
! 				COMP_CRC32(rdata_crc,
! 						   page + (bkpb->hole_offset + bkpb->hole_length),
! 						   BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
! 			}
! 		}
! 	}
  
  	/*
  	 * NOTE: We disallow len == 0 because it provides a useful bit of extra
--- 910,917 ----
  			break;
  		rdt = rdt->next;
  	}
! 	/* Remember that this was the last regular rdata entry */
! 	rdt_lastnormal = rdt;
  
  	/*
  	 * NOTE: We disallow len == 0 because it provides a useful bit of extra
***************
*** 858,922 **** begin:;
  	if (len == 0 && !isLogSwitch)
  		elog(PANIC, "invalid xlog record length %u", len);
  
- 	START_CRIT_SECTION();
- 
- 	/* Now wait to get insert lock */
- 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
- 
- 	/*
- 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
- 	 * back and recompute everything.  This can only happen just after a
- 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
- 	 *
- 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
- 	 * affect the contents of the XLOG record, so we'll update our local copy
- 	 * but not force a recomputation.
- 	 */
- 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
- 	{
- 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
- 		RedoRecPtr = Insert->RedoRecPtr;
- 
- 		if (doPageWrites)
- 		{
- 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
- 			{
- 				if (dtbuf[i] == InvalidBuffer)
- 					continue;
- 				if (dtbuf_bkp[i] == false &&
- 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
- 				{
- 					/*
- 					 * Oops, this buffer now needs to be backed up, but we
- 					 * didn't think so above.  Start over.
- 					 */
- 					LWLockRelease(WALInsertLock);
- 					END_CRIT_SECTION();
- 					goto begin;
- 				}
- 			}
- 		}
- 	}
- 
- 	/*
- 	 * Also check to see if forcePageWrites was just turned on; if we weren't
- 	 * already doing full-page writes then go back and recompute. (If it was
- 	 * just turned off, we could recompute the record without full pages, but
- 	 * we choose not to bother.)
- 	 */
- 	if (Insert->forcePageWrites && !doPageWrites)
- 	{
- 		/* Oops, must redo it with full-page data */
- 		LWLockRelease(WALInsertLock);
- 		END_CRIT_SECTION();
- 		goto begin;
- 	}
- 
  	/*
  	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  Note that we have
! 	 * now irrevocably changed the input rdata chain.  At the exit of this
! 	 * loop, write_len includes the backup block data.
  	 *
  	 * Also set the appropriate info bits to show which buffers were backed
  	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
--- 923,936 ----
  	if (len == 0 && !isLogSwitch)
  		elog(PANIC, "invalid xlog record length %u", len);
  
  	/*
  	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  We have now
! 	 * modified the original rdata chain, but we remembered the last regular
! 	 * entry in rdt_lastnormal, so we can undo this if we have to loop back
! 	 * to the beginning.
! 	 *
! 	 * At the exit of this loop, write_len includes the backup block data.
  	 *
  	 * Also set the appropriate info bits to show which buffers were backed
  	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
***************
*** 931,943 **** begin:;
  		if (!dtbuf_bkp[i])
  			continue;
  
! 		info |= XLR_SET_BKP_BLOCK(i);
  
  		bkpb = &(dtbuf_xlg[i]);
  		page = (char *) BufferGetBlock(dtbuf[i]);
  
  		rdt->next = &(dtbuf_rdt1[i]);
! 		rdt = rdt->next;
  
  		rdt->data = (char *) bkpb;
  		rdt->len = sizeof(BkpBlock);
--- 945,957 ----
  		if (!dtbuf_bkp[i])
  			continue;
  
! 		info_final |= XLR_SET_BKP_BLOCK(i);
  
  		bkpb = &(dtbuf_xlg[i]);
  		page = (char *) BufferGetBlock(dtbuf[i]);
  
  		rdt->next = &(dtbuf_rdt1[i]);
! 		rdt = &(dtbuf_rdt1[i]);
  
  		rdt->data = (char *) bkpb;
  		rdt->len = sizeof(BkpBlock);
***************
*** 971,1044 **** begin:;
  	}
  
  	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
  	 */
! 	updrqst = false;
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 
! 	/* Compute record's XLOG location */
! 	curridx = Insert->curridx;
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 	 * segment, we need not insert it (and don't want to because we'd like
! 	 * consecutive switch requests to be no-ops).  Instead, make sure
! 	 * everything is written and flushed through the end of the prior segment,
! 	 * and return the prior segment's end address.
  	 */
! 	if (isLogSwitch &&
! 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
! 	{
! 		/* We can release insert lock immediately */
! 		LWLockRelease(WALInsertLock);
! 
! 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
! 		if (RecPtr.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			RecPtr.xlogid -= 1;
! 			RecPtr.xrecoff = XLogFileSize;
! 		}
! 
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
! 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
! 		{
! 			XLogwrtRqst FlushRqst;
! 
! 			FlushRqst.Write = RecPtr;
! 			FlushRqst.Flush = RecPtr;
! 			XLogWrite(FlushRqst, false, false);
! 		}
! 		LWLockRelease(WALWriteLock);
! 
! 		END_CRIT_SECTION();
! 
! 		return RecPtr;
! 	}
! 
! 	/* Insert record header */
! 
! 	record = (XLogRecord *) Insert->currpos;
! 	record->xl_prev = Insert->PrevRecord;
! 	record->xl_xid = GetCurrentTransactionIdIfAny();
! 	record->xl_tot_len = SizeOfXLogRecord + write_len;
! 	record->xl_len = len;		/* doesn't include backup blocks */
! 	record->xl_info = info;
! 	record->xl_rmid = rmid;
! 
! 	/* Now we can finish computing the record's CRC */
! 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
--- 985,1013 ----
  	}
  
  	/*
! 	 * Calculate CRC of the data, including all the backup blocks
! 	 *
! 	 * Note that the record header isn't added into the CRC initially since
! 	 * we don't know the prev-link yet.  Thus, the CRC will represent the CRC
! 	 * of the whole record in the order "rdata, then backup blocks, then
! 	 * record header".
  	 */
! 	INIT_CRC32(rdata_crc);
! 	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
! 		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
  	/*
! 	 * Construct record header. We can't CRC it yet, because the prev-link
! 	 * needs to be covered by the CRC and we don't know that yet. We will
! 	 * finish computing the CRC when we do.
  	 */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	rechdr.xl_prev = InvalidXLogRecPtr; /* TO BE DETERMINED */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
***************
*** 1059,1231 **** begin:;
  	}
  #endif
  
! 	/* Record begin of record in appropriate places */
! 	ProcLastRecPtr = RecPtr;
! 	Insert->PrevRecord = RecPtr;
! 
! 	Insert->currpos += SizeOfXLogRecord;
! 	freespace -= SizeOfXLogRecord;
  
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
! 		{
! 			if (rdata->len > freespace)
! 			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
! 				rdata->data += freespace;
! 				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
! 			}
! 		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
- 	/* Ensure next record will be properly aligned */
- 	Insert->currpos = (char *) Insert->currpage +
- 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
- 	freespace = INSERT_FREESPACE(Insert);
- 
  	/*
  	 * The recptr I return is the beginning of the *next* record. This will be
  	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
  	{
! 		XLogCtlWrite *Write = &XLogCtl->Write;
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
! 		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
! 		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == Write->curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
! 		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
! 		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
! 		}
  
! 		Write->LogwrtResult = LogwrtResult;
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
! 	}
! 	else
! 	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
! 		{
! 			/* curridx is filled and available for writing out */
! 			updrqst = true;
! 		}
! 		else
! 		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
! 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
! 	LWLockRelease(WALInsertLock);
  
! 	if (updrqst)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(&xlogctl->info_lck);
  		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
  		/* update local result copy while I have the chance */
  		LogwrtResult = xlogctl->LogwrtResult;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
- 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1028,1505 ----
  	}
  #endif
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to do the insertion.
  	 */
! 	RecPtr = PerformXLogInsert(write_len, &rechdr, rdata, rdata_crc,
! 							forcePageWrites);
! 	END_CRIT_SECTION();
  
! 	if (XLogRecPtrIsInvalid(RecPtr))
! 	{
! 		/*
! 		 * Oops, have to retry. Unlink the backup blocks from the chain,
! 		 * restoring it to its original state.
! 		 */
! 		rdt_lastnormal->next = NULL;
  
! 		goto begin;
  	}
  
  	/*
  	 * The recptr I return is the beginning of the *next* record. This will be
  	 * stored as LSN for changed data pages...
  	 */
! 	return RecPtr;
! }
! 
! /*
!  * Subroutine of XLogInsert. All the changes to shared state are done here,
!  * XLogInsert only prepares the record for insertion.
!  *
!  * On success, returns pointer to end of inserted record like XLogInsert().
!  * If RedoRecPtr or forcePageWrites had changed, returns InvalidRecPtr, and
!  * the caller must recalculate full-page-images and retry.
!  */
! static XLogRecPtr
! PerformXLogInsert(int write_len, XLogRecord *rechdr,
! 				  XLogRecData *rdata, pg_crc32 rdata_crc,
! 				  bool forcePageWrites)
! {
! 	volatile BackendXLogInsertSlot *myslot;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			tot_len;
! 	int			freespace;
! 	int			tot_left;
! 	XLogRecPtr	PrevRecord;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	EndPos;
! 	XLogRecPtr	CurrPos;
  
  	/*
! 	 * Our fast insertion method requires each process to have its own
! 	 * "slot", to tell others that it's still busy doing the insertion. Each
! 	 * regular backend has a dedicated slot, but auxiliary processes share
! 	 * one extra slot. Aux processes don't write a lot of WAL so they can
! 	 * well share. WALAuxSlotLock is used to coordinate access to the slot
! 	 * between aux processes.
  	 */
! 	if (MyBackendId == InvalidBackendId)
  	{
! 		LWLockAcquire(WALAuxSlotLock, LW_EXCLUSIVE);
! 		myslot = &XLogCtl->BackendXLogInsertSlots[MaxBackends];
! 	}
! 	else
! 		myslot = &XLogCtl->BackendXLogInsertSlots[MyBackendId];
  
! 	/* Get an insert location  */
! 	tot_len = SizeOfXLogRecord + write_len;
! 	if (!ReserveXLogInsertLocation(tot_len, forcePageWrites,
! 								   &PrevRecord, &StartPos, &EndPos, myslot))
! 	{
! 		if (MyBackendId == InvalidBackendId)
! 			LWLockRelease(WALAuxSlotLock);
! 		return InvalidXLogRecPtr;
! 	}
  
! 	/*
! 	 * Got an insertion location! Now that we know the prev-link, we can
! 	 * finish computing the record's CRC
! 	 */
! 	rechdr->xl_prev = PrevRecord;
! 	COMP_CRC32(rdata_crc, ((char *) rechdr) + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	rechdr->xl_crc = rdata_crc;
  
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
! 	freespace = XLOG_BLCKSZ - CurrPos.xrecoff % XLOG_BLCKSZ;
  
! 	/* Copy the record header and data */
! 	record = (XLogRecord *) currpos;
  
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	currpos += SizeOfXLogRecord;
! 	CurrPos.xrecoff += SizeOfXLogRecord;
! 	freespace -= SizeOfXLogRecord;
! 
! 	tot_left = write_len;
! 	while (rdata != NULL)
! 	{
! 		while (rdata->len > freespace)
  		{
! 			/*
! 			 * Write what fits on this page, then write the continuation
! 			 * record, and continue.
! 			 */
! 			XLogContRecord *contrecord;
  
! 			memcpy(currpos, rdata->data, freespace);
! 			rdata->data += freespace;
! 			rdata->len -= freespace;
! 			tot_left -= freespace;
  
! 			CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
  
! 			/*
! 			 * Make sure the memcpy is visible to others before we claim
! 			 * it to be done. It's important to update CurrPos before calling
! 			 * GetXLogBuffer(), because GetXLogBuffer() might need to wait
! 			 * for some insertions to finish so that it can write out a
! 			 * buffer to make room for the new page. Updating CurrPos before
! 			 * waiting for a new buffer ensures that we don't deadlock with
! 			 * ourselves if we run out of clean buffers.
! 			 */
! 			pg_write_barrier();
! 			myslot->CurrPos = CurrPos;
  
! 			currpos = GetXLogBuffer(CurrPos);
  
! 			contrecord = (XLogContRecord *) currpos;
! 			contrecord->xl_rem_len = tot_len - tot_left;
! 
! 			currpos += SizeOfXLogContRecord;
! 			CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 			freespace = XLOG_BLCKSZ - CurrPos.xrecoff % XLOG_BLCKSZ;
  		}
! 
! 		memcpy(currpos, rdata->data, rdata->len);
! 		currpos += rdata->len;
! 		CurrPos.xrecoff += rdata->len;
! 		freespace -= rdata->len;
! 		tot_left -= rdata->len;
! 
! 		rdata = rdata->next;
  	}
+ 	Assert(tot_left == 0);
+ 
+ 	CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
+ 	Assert(XLByteEQ(CurrPos, EndPos));
+ 
+ 	/*
+ 	 * Done! Clear CurrPos in our slot to let others know that we're finished,
+ 	 * but first make sure the changes we made to the WAL pages are visible
+ 	 * to everyone.
+ 	 */
+ 	pg_write_barrier();
+ 	myslot->CurrPos = InvalidXLogRecPtr;
+ 	if (MyBackendId == InvalidBackendId)
+ 		LWLockRelease(WALAuxSlotLock);
  
! 	/* update our global variables */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
  
! 	/* update shared LogwrtRqst.Write, if we crossed page boundary */
! 	if (StartPos.xrecoff / XLOG_BLCKSZ != EndPos.xrecoff / XLOG_BLCKSZ)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(&xlogctl->info_lck);
  		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
  		/* update local result copy while I have the chance */
  		LogwrtResult = xlogctl->LogwrtResult;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	return EndPos;
! }
  
  
! /*
!  * Reserves the right amount of space for a record of the given size from
!  * the WAL. *StartPos_p is set to the beginning of the reserved section,
!  * *EndPos_p to its end, and *Prev_record_p points to the beginning of the
!  * previous record to set to the prev-link of the record header.
!  *
!  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
!  * to let others know that we're busy inserting to the reserved area. The
!  * caller must clear it when the insertion is finished.
!  *
!  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
!  * changed. On failure, the shared state is not modified.
!  *
!  * This is the performance critical part of XLogInsert that must be
!  * serialized across backends. The rest can happen mostly in parallel.
!  *
!  * NB: The space calculation here must match the code in PerformXLogInsert,
!  * where we actually copy the record to the reserved space.
!  */
! static bool
! ReserveXLogInsertLocation(int size, bool forcePageWrites,
! 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
! 						  XLogRecPtr *EndPos_p,
! 						  volatile BackendXLogInsertSlot *myslot)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			freespace;
! 	XLogRecPtr	ptr;
! 	XLogRecPtr	StartPos;
! 	int			sizeleft;
! 
! 	sizeleft = size;
! 
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
! 		Insert->forcePageWrites != forcePageWrites)
! 	{
! 		/*
! 		 * Oops, forcePageWrites was just turned on, or a checkpoint
! 		 * just happened. Loop back to the beginning, because we might have
! 		 * to include more full-page images in the record
! 		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		return false;
! 	}
! 
! 	/*
! 	 * Now reserve the right amount of space from the WAL for our record.
! 	 */
! 	ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	freespace = INSERT_FREESPACE(ptr);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
! 		freespace = INSERT_FREESPACE(ptr);
! 	}
! 	StartPos = ptr;
! 
! 	/*
! 	 * Set our slot's CurrPos to the starting position, to let others know
! 	 * that we're busy inserting to this area.
! 	 */
! 	myslot->CurrPos = StartPos;
! 
! 	while (freespace < sizeleft)
! 	{
! 		/* fill this page, and continue on next page */
! 		sizeleft -= freespace;
! 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
! 
! 		/* account for continuation record header */
! 		ptr.xrecoff += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(ptr);
! 	}
! 	/* the rest fits on this page */
! 	ptr.xrecoff += sizeleft;
! 	sizeleft = 0;
! 
! 	/* Align the end position, so that the next record starts aligned */
! 	ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(DEBUG2, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 result->PrevRecord.xlogid, result->PrevRecord.xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 reqsize);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 
! 	return true;
! }
! 
! /*
!  * Get a pointer to the right location in the WAL buffer corresponding a
!  * given XLogRecPtr.
!  *
!  * If the page is not initialized yet, it is initialized. That might also
!  * require evicting an old diry buffer from the buffer cache, which means I/O.
!  *
!  * The caller must ensure that the page containing the requested location
!  * isn't evicted yet, and won't be evicted, by holding onto a
!  * BackendXLogInsertSlot with CurrPos set to 'ptr'. Setting it to some value
!  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
!  * if we have to evict a buffer, we might have to wait for someone else to
!  * finish a write. And that someone else might not be able to finish the write
!  * if our CurrPos points to a buffer that's still in the buffer cache.
!  */
! static char *
! GetXLogBuffer(XLogRecPtr ptr)
! {
! 	int			idx;
! 	XLogRecPtr	endptr;
! 
! 	/*
! 	 * The XLog buffer cache is organized so that we can easily calculate the
! 	 * buffer a given page must be loaded into from the XLogRecPtr alone.
! 	 * A page must always be loaded to a particular buffer.
! 	 */
! 	idx = XLogRecPtrToBufIdx(ptr);
! 
! 	/*
! 	 * See what page is loaded in the buffer at the moment. It could be the
! 	 * page we're looking for, or something older. It can't be anything
! 	 * newer - that would imply the page we're looking for has already
! 	 * been written out to disk, which shouldn't happen as long as the caller
! 	 * has set its slot's CurrPos correctly.
! 	 *
! 	 * However, we don't hold a lock while we read the value. If someone has
! 	 * just initialized the page, it's possible that we get a "torn read",
! 	 * and read a bogus value. That's ok, we'll grab the mapping lock (in
! 	 * AdvanceXLInsertBuffer) and retry if we see anything else than the page
! 	 * we're looking for. But it means that when we do this unlocked read, we
! 	 * might see a value that *is* ahead of the page we're looking for. So
! 	 * don't PANIC on that, until we've verified the value while holding the
! 	 * lock.
! 	 */
! 	endptr = XLogCtl->xlblocks[idx];
! 	if (ptr.xlogid != endptr.xlogid ||
! 		!(ptr.xrecoff < endptr.xrecoff &&
! 		  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 	{
! 		AdvanceXLInsertBuffer(false, ptr, false);
! 		endptr = XLogCtl->xlblocks[idx];
! 
! 		if (ptr.xlogid != endptr.xlogid ||
! 			!(ptr.xrecoff < endptr.xrecoff &&
! 			  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 		{
! 			elog(PANIC, "could not find WAL buffer for %X/%X", ptr.xlogid, ptr.xrecoff);
! 		}
! 	}
! 
! 	/*
! 	 * Found the buffer holding this page. Return a pointer to the right
! 	 * offset within the page.
! 	 */
! 	return (char *) XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
! 		ptr.xrecoff % XLOG_BLCKSZ;
! }
! 
! /*
!  * Advance an XLogRecPtr to the first valid insertion location on the next
!  * page, right after the page header. An XLogRecPtr pointing to a boundary,
!  * ie. the first byte of a page, is taken to belong to the previous page.
!  */
! static XLogRecPtr
! AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr)
! {
! 	int			freespace;
! 
! 	freespace = INSERT_FREESPACE(ptr);
! 	XLByteAdvance(ptr, freespace);
! 	if (ptr.xrecoff % XLogSegSize == 0)
! 		ptr.xrecoff += SizeOfXLogLongPHD;
! 	else
! 		ptr.xrecoff += SizeOfXLogShortPHD;
! 
! 	return ptr;
! }
! 
! /*
!  * Wait for any insertions < upto to finish.
!  *
!  * Returns a value >= upto, which indicates the oldest in-progress insertion
!  * that we saw in the array. All insertions upto that point are guaranteed to
!  * be finished. Note that it is just a conservative guess, there are race
!  * conditions where we return a bogus, too-low, value. If you care about the
!  * return value, you must get the current Insert->CurrPos value *before*
!  * calling this function, and pass that as the CurrPos argument.
!  */
! static XLogRecPtr
! WaitXLogInsertionsToFinish(XLogRecPtr upto, XLogRecPtr CurrPos)
! {
! 	int			i;
! 	int			nbusyslots = 0;
! 	/* FIXME: it's not safe to palloc here, would PANIC on OOM */
! 	int		   *busyslots = palloc0(sizeof(int) * NumXLogInsertSlots);
! 	int			cycles = 0;
! 
! 	/*
! 	 * Get a list of backend slots that are still inserting to a point earlier
! 	 * than 'upto'.
! 	 *
! 	 * This is a bit sloppy because we don't do any locking here. A slot's
! 	 * CurrPos that we read might get split if 8-byte loads are not atomic,
! 	 * but that's harmless. All slots with values <= upto, which we really do
! 	 * have to wait for, must be in the array before this function is called.
! 	 * That's because the 'upto' value must've been obtained by reading the
! 	 * current insert position, either directly or indirectly. It can never
! 	 * be > Insert->CurrPos. So we shouldn't miss anything that we genuinely
! 	 * need to wait for. OTOH, if someone is just storing an XLogRecPtr in a
! 	 * slot while we read it, we might incorrectly think that we have to wait
! 	 * for it. But that's OK, because in the loop that follows, we'll retry
! 	 * and see that it's actually > upto.
! 	 *
! 	 * XXX: that's bogus. You might see a too-new value if a slot's CurrPos is
! 	 * advanced at the same instant.
! 	 */
! 	for (i = 0; i < NumXLogInsertSlots; i++)
! 	{
! 		volatile BackendXLogInsertSlot *slot = &XLogCtl->BackendXLogInsertSlots[i];
! 		XLogRecPtr slotptr = slot->CurrPos;
! 
! 		if (XLogRecPtrIsInvalid(slotptr))
! 			continue;
! 
! 		if (XLByteLT(slotptr, upto))
! 			busyslots[nbusyslots++] = i;
! 		else if (XLByteLT(slotptr, CurrPos))
! 			CurrPos = slotptr;
! 	}
! 
! 	/*
! 	 * Busy-wait until the insertion is done.
! 	 *
! 	 * TODO: This needs to be replaced with something smarter. I don't think
! 	 * it's possible that we'd have to wait for I/O here, though. As the code
! 	 * stands, the caller never passes an 'upto' pointer that points to an
! 	 * uninitialized page. It always points to an already inserted record, in
! 	 * which case the page must already be initialized in the WAL buffer
! 	 * cache. Nevertheless, busy-waiting is not good.
! 	 */
! 	while (nbusyslots > 0)
! 	{
! 		pg_read_barrier();
! 		for (i = 0; i < nbusyslots; i++)
! 		{
! 			volatile BackendXLogInsertSlot *slot = &XLogCtl->BackendXLogInsertSlots[busyslots[i]];
! 			XLogRecPtr slotptr = slot->CurrPos;
! 
! 			if (XLogRecPtrIsInvalid(slot->CurrPos) || !XLByteLT(slotptr, upto))
! 			{
! 				if (nbusyslots > 1)
! 				{
! 					busyslots[i] = busyslots[nbusyslots - 1];
! 					i--;
! 				}
! 				nbusyslots--;
! 			}
! 		}
! 
! 		/* a debugging aid */
! 		if (++cycles == 1000000)
! 			elog(LOG, "stuck waiting upto %X/%X", upto.xlogid, upto.xrecoff);
! 	}
! 	pfree(busyslots);
! 
! 	return CurrPos;
  }
  
  /*
***************
*** 1458,1486 **** XLogArchiveCleanup(const char *xlog)
   * If new_segment is TRUE then we set up the next buffer page as the first
   * page of the next xlog segment file, possibly but not usually the next
   * consecutive file page.
-  *
-  * The global LogwrtRqst.Write pointer needs to be advanced to include the
-  * just-filled page.  If we can do this for free (without an extra lock),
-  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
-  * request update still needs to be done, FALSE if we did it internally.
-  *
-  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
  	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
! 	/* Use Insert->LogwrtResult copy if it's more fresh */
! 	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
! 		LogwrtResult = Insert->LogwrtResult;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 1732,1763 ----
   * If new_segment is TRUE then we set up the next buffer page as the first
   * page of the next xlog segment file, possibly but not usually the next
   * consecutive file page.
   */
! static void
! AdvanceXLInsertBuffer(bool new_segment, XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
  	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
+ 	bool		needflush;
+ 	int			npages = 0;
+ 	XLogRecPtr	EvictedPtr;
+ 
+ 	Assert(!new_segment); /* FIXME: not implemented */
+ 
+ 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
  
! 	/*
! 	 * Now that we have the lock, check if someone initialized the page
! 	 * already.
! 	 */
! while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
! {
! 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1488,1499 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
- 	{
- 		/* nope, got work to do... */
- 		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 1765,1781 ----
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  
! 	needflush = !XLByteLE(OldPageRqstPtr, LogwrtResult.Write);
! 
! 	if (needflush)
! 	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1501,1534 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		update_needed = false;	/* Did the shared-request update */
! 
! 		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
! 		{
! 			/* OK, someone wrote it already */
! 			Insert->LogwrtResult = LogwrtResult;
! 		}
! 		else
  		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding insert lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
--- 1783,1819 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 			{
! 				Assert(XLByteLE(OldPageRqstPtr, xlogctl->Insert.CurrPos));
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
! 			}
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr, InvalidXLogRecPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding mapping lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
***************
*** 1537,1560 **** AdvanceXLInsertBuffer(bool new_segment)
  				WriteRqst.Flush.xrecoff = 0;
  				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
  	/*
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
  
  	if (new_segment)
  	{
  		/* force it to a segment start point */
  		NewPageEndPtr.xrecoff += XLogSegSize - 1;
  		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
  	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 1822,1851 ----
  				WriteRqst.Flush.xrecoff = 0;
  				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
+ 	EvictedPtr = OldPageRqstPtr;
+ 
  	/*
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
+ #ifdef BROKEN
  	if (new_segment)
  	{
  		/* force it to a segment start point */
  		NewPageEndPtr.xrecoff += XLogSegSize - 1;
  		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
  	}
+ #endif
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1564,1577 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
  	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
  	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
- 	Insert->curridx = nextidx;
- 	Insert->currpage = NewPage;
- 
- 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
- 
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
  	 * will look like zeroes and not valid XLOG records...
--- 1855,1866 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
+ 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
+ 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
+ 
  	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
  	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
  	 * will look like zeroes and not valid XLOG records...
***************
*** 1614,1624 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 1903,1932 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * make sure the xlblocks update becomes visible to others before the
! 	 * curridx update.
! 	 */
! 	pg_write_barrier();
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 
! 	Assert(opportunistic || XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]));
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(LOG, "initialized %d pages, upto %X/%X (evicted upto %X/%X) in slot %d (backend %d)",
! 			 npages,
! 			 NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff,
! 			 EvictedPtr.xlogid, EvictedPtr.xrecoff,
! 			 nextidx, MyBackendId);
! #endif
  }
  
  /*
***************
*** 1669,1675 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * only if caller specifies WriteRqst == page-end and flexible == false,
   * and there is some data to write.)
   *
!  * Must be called with WALWriteLock held.
   */
  static void
  XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
--- 1977,1985 ----
   * only if caller specifies WriteRqst == page-end and flexible == false,
   * and there is some data to write.)
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) to make sure the data is ready to
!  * write.
   */
  static void
  XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
***************
*** 1722,1731 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
! 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
  		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
--- 2032,2041 ----
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
! 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X (slot %d)",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff, curridx);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
  		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
***************
*** 2097,2129 **** XLogFlush(XLogRecPtr record)
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
- 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
- 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
- 				else
- 				{
- 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
- 					WriteRqstPtr.xrecoff -= freespace;
- 				}
- 				LWLockRelease(WALInsertLock);
- 				WriteRqst.Write = WriteRqstPtr;
- 				WriteRqst.Flush = WriteRqstPtr;
- 			}
- 			else
- 			{
- 				WriteRqst.Write = WriteRqstPtr;
- 				WriteRqst.Flush = record;
- 			}
  			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
--- 2407,2454 ----
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
+ 		/* try to write/flush later additions to XLOG as well */
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos;
+ 
+ 		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to use LWLockConditionalAcquire, and fall back
+ 		 * to writing just up to 'record' if we couldn't get the lock. I
+ 		 * wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand,
+ 		 * it would be good to not cause more contention on the lock if it's
+ 		 * busy, but on the other hand, this spinlock is much more lightweight
+ 		 * than the WALInsertLock was, so maybe it's better to just grab the
+ 		 * spinlock. Also note that if we stored the XLogRecPtr as one 64-bit
+ 		 * integer, we could just read it with no lock on platforms where
+ 		 * 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)		/* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, insertpos);
+ 
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
  			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
***************
*** 2234,2243 **** XLogBackgroundFlush(void)
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
- 
  	START_CRIT_SECTION();
  
  	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
--- 2559,2570 ----
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
  	START_CRIT_SECTION();
  
  	/* now wait for the write lock */
+ 
+ 	WaitXLogInsertionsToFinish(WriteRqstPtr, InvalidXLogRecPtr);
+ 
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
***************
*** 2248,2256 **** XLogBackgroundFlush(void)
  		WriteRqst.Flush = WriteRqstPtr;
  		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2575,2594 ----
  		WriteRqst.Flush = WriteRqstPtr;
  		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LogwrtResult = XLogCtl->Write.LogwrtResult;
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 *
+ 	 * Before we release the write lock, calculate the location of the last
+ 	 * fully written page.
+ 	 */
+ 	AdvanceXLInsertBuffer(false, InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5044,5049 **** XLOGShmemSize(void)
--- 5382,5390 ----
  	/* and the buffers themselves */
  	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
  
+ 	/* XLog insertion slots */
+ 	size = add_size(size, mul_size(sizeof(BackendXLogInsertSlot), NumXLogInsertSlots));
+ 
  	/*
  	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
  	 * added by CreateSharedMemoryAndSemaphores.  This lets us use this
***************
*** 5059,5064 **** XLOGShmemInit(void)
--- 5400,5406 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5084,5089 **** XLOGShmemInit(void)
--- 5426,5439 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize per-backend buffers */
+ 	XLogCtl->BackendXLogInsertSlots = (BackendXLogInsertSlot *) allocptr;
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogCtl->BackendXLogInsertSlots[i].CurrPos = InvalidXLogRecPtr;
+ 	}
+ 	allocptr += sizeof(BackendXLogInsertSlot) * NumXLogInsertSlots;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5098,5108 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
- 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
--- 5448,5459 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
+ 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
+ 
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
***************
*** 5942,5947 **** StartupXLOG(void)
--- 6293,6299 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6697,6704 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7049,7060 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6706,6731 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->Write.LogwrtResult = LogwrtResult;
- 	Insert->LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7062,7085 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->Write.LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndRecPtr.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6737,6743 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7091,7097 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7231,7239 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
  XLogRecPtr
  GetInsertRecPtr(void)
--- 7585,7597 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
+  *
+  * XXX: now that there can be several insertions "in-flight", what should
+  * this return? The position a new insertion would got to? Or the the oldest
+  * still in-progress insertion, perhaps?
   */
  XLogRecPtr
  GetInsertRecPtr(void)
***************
*** 7507,7512 **** CreateCheckPoint(int flags)
--- 7865,7871 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7574,7584 **** CreateCheckPoint(int flags)
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 7933,7944 ----
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
+ 
  	/*
! 	 * Determine the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7590,7596 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 7950,7956 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7599,7613 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 7959,7970 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7633,7646 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
--- 7990,7999 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = XLOG_BLCKSZ - curInsert.xrecoff % XLOG_BLCKSZ;
  	if (freespace < SizeOfXLogRecord)
! 		curInsert = AdvanceXLogRecPtrToNextPage(curInsert);
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
***************
*** 7666,7672 **** CreateCheckPoint(int flags)
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8019,8025 ----
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7686,7692 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8039,8045 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 7798,7804 **** CreateCheckPoint(int flags)
  	 */
  	if (shutdown && !XLByteEQ(checkPoint.redo, ProcLastRecPtr))
  		ereport(PANIC,
! 				(errmsg("concurrent transaction log activity while database system is shutting down")));
  
  	/*
  	 * Select point at which we can truncate the log, which we base on the
--- 8151,8159 ----
  	 */
  	if (shutdown && !XLByteEQ(checkPoint.redo, ProcLastRecPtr))
  		ereport(PANIC,
! 				(errmsg("concurrent transaction log activity while database system is shutting down (redo %X/%X, ProcLastRecPtr %X/%X",
! 						checkPoint.redo.xlogid, checkPoint.redo.xrecoff,
! 						ProcLastRecPtr.xlogid, ProcLastRecPtr.xrecoff)));
  
  	/*
  	 * Select point at which we can truncate the log, which we base on the
***************
*** 8053,8067 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
- 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
- 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8408,8419 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold info_lck to update it. There is no other processes
! 	 * updating Insert.RedoRecPtr, so we don't need a lock to protect that.
  	 */
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8816,8821 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9168,9174 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
***************
*** 8865,8890 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9218,9243 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 8946,8958 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9299,9311 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9034,9043 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
--- 9387,9397 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
***************
*** 9054,9060 **** pg_start_backup_callback(int code, Datum arg)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9408,9414 ----
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9067,9072 **** pg_start_backup_callback(int code, Datum arg)
--- 9421,9427 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
***************
*** 9108,9116 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 9463,9471 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9119,9134 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 9474,9489 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9330,9345 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9685,9702 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
! 
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9391,9406 **** GetStandbyFlushRecPtr(void)
   * Get latest WAL insert pointer
   */
  XLogRecPtr
! GetXLogInsertRecPtr(bool needlock)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	if (needlock)
! 		LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	if (needlock)
! 		LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 9748,9761 ----
   * Get latest WAL insert pointer
   */
  XLogRecPtr
! GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***************
*** 200,206 **** pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	current_recptr = GetXLogInsertRecPtr(true);
  
  	snprintf(location, sizeof(location), "%X/%X",
  			 current_recptr.xlogid, current_recptr.xrecoff);
--- 200,206 ----
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	current_recptr = GetXLogInsertRecPtr();
  
  	snprintf(location, sizeof(location), "%X/%X",
  			 current_recptr.xlogid, current_recptr.xrecoff);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 288,294 **** extern bool XLogInsertAllowed(void);
  extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
  extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
  extern XLogRecPtr GetStandbyFlushRecPtr(void);
! extern XLogRecPtr GetXLogInsertRecPtr(bool needlock);
  extern XLogRecPtr GetXLogWriteRecPtr(void);
  extern bool RecoveryIsPaused(void);
  extern void SetRecoveryPause(bool recoveryPause);
--- 288,294 ----
  extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
  extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
  extern XLogRecPtr GetStandbyFlushRecPtr(void);
! extern XLogRecPtr GetXLogInsertRecPtr(void);
  extern XLogRecPtr GetXLogWriteRecPtr(void);
  extern bool RecoveryIsPaused(void);
  extern void SetRecoveryPause(bool recoveryPause);
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALAuxSlotLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#20Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#19)
Re: Moving more work outside WALInsertLock

On Sat, Dec 24, 2011 at 4:54 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Sorry. Last minute changes, didn't retest properly.. Here's another attempt.

When I tested the patch, initdb failed:

$ initdb -D data
....
initializing dependencies ... PANIC: could not locate a valid checkpoint record

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#21Greg Stark
stark@mit.edu
In reply to: Tom Lane (#10)
Re: Moving more work outside WALInsertLock

On Fri, Dec 16, 2011 at 3:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

On its own that sounds dangerous, but its not. When we need to confirm
the prev link we already know what we expect it to be, so CRC-ing it
is overkill. That isn't true of any other part of the WAL record, so
the prev link is the only thing we can relax, but thats OK because we
can CRC check everything else outside of the locked section.

That isn't my idea, but I'm happy to put it on the table since I'm not shy.

I'm glad it's not your idea, because it's a bad one.

I'll take the blame or credit here.

 A large part of
the point of CRC'ing WAL records is to guard against torn-page problems
in the WAL files, and doing things like that would give up a significant
part of that protection, because there would no longer be any assurance
that the body of a WAL record had anything to do with its prev_link.

Hm, I hadn't considered the possibility of a prev_link being the only
thing left over from a torn page. As Heikki pointed out having the CRC
and the rest of the record on opposite sides of the prev_link does
seem like convincing protection but it's a lot more fiddly and hard to
explain the dependencies this way.

Another thought that was discussed in the same dinner was separating
the CRC into a separate record that would cover all the WAL since the
last CRC. These would only need to be emitted when there's a WAL sync,
not on every record. I think someone showed some benchmarks claiming
that a significant overhead with the CRC was the startup and finishing
time for doing lots of small chunks. If it processes larger blocks it
might be able to make more efficient use of the memory bandwidth. I'm
not entirely convinced of that myself but it bears some
experimentation.

--
greg

#22Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#19)
Re: Moving more work outside WALInsertLock

On Fri, Dec 23, 2011 at 2:54 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Sorry. Last minute changes, didn't retest properly.. Here's another attempt.

I tried this one out on Nate Boley's system. Looks pretty good.

m = master, x = with xloginsert-scale-2 patch. shared_buffers = 8GB,
maintenance_work_mem = 1GB, synchronous_commit = off,
checkpoint_segments = 300, checkpoint_timeout = 15min,
checkpoint_completion_target = 0.9, wal_writer_delay = 20ms. pgbench,
scale factor 100, median of five five-minute runs.

Permanent tables:

m01 tps = 631.875547 (including connections establishing)
x01 tps = 611.443724 (including connections establishing)
m08 tps = 4573.701237 (including connections establishing)
x08 tps = 4576.242333 (including connections establishing)
m16 tps = 7697.783265 (including connections establishing)
x16 tps = 7837.028713 (including connections establishing)
m24 tps = 11613.690878 (including connections establishing)
x24 tps = 12924.027954 (including connections establishing)
m32 tps = 10684.931858 (including connections establishing)
x32 tps = 14168.419730 (including connections establishing)
m80 tps = 10259.628774 (including connections establishing)
x80 tps = 13864.651340 (including connections establishing)

And, on unlogged tables:

m01 tps = 681.805851 (including connections establishing)
x01 tps = 665.120212 (including connections establishing)
m08 tps = 4753.823067 (including connections establishing)
x08 tps = 4638.690397 (including connections establishing)
m16 tps = 8150.519673 (including connections establishing)
x16 tps = 8082.504658 (including connections establishing)
m24 tps = 14069.077657 (including connections establishing)
x24 tps = 13934.955205 (including connections establishing)
m32 tps = 18736.317650 (including connections establishing)
x32 tps = 18888.585420 (including connections establishing)
m80 tps = 17709.683344 (including connections establishing)
x80 tps = 18330.488958 (including connections establishing)

Unfortunately, it does look like there is some raw loss of performance
when WALInsertLock is NOT badly contended; hence the drop-off at a
single client on permanent tables, and up through 24 clients on
unlogged tables.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#22)
1 attachment(s)
Re: Moving more work outside WALInsertLock

On 25.12.2011 21:48, Robert Haas wrote:

On Fri, Dec 23, 2011 at 2:54 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Sorry. Last minute changes, didn't retest properly.. Here's another attempt.

I tried this one out on Nate Boley's system. Looks pretty good.
[pgbench results]

Great, thanks for the testing!

Unfortunately, it does look like there is some raw loss of performance
when WALInsertLock is NOT badly contended; hence the drop-off at a
single client on permanent tables, and up through 24 clients on
unlogged tables.

Hmm, I haven't been able to put my finger on what's causing that.

Anyway, here's a new version of the patch. It no longer busy-waits for
in-progress insertions to finish, and handles xlog-switches. This is now
feature-complete. It's a pretty complicated patch, so I would appreciate
more eyeballs on it. And benchmarking again.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-4.patchtext/x-diff; name=xloginsert-scale-4.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 282,307 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually three shared-memory copies of
!  * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
!  *		XLogCtl->LogwrtResult is protected by info_lck
!  *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
!  *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
!  * One must hold the associated lock to read or write any of these, but
!  * of course no lock is needed to read/write the unshared LogwrtResult.
!  *
!  * XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always
!  * right", since both are updated by a write or flush operation before
!  * it releases WALWriteLock.  The point of keeping XLogCtl->Write.LogwrtResult
!  * is that it can be examined/modified by code that already holds WALWriteLock
!  * without needing to grab info_lck as well.
!  *
!  * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
!  * but is updated when convenient.	Again, it exists for the convenience of
!  * code that is already holding WALInsertLock but not the other locks.
!  *
!  * The unshared LogwrtResult may lag behind any or all of these, and again
!  * is updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
--- 283,293 ----
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There is one shared-memory copy of LogwrtResult,
!  * plus one unshared copy in each backend. To read the shared copy, you need
!  * to hold info_lck *or* WALWriteLock. To update it, you need to hold both
!  * locks. The unshared LogwrtResult may lag behind the shared copy, and is
!  * updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
***************
*** 311,320 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 297,311 ----
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 326,331 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 317,393 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL, and the next insertion
+  *    slot to advertise that the insertion is in progress. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    insertpos_lck. Try to keep this section as short as possible,
+  *    insertpos_lck can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each insertion is asssigned its own
+  * "XLog insertion slot", which is used to advertise the position the backend
+  * is writing to. The slot is marked as in-use in step 1, while holding
+  * insertpos_lck, by setting the position field in the slot. When the backend
+  * is finished with the insertion, it clears its slot. Each slot is protected
+  * by a separate spinlock, to keep contention minimal.
+  *
+  * The insertion slots also provide a mechanism to wait for an insertion to
+  * finish. This is important when an XLOG page is written out - any
+  * in-progress insertions must finish copying data to the page first, or the
+  * on-disk copy will be incomplete. Waiting is done by the
+  * WaitXLogInsertionsToFinish() function. It sets the 'waiter' field in the
+  * slot it needs to wait for, and when that insertion finishes (or proceeds
+  * to the next page, at least), the inserter wakes up the process waiting for
+  * it. There is only one waiter field in each slot, so
+  * WaitXLogInsertionsToFinish() uses a lwlock, WALInsertWaitLock, to ensure
+  * that only one process attempts to wait on any slot at a time.
+  *
+  * The insertion slots form a ring. Insert->nextslot points to the next free
+  * slot, and Insert->lastslot points to the last slot that's still in use.
+  * lastslot can lag behind reality by any number of slots, as long as nextslot
+  * doesn't catch up with it. lastslot is advanced by
+  * WaitXLogInsertionsToFinish(), and is protected by WALInsertWaitLock.
+  * nextslot is advanced in ReserveXLogInsertLocation() and is protected by
+  * insertpos_lck. Both slot variables are 32-bit integers, so that they can
+  * be read atomically without holding a lock.
+  *
+  * Whenever the ring fills up, ie. when nextslot wraps around and catches up
+  * with lastslot, ReserveXLogInsertLocation() has to wait for the oldest
+  * insertion to finish and advance lastslot, to make room for the new
+  * insertion. This is done by WaitForXLogInsertionSlotToBecomeFree() function,
+  * which is similar to WaitXLogInsertionsToFinish(), but instead of waiting
+  * for all insertions up to a given point to finish, it just waits for the
+  * inserter in the oldest slot to finish.
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+  * finish (or at least advance to next uninitialized page), while you're
+  * holding WALWriteLock. That would be bad, because the backend you're waiting
+  * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+  * you'd get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * it's called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition, so the common pattern:
+  *
+  *   recptr = XLogInsert(...)
+  *   XLogFlush(recptr)
+  *
+  * is safe. It can't get stuck, because an insertion to a WAL page that's
+  * already initialized in cache can always proceed without waiting on a lock.
+  *
   *----------
   */
  
***************
*** 346,356 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 408,437 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct
! 								 * (except lastslot). */
! 
! 	int32		nextslot;		/* next insertion slot to use */
! 	int32		lastslot;		/* last in-use insertion slot (protected by
! 								 * WALInsertWaitLock) */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment. The
! 	 * next record will be inserted there (or somewhere after it if there's
! 	 * not enough space on the current page). PrevRecord points to the
! 	 * beginning of the last record already reserved. It might not be fully
! 	 * copied into place yet, but we know its exact location already.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
! 
! 	/*
! 	 * padding to push RedoRecPtr and forcePageWrites, which rarely change,
! 	 * to a different cache line than the rapidly-changing CurrPos and
! 	 * PrevRecord values. XXX: verify if this makes any difference
! 	 */
! 	char		padding[128];
! 
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 371,389 **** typedef struct XLogCtlInsert
   */
  typedef struct XLogCtlWrite
  {
- 	XLogwrtResult LogwrtResult; /* current value of LogwrtResult */
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
--- 452,484 ----
   */
  typedef struct XLogCtlWrite
  {
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
+ 
+ /*
+  * Slots for in-progress WAL insertions.
+  */
+ typedef struct
+ {
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ 	PGPROC	   *waiter;
+ 	slock_t		lck;
+ } XLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	1000
+ 
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	XLogInsertSlot *XLogInsertSlots;
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
***************
*** 397,405 **** typedef struct XLogCtlData
  	XLogCtlWrite Write;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 492,510 ----
  	XLogCtlWrite Write;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock. To change the identity of a buffer that's
+ 	 * still dirty, the old page needs to be written out first, and for that
+ 	 * you need WALWriteLock, and you need to ensure that there's no
+ 	 * in-progress insertions to the page by calling
+ 	 * WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* current (latest) block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 471,498 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
  
- /* Free space remaining in the current xlog page buffer */
- #define INSERT_FREESPACE(Insert)  \
- 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
! /* Construct XLogRecPtr value for current insertion point */
! #define INSERT_RECPTR(recptr,Insert,curridx)  \
! 	( \
! 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
! 	  (recptr).xrecoff = \
! 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
! 	)
  
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
--- 576,606 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
+ #define INSERT_FREESPACE(endptr)	\
+ 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
+ 
+ #define NextBufIdx(idx)		\
+ 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  
! #define NextSlotNo(idx)		\
! 		(((idx) == NumXLogInsertSlots) ? 0 : ((idx) + 1))
  
! /*
!  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
!  * would hold if it was in cache, the page containing 'recptr'.
!  *
!  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
!  * page is taken to mean the previous page.
!  */
! #define XLogRecPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
! #define XLogRecEndPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
***************
*** 618,626 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
--- 726,734 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
***************
*** 667,672 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 775,798 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static XLogRecPtr PerformXLogInsert(int write_len,
+ 				  bool isLogSwitch,
+ 				  XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  bool forcePageWrites);
+ static bool ReserveXLogInsertLocation(int size, bool forcePageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p);
+ static void UpdateSlotCurrPos(volatile XLogInsertSlot *myslot,
+ 				  XLogRecPtr CurrPos);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto,
+ 						   XLogRecPtr CurrPos);
+ static void WaitForXLogInsertionSlotToBecomeFree(void);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ static XLogRecPtr AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 687,699 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
  	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
--- 813,821 ----
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	RecPtr;
  	XLogRecData *rdt;
+ 	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
***************
*** 705,713 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	uint32		len,
  				write_len;
  	unsigned	i;
- 	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 827,837 ----
  	uint32		len,
  				write_len;
  	unsigned	i;
  	bool		doPageWrites;
+ 	bool		forcePageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	uint8		info_orig = info;
+ 	XLogRecord	rechdr;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 731,753 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	}
  
  	/*
! 	 * Here we scan the rdata chain, determine which buffers must be backed
! 	 * up, and compute the CRC values for the data.  Note that the record
! 	 * header isn't added into the CRC initially since we don't know the final
! 	 * length or info bits quite yet.  Thus, the CRC will represent the CRC of
! 	 * the whole record in the order "rdata, then backup blocks, then record
! 	 * header".
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
! 	 * We could prevent the race by doing all this work while holding the
! 	 * insert lock, but it seems better to avoid doing CRC calculations while
! 	 * holding the lock.  This means we have to be careful about modifying the
! 	 * rdata chain until we know we aren't going to loop back again.  The only
! 	 * change we allow ourselves to make earlier is to set rdt->data = NULL in
! 	 * chain items we have decided we will have to back up the whole buffer
! 	 * for.  This is OK because we will certainly decide the same thing again
! 	 * for those items if we do it over; doing it here saves an extra pass
! 	 * over the chain later.
  	 */
  begin:;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
--- 855,872 ----
  	}
  
  	/*
! 	 * Here we scan the rdata chain, to determine which buffers must be backed
! 	 * up.
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
! 	 * We could prevent the race by doing all this work while holding a lock
! 	 * while doing the CRC calculation, but the race condition is so rare that
! 	 * it's better to take an optimistic approach.
! 	 *
! 	 * We add entries for backup blocks to the chain, so that they don't
! 	 * need any special treatment in the critical section where the chunks
! 	 * are copied into the WAL buffers. But we are also prepared to undo those
! 	 * changes if we have to loop back here.
  	 */
  begin:;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
***************
*** 762,770 **** begin:;
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
- 	INIT_CRC32(rdata_crc);
  	len = 0;
  	for (rdt = rdata;;)
  	{
--- 881,889 ----
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	forcePageWrites = Insert->forcePageWrites;
! 	doPageWrites = fullPageWrites || forcePageWrites;
  
  	len = 0;
  	for (rdt = rdata;;)
  	{
***************
*** 772,778 **** begin:;
  		{
  			/* Simple data, just include it */
  			len += rdt->len;
- 			COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  		}
  		else
  		{
--- 891,896 ----
***************
*** 783,794 **** begin:;
  				{
  					/* Buffer already referenced by earlier chain item */
  					if (dtbuf_bkp[i])
  						rdt->data = NULL;
  					else if (rdt->data)
- 					{
  						len += rdt->len;
- 						COMP_CRC32(rdata_crc, rdt->data, rdt->len);
- 					}
  					break;
  				}
  				if (dtbuf[i] == InvalidBuffer)
--- 901,912 ----
  				{
  					/* Buffer already referenced by earlier chain item */
  					if (dtbuf_bkp[i])
+ 					{
  						rdt->data = NULL;
+ 						rdt->len = 0;
+ 					}
  					else if (rdt->data)
  						len += rdt->len;
  					break;
  				}
  				if (dtbuf[i] == InvalidBuffer)
***************
*** 800,811 **** begin:;
  					{
  						dtbuf_bkp[i] = true;
  						rdt->data = NULL;
  					}
  					else if (rdt->data)
- 					{
  						len += rdt->len;
- 						COMP_CRC32(rdata_crc, rdt->data, rdt->len);
- 					}
  					break;
  				}
  			}
--- 918,927 ----
  					{
  						dtbuf_bkp[i] = true;
  						rdt->data = NULL;
+ 						rdt->len = 0;
  					}
  					else if (rdt->data)
  						len += rdt->len;
  					break;
  				}
  			}
***************
*** 820,858 **** begin:;
  	}
  
  	/*
- 	 * Now add the backup block headers and data into the CRC
- 	 */
- 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
- 	{
- 		if (dtbuf_bkp[i])
- 		{
- 			BkpBlock   *bkpb = &(dtbuf_xlg[i]);
- 			char	   *page;
- 
- 			COMP_CRC32(rdata_crc,
- 					   (char *) bkpb,
- 					   sizeof(BkpBlock));
- 			page = (char *) BufferGetBlock(dtbuf[i]);
- 			if (bkpb->hole_length == 0)
- 			{
- 				COMP_CRC32(rdata_crc,
- 						   page,
- 						   BLCKSZ);
- 			}
- 			else
- 			{
- 				/* must skip the hole */
- 				COMP_CRC32(rdata_crc,
- 						   page,
- 						   bkpb->hole_offset);
- 				COMP_CRC32(rdata_crc,
- 						   page + (bkpb->hole_offset + bkpb->hole_length),
- 						   BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
- 			}
- 		}
- 	}
- 
- 	/*
  	 * NOTE: We disallow len == 0 because it provides a useful bit of extra
  	 * error checking in ReadRecord.  This means that all callers of
  	 * XLogInsert must supply at least some not-in-a-buffer data.  However, we
--- 936,941 ----
***************
*** 862,931 **** begin:;
  	if (len == 0 && !isLogSwitch)
  		elog(PANIC, "invalid xlog record length %u", len);
  
- 	START_CRIT_SECTION();
- 
- 	/* Now wait to get insert lock */
- 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
- 
- 	/*
- 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
- 	 * back and recompute everything.  This can only happen just after a
- 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
- 	 *
- 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
- 	 * affect the contents of the XLOG record, so we'll update our local copy
- 	 * but not force a recomputation.
- 	 */
- 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
- 	{
- 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
- 		RedoRecPtr = Insert->RedoRecPtr;
- 
- 		if (doPageWrites)
- 		{
- 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
- 			{
- 				if (dtbuf[i] == InvalidBuffer)
- 					continue;
- 				if (dtbuf_bkp[i] == false &&
- 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
- 				{
- 					/*
- 					 * Oops, this buffer now needs to be backed up, but we
- 					 * didn't think so above.  Start over.
- 					 */
- 					LWLockRelease(WALInsertLock);
- 					END_CRIT_SECTION();
- 					goto begin;
- 				}
- 			}
- 		}
- 	}
- 
- 	/*
- 	 * Also check to see if forcePageWrites was just turned on; if we weren't
- 	 * already doing full-page writes then go back and recompute. (If it was
- 	 * just turned off, we could recompute the record without full pages, but
- 	 * we choose not to bother.)
- 	 */
- 	if (Insert->forcePageWrites && !doPageWrites)
- 	{
- 		/* Oops, must redo it with full-page data */
- 		LWLockRelease(WALInsertLock);
- 		END_CRIT_SECTION();
- 		goto begin;
- 	}
- 
  	/*
  	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  Note that we have
! 	 * now irrevocably changed the input rdata chain.  At the exit of this
! 	 * loop, write_len includes the backup block data.
  	 *
  	 * Also set the appropriate info bits to show which buffers were backed
  	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
  	 * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
  	 */
  	write_len = len;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
--- 945,964 ----
  	if (len == 0 && !isLogSwitch)
  		elog(PANIC, "invalid xlog record length %u", len);
  
  	/*
  	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  We have now
! 	 * modified the original rdata chain, but we remember the last regular
! 	 * entry in rdt_lastnormal, so we can undo this if we have to loop back
! 	 * to the beginning.
! 	 *
! 	 * At the exit of this loop, write_len includes the backup block data.
  	 *
  	 * Also set the appropriate info bits to show which buffers were backed
  	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
  	 * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
  	 */
+ 	rdt_lastnormal = rdt;
  	write_len = len;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
***************
*** 941,947 **** begin:;
  		page = (char *) BufferGetBlock(dtbuf[i]);
  
  		rdt->next = &(dtbuf_rdt1[i]);
! 		rdt = rdt->next;
  
  		rdt->data = (char *) bkpb;
  		rdt->len = sizeof(BkpBlock);
--- 974,980 ----
  		page = (char *) BufferGetBlock(dtbuf[i]);
  
  		rdt->next = &(dtbuf_rdt1[i]);
! 		rdt = &(dtbuf_rdt1[i]);
  
  		rdt->data = (char *) bkpb;
  		rdt->len = sizeof(BkpBlock);
***************
*** 975,1048 **** begin:;
  	}
  
  	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
  	 */
! 	updrqst = false;
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 
! 	/* Compute record's XLOG location */
! 	curridx = Insert->curridx;
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 	 * segment, we need not insert it (and don't want to because we'd like
! 	 * consecutive switch requests to be no-ops).  Instead, make sure
! 	 * everything is written and flushed through the end of the prior segment,
! 	 * and return the prior segment's end address.
  	 */
! 	if (isLogSwitch &&
! 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
! 	{
! 		/* We can release insert lock immediately */
! 		LWLockRelease(WALInsertLock);
! 
! 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
! 		if (RecPtr.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			RecPtr.xlogid -= 1;
! 			RecPtr.xrecoff = XLogFileSize;
! 		}
! 
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
! 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
! 		{
! 			XLogwrtRqst FlushRqst;
! 
! 			FlushRqst.Write = RecPtr;
! 			FlushRqst.Flush = RecPtr;
! 			XLogWrite(FlushRqst, false, false);
! 		}
! 		LWLockRelease(WALWriteLock);
! 
! 		END_CRIT_SECTION();
! 
! 		return RecPtr;
! 	}
! 
! 	/* Insert record header */
! 
! 	record = (XLogRecord *) Insert->currpos;
! 	record->xl_prev = Insert->PrevRecord;
! 	record->xl_xid = GetCurrentTransactionIdIfAny();
! 	record->xl_tot_len = SizeOfXLogRecord + write_len;
! 	record->xl_len = len;		/* doesn't include backup blocks */
! 	record->xl_info = info;
! 	record->xl_rmid = rmid;
! 
! 	/* Now we can finish computing the record's CRC */
! 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
--- 1008,1036 ----
  	}
  
  	/*
! 	 * Calculate CRC of the data, including all the backup blocks
! 	 *
! 	 * Note that the record header isn't added into the CRC initially since
! 	 * we don't know the prev-link yet.  Thus, the CRC will represent the CRC
! 	 * of the whole record in the order: rdata, then backup blocks, then
! 	 * record header.
  	 */
! 	INIT_CRC32(rdata_crc);
! 	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
! 		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
  	/*
! 	 * Construct record header. We can't CRC it yet, because the prev-link
! 	 * needs to be covered by the CRC and we don't know that yet. We will
! 	 * finish computing the CRC when we do.
  	 */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	rechdr.xl_prev = InvalidXLogRecPtr; /* TO BE DETERMINED */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
***************
*** 1063,1235 **** begin:;
  	}
  #endif
  
! 	/* Record begin of record in appropriate places */
! 	ProcLastRecPtr = RecPtr;
! 	Insert->PrevRecord = RecPtr;
  
! 	Insert->currpos += SizeOfXLogRecord;
! 	freespace -= SizeOfXLogRecord;
  
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
  	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
  		{
! 			if (rdata->len > freespace)
  			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
  			}
  		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Ensure next record will be properly aligned */
! 	Insert->currpos = (char *) Insert->currpage +
! 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
! 	freespace = INSERT_FREESPACE(Insert);
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
  	{
! 		XLogCtlWrite *Write = &XLogCtl->Write;
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
  		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
  		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == Write->curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
  		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
! 		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
! 		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		Write->LogwrtResult = LogwrtResult;
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
  	}
  	else
  	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
  		{
! 			/* curridx is filled and available for writing out */
! 			updrqst = true;
  		}
  		else
  		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
- 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
! 	LWLockRelease(WALInsertLock);
  
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
! 
! 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1051,1694 ----
  	}
  #endif
  
! 	START_CRIT_SECTION();
  
! 	/*
! 	 * Try to do the insertion.
! 	 */
! 	RecPtr = PerformXLogInsert(write_len, isLogSwitch, &rechdr,
! 							   rdata, rdata_crc, forcePageWrites);
! 	END_CRIT_SECTION();
! 
! 	if (XLogRecPtrIsInvalid(RecPtr))
! 	{
! 		/*
! 		 * Oops, have to retry. Unlink the backup blocks from the chain and
! 		 * reset info bitmask to undo the changes we've done.
! 		 */
! 		rdt_lastnormal->next = NULL;
! 		info = info_orig;
! 		goto begin;
! 	}
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	return RecPtr;
! }
! 
! /*
!  * Subroutine of XLogInsert. All the changes to shared state are done here,
!  * XLogInsert only prepares the record for insertion.
!  *
!  * On success, returns pointer to end of inserted record like XLogInsert().
!  * If RedoRecPtr or forcePageWrites had changed, returns InvalidRecPtr, and
!  * the caller must recalculate full-page-images and retry.
!  */
! static XLogRecPtr
! PerformXLogInsert(int write_len, bool isLogSwitch, XLogRecord *rechdr,
! 				  XLogRecData *rdata, pg_crc32 rdata_crc,
! 				  bool forcePageWrites)
! {
! 	volatile XLogInsertSlot *myslot = NULL;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			tot_len;
! 	int			freespace;
! 	int			written;
! 	XLogRecPtr	PrevRecord;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	EndPos;
! 	XLogRecPtr	CurrPos;
! 	bool		updrqst;
! 
! 	/* Get an insert location  */
! 	tot_len = SizeOfXLogRecord + write_len;
! 	if (!ReserveXLogInsertLocation(tot_len, forcePageWrites, isLogSwitch,
! 								   &PrevRecord, &StartPos, &EndPos,
! 								   (XLogInsertSlot **) &myslot, &updrqst))
  	{
! 		return EndPos;
! 	}
! 
! 	/*
! 	 * Got it! Now that we know the prev-link, we can finish computing the
! 	 * record's CRC.
! 	 */
! 	rechdr->xl_prev = PrevRecord;
! 
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
! 
! 	/* Copy the record header in place */
! 	record = (XLogRecord *) currpos;
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	COMP_CRC32(rdata_crc, currpos + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
  
! 	currpos += SizeOfXLogRecord;
! 	CurrPos.xrecoff += SizeOfXLogRecord;
! 
! 	if (!isLogSwitch)
! 	{
! 		/* Copy record data */
! 		written = 0;
! 		freespace = INSERT_FREESPACE(CurrPos);
! 		while (rdata != NULL)
  		{
! 			while (rdata->len > freespace)
  			{
! 				/*
! 				 * Write what fits on this page, then write the continuation
! 				 * record, and continue.
! 				 */
! 				XLogContRecord *contrecord;
! 
! 				memcpy(currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				written += freespace;
! 
! 				CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 
! 				/*
! 				 * Get the next page. It's important to update CurrPos before
! 				 * calling GetXLogBuffer(), because GetXLogBuffer() might need
! 				 * to wait for some insertions to finish so that it can write
! 				 * out a buffer to make room for the new page. Updating CurrPos
! 				 * before waiting for a new buffer ensures that we don't
! 				 * deadlock with ourselves if we run out of clean buffers.
! 				 */
! 				UpdateSlotCurrPos(myslot, CurrPos);
! 
! 				currpos = GetXLogBuffer(CurrPos);
! 
! 				contrecord = (XLogContRecord *) currpos;
! 				contrecord->xl_rem_len = write_len - written;
! 
! 				currpos += SizeOfXLogContRecord;
! 				CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 				freespace = INSERT_FREESPACE(CurrPos);
  			}
+ 
+ 			memcpy(currpos, rdata->data, rdata->len);
+ 			currpos += rdata->len;
+ 			CurrPos.xrecoff += rdata->len;
+ 			freespace -= rdata->len;
+ 			written += rdata->len;
+ 
+ 			rdata = rdata->next;
  		}
+ 		Assert(written == write_len);
+ 
+ 		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
+ 		Assert(XLByteEQ(CurrPos, EndPos));
  
! 		/*
! 		 * Done! Clear CurrPos in our slot to let others know that we're
! 		 * finished.
! 		 */
! 		UpdateSlotCurrPos(myslot, InvalidXLogRecPtr);
  	}
+ 	else
+ 	{
+ 		/* An xlog-switch record doesn't contain any data besides the header */
+ 		Assert(write_len == 0);
  
! 		/*
! 		 * An xlog-switch record consumes all the remaining space on the
! 		 * WAL segment. We have already reserved it for us, but we still need
! 		 * to make sure it's been allocated and zeroed in the WAL buffers so
! 		 * that when the caller does XLogWrite(), it can really write out all
! 		 * the zeros.
! 		 *
! 		 * Before we do that, update our CurrPos to the end of segment. We
! 		 * don't write anything to the remaining wasted space here, the
! 		 * zeroing is done in AdvanceXLInsertBuffer().
! 		 */
! 		UpdateSlotCurrPos(myslot, InvalidXLogRecPtr);
! 		AdvanceXLInsertBuffer(EndPos, false);
! 
! 		/*
! 		 * Even though we reserved the rest of the segment for us, which
! 		 * is reflected in EndPos, we need to return a value that points just
! 		 * to the end of the xlog-switch record.
! 		 */
! 		EndPos.xlogid = StartPos.xlogid;
! 		EndPos.xrecoff = StartPos.xrecoff + SizeOfXLogRecord;
! 	}
  
  	/*
! 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
  	 */
! 	if (updrqst)
! 	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
! 
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
! 
! 	/* update our global variables */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
! 
! 	return EndPos;
! }
! 
! /*
!  * Reserves the right amount of space for a record of the given size from
!  * the WAL. *StartPos_p is set to the beginning of the reserved section,
!  * *EndPos_p to its end, and *Prev_record_p points to the beginning of the
!  * previous record to set to the prev-link of the record header.
!  *
!  * A log-switch record is handled slightly differently. The rest of the
!  * segment will be reserved for this insertion, as indicated by the returned
!  * *EndPos_p value. However, if we are already at the beginning of the current
!  * segment, the *EndPos_p is set to the current location without reserving
!  * any space, and the function returns false.
!  *
!  * *updrqst_p is set to true, if this record ends on different page than
!  * the previous one - the caller should update the shared LogwrtRqst value
!  * after it's done inserting the record in that case, so that the WAL page
!  * that filled up gets written out at the next convenient moment.
!  *
!  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
!  * to let others know that we're busy inserting to the reserved area. The
!  * caller must clear it when the insertion is finished.
!  *
!  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
!  * changed. On failure, the shared state is not modified.
!  *
!  * This is the performance critical part of XLogInsert that must be
!  * serialized across backends. The rest can happen mostly in parallel.
!  *
!  * NB: The space calculation here must match the code in PerformXLogInsert,
!  * where we actually copy the record to the reserved space.
!  */
! static bool
! ReserveXLogInsertLocation(int size, bool forcePageWrites,
! 						  bool isLogSwitch,
! 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
! 						  XLogRecPtr *EndPos_p,
! 						  XLogInsertSlot **myslot_p, bool *updrqst_p)
! {
! 	volatile XLogInsertSlot *myslot;
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			freespace;
! 	XLogRecPtr	ptr;
! 	XLogRecPtr	StartPos;
! 	int32		nextslot;
! 	int32		lastslot;
! 	bool		updrqst = false;
! 
! retry:
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 
! 	if (!isLogSwitch &&
! 		(!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
! 		 Insert->forcePageWrites != forcePageWrites))
! 	{
! 		/*
! 		 * Oops, forcePageWrites was just turned on, or a checkpoint
! 		 * just happened. Loop back to the beginning, because we might have
! 		 * to include more full-page images in the record.
! 		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		*EndPos_p = InvalidXLogRecPtr;
! 		return false;
! 	}
  
  	/*
! 	 * Reserve the next insertion slot for us.
! 	 *
! 	 * First check that the slot is not still in use. Modifications to
! 	 * lastslot are protected by WALInsertWaitLock, but here we assume that
! 	 * reading an int32 is atomic. Another process might advance lastslot at
! 	 * the same time, but not past nextslot.
  	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	if (NextSlotNo(nextslot) == lastslot)
  	{
! 		/*
! 		 * Oops, we've "caught our tail" and the oldest slot is still in use.
! 		 * Have to wait for it to become vacant.
! 		 */
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		WaitForXLogInsertionSlotToBecomeFree();
! 		goto retry;
! 	}
! 	myslot = &XLogCtl->XLogInsertSlots[nextslot];
! 	nextslot = NextSlotNo(nextslot);
  
! 	/*
! 	 * Now reserve the right amount of space from the WAL for our record.
! 	 */
! 	ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
  
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	freespace = INSERT_FREESPACE(ptr);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
! 		freespace = INSERT_FREESPACE(ptr);
! 		updrqst = true;
! 	}
  
+ 	/*
+ 	 * We are now at the starting position of our record. Now figure out how
+ 	 * the data will be split across the WAL pages, to calculate where the
+ 	 * record ends.
+ 	 */
+ 	StartPos = ptr;
+ 
+ 	if (isLogSwitch)
+ 	{
  		/*
! 		 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 		 * segment, we need not insert it (and don't want to because we'd like
! 		 * consecutive switch requests to be no-ops). Otherwise the XLOG_SWITCH
! 		 * record should consume all the remaining space on the current segment.
  		 */
! 		Assert(size == SizeOfXLogRecord);
! 		if ((ptr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  		{
! 			/* We can release insert lock immediately */
! 			SpinLockRelease(&Insert->insertpos_lck);
  
! 			ptr.xrecoff -= SizeOfXLogLongPHD;
! 			if (ptr.xrecoff == 0)
! 			{
! 				/* crossing a logid boundary */
! 				ptr.xlogid -= 1;
! 				ptr.xrecoff = XLogFileSize;
! 			}
  
! 			*EndPos_p = ptr;
! 			*StartPos_p = ptr;
! 			*myslot_p = NULL;
! 
! 			return false;
! 		}
! 		else
  		{
! 			if (ptr.xrecoff % XLOG_SEG_SIZE != 0)
! 			{
! 				int segleft = XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE);
! 				XLByteAdvance(ptr, segleft);
! 			}
! 			updrqst = true;
! 		}
! 	}
! 	else
! 	{
! 		/* A normal record, ie. not xlog-switch */
! 		int sizeleft = size;
! 		while (freespace < sizeleft)
! 		{
! 			/* fill this page, and continue on next page */
! 			sizeleft -= freespace;
! 			ptr = AdvanceXLogRecPtrToNextPage(ptr);
  
! 			/* account for continuation record header */
! 			ptr.xrecoff += SizeOfXLogContRecord;
! 			freespace = INSERT_FREESPACE(ptr);
! 
! 			updrqst = true;
  		}
+ 		/* the rest fits on this page */
+ 		ptr.xrecoff += sizeleft;
  
! 		/* Align the end position, so that the next record starts aligned */
! 		ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 	}
  
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	myslot->CurrPos = StartPos;
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 	Insert->nextslot = nextslot;
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(LOG, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 PrevRecord_p->xlogid, PrevRecord_p->xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 size);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 	*myslot_p = (XLogInsertSlot *) myslot;
! 	*updrqst_p = updrqst;
! 
! 	return true;
! }
! 
! /*
!  * Update slot's CurrPos variable, and wake up anyone waiting on it.
!  */
! static void
! UpdateSlotCurrPos(volatile XLogInsertSlot *myslot, XLogRecPtr CurrPos)
! {
! 	PGPROC	   *waiter;
! 
! 	/*
! 	 * The write-barrier ensures that the changes we made to the WAL pages
! 	 * are visible to everyone before the update of CurrPos.
! 	 *
! 	 * XXX: I'm not sure if this is necessary. Does a function call act
! 	 * as an implicit barrier?
! 	 */
! 	pg_write_barrier();
! 
! 	SpinLockAcquire(&myslot->lck);
! 	myslot->CurrPos = CurrPos;
! 	waiter = myslot->waiter;
! 	myslot->waiter = NULL;
! 	SpinLockRelease(&myslot->lck);
! 	if (waiter != NULL)
! 		PGSemaphoreUnlock(&waiter->sem);
! }
! 
! /*
!  * Get a pointer to the right location in the WAL buffer corresponding a
!  * given XLogRecPtr.
!  *
!  * If the page is not initialized yet, it is initialized. That might also
!  * require evicting an old diry buffer from the buffer cache, which means I/O.
!  *
!  * The caller must ensure that the page containing the requested location
!  * isn't evicted yet, and won't be evicted, by holding onto an
!  * XLogInsertSlot with CurrPos set to 'ptr'. Setting it to some value
!  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
!  * if we have to evict a buffer, we might have to wait for someone else to
!  * finish a write. And that someone else might not be able to finish the write
!  * if our CurrPos points to a buffer that's still in the buffer cache.
!  */
! static char *
! GetXLogBuffer(XLogRecPtr ptr)
! {
! 	int			idx;
! 	XLogRecPtr	endptr;
  
! 	/*
! 	 * The XLog buffer cache is organized so that we can easily calculate the
! 	 * buffer a given page must be loaded into from the XLogRecPtr alone.
! 	 * A page must always be loaded to a particular buffer.
! 	 */
! 	idx = XLogRecPtrToBufIdx(ptr);
! 
! 	/*
! 	 * See what page is loaded in the buffer at the moment. It could be the
! 	 * page we're looking for, or something older. It can't be anything
! 	 * newer - that would imply the page we're looking for has already
! 	 * been written out to disk, which shouldn't happen as long as the caller
! 	 * has set its slot's CurrPos correctly.
! 	 *
! 	 * However, we don't hold a lock while we read the value. If someone has
! 	 * just initialized the page, it's possible that we get a "torn read",
! 	 * and read a bogus value. That's ok, we'll grab the mapping lock (in
! 	 * AdvanceXLInsertBuffer) and retry if we see anything else than the page
! 	 * we're looking for. But it means that when we do this unlocked read, we
! 	 * might see a value that *is* ahead of the page we're looking for. So
! 	 * don't PANIC on that, until we've verified the value while holding the
! 	 * lock.
! 	 */
! 	endptr = XLogCtl->xlblocks[idx];
! 	if (ptr.xlogid != endptr.xlogid ||
! 		!(ptr.xrecoff < endptr.xrecoff &&
! 		  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 	{
! 		AdvanceXLInsertBuffer(ptr, false);
! 		endptr = XLogCtl->xlblocks[idx];
! 
! 		if (ptr.xlogid != endptr.xlogid ||
! 			!(ptr.xrecoff < endptr.xrecoff &&
! 			  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 		{
! 			elog(PANIC, "could not find WAL buffer for %X/%X",
! 				 ptr.xlogid, ptr.xrecoff);
! 		}
  	}
+ 
+ 	/*
+ 	 * Found the buffer holding this page. Return a pointer to the right
+ 	 * offset within the page.
+ 	 */
+ 	return (char *) XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
+ 		ptr.xrecoff % XLOG_BLCKSZ;
+ }
+ 
+ /*
+  * Advance an XLogRecPtr to the first valid insertion location on the next
+  * page, right after the page header. An XLogRecPtr pointing to a boundary,
+  * ie. the first byte of a page, is taken to belong to the previous page.
+  */
+ static XLogRecPtr
+ AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr)
+ {
+ 	int			freespace;
+ 
+ 	freespace = INSERT_FREESPACE(ptr);
+ 	XLByteAdvance(ptr, freespace);
+ 	if (ptr.xrecoff % XLogSegSize == 0)
+ 		ptr.xrecoff += SizeOfXLogLongPHD;
  	else
+ 		ptr.xrecoff += SizeOfXLogShortPHD;
+ 
+ 	return ptr;
+ }
+ 
+ /*
+  * Wait for any insertions < upto to finish.
+  *
+  * Returns a value >= upto, which indicates the oldest in-progress insertion
+  * that we saw in the array, or CurrPos if there are no insertions in-progress
+  * at exit.
+  */
+ static XLogRecPtr
+ WaitXLogInsertionsToFinish(XLogRecPtr upto, XLogRecPtr CurrPos)
+ {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			lastslot;
+ 	int			nextslot;
+ 	volatile XLogInsertSlot *slot;
+ 	XLogRecPtr	slotptr = InvalidXLogRecPtr;
+ 	XLogRecPtr	LastPos;
+ 
+ 	if (MyProc == NULL)
+ 		elog(PANIC, "cannot wait without a PGPROC structure");
+ 
+ 	LastPos = CurrPos;
+ 
+ 	LWLockAcquire(WALInsertWaitLock, LW_EXCLUSIVE);
+ 
+ 	lastslot = Insert->lastslot;
+ 	nextslot = Insert->nextslot;
+ 
+ 	/* Skip over slots that have finished already */
+ 	while (lastslot != nextslot)
  	{
! 		slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		SpinLockAcquire(&slot->lck);
! 		slotptr = slot->CurrPos;
  
! 		if (XLogRecPtrIsInvalid(slotptr))
  		{
! 			lastslot = NextSlotNo(lastslot);
! 			SpinLockRelease(&slot->lck);
  		}
  		else
  		{
! 			/*
! 			 * This insertion is still in-progress. Wait for it to finish
! 			 * if it's <= upto, otherwise we're done.
! 			 */
! 			Insert->lastslot = lastslot;
! 
! 			if (XLogRecPtrIsInvalid(upto) || XLByteLE(upto, slotptr))
! 			{
! 				LastPos = slotptr;
! 				SpinLockRelease(&slot->lck);
! 				break;
! 			}
! 
! 			/* wait */
! 			slot->waiter = MyProc;
! 			SpinLockRelease(&slot->lck);
! 			ProcWaitForSignal();
  		}
  	}
  
! 	Insert->lastslot = lastslot;
! 	LWLockRelease(WALInsertWaitLock);
  
! 	return LastPos;
! }
! 
! /*
!  * Wait for the next insertion slot to become vacant.
!  */
! static void
! WaitForXLogInsertionSlotToBecomeFree(void)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			lastslot;
! 	int			nextslot;
! 
! 	if (MyProc == NULL)
! 		elog(PANIC, "cannot wait without a PGPROC structure");
! 
! 	LWLockAcquire(WALInsertWaitLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Re-read lastslot and nextslot, now that we have the wait-lock.
! 	 * We're reading nextslot without holding insertpos_lck. It could advance
! 	 * at the same time, but it can't advance beyond lastslot - 1.
! 	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 
! 	/*
! 	 * If there are still no slots available, wait for the oldest slot to
! 	 * become vacant.
! 	 */
! 	if (NextSlotNo(nextslot) == lastslot)
  	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
  
! 		SpinLockAcquire(&slot->lck);
! 		while(!XLogRecPtrIsInvalid(slot->CurrPos))
! 		{
! 			slot->waiter = MyProc;
! 			SpinLockRelease(&slot->lck);
! 			ProcWaitForSignal();
! 			SpinLockAcquire(&slot->lck);
! 		}
! 		SpinLockRelease(&slot->lck);
  	}
+ 	/*
+ 	 * Ok, there is at least one empty slot now. That's enouugh for our
+ 	 * insertion, but ẃhile we're at it, advance lastslot as much as we
+ 	 * can. That way we don't need to come back here on the next call
+ 	 * again.
+ 	 */
+ 	while (lastslot != nextslot)
+ 	{
+ 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
+ 		/*
+ 		 * Don't need to grab the slot's spinlock here, because we're not
+ 		 * interested in the exact value of CurrPos, only whether it's
+ 		 * valid or not.
+ 		 */
+ 		if (!XLogRecPtrIsInvalid(slot->CurrPos))
+ 			break;
  
! 		lastslot = NextSlotNo(lastslot);
! 	}
! 	Insert->lastslot = lastslot;
  
! 	LWLockRelease(WALInsertWaitLock);
  }
  
  /*
***************
*** 1457,1490 **** XLogArchiveCleanup(const char *xlog)
  
  /*
   * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data.
!  *
!  * If new_segment is TRUE then we set up the next buffer page as the first
!  * page of the next xlog segment file, possibly but not usually the next
!  * consecutive file page.
!  *
!  * The global LogwrtRqst.Write pointer needs to be advanced to include the
!  * just-filled page.  If we can do this for free (without an extra lock),
!  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
!  * request update still needs to be done, FALSE if we did it internally.
!  *
!  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
  	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
! 	/* Use Insert->LogwrtResult copy if it's more fresh */
! 	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
! 		LogwrtResult = Insert->LogwrtResult;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 1916,1946 ----
  
  /*
   * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data. Any new pages are initialized
!  * to zeros, with pages headers initialized properly.
   */
! static void
! AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
  	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
+ 	bool		needflush;
+ 	int			npages = 0;
+ 	XLogRecPtr	EvictedPtr;
  
! 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Now that we have the lock, check if someone initialized the page
! 	 * already.
! 	 */
! while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
! {
! 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1492,1503 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
- 	{
- 		/* nope, got work to do... */
- 		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 1948,1964 ----
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  
! 	needflush = !XLByteLE(OldPageRqstPtr, LogwrtResult.Write);
! 
! 	if (needflush)
! 	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1505,1564 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		update_needed = false;	/* Did the shared-request update */
! 
! 		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/* OK, someone wrote it already */
! 			Insert->LogwrtResult = LogwrtResult;
! 		}
! 		else
! 		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding insert lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
  	/*
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
! 
! 	if (new_segment)
! 	{
! 		/* force it to a segment start point */
! 		NewPageEndPtr.xrecoff += XLogSegSize - 1;
! 		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
! 	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 1966,2025 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 			{
! 				Assert(XLByteLE(OldPageRqstPtr, xlogctl->Insert.CurrPos));
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
! 			}
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr, InvalidXLogRecPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding mapping lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
+ 	EvictedPtr = OldPageRqstPtr;
+ 
  	/*
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1568,1581 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
  	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
  	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
- 	Insert->curridx = nextidx;
- 	Insert->currpage = NewPage;
- 
- 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
- 
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
  	 * will look like zeroes and not valid XLOG records...
--- 2029,2040 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
+ 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
+ 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
+ 
  	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
  	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
  	 * will look like zeroes and not valid XLOG records...
***************
*** 1618,1628 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 2077,2106 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * make sure the xlblocks update becomes visible to others before the
! 	 * curridx update.
! 	 */
! 	pg_write_barrier();
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 
! 	Assert(opportunistic || XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]));
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(DEBUG1, "initialized %d pages, upto %X/%X (evicted upto %X/%X) in slot %d (backend %d)",
! 			 npages,
! 			 NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff,
! 			 EvictedPtr.xlogid, EvictedPtr.xrecoff,
! 			 nextidx, MyBackendId);
! #endif
  }
  
  /*
***************
*** 1667,1682 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * If xlog_switch == TRUE, we are intending an xlog segment switch, so
!  * perform end-of-segment actions after writing the last page, even if
!  * it's not physically the end of its segment.  (NB: this will work properly
!  * only if caller specifies WriteRqst == page-end and flexible == false,
!  * and there is some data to write.)
!  *
!  * Must be called with WALWriteLock held.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
--- 2145,2156 ----
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
!  * the data is ready to write.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
***************
*** 1694,1700 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = Write->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
--- 2168,2174 ----
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
***************
*** 1726,1735 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
! 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
  		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
--- 2200,2209 ----
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
! 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X (slot %d)",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff, curridx);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
  		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
***************
*** 1829,1844 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
- 			 * We also do this if this is the last page written for an xlog
- 			 * switch.
- 			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg || (xlog_switch && last_iteration))
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
--- 2303,2315 ----
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg)
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
***************
*** 1928,1935 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
- 
- 	Write->LogwrtResult = LogwrtResult;
  }
  
  /*
--- 2399,2404 ----
***************
*** 2101,2134 **** XLogFlush(XLogRecPtr record)
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
! 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 				else
! 				{
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 					WriteRqstPtr.xrecoff -= freespace;
! 				}
! 				LWLockRelease(WALInsertLock);
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = WriteRqstPtr;
! 			}
! 			else
! 			{
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = record;
! 			}
! 			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
  	}
--- 2570,2618 ----
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
+ 		/* try to write/flush later additions to XLOG as well */
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos;
+ 
+ 		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to do LWLockConditionalAcquire(WALInsertLock), and
+ 		 * fall back to writing just up to 'record' if we couldn't get the
+ 		 * lock. I wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand,
+ 		 * it would be good to not cause more contention on the lock if it's
+ 		 * busy, but on the other hand, this spinlock is much more lightweight
+ 		 * than the WALInsertLock was, so maybe it's better to just grab the
+ 		 * spinlock. Also note that if we stored the XLogRecPtr as one 64-bit
+ 		 * integer, we could just read it with no lock on platforms where
+ 		 * 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)		/* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, insertpos);
+ 
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
! 			XLogWrite(WriteRqst, false);
  		}
  		LWLockRelease(WALWriteLock);
  	}
***************
*** 2238,2260 **** XLogBackgroundFlush(void)
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
- 
  	START_CRIT_SECTION();
  
! 	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2722,2752 ----
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
  	START_CRIT_SECTION();
  
! 	/* now wait for any in-progress insertions to finish and get write lock */
! 	WaitXLogInsertionsToFinish(WriteRqstPtr, InvalidXLogRecPtr);
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible);
  	}
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 */
+ 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5048,5053 **** XLOGShmemSize(void)
--- 5540,5548 ----
  	/* and the buffers themselves */
  	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
  
+ 	/* XLog insertion slots */
+ 	size = add_size(size, mul_size(sizeof(XLogInsertSlot), NumXLogInsertSlots));
+ 
  	/*
  	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
  	 * added by CreateSharedMemoryAndSemaphores.  This lets us use this
***************
*** 5063,5068 **** XLOGShmemInit(void)
--- 5558,5564 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5088,5093 **** XLOGShmemInit(void)
--- 5584,5601 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize insertion slots */
+ 	XLogCtl->XLogInsertSlots = (XLogInsertSlot *) allocptr;
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogCtl->XLogInsertSlots[i].CurrPos = InvalidXLogRecPtr;
+ 		XLogCtl->XLogInsertSlots[i].waiter = NULL;
+ 		SpinLockInit(&XLogCtl->XLogInsertSlots[i].lck);
+ 	}
+ 	XLogCtl->Insert.nextslot = 1;
+ 	XLogCtl->Insert.lastslot = 0;
+ 	allocptr += sizeof(XLogInsertSlot) * NumXLogInsertSlots;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5102,5112 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
- 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
--- 5610,5621 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
+ 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
+ 
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
***************
*** 5981,5986 **** StartupXLOG(void)
--- 6490,6496 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6737,6744 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7247,7258 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6746,6771 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
- 	XLogCtl->Write.LogwrtResult = LogwrtResult;
- 	Insert->LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7260,7282 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndRecPtr.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6777,6783 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7288,7294 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7271,7279 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
  XLogRecPtr
  GetInsertRecPtr(void)
--- 7782,7794 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
+  *
+  * XXX: now that there can be several insertions "in-flight", what should
+  * this return? The position a new insertion would got to? Or the the oldest
+  * still in-progress insertion, perhaps?
   */
  XLogRecPtr
  GetInsertRecPtr(void)
***************
*** 7547,7552 **** CreateCheckPoint(int flags)
--- 8062,8068 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7615,7624 **** CreateCheckPoint(int flags)
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 8131,8140 ----
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * Determine the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7630,7636 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 8146,8152 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7639,7653 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 8155,8166 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7673,7686 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
--- 8186,8195 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(curInsert);
  	if (freespace < SizeOfXLogRecord)
! 		curInsert = AdvanceXLogRecPtrToNextPage(curInsert);
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
***************
*** 7706,7712 **** CreateCheckPoint(int flags)
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8215,8221 ----
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7726,7732 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8235,8241 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 8093,8107 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8602,8616 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * Like in CreateCheckPoint(), you need both insertpos_lck and info_lck
! 	 * to update it, although during recovery acquiring insertpos_lck is just
! 	 * pro forma, because no WAL insertions are happening.
  	 */
! 	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8294,8299 **** RequestXLogSwitch(void)
--- 8803,8809 ----
  {
  	XLogRecPtr	RecPtr;
  	XLogRecData rdata;
+ 	XLogwrtRqst FlushRqst;
  
  	/* XLOG SWITCH, alone among xlog record types, has no data */
  	rdata.buffer = InvalidBuffer;
***************
*** 8303,8308 **** RequestXLogSwitch(void)
--- 8813,8839 ----
  
  	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH, &rdata);
  
+ 	/* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
+ 	 * if the xlog switch had no work to do, ie. if we were already at the
+ 	 * beginning of a new XLOG segment. You can check if RecPtr points to
+ 	 * beginning of a segment if you want to keep the distinction.
+ 	 */
+ 	TRACE_POSTGRESQL_XLOG_SWITCH();
+ 
+ 	/*
+ 	 * Flush through the end of the page containing XLOG_SWITCH, and
+ 	 * perform end-of-segment actions (eg, notifying archiver).
+ 	 */
+ 	WaitXLogInsertionsToFinish(RecPtr, InvalidXLogRecPtr);
+ 
+ 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ 	FlushRqst.Write = RecPtr;
+ 	FlushRqst.Flush = RecPtr;
+ 	START_CRIT_SECTION();
+ 	XLogWrite(FlushRqst, false);
+ 	END_CRIT_SECTION();
+ 	LWLockRelease(WALWriteLock);
+ 
  	return RecPtr;
  }
  
***************
*** 8856,8861 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9387,9393 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
***************
*** 8905,8930 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9437,9462 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 8986,8998 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9518,9530 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9074,9083 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
--- 9606,9616 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
***************
*** 9094,9100 **** pg_start_backup_callback(int code, Datum arg)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9627,9633 ----
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9107,9112 **** pg_start_backup_callback(int code, Datum arg)
--- 9640,9646 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
***************
*** 9148,9156 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 9682,9690 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9159,9174 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 9693,9708 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9370,9385 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9904,9921 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
! 
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9431,9446 **** GetStandbyFlushRecPtr(void)
   * Get latest WAL insert pointer
   */
  XLogRecPtr
! GetXLogInsertRecPtr(bool needlock)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	if (needlock)
! 		LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	if (needlock)
! 		LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 9967,9980 ----
   * Get latest WAL insert pointer
   */
  XLogRecPtr
! GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***************
*** 200,206 **** pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	current_recptr = GetXLogInsertRecPtr(true);
  
  	snprintf(location, sizeof(location), "%X/%X",
  			 current_recptr.xlogid, current_recptr.xrecoff);
--- 200,206 ----
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	current_recptr = GetXLogInsertRecPtr();
  
  	snprintf(location, sizeof(location), "%X/%X",
  			 current_recptr.xlogid, current_recptr.xrecoff);
*** a/src/backend/storage/lmgr/spin.c
--- b/src/backend/storage/lmgr/spin.c
***************
*** 56,61 **** SpinlockSemas(void)
--- 56,64 ----
  	 *
  	 * For now, though, we just need a few spinlocks (10 should be plenty)
  	 * plus one for each LWLock and one for each buffer header.
+ 	 *
+ 	 * XXX: remember to adjust this for the number of spinlocks needed by the
+ 	 * xlog.c changes before committing!
  	 */
  	return NumLWLocks() + NBuffers + 10;
  }
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 288,294 **** extern bool XLogInsertAllowed(void);
  extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
  extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
  extern XLogRecPtr GetStandbyFlushRecPtr(void);
! extern XLogRecPtr GetXLogInsertRecPtr(bool needlock);
  extern XLogRecPtr GetXLogWriteRecPtr(void);
  extern bool RecoveryIsPaused(void);
  extern void SetRecoveryPause(bool recoveryPause);
--- 288,294 ----
  extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
  extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
  extern XLogRecPtr GetStandbyFlushRecPtr(void);
! extern XLogRecPtr GetXLogInsertRecPtr(void);
  extern XLogRecPtr GetXLogWriteRecPtr(void);
  extern bool RecoveryIsPaused(void);
  extern void SetRecoveryPause(bool recoveryPause);
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALInsertWaitLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#22)
Re: Moving more work outside WALInsertLock

On Sun, Dec 25, 2011 at 7:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

m01 tps = 631.875547 (including connections establishing)
x01 tps = 611.443724 (including connections establishing)
m08 tps = 4573.701237 (including connections establishing)
x08 tps = 4576.242333 (including connections establishing)
m16 tps = 7697.783265 (including connections establishing)
x16 tps = 7837.028713 (including connections establishing)
m24 tps = 11613.690878 (including connections establishing)
x24 tps = 12924.027954 (including connections establishing)
m32 tps = 10684.931858 (including connections establishing)
x32 tps = 14168.419730 (including connections establishing)
m80 tps = 10259.628774 (including connections establishing)
x80 tps = 13864.651340 (including connections establishing)

I think a 5% loss on 1 session is worth a 40% gain on a full loaded system.

Well done Heikki.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#23)
Re: Moving more work outside WALInsertLock

On Sat, Jan 7, 2012 at 9:31 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Anyway, here's a new version of the patch. It no longer busy-waits for
in-progress insertions to finish, and handles xlog-switches. This is now
feature-complete. It's a pretty complicated patch, so I would appreciate
more eyeballs on it. And benchmarking again.

Took me awhile to understand why the data structure for the insertion
slots is so complex. Why not have slots per buffer? That would be
easier to understand and slots are very small. Not sure if its a good
idea, but we should explain the design options around that choice.

Can we avoid having spinlocks on the slots altogether? If we have a
page number (int) and an LSN, inserters would set LSN and then set
page number. Anybody waiting for slots would stop if page number is
zero since that means its not complete yet. So readers look at page
number first and aren't allowed to look at LSN without valid page
number.

Page number would be useful in working out where to stop when doing
background flushes, which we need for Group Commit, which is arriving
soon for this release.

Can we also try aligning the actual insertions onto cache lines rather
than just MAXALIGNing them? The WAL header fills half a cache line as
it is, so many other records will fit nicely. I'd like to see what
that does to space consumption, but it might be a useful option at
least.

I'd like to see test results with FPWs turned off and CACHEALIGNed
inserts. Again, we're planning on avoiding FPWs in future, so it would
be sensible to check the tuning in that configuration also.

GetInsertRecPtr() should return the XLogRecPtr of the latest
allocation. IMHO that is what we need for checkpoints and the
walsender doesn't really matter.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#25)
Re: Moving more work outside WALInsertLock

On 09.01.2012 15:44, Simon Riggs wrote:

On Sat, Jan 7, 2012 at 9:31 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Anyway, here's a new version of the patch. It no longer busy-waits for
in-progress insertions to finish, and handles xlog-switches. This is now
feature-complete. It's a pretty complicated patch, so I would appreciate
more eyeballs on it. And benchmarking again.

Took me awhile to understand why the data structure for the insertion
slots is so complex. Why not have slots per buffer? That would be
easier to understand and slots are very small.

Hmm, how would that work?

Can we avoid having spinlocks on the slots altogether? If we have a
page number (int) and an LSN, inserters would set LSN and then set
page number. Anybody waiting for slots would stop if page number is
zero since that means its not complete yet. So readers look at page
number first and aren't allowed to look at LSN without valid page
number.

The LSN on a slot is set in ReserveXLogInsertLocation(), while holding
the insertpos_lck spinlock. The inserter doesn't acquire the per-slot
spinlock at that point, it relies on the fact that no-one will look at
the slot until the shared nextslot variable has been incremented. The
spinlock is only acquired when updating the pointer, which only happens
when crossing a WAL page, which isn't that performance-criticial, and
when the insertion is finished. It would be nice to get rid of the
spinlock acquisition when the insertion is finished, but I don't see any
easy way around that. The spinlock is needed to make sure that when the
inserter clears its slot, it can atomically check the waiter field.

The theory is that contention on those per-slot spinlocks is very rare.
Profiling this patch with "perf", it looks like the bottleneck is the
insertpos_lck spinlock.

Page number would be useful in working out where to stop when doing
background flushes, which we need for Group Commit, which is arriving
soon for this release.

Ok, I'll have to just take your word on that :-). I don't see why Group
Commit needs to care about page boundaries, but the slot data structure
I used already allows you to check fairly how far you write out the WAL
without having to wait for any in-progress insertions to complete.

Can we also try aligning the actual insertions onto cache lines rather
than just MAXALIGNing them? The WAL header fills half a cache line as
it is, so many other records will fit nicely. I'd like to see what
that does to space consumption, but it might be a useful option at
least.

Hmm, that's an interesting thought. That would mean having gaps in the
in-memory WAL cache, so that when it's written out, you'd need to stitch
together the pieces to form the WAL that's actually written to disk. Or
just leave the gaps in the on-disk format, if we're willing to change
the WAL format for this, but I don't think we want make our WAL any
larger than it already is.

I've written this patch avoiding WAL format changes, but if we're
willing to do that, there's a few things we could do that would help.
For one, the logic in ReserveXLogInsertLocation() that figures out where
in the WAL stream the record begins and where it ends could be made a
lot simpler. At the moment, we refuse to split a WAL record header
across WAL pages, and because of that, the number of bytes occupied by a
WAL record depends on where in the WAL it's written. If we changed that,
reserving space from the WAL for a record that's N bytes long could be
done essentially as "CurrPos += N". There's some complications, like
having to keep track of the prev-link too, but I believe it would be
possible to get rid of the spinlock and implement
ReserveXLogInsertLocation() as a single atomic fetch-and-add instruction.

GetInsertRecPtr() should return the XLogRecPtr of the latest
allocation. IMHO that is what we need for checkpoints and the
walsender doesn't really matter.

Ok. Thanks looking at the patch!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#26)
Re: Moving more work outside WALInsertLock

On Mon, Jan 9, 2012 at 2:29 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Can we also try aligning the actual insertions onto cache lines rather
than just MAXALIGNing them? The WAL header fills half a cache line as
it is, so many other records will fit nicely. I'd like to see what
that does to space consumption, but it might be a useful option at
least.

Hmm, that's an interesting thought. That would mean having gaps in the
in-memory WAL cache, so that when it's written out, you'd need to stitch
together the pieces to form the WAL that's actually written to disk. Or just
leave the gaps in the on-disk format, if we're willing to change the WAL
format for this, but I don't think we want make our WAL any larger than it
already is.

I don't think that would require any format changes at all. We never
check that the total length of the WAL matches the size of the
contents do we? So if the record is just a little too big for its
contents, it will still all work fine. Recovery just calls rmgr
functions, so it doesn't know how big things should be. _redo routines
don't do local validation anywhere that I'm aware of. Sure, they cast
the record to a specific type, but that doesn't prevent the record
from being longer that _redo wants it to be. Try it.

You could probably do it for individual record types with an
additional rdata item.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#26)
1 attachment(s)
Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Here's another version of the patch to make XLogInsert less of a
bottleneck on multi-CPU systems. The basic idea is the same as before,
but several bugs have been fixed, and lots of misc. clean up has been done.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-5.patchtext/x-diff; name=xloginsert-scale-5.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 282,307 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually three shared-memory copies of
!  * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
!  *		XLogCtl->LogwrtResult is protected by info_lck
!  *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
!  *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
!  * One must hold the associated lock to read or write any of these, but
!  * of course no lock is needed to read/write the unshared LogwrtResult.
!  *
!  * XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always
!  * right", since both are updated by a write or flush operation before
!  * it releases WALWriteLock.  The point of keeping XLogCtl->Write.LogwrtResult
!  * is that it can be examined/modified by code that already holds WALWriteLock
!  * without needing to grab info_lck as well.
!  *
!  * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
!  * but is updated when convenient.	Again, it exists for the convenience of
!  * code that is already holding WALInsertLock but not the other locks.
!  *
!  * The unshared LogwrtResult may lag behind any or all of these, and again
!  * is updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
--- 283,293 ----
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There is one shared-memory copy of LogwrtResult,
!  * plus one unshared copy in each backend. To read the shared copy, you need
!  * to hold info_lck *or* WALWriteLock. To update it, you need to hold both
!  * locks. The unshared LogwrtResult may lag behind the shared copy, and is
!  * updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
***************
*** 311,320 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 297,311 ----
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 326,331 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 317,391 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL, and the next insertion
+  *    slot to advertise that the insertion is in progress. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    insertpos_lck. Try to keep this section as short as possible,
+  *    insertpos_lck can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each insertion is asssigned its own
+  * "XLog insertion slot", which is used to advertise the position the backend
+  * is writing to. The slot is marked as in-use in step 1, while holding
+  * insertpos_lck, by setting the position field in the slot. When the backend
+  * is finished with the insertion, it clears its slot. Each slot is protected
+  * by a separate spinlock, to keep contention minimal.
+  *
+  * The insertion slots also provide a mechanism to wait for an insertion to
+  * finish. This is important when an XLOG page is written out - any
+  * in-progress insertions must finish copying data to the page first, or the
+  * on-disk copy will be incomplete. Waiting is done by the
+  * WaitXLogInsertionsToFinish() function. It adds the current process to the
+  * waiting queue in the slot it needs to wait for, and when that insertion
+  * finishes (or proceeds to the next page, at least), the inserter wakes up
+  * the process.
+  *
+  * The insertion slots form a ring. Insert->nextslot points to the next free
+  * slot, and Insert->lastslot points to the last slot that's still in use.
+  * lastslot can lag behind reality by any number of slots, as long as nextslot
+  * doesn't catch up with it. lastslot is advanced by
+  * WaitXLogInsertionsToFinish(), and is protected by WALInsertTailLock.
+  * nextslot is advanced in ReserveXLogInsertLocation() and is protected by
+  * insertpos_lck. Both slot variables are 32-bit integers, so that they can
+  * be read atomically without holding a lock.
+  *
+  * Whenever the ring fills up, ie. when nextslot wraps around and catches up
+  * with lastslot, ReserveXLogInsertLocation() has to wait for the oldest
+  * insertion to finish and advance lastslot, to make room for the new
+  * insertion. This is done by WaitForXLogInsertionSlotToBecomeFree() function,
+  * which is similar to WaitXLogInsertionsToFinish(), but instead of waiting
+  * for all insertions up to a given point to finish, it just waits for the
+  * inserter in the oldest slot to finish.
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+  * finish (or at least advance to next uninitialized page), while you're
+  * holding WALWriteLock. That would be bad, because the backend you're waiting
+  * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+  * you'd get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * it's called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition, so the common pattern:
+  *
+  *   recptr = XLogInsert(...)
+  *   XLogFlush(recptr)
+  *
+  * is safe. It can't get stuck, because an insertion to a WAL page that's
+  * already initialized in cache can always proceed without waiting on a lock.
+  *
   *----------
   */
  
***************
*** 346,356 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 406,435 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct
! 								 * (except lastslot). */
! 
! 	int32		nextslot;		/* next insertion slot to use */
! 	int32		lastslot;		/* last in-use insertion slot (protected by
! 								 * WALInsertTailLock) */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment. The
! 	 * next record will be inserted there (or somewhere after it if there's
! 	 * not enough space on the current page). PrevRecord points to the
! 	 * beginning of the last record already reserved. It might not be fully
! 	 * copied into place yet, but we know its exact location already.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
! 
! 	/*
! 	 * padding to push RedoRecPtr and forcePageWrites, which rarely change,
! 	 * to a different cache line than the rapidly-changing CurrPos and
! 	 * PrevRecord values. XXX: verify if this makes any difference
! 	 */
! 	char		padding[128];
! 
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 371,389 **** typedef struct XLogCtlInsert
   */
  typedef struct XLogCtlWrite
  {
- 	XLogwrtResult LogwrtResult; /* current value of LogwrtResult */
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
--- 450,483 ----
   */
  typedef struct XLogCtlWrite
  {
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
+ 
+ /*
+  * Slots for in-progress WAL insertions.
+  */
+ typedef struct
+ {
+ 	slock_t		lck;
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ 	PGPROC	   *head;		/* head of list of waiting PGPROCs */
+ 	PGPROC	   *tail;		/* tail of list of waiting PGPROCs */
+ } XLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	1000
+ 
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	XLogInsertSlot *XLogInsertSlots;
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
***************
*** 397,405 **** typedef struct XLogCtlData
  	XLogCtlWrite Write;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 491,509 ----
  	XLogCtlWrite Write;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock. To change the identity of a buffer that's
+ 	 * still dirty, the old page needs to be written out first, and for that
+ 	 * you need WALWriteLock, and you need to ensure that there's no
+ 	 * in-progress insertions to the page by calling
+ 	 * WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* current (latest) block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 471,498 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
  
- /* Free space remaining in the current xlog page buffer */
- #define INSERT_FREESPACE(Insert)  \
- 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
! /* Construct XLogRecPtr value for current insertion point */
! #define INSERT_RECPTR(recptr,Insert,curridx)  \
! 	( \
! 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
! 	  (recptr).xrecoff = \
! 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
! 	)
  
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
--- 575,605 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
+ #define INSERT_FREESPACE(endptr)	\
+ 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
+ 
+ #define NextBufIdx(idx)		\
+ 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  
! #define NextSlotNo(idx)		\
! 		(((idx) == NumXLogInsertSlots) ? 0 : ((idx) + 1))
  
! /*
!  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
!  * would hold if it was in cache, the page containing 'recptr'.
!  *
!  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
!  * page is taken to mean the previous page.
!  */
! #define XLogRecPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
! #define XLogRecEndPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
***************
*** 618,626 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
--- 725,733 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
***************
*** 667,672 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 774,797 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static XLogRecPtr PerformXLogInsert(int write_len,
+ 				  bool isLogSwitch,
+ 				  XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  bool didPageWrites);
+ static bool ReserveXLogInsertLocation(int size, bool forcePageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p);
+ static void UpdateSlotCurrPos(volatile XLogInsertSlot *myslot,
+ 				  XLogRecPtr CurrPos);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto,
+ 						   XLogRecPtr CurrPos);
+ static void WaitForXLogInsertionSlotToBecomeFree(void);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ static XLogRecPtr AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 687,698 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
  	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
--- 812,818 ----
***************
*** 706,715 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	uint32		len,
  				write_len;
  	unsigned	i;
- 	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 826,835 ----
  	uint32		len,
  				write_len;
  	unsigned	i;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
+ 	XLogRecord	rechdr;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 896,1029 **** begin:;
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
- 	START_CRIT_SECTION();
- 
- 	/* Now wait to get insert lock */
- 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
- 
  	/*
! 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
! 	 * back and recompute everything.  This can only happen just after a
! 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
! 	 *
! 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
! 	 * affect the contents of the XLOG record, so we'll update our local copy
! 	 * but not force a recomputation.
  	 */
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
! 	{
! 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
! 		RedoRecPtr = Insert->RedoRecPtr;
  
! 		if (doPageWrites)
! 		{
! 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 			{
! 				if (dtbuf[i] == InvalidBuffer)
! 					continue;
! 				if (dtbuf_bkp[i] == false &&
! 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
! 				{
! 					/*
! 					 * Oops, this buffer now needs to be backed up, but we
! 					 * didn't think so above.  Start over.
! 					 */
! 					LWLockRelease(WALInsertLock);
! 					END_CRIT_SECTION();
! 					rdt_lastnormal->next = NULL;
! 					info = info_orig;
! 					goto begin;
! 				}
! 			}
! 		}
! 	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
! 		/* Oops, must redo it with full-page data. */
! 		LWLockRelease(WALInsertLock);
! 		END_CRIT_SECTION();
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
  
- 	/*
- 	 * If there isn't enough space on the current XLOG page for a record
- 	 * header, advance to the next page (leaving the unused space as zeroes).
- 	 */
- 	updrqst = false;
- 	freespace = INSERT_FREESPACE(Insert);
- 	if (freespace < SizeOfXLogRecord)
- 	{
- 		updrqst = AdvanceXLInsertBuffer(false);
- 		freespace = INSERT_FREESPACE(Insert);
- 	}
- 
- 	/* Compute record's XLOG location */
- 	curridx = Insert->curridx;
- 	INSERT_RECPTR(RecPtr, Insert, curridx);
- 
- 	/*
- 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
- 	 * segment, we need not insert it (and don't want to because we'd like
- 	 * consecutive switch requests to be no-ops).  Instead, make sure
- 	 * everything is written and flushed through the end of the prior segment,
- 	 * and return the prior segment's end address.
- 	 */
- 	if (isLogSwitch &&
- 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
- 	{
- 		/* We can release insert lock immediately */
- 		LWLockRelease(WALInsertLock);
- 
- 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
- 		if (RecPtr.xrecoff == 0)
- 		{
- 			/* crossing a logid boundary */
- 			RecPtr.xlogid -= 1;
- 			RecPtr.xrecoff = XLogFileSize;
- 		}
- 
- 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
- 		LogwrtResult = XLogCtl->Write.LogwrtResult;
- 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
- 		{
- 			XLogwrtRqst FlushRqst;
- 
- 			FlushRqst.Write = RecPtr;
- 			FlushRqst.Flush = RecPtr;
- 			XLogWrite(FlushRqst, false, false);
- 		}
- 		LWLockRelease(WALWriteLock);
- 
- 		END_CRIT_SECTION();
- 
- 		return RecPtr;
- 	}
- 
- 	/* Insert record header */
- 
- 	record = (XLogRecord *) Insert->currpos;
- 	record->xl_prev = Insert->PrevRecord;
- 	record->xl_xid = GetCurrentTransactionIdIfAny();
- 	record->xl_tot_len = SizeOfXLogRecord + write_len;
- 	record->xl_len = len;		/* doesn't include backup blocks */
- 	record->xl_info = info;
- 	record->xl_rmid = rmid;
- 
- 	/* Now we can finish computing the record's CRC */
- 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
- 			   SizeOfXLogRecord - sizeof(pg_crc32));
- 	FIN_CRC32(rdata_crc);
- 	record->xl_crc = rdata_crc;
- 
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
--- 1016,1055 ----
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
  	/*
! 	 * Construct record header. We can't CRC it yet, because the prev-link
! 	 * needs to be covered by the CRC and we don't know that yet. We will
! 	 * finish computing the CRC when we do.
  	 */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	/* rechdr.xl_prev is set in PerformXLogInsert() */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to do the insertion.
  	 */
! 	RecPtr = PerformXLogInsert(write_len, isLogSwitch, &rechdr,
! 							   rdata, rdata_crc, doPageWrites);
! 	END_CRIT_SECTION();
! 
! 	if (XLogRecPtrIsInvalid(RecPtr))
  	{
! 		/*
! 		 * Oops, must redo it with full-page data. Unlink the backup blocks
! 		 * from the chain and reset info bitmask to undo the changes we've
! 		 * done.
! 		 */
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
***************
*** 1032,1215 **** begin:;
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
  						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, record);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
! 	/* Record begin of record in appropriate places */
! 	ProcLastRecPtr = RecPtr;
! 	Insert->PrevRecord = RecPtr;
  
! 	Insert->currpos += SizeOfXLogRecord;
! 	freespace -= SizeOfXLogRecord;
  
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
  		{
! 			if (rdata->len > freespace)
  			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
  			}
  		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Ensure next record will be properly aligned */
! 	Insert->currpos = (char *) Insert->currpage +
! 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
! 	freespace = INSERT_FREESPACE(Insert);
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
  	{
! 		XLogCtlWrite *Write = &XLogCtl->Write;
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
  		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
  		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == Write->curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
  		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
  		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		Write->LogwrtResult = LogwrtResult;
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
  	}
  	else
  	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
  		{
! 			/* curridx is filled and available for writing out */
! 			updrqst = true;
  		}
  		else
  		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
- 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
! 	LWLockRelease(WALInsertLock);
  
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
! 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1058,1762 ----
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
  						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, &rechdr);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rmid].rm_desc(&buf, rechdr.xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
! 	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
! 	 */
! 	return RecPtr;
! }
  
! /*
!  * Subroutine of XLogInsert. All the changes to shared state are done here,
!  * XLogInsert only prepares the record for insertion.
!  *
!  * On success, returns pointer to end of inserted record like XLogInsert().
!  * If RedoRecPtr or forcePageWrites had changed, returns InvalidRecPtr, and
!  * the caller must recalculate full-page-images and retry.
!  */
! static XLogRecPtr
! PerformXLogInsert(int write_len, bool isLogSwitch, XLogRecord *rechdr,
! 				  XLogRecData *rdata, pg_crc32 rdata_crc,
! 				  bool didPageWrites)
! {
! 	volatile XLogInsertSlot *myslot = NULL;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			tot_len;
! 	int			freespace;
! 	int			written;
! 	XLogRecPtr	PrevRecord;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	EndPos;
! 	XLogRecPtr	CurrPos;
! 	bool		updrqst;
! 
! 	/* Get an insert location  */
! 	tot_len = SizeOfXLogRecord + write_len;
! 	if (!ReserveXLogInsertLocation(tot_len, didPageWrites, isLogSwitch,
! 								   &PrevRecord, &StartPos, &EndPos,
! 								   (XLogInsertSlot **) &myslot, &updrqst))
! 	{
! 		return EndPos;
! 	}
  
  	/*
! 	 * Got it! Now that we know the prev-link, we can finish computing the
! 	 * record's CRC.
  	 */
! 	rechdr->xl_prev = PrevRecord;
! 
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
  
! 	/* Copy the record header in place */
! 	record = (XLogRecord *) currpos;
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	COMP_CRC32(rdata_crc, currpos + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
! 
! 	currpos += SizeOfXLogRecord;
! 	CurrPos.xrecoff += SizeOfXLogRecord;
! 
! 	if (!isLogSwitch)
! 	{
! 		/* Copy record data */
! 		written = 0;
! 		freespace = INSERT_FREESPACE(CurrPos);
! 		while (rdata != NULL)
  		{
! 			while (rdata->len > freespace)
  			{
! 				/*
! 				 * Write what fits on this page, then write the continuation
! 				 * record, and continue.
! 				 */
! 				XLogContRecord *contrecord;
! 
! 				memcpy(currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				written += freespace;
! 				CurrPos.xrecoff += freespace;
! 
! 				/*
! 				 * CurrPos now points to the page boundary, ie. the first byte
! 				 * of the next page. Update CurrPos with that before
! 				 * calling GetXLogBuffer(), because GetXLogBuffer() might need
! 				 * to wait for some insertions to finish so that it can write
! 				 * out a buffer to make room for the new page. Updating CurrPos
! 				 * before waiting for a new buffer ensures that we don't
! 				 * deadlock with ourselves if we run out of clean buffers.
! 				 *
! 				 * However, we must not advance CurrPos past the page header
! 				 * yet, otherwise someone might try to flush up to that point,
! 				 * which would fail if the next page was not initialized yet.
! 				 */
! 				UpdateSlotCurrPos(myslot, CurrPos);
! 
! 				/* Now skip page header */
! 				CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 
! 				currpos = GetXLogBuffer(CurrPos);
! 
! 				contrecord = (XLogContRecord *) currpos;
! 				contrecord->xl_rem_len = write_len - written;
! 
! 				currpos += SizeOfXLogContRecord;
! 				CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 				freespace = INSERT_FREESPACE(CurrPos);
  			}
+ 
+ 			memcpy(currpos, rdata->data, rdata->len);
+ 			currpos += rdata->len;
+ 			CurrPos.xrecoff += rdata->len;
+ 			freespace -= rdata->len;
+ 			written += rdata->len;
+ 
+ 			rdata = rdata->next;
  		}
+ 		Assert(written == write_len);
  
! 		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
! 		Assert(XLByteEQ(CurrPos, EndPos));
  	}
+ 	else
+ 	{
+ 		/* An xlog-switch record doesn't contain any data besides the header */
+ 		Assert(write_len == 0);
+ 
+ 		/*
+ 		 * An xlog-switch record consumes all the remaining space on the
+ 		 * WAL segment. We have already reserved it for us, but we still need
+ 		 * to make sure it's been allocated and zeroed in the WAL buffers so
+ 		 * that when the caller (or someone else) does XLogWrite(), it can
+ 		 * really write out all the zeros.
+ 		 *
+ 		 * We do this one page at a time, to make sure we don't deadlock
+ 		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
+ 		 */
+ 		while (XLByteLT(CurrPos, EndPos))
+ 		{
+ 			/* use up all the remaining space in this page */
+ 			freespace = INSERT_FREESPACE(CurrPos);
+ 			XLByteAdvance(CurrPos, freespace);
+ 			/*
+ 			 * like in the non-xlog-switch codepath, let others know that
+ 			 * we're done writing up to the end of this page
+ 			 */
+ 			UpdateSlotCurrPos(myslot, CurrPos);
+ 			/*
+ 			 * let GetXLogBuffer initialize next page if necessary.
+ 			 */
+ 			CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
+ 			(void) GetXLogBuffer(CurrPos);
+ 		}
  
! 		/*
! 		 * Even though we reserved the rest of the segment for us, which
! 		 * is reflected in EndPos, we need to return a value that points just
! 		 * to the end of the xlog-switch record.
! 		 */
! 		EndPos.xlogid = StartPos.xlogid;
! 		EndPos.xrecoff = StartPos.xrecoff + SizeOfXLogRecord;
! 	}
  
  	/*
! 	 * Done! Clear CurrPos in our slot to let others know that we're
! 	 * finished.
  	 */
! 	UpdateSlotCurrPos(myslot, InvalidXLogRecPtr);
  
  	/*
! 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
  	 */
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
! 
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
  
! 	/* update our global variables */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
  
! 	return EndPos;
! }
  
+ /*
+  * Reserves the right amount of space for a record of given size from the WAL.
+  * *StartPos_p is set to the beginning of the reserved section, *EndPos_p to
+  * its end, and *Prev_record_p points to the beginning of the previous record
+  * to set to the prev-link of the record header.
+  *
+  * A log-switch record is handled slightly differently. The rest of the
+  * segment will be reserved for this insertion, as indicated by the returned
+  * *EndPos_p value. However, if we are already at the beginning of the current
+  * segment, the *EndPos_p is set to the current location without reserving
+  * any space, and the function returns false.
+  *
+  * *updrqst_p is set to true, if this record ends on different page than
+  * the previous one - the caller should update the shared LogwrtRqst value
+  * after it's done inserting the record in that case, so that the WAL page
+  * that filled up gets written out at the next convenient moment.
+  *
+  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
+  * (or the end of previous record, to be exact) to let others know that we're
+  * busy inserting to the reserved area. The caller must clear it when the
+  * insertion is finished.
+  *
+  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
+  * changed. On failure, the shared state is not modified.
+  *
+  * This is the performance critical part of XLogInsert that must be
+  * serialized across backends. The rest can happen mostly in parallel.
+  *
+  * NB: The space calculation here must match the code in PerformXLogInsert,
+  * where we actually copy the record to the reserved space.
+  */
+ static bool
+ ReserveXLogInsertLocation(int size, bool didPageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p)
+ {
+ 	volatile XLogInsertSlot *myslot;
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			freespace;
+ 	XLogRecPtr	ptr;
+ 	XLogRecPtr	StartPos;
+ 	XLogRecPtr	LastEndPos;
+ 	int32		nextslot;
+ 	int32		lastslot;
+ 	bool		updrqst = false;
+ 
+ retry:
+ 	SpinLockAcquire(&Insert->insertpos_lck);
+ 
+ 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
+ 		(!didPageWrites && Insert->forcePageWrites))
+ 	{
  		/*
! 		 * Oops, a checkpoint just happened, or forcePageWrites was just
! 		 * turned on. Start XLogInsert() all over, because we might have to
! 		 * include more full-page images in the record.
  		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		*EndPos_p = InvalidXLogRecPtr;
! 		return false;
! 	}
! 
! 	/*
! 	 * Reserve the next insertion slot for us.
! 	 *
! 	 * First check that the slot is not still in use. Modifications to
! 	 * lastslot are protected by WALInsertTailLock, but here we assume that
! 	 * reading an int32 is atomic. Another process might advance lastslot at
! 	 * the same time, but not past nextslot.
! 	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	if (NextSlotNo(nextslot) == lastslot)
! 	{
! 		/*
! 		 * Oops, we've "caught our tail" and the oldest slot is still in use.
! 		 * Have to wait for it to become vacant.
! 		 */
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		WaitForXLogInsertionSlotToBecomeFree();
! 		goto retry;
! 	}
! 	myslot = &XLogCtl->XLogInsertSlots[nextslot];
! 	nextslot = NextSlotNo(nextslot);
  
! 	/*
! 	 * Got the slot, now reserve the right amount of space from the WAL for
! 	 * our record.
! 	 */
! 	LastEndPos = ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
  
+ 	/*
+ 	 * If there isn't enough space on the current XLOG page for a record
+ 	 * header, advance to the next page (leaving the unused space as zeroes).
+ 	 */
+ 	freespace = INSERT_FREESPACE(ptr);
+ 	if (freespace < SizeOfXLogRecord)
+ 	{
+ 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
+ 		freespace = INSERT_FREESPACE(ptr);
+ 		updrqst = true;
+ 	}
+ 
+ 	/*
+ 	 * We are now at the starting position of our record. Now figure out how
+ 	 * the data will be split across the WAL pages, to calculate where the
+ 	 * record ends.
+ 	 */
+ 	StartPos = ptr;
+ 
+ 	if (isLogSwitch)
+ 	{
  		/*
! 		 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 		 * segment, we need not insert it (and don't want to because we'd like
! 		 * consecutive switch requests to be no-ops). Otherwise the XLOG_SWITCH
! 		 * record should consume all the remaining space on the current segment.
  		 */
+ 		Assert(size == SizeOfXLogRecord);
+ 		if ((ptr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  		{
! 			/* We can release insert lock immediately */
! 			SpinLockRelease(&Insert->insertpos_lck);
  
! 			ptr.xrecoff -= SizeOfXLogLongPHD;
! 			if (ptr.xrecoff == 0)
! 			{
! 				/* crossing a logid boundary */
! 				ptr.xlogid -= 1;
! 				ptr.xrecoff = XLogFileSize;
! 			}
! 
! 			*EndPos_p = ptr;
! 			*StartPos_p = ptr;
! 			*myslot_p = NULL;
! 
! 			return false;
  		}
+ 		else
+ 		{
+ 			if (ptr.xrecoff % XLOG_SEG_SIZE != 0)
+ 			{
+ 				int segleft = XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE);
+ 				XLByteAdvance(ptr, segleft);
+ 			}
+ 			updrqst = true;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/* A normal record, ie. not xlog-switch */
+ 		int sizeleft = size;
+ 		while (freespace < sizeleft)
+ 		{
+ 			/* fill this page, and continue on next page */
+ 			sizeleft -= freespace;
+ 			ptr = AdvanceXLogRecPtrToNextPage(ptr);
  
! 			/* account for continuation record header */
! 			ptr.xrecoff += SizeOfXLogContRecord;
! 			freespace = INSERT_FREESPACE(ptr);
  
! 			updrqst = true;
! 		}
! 		/* the rest fits on this page */
! 		ptr.xrecoff += sizeleft;
! 
! 		/* Align the end position, so that the next record starts aligned */
! 		ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 	}
! 
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	myslot->CurrPos = LastEndPos;
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 	Insert->nextslot = nextslot;
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(LOG, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 PrevRecord_p->xlogid, PrevRecord_p->xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 size);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 	*myslot_p = (XLogInsertSlot *) myslot;
! 	*updrqst_p = updrqst;
! 
! 	return true;
! }
! 
! /*
!  * Update slot's CurrPos variable, and wake up anyone waiting on it.
!  */
! static void
! UpdateSlotCurrPos(volatile XLogInsertSlot *myslot, XLogRecPtr CurrPos)
! {
! 	PGPROC	   *head;
! 
! 	/*
! 	 * The write-barrier ensures that the changes we made to the WAL pages
! 	 * are visible to everyone before the update of CurrPos.
! 	 *
! 	 * XXX: I'm not sure if this is necessary. Does a function call act
! 	 * as an implicit barrier?
! 	 */
! 	pg_write_barrier();
! 
! 	SpinLockAcquire(&myslot->lck);
! 	myslot->CurrPos = CurrPos;
! 	head = myslot->head;
! 	myslot->head = myslot->tail = NULL;
! 	SpinLockRelease(&myslot->lck);
! 	while (head != NULL)
! 	{
! 		PGPROC *proc = head;
! 		head = proc->lwWaitLink;
! 		proc->lwWaitLink = NULL;
! 		proc->lwWaiting = false;
! 		PGSemaphoreUnlock(&proc->sem);
! 	}
! }
! 
! /*
!  * Get a pointer to the right location in the WAL buffer containing the
!  * given XLogRecPtr.
!  *
!  * If the page is not initialized yet, it is initialized. That might require
!  * evicting an old dirty buffer from the buffer cache, which means I/O.
!  *
!  * The caller must ensure that the page containing the requested location
!  * isn't evicted yet, and won't be evicted, by holding onto an
!  * XLogInsertSlot with CurrPos set to 'ptr'. Setting it to some value
!  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
!  * if we have to evict a buffer, we might have to wait for someone else to
!  * finish a write. And that someone else might not be able to finish the write
!  * if our CurrPos points to a buffer that's still in the buffer cache.
!  */
! static char *
! GetXLogBuffer(XLogRecPtr ptr)
! {
! 	int			idx;
! 	XLogRecPtr	endptr;
! 
! 	/*
! 	 * The XLog buffer cache is organized so that we can easily calculate the
! 	 * buffer a given page must be loaded into from the XLogRecPtr alone.
! 	 * A page must always be loaded to a particular buffer.
! 	 */
! 	idx = XLogRecPtrToBufIdx(ptr);
! 
! 	/*
! 	 * See what page is loaded in the buffer at the moment. It could be the
! 	 * page we're looking for, or something older. It can't be anything
! 	 * newer - that would imply the page we're looking for has already
! 	 * been written out to disk, which shouldn't happen as long as the caller
! 	 * has set its slot's CurrPos correctly.
! 	 *
! 	 * However, we don't hold a lock while we read the value. If someone has
! 	 * just initialized the page, it's possible that we get a "torn read",
! 	 * and read a bogus value. That's ok, we'll grab the mapping lock (in
! 	 * AdvanceXLInsertBuffer) and retry if we see anything else than the page
! 	 * we're looking for. But it means that when we do this unlocked read, we
! 	 * might see a value that appears to be ahead of the page we're looking
! 	 * for. So don't PANIC on that, until we've verified the value while
! 	 * holding the lock.
! 	 */
! 	endptr = XLogCtl->xlblocks[idx];
! 	if (ptr.xlogid != endptr.xlogid ||
! 		!(ptr.xrecoff < endptr.xrecoff &&
! 		  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 	{
! 		AdvanceXLInsertBuffer(ptr, false);
! 		endptr = XLogCtl->xlblocks[idx];
  
! 		if (ptr.xlogid != endptr.xlogid ||
! 			!(ptr.xrecoff < endptr.xrecoff &&
! 			  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 		{
! 			elog(PANIC, "could not find WAL buffer for %X/%X",
! 				 ptr.xlogid, ptr.xrecoff);
! 		}
  	}
+ 
+ 	/*
+ 	 * Found the buffer holding this page. Return a pointer to the right
+ 	 * offset within the page.
+ 	 */
+ 	return (char *) XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
+ 		ptr.xrecoff % XLOG_BLCKSZ;
+ }
+ 
+ /*
+  * Advance an XLogRecPtr to the first valid insertion location on the next
+  * page, right after the page header. An XLogRecPtr pointing to a boundary,
+  * ie. the first byte of a page, is taken to belong to the previous page.
+  */
+ static XLogRecPtr
+ AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr)
+ {
+ 	int			freespace;
+ 
+ 	freespace = INSERT_FREESPACE(ptr);
+ 	XLByteAdvance(ptr, freespace);
+ 	if (ptr.xrecoff % XLogSegSize == 0)
+ 		ptr.xrecoff += SizeOfXLogLongPHD;
  	else
+ 		ptr.xrecoff += SizeOfXLogShortPHD;
+ 
+ 	return ptr;
+ }
+ 
+ /*
+  * Wait for any insertions < upto to finish.
+  *
+  * Returns a value >= upto, which indicates the oldest in-progress insertion
+  * that we saw in the array, or CurrPos if there are no insertions in-progress
+  * at exit.
+  */
+ static XLogRecPtr
+ WaitXLogInsertionsToFinish(XLogRecPtr upto, XLogRecPtr CurrPos)
+ {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			lastslot;
+ 	int			nextslot;
+ 	volatile XLogInsertSlot *slot;
+ 	XLogRecPtr	slotptr = InvalidXLogRecPtr;
+ 	XLogRecPtr	LastPos;
+ 	int			extraWaits = 0;
+ 
+ 	if (MyProc == NULL)
+ 		elog(PANIC, "cannot wait without a PGPROC structure");
+ 
+ 	LastPos = CurrPos;
+ 
+ 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+ 
+ 	lastslot = Insert->lastslot;
+ 	nextslot = Insert->nextslot;
+ 
+ 	/* Skip over slots that have finished already */
+ 	while (lastslot != nextslot)
  	{
! 		slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		SpinLockAcquire(&slot->lck);
! 		slotptr = slot->CurrPos;
  
! 		if (XLogRecPtrIsInvalid(slotptr))
  		{
! 			lastslot = NextSlotNo(lastslot);
! 			SpinLockRelease(&slot->lck);
  		}
  		else
  		{
! 			/*
! 			 * This insertion is still in-progress. Wait for it to finish
! 			 * if it's <= upto, otherwise we're done.
! 			 */
! 			Insert->lastslot = lastslot;
! 
! 			if (XLogRecPtrIsInvalid(upto) || XLByteLE(upto, slotptr))
! 			{
! 				LastPos = slotptr;
! 				SpinLockRelease(&slot->lck);
! 				break;
! 			}
! 
! 			/* wait */
! 			MyProc->lwWaiting = true;
! 			MyProc->lwExclusive = false;
! 			MyProc->lwWaitLink = NULL;
! 			if (slot->head == NULL)
! 				slot->head = MyProc;
! 			else
! 				slot->tail->lwWaitLink = MyProc;
! 			slot->tail = MyProc;
! 			SpinLockRelease(&slot->lck);
! 			LWLockRelease(WALInsertTailLock);
! 			for (;;)
! 			{
! 				PGSemaphoreLock(&MyProc->sem, false);
! 				if (!MyProc->lwWaiting)
! 					break;
! 				extraWaits++;
! 			}
! 			LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 			lastslot = Insert->lastslot;
! 			nextslot = Insert->nextslot;
  		}
  	}
  
! 	Insert->lastslot = lastslot;
! 	LWLockRelease(WALInsertTailLock);
  
! 	while (extraWaits-- > 0)
! 		PGSemaphoreUnlock(&MyProc->sem);
! 
! 	return LastPos;
! }
! 
! /*
!  * Wait for the next insertion slot to become vacant.
!  */
! static void
! WaitForXLogInsertionSlotToBecomeFree(void)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			lastslot;
! 	int			nextslot;
! 	int			extraWaits = 0;
! 
! 	if (MyProc == NULL)
! 		elog(PANIC, "cannot wait without a PGPROC structure");
! 
! 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Re-read lastslot and nextslot, now that we have the wait-lock.
! 	 * We're reading nextslot without holding insertpos_lck. It could advance
! 	 * at the same time, but it can't advance beyond lastslot - 1.
! 	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 
! 	/*
! 	 * If there are still no slots available, wait for the oldest slot to
! 	 * become vacant.
! 	 */
! 	while (NextSlotNo(nextslot) == lastslot)
  	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
  
! 		SpinLockAcquire(&slot->lck);
! 		if (XLogRecPtrIsInvalid(slot->CurrPos))
! 		{
! 			SpinLockRelease(&slot->lck);
! 			break;
! 		}
! 
! 		/* wait */
! 		MyProc->lwWaiting = true;
! 		MyProc->lwExclusive = false;
! 		MyProc->lwWaitLink = NULL;
! 		if (slot->head == NULL)
! 			slot->head = MyProc;
! 		else
! 			slot->tail->lwWaitLink = MyProc;
! 		slot->tail = MyProc;
! 		SpinLockRelease(&slot->lck);
! 		LWLockRelease(WALInsertTailLock);
! 		for (;;)
! 		{
! 			PGSemaphoreLock(&MyProc->sem, false);
! 			if (!MyProc->lwWaiting)
! 				break;
! 			extraWaits++;
! 		}
! 		LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 		lastslot = Insert->lastslot;
! 		nextslot = Insert->nextslot;
  	}
  
! 	/*
! 	 * Ok, there is at least one empty slot now. That's enouugh for our
! 	 * insertion, but ẃhile we're at it, advance lastslot as much as we
! 	 * can. That way we don't need to come back here on the next call
! 	 * again.
! 	 */
! 	while (lastslot != nextslot)
! 	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		/*
! 		 * Don't need to grab the slot's spinlock here, because we're not
! 		 * interested in the exact value of CurrPos, only whether it's
! 		 * valid or not.
! 		 */
! 		if (!XLogRecPtrIsInvalid(slot->CurrPos))
! 			break;
  
! 		lastslot = NextSlotNo(lastslot);
! 	}
! 	Insert->lastslot = lastslot;
  
! 	LWLockRelease(WALInsertTailLock);
  }
  
  /*
***************
*** 1436,1470 **** XLogArchiveCleanup(const char *xlog)
  }
  
  /*
!  * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data.
!  *
!  * If new_segment is TRUE then we set up the next buffer page as the first
!  * page of the next xlog segment file, possibly but not usually the next
!  * consecutive file page.
!  *
!  * The global LogwrtRqst.Write pointer needs to be advanced to include the
!  * just-filled page.  If we can do this for free (without an extra lock),
!  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
!  * request update still needs to be done, FALSE if we did it internally.
!  *
!  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
! 	/* Use Insert->LogwrtResult copy if it's more fresh */
! 	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
! 		LogwrtResult = Insert->LogwrtResult;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 1983,2016 ----
  }
  
  /*
!  * Initialize XLOG buffers, writing out old buffers if they still contain
!  * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
!  * true, initialize as many pages as we can without having to write out
!  * unwritten data. Any new pages are initialized to zeros, with pages headers
!  * initialized properly.
   */
! static void
! AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
  	XLogPageHeader NewPage;
+ 	bool		needflush;
+ 	int			npages = 0;
  
! 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Now that we have the lock, check if someone initialized the page
! 	 * already.
! 	 */
! /* XXX: fix indentation before commit */
! while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
! {
! 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1472,1483 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
- 	{
- 		/* nope, got work to do... */
- 		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 2018,2034 ----
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  
! 	needflush = !XLByteLE(OldPageRqstPtr, LogwrtResult.Write);
! 
! 	if (needflush)
! 	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1485,1529 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		update_needed = false;	/* Did the shared-request update */
! 
! 		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/* OK, someone wrote it already */
! 			Insert->LogwrtResult = LogwrtResult;
! 		}
! 		else
! 		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding insert lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
--- 2036,2085 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 			{
! 				Assert(XLByteLE(OldPageRqstPtr, xlogctl->Insert.CurrPos));
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
! 			}
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr, InvalidXLogRecPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding mapping lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
***************
*** 1531,1544 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
! 
! 	if (new_segment)
! 	{
! 		/* force it to a segment start point */
! 		NewPageEndPtr.xrecoff += XLogSegSize - 1;
! 		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
! 	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 2087,2093 ----
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1548,1560 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
! 	Insert->curridx = nextidx;
! 	Insert->currpage = NewPage;
! 
! 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
--- 2097,2106 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
! 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
  
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
***************
*** 1598,1608 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 2144,2171 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * Make sure the initialization of the page becomes visible to others
! 	 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
! 	 * holding a lock.
! 	 */
! 	pg_write_barrier();
! 
! 	*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(DEBUG1, "initialized %d pages, upto %X/%X",
! 			 npages, NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff);
! #endif
  }
  
  /*
***************
*** 1647,1662 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * If xlog_switch == TRUE, we are intending an xlog segment switch, so
!  * perform end-of-segment actions after writing the last page, even if
!  * it's not physically the end of its segment.  (NB: this will work properly
!  * only if caller specifies WriteRqst == page-end and flexible == false,
!  * and there is some data to write.)
!  *
!  * Must be called with WALWriteLock held.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
--- 2210,2221 ----
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
!  * the data is ready to write.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
***************
*** 1674,1680 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = Write->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
--- 2233,2239 ----
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
***************
*** 1705,1718 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 2264,2277 ----
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
! 		if (!XLByteLT(LogwrtResult.Write, EndPtr))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 EndPtr.xlogid, EndPtr.xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = EndPtr;
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1809,1824 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
- 			 * We also do this if this is the last page written for an xlog
- 			 * switch.
- 			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg || (xlog_switch && last_iteration))
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
--- 2368,2380 ----
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg)
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
***************
*** 1908,1915 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
- 
- 	Write->LogwrtResult = LogwrtResult;
  }
  
  /*
--- 2464,2469 ----
***************
*** 2081,2114 **** XLogFlush(XLogRecPtr record)
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
! 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 				else
! 				{
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 					WriteRqstPtr.xrecoff -= freespace;
! 				}
! 				LWLockRelease(WALInsertLock);
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = WriteRqstPtr;
! 			}
! 			else
! 			{
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = record;
! 			}
! 			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
  	}
--- 2635,2683 ----
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
+ 		/* try to write/flush later additions to XLOG as well */
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos;
+ 
+ 		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to do LWLockConditionalAcquire(WALInsertLock), and
+ 		 * fall back to writing just up to 'record' if we couldn't get the
+ 		 * lock. I wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand,
+ 		 * it would be good to not cause more contention on the lock if it's
+ 		 * busy, but on the other hand, this spinlock is much more lightweight
+ 		 * than the WALInsertLock was, so maybe it's better to just grab the
+ 		 * spinlock. Also note that if we stored the XLogRecPtr as one 64-bit
+ 		 * integer, we could just read it with no lock on platforms where
+ 		 * 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)		/* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, insertpos);
+ 
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
! 			XLogWrite(WriteRqst, false);
  		}
  		LWLockRelease(WALWriteLock);
  	}
***************
*** 2218,2240 **** XLogBackgroundFlush(void)
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
- 
  	START_CRIT_SECTION();
  
! 	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2787,2817 ----
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
  	START_CRIT_SECTION();
  
! 	/* now wait for any in-progress insertions to finish and get write lock */
! 	WaitXLogInsertionsToFinish(WriteRqstPtr, InvalidXLogRecPtr);
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible);
  	}
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 */
+ 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5028,5033 **** XLOGShmemSize(void)
--- 5605,5613 ----
  	/* and the buffers themselves */
  	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
  
+ 	/* XLog insertion slots */
+ 	size = add_size(size, mul_size(sizeof(XLogInsertSlot), NumXLogInsertSlots));
+ 
  	/*
  	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
  	 * added by CreateSharedMemoryAndSemaphores.  This lets us use this
***************
*** 5043,5048 **** XLOGShmemInit(void)
--- 5623,5629 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5068,5073 **** XLOGShmemInit(void)
--- 5649,5667 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize insertion slots */
+ 	XLogCtl->XLogInsertSlots = (XLogInsertSlot *) allocptr;
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[i];
+ 		slot->CurrPos = InvalidXLogRecPtr;
+ 		slot->head = slot->tail = NULL;
+ 		SpinLockInit(&slot->lck);
+ 	}
+ 	XLogCtl->Insert.nextslot = 1;
+ 	XLogCtl->Insert.lastslot = 0;
+ 	allocptr += sizeof(XLogInsertSlot) * NumXLogInsertSlots;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5082,5092 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
- 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
--- 5676,5687 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
+ 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
+ 
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
***************
*** 5961,5966 **** StartupXLOG(void)
--- 6556,6562 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6717,6724 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7313,7324 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6726,6751 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
- 	XLogCtl->Write.LogwrtResult = LogwrtResult;
- 	Insert->LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7326,7348 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndRecPtr.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6757,6763 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7354,7360 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7251,7257 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
--- 7848,7854 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
***************
*** 7527,7532 **** CreateCheckPoint(int flags)
--- 8124,8130 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7595,7604 **** CreateCheckPoint(int flags)
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 8193,8202 ----
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * Determine the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7610,7616 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 8208,8214 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7619,7633 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 8217,8228 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7653,7666 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
--- 8248,8257 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(curInsert);
  	if (freespace < SizeOfXLogRecord)
! 		curInsert = AdvanceXLogRecPtrToNextPage(curInsert);
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
***************
*** 7686,7692 **** CreateCheckPoint(int flags)
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8277,8283 ----
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7706,7712 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8297,8303 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 8073,8087 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8664,8678 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * Like in CreateCheckPoint(), you need both insertpos_lck and info_lck
! 	 * to update it, although during recovery acquiring insertpos_lck is just
! 	 * pro forma, because no WAL insertions are happening.
  	 */
! 	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8274,8279 **** RequestXLogSwitch(void)
--- 8865,8871 ----
  {
  	XLogRecPtr	RecPtr;
  	XLogRecData rdata;
+ 	XLogwrtRqst FlushRqst;
  
  	/* XLOG SWITCH, alone among xlog record types, has no data */
  	rdata.buffer = InvalidBuffer;
***************
*** 8283,8288 **** RequestXLogSwitch(void)
--- 8875,8901 ----
  
  	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH, &rdata);
  
+ 	/* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
+ 	 * if the xlog switch had no work to do, ie. if we were already at the
+ 	 * beginning of a new XLOG segment. You can check if RecPtr points to
+ 	 * beginning of a segment if you want to keep the distinction.
+ 	 */
+ 	TRACE_POSTGRESQL_XLOG_SWITCH();
+ 
+ 	/*
+ 	 * Flush through the end of the page containing XLOG_SWITCH, and
+ 	 * perform end-of-segment actions (eg, notifying archiver).
+ 	 */
+ 	WaitXLogInsertionsToFinish(RecPtr, InvalidXLogRecPtr);
+ 
+ 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ 	FlushRqst.Write = RecPtr;
+ 	FlushRqst.Flush = RecPtr;
+ 	START_CRIT_SECTION();
+ 	XLogWrite(FlushRqst, false);
+ 	END_CRIT_SECTION();
+ 	LWLockRelease(WALWriteLock);
+ 
  	return RecPtr;
  }
  
***************
*** 8836,8841 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9449,9455 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
***************
*** 8885,8910 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9499,9524 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 8966,8978 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9580,9592 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9054,9063 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
--- 9668,9678 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
***************
*** 9074,9080 **** pg_start_backup_callback(int code, Datum arg)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9689,9695 ----
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9087,9092 **** pg_start_backup_callback(int code, Datum arg)
--- 9702,9708 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
***************
*** 9128,9136 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 9744,9752 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9139,9154 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 9755,9770 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9350,9365 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9966,9983 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
! 
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9413,9424 **** GetStandbyFlushRecPtr(void)
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 10031,10042 ----
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/storage/lmgr/spin.c
--- b/src/backend/storage/lmgr/spin.c
***************
*** 56,61 **** SpinlockSemas(void)
--- 56,64 ----
  	 *
  	 * For now, though, we just need a few spinlocks (10 should be plenty)
  	 * plus one for each LWLock and one for each buffer header.
+ 	 *
+ 	 * XXX: remember to adjust this for the number of spinlocks needed by the
+ 	 * xlog.c changes before committing!
  	 */
  	return NumLWLocks() + NBuffers + 10;
  }
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALInsertTailLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#29Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#28)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here's another version of the patch to make XLogInsert less of a bottleneck
on multi-CPU systems. The basic idea is the same as before, but several bugs
have been fixed, and lots of misc. clean up has been done.

This seems to need a rebase.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#29)
1 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 20.01.2012 15:32, Robert Haas wrote:

On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here's another version of the patch to make XLogInsert less of a bottleneck
on multi-CPU systems. The basic idea is the same as before, but several bugs
have been fixed, and lots of misc. clean up has been done.

This seems to need a rebase.

Here you go.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-6.patchtext/x-diff; name=xloginsert-scale-6.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 282,307 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually three shared-memory copies of
!  * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
!  *		XLogCtl->LogwrtResult is protected by info_lck
!  *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
!  *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
!  * One must hold the associated lock to read or write any of these, but
!  * of course no lock is needed to read/write the unshared LogwrtResult.
!  *
!  * XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always
!  * right", since both are updated by a write or flush operation before
!  * it releases WALWriteLock.  The point of keeping XLogCtl->Write.LogwrtResult
!  * is that it can be examined/modified by code that already holds WALWriteLock
!  * without needing to grab info_lck as well.
!  *
!  * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
!  * but is updated when convenient.	Again, it exists for the convenience of
!  * code that is already holding WALInsertLock but not the other locks.
!  *
!  * The unshared LogwrtResult may lag behind any or all of these, and again
!  * is updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
--- 283,293 ----
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There is one shared-memory copy of LogwrtResult,
!  * plus one unshared copy in each backend. To read the shared copy, you need
!  * to hold info_lck *or* WALWriteLock. To update it, you need to hold both
!  * locks. The unshared LogwrtResult may lag behind the shared copy, and is
!  * updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
***************
*** 311,320 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 297,311 ----
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 326,331 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 317,391 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL, and the next insertion
+  *    slot to advertise that the insertion is in progress. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    insertpos_lck. Try to keep this section as short as possible,
+  *    insertpos_lck can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each insertion is asssigned its own
+  * "XLog insertion slot", which is used to advertise the position the backend
+  * is writing to. The slot is marked as in-use in step 1, while holding
+  * insertpos_lck, by setting the position field in the slot. When the backend
+  * is finished with the insertion, it clears its slot. Each slot is protected
+  * by a separate spinlock, to keep contention minimal.
+  *
+  * The insertion slots also provide a mechanism to wait for an insertion to
+  * finish. This is important when an XLOG page is written out - any
+  * in-progress insertions must finish copying data to the page first, or the
+  * on-disk copy will be incomplete. Waiting is done by the
+  * WaitXLogInsertionsToFinish() function. It adds the current process to the
+  * waiting queue in the slot it needs to wait for, and when that insertion
+  * finishes (or proceeds to the next page, at least), the inserter wakes up
+  * the process.
+  *
+  * The insertion slots form a ring. Insert->nextslot points to the next free
+  * slot, and Insert->lastslot points to the last slot that's still in use.
+  * lastslot can lag behind reality by any number of slots, as long as nextslot
+  * doesn't catch up with it. lastslot is advanced by
+  * WaitXLogInsertionsToFinish(), and is protected by WALInsertTailLock.
+  * nextslot is advanced in ReserveXLogInsertLocation() and is protected by
+  * insertpos_lck. Both slot variables are 32-bit integers, so that they can
+  * be read atomically without holding a lock.
+  *
+  * Whenever the ring fills up, ie. when nextslot wraps around and catches up
+  * with lastslot, ReserveXLogInsertLocation() has to wait for the oldest
+  * insertion to finish and advance lastslot, to make room for the new
+  * insertion. This is done by WaitForXLogInsertionSlotToBecomeFree() function,
+  * which is similar to WaitXLogInsertionsToFinish(), but instead of waiting
+  * for all insertions up to a given point to finish, it just waits for the
+  * inserter in the oldest slot to finish.
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+  * finish (or at least advance to next uninitialized page), while you're
+  * holding WALWriteLock. That would be bad, because the backend you're waiting
+  * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+  * you'd get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * it's called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition, so the common pattern:
+  *
+  *   recptr = XLogInsert(...)
+  *   XLogFlush(recptr)
+  *
+  * is safe. It can't get stuck, because an insertion to a WAL page that's
+  * already initialized in cache can always proceed without waiting on a lock.
+  *
   *----------
   */
  
***************
*** 346,356 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 406,435 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct
! 								 * (except lastslot). */
! 
! 	int32		nextslot;		/* next insertion slot to use */
! 	int32		lastslot;		/* last in-use insertion slot (protected by
! 								 * WALInsertTailLock) */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment. The
! 	 * next record will be inserted there (or somewhere after it if there's
! 	 * not enough space on the current page). PrevRecord points to the
! 	 * beginning of the last record already reserved. It might not be fully
! 	 * copied into place yet, but we know its exact location already.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
! 
! 	/*
! 	 * padding to push RedoRecPtr and forcePageWrites, which rarely change,
! 	 * to a different cache line than the rapidly-changing CurrPos and
! 	 * PrevRecord values. XXX: verify if this makes any difference
! 	 */
! 	char		padding[128];
! 
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 371,389 **** typedef struct XLogCtlInsert
   */
  typedef struct XLogCtlWrite
  {
- 	XLogwrtResult LogwrtResult; /* current value of LogwrtResult */
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
--- 450,483 ----
   */
  typedef struct XLogCtlWrite
  {
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
+ 
+ /*
+  * Slots for in-progress WAL insertions.
+  */
+ typedef struct
+ {
+ 	slock_t		lck;
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ 	PGPROC	   *head;		/* head of list of waiting PGPROCs */
+ 	PGPROC	   *tail;		/* tail of list of waiting PGPROCs */
+ } XLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	1000
+ 
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	XLogInsertSlot *XLogInsertSlots;
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
***************
*** 397,405 **** typedef struct XLogCtlData
  	XLogCtlWrite Write;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 491,509 ----
  	XLogCtlWrite Write;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock. To change the identity of a buffer that's
+ 	 * still dirty, the old page needs to be written out first, and for that
+ 	 * you need WALWriteLock, and you need to ensure that there's no
+ 	 * in-progress insertions to the page by calling
+ 	 * WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* current (latest) block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 471,498 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
  
- /* Free space remaining in the current xlog page buffer */
- #define INSERT_FREESPACE(Insert)  \
- 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
! /* Construct XLogRecPtr value for current insertion point */
! #define INSERT_RECPTR(recptr,Insert,curridx)  \
! 	( \
! 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
! 	  (recptr).xrecoff = \
! 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
! 	)
  
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
--- 575,605 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
+ #define INSERT_FREESPACE(endptr)	\
+ 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
+ 
+ #define NextBufIdx(idx)		\
+ 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  
! #define NextSlotNo(idx)		\
! 		(((idx) == NumXLogInsertSlots) ? 0 : ((idx) + 1))
  
! /*
!  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
!  * would hold if it was in cache, the page containing 'recptr'.
!  *
!  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
!  * page is taken to mean the previous page.
!  */
! #define XLogRecPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
! #define XLogRecEndPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
***************
*** 618,626 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
--- 725,733 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
***************
*** 667,672 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 774,797 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static XLogRecPtr PerformXLogInsert(int write_len,
+ 				  bool isLogSwitch,
+ 				  XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  bool didPageWrites);
+ static bool ReserveXLogInsertLocation(int size, bool forcePageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p);
+ static void UpdateSlotCurrPos(volatile XLogInsertSlot *myslot,
+ 				  XLogRecPtr CurrPos);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto,
+ 						   XLogRecPtr CurrPos);
+ static void WaitForXLogInsertionSlotToBecomeFree(void);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ static XLogRecPtr AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 687,698 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
  	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
--- 812,818 ----
***************
*** 706,715 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	uint32		len,
  				write_len;
  	unsigned	i;
- 	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 826,835 ----
  	uint32		len,
  				write_len;
  	unsigned	i;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
+ 	XLogRecord	rechdr;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 896,1029 **** begin:;
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
- 	START_CRIT_SECTION();
- 
- 	/* Now wait to get insert lock */
- 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
- 
  	/*
! 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
! 	 * back and recompute everything.  This can only happen just after a
! 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
! 	 *
! 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
! 	 * affect the contents of the XLOG record, so we'll update our local copy
! 	 * but not force a recomputation.
  	 */
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
! 	{
! 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
! 		RedoRecPtr = Insert->RedoRecPtr;
  
! 		if (doPageWrites)
! 		{
! 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 			{
! 				if (dtbuf[i] == InvalidBuffer)
! 					continue;
! 				if (dtbuf_bkp[i] == false &&
! 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
! 				{
! 					/*
! 					 * Oops, this buffer now needs to be backed up, but we
! 					 * didn't think so above.  Start over.
! 					 */
! 					LWLockRelease(WALInsertLock);
! 					END_CRIT_SECTION();
! 					rdt_lastnormal->next = NULL;
! 					info = info_orig;
! 					goto begin;
! 				}
! 			}
! 		}
! 	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
! 		/* Oops, must redo it with full-page data. */
! 		LWLockRelease(WALInsertLock);
! 		END_CRIT_SECTION();
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
  
- 	/*
- 	 * If there isn't enough space on the current XLOG page for a record
- 	 * header, advance to the next page (leaving the unused space as zeroes).
- 	 */
- 	updrqst = false;
- 	freespace = INSERT_FREESPACE(Insert);
- 	if (freespace < SizeOfXLogRecord)
- 	{
- 		updrqst = AdvanceXLInsertBuffer(false);
- 		freespace = INSERT_FREESPACE(Insert);
- 	}
- 
- 	/* Compute record's XLOG location */
- 	curridx = Insert->curridx;
- 	INSERT_RECPTR(RecPtr, Insert, curridx);
- 
- 	/*
- 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
- 	 * segment, we need not insert it (and don't want to because we'd like
- 	 * consecutive switch requests to be no-ops).  Instead, make sure
- 	 * everything is written and flushed through the end of the prior segment,
- 	 * and return the prior segment's end address.
- 	 */
- 	if (isLogSwitch &&
- 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
- 	{
- 		/* We can release insert lock immediately */
- 		LWLockRelease(WALInsertLock);
- 
- 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
- 		if (RecPtr.xrecoff == 0)
- 		{
- 			/* crossing a logid boundary */
- 			RecPtr.xlogid -= 1;
- 			RecPtr.xrecoff = XLogFileSize;
- 		}
- 
- 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
- 		LogwrtResult = XLogCtl->Write.LogwrtResult;
- 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
- 		{
- 			XLogwrtRqst FlushRqst;
- 
- 			FlushRqst.Write = RecPtr;
- 			FlushRqst.Flush = RecPtr;
- 			XLogWrite(FlushRqst, false, false);
- 		}
- 		LWLockRelease(WALWriteLock);
- 
- 		END_CRIT_SECTION();
- 
- 		return RecPtr;
- 	}
- 
- 	/* Insert record header */
- 
- 	record = (XLogRecord *) Insert->currpos;
- 	record->xl_prev = Insert->PrevRecord;
- 	record->xl_xid = GetCurrentTransactionIdIfAny();
- 	record->xl_tot_len = SizeOfXLogRecord + write_len;
- 	record->xl_len = len;		/* doesn't include backup blocks */
- 	record->xl_info = info;
- 	record->xl_rmid = rmid;
- 
- 	/* Now we can finish computing the record's CRC */
- 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
- 			   SizeOfXLogRecord - sizeof(pg_crc32));
- 	FIN_CRC32(rdata_crc);
- 	record->xl_crc = rdata_crc;
- 
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
--- 1016,1055 ----
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
  	/*
! 	 * Construct record header. We can't CRC it yet, because the prev-link
! 	 * needs to be covered by the CRC and we don't know that yet. We will
! 	 * finish computing the CRC when we do.
  	 */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	/* rechdr.xl_prev is set in PerformXLogInsert() */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to do the insertion.
  	 */
! 	RecPtr = PerformXLogInsert(write_len, isLogSwitch, &rechdr,
! 							   rdata, rdata_crc, doPageWrites);
! 	END_CRIT_SECTION();
! 
! 	if (XLogRecPtrIsInvalid(RecPtr))
  	{
! 		/*
! 		 * Oops, must redo it with full-page data. Unlink the backup blocks
! 		 * from the chain and reset info bitmask to undo the changes we've
! 		 * done.
! 		 */
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
***************
*** 1032,1215 **** begin:;
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
  						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, record);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
! 	/* Record begin of record in appropriate places */
! 	ProcLastRecPtr = RecPtr;
! 	Insert->PrevRecord = RecPtr;
  
! 	Insert->currpos += SizeOfXLogRecord;
! 	freespace -= SizeOfXLogRecord;
  
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
  		{
! 			if (rdata->len > freespace)
  			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
  			}
  		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Ensure next record will be properly aligned */
! 	Insert->currpos = (char *) Insert->currpage +
! 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
! 	freespace = INSERT_FREESPACE(Insert);
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
  	{
! 		XLogCtlWrite *Write = &XLogCtl->Write;
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
  		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
  		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == Write->curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
  		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
  		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		Write->LogwrtResult = LogwrtResult;
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
  	}
  	else
  	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
  		{
! 			/* curridx is filled and available for writing out */
! 			updrqst = true;
  		}
  		else
  		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
- 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
! 	LWLockRelease(WALInsertLock);
  
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
! 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1058,1762 ----
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
  						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, &rechdr);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rmid].rm_desc(&buf, rechdr.xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
! 	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
! 	 */
! 	return RecPtr;
! }
  
! /*
!  * Subroutine of XLogInsert. All the changes to shared state are done here,
!  * XLogInsert only prepares the record for insertion.
!  *
!  * On success, returns pointer to end of inserted record like XLogInsert().
!  * If RedoRecPtr or forcePageWrites had changed, returns InvalidRecPtr, and
!  * the caller must recalculate full-page-images and retry.
!  */
! static XLogRecPtr
! PerformXLogInsert(int write_len, bool isLogSwitch, XLogRecord *rechdr,
! 				  XLogRecData *rdata, pg_crc32 rdata_crc,
! 				  bool didPageWrites)
! {
! 	volatile XLogInsertSlot *myslot = NULL;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			tot_len;
! 	int			freespace;
! 	int			written;
! 	XLogRecPtr	PrevRecord;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	EndPos;
! 	XLogRecPtr	CurrPos;
! 	bool		updrqst;
! 
! 	/* Get an insert location  */
! 	tot_len = SizeOfXLogRecord + write_len;
! 	if (!ReserveXLogInsertLocation(tot_len, didPageWrites, isLogSwitch,
! 								   &PrevRecord, &StartPos, &EndPos,
! 								   (XLogInsertSlot **) &myslot, &updrqst))
! 	{
! 		return EndPos;
! 	}
  
  	/*
! 	 * Got it! Now that we know the prev-link, we can finish computing the
! 	 * record's CRC.
  	 */
! 	rechdr->xl_prev = PrevRecord;
! 
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
  
! 	/* Copy the record header in place */
! 	record = (XLogRecord *) currpos;
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	COMP_CRC32(rdata_crc, currpos + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
! 
! 	currpos += SizeOfXLogRecord;
! 	CurrPos.xrecoff += SizeOfXLogRecord;
! 
! 	if (!isLogSwitch)
! 	{
! 		/* Copy record data */
! 		written = 0;
! 		freespace = INSERT_FREESPACE(CurrPos);
! 		while (rdata != NULL)
  		{
! 			while (rdata->len > freespace)
  			{
! 				/*
! 				 * Write what fits on this page, then write the continuation
! 				 * record, and continue.
! 				 */
! 				XLogContRecord *contrecord;
! 
! 				memcpy(currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				written += freespace;
! 				CurrPos.xrecoff += freespace;
! 
! 				/*
! 				 * CurrPos now points to the page boundary, ie. the first byte
! 				 * of the next page. Update CurrPos with that before
! 				 * calling GetXLogBuffer(), because GetXLogBuffer() might need
! 				 * to wait for some insertions to finish so that it can write
! 				 * out a buffer to make room for the new page. Updating CurrPos
! 				 * before waiting for a new buffer ensures that we don't
! 				 * deadlock with ourselves if we run out of clean buffers.
! 				 *
! 				 * However, we must not advance CurrPos past the page header
! 				 * yet, otherwise someone might try to flush up to that point,
! 				 * which would fail if the next page was not initialized yet.
! 				 */
! 				UpdateSlotCurrPos(myslot, CurrPos);
! 
! 				/* Now skip page header */
! 				CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 
! 				currpos = GetXLogBuffer(CurrPos);
! 
! 				contrecord = (XLogContRecord *) currpos;
! 				contrecord->xl_rem_len = write_len - written;
! 
! 				currpos += SizeOfXLogContRecord;
! 				CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 				freespace = INSERT_FREESPACE(CurrPos);
  			}
+ 
+ 			memcpy(currpos, rdata->data, rdata->len);
+ 			currpos += rdata->len;
+ 			CurrPos.xrecoff += rdata->len;
+ 			freespace -= rdata->len;
+ 			written += rdata->len;
+ 
+ 			rdata = rdata->next;
  		}
+ 		Assert(written == write_len);
  
! 		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
! 		Assert(XLByteEQ(CurrPos, EndPos));
  	}
+ 	else
+ 	{
+ 		/* An xlog-switch record doesn't contain any data besides the header */
+ 		Assert(write_len == 0);
+ 
+ 		/*
+ 		 * An xlog-switch record consumes all the remaining space on the
+ 		 * WAL segment. We have already reserved it for us, but we still need
+ 		 * to make sure it's been allocated and zeroed in the WAL buffers so
+ 		 * that when the caller (or someone else) does XLogWrite(), it can
+ 		 * really write out all the zeros.
+ 		 *
+ 		 * We do this one page at a time, to make sure we don't deadlock
+ 		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
+ 		 */
+ 		while (XLByteLT(CurrPos, EndPos))
+ 		{
+ 			/* use up all the remaining space in this page */
+ 			freespace = INSERT_FREESPACE(CurrPos);
+ 			XLByteAdvance(CurrPos, freespace);
+ 			/*
+ 			 * like in the non-xlog-switch codepath, let others know that
+ 			 * we're done writing up to the end of this page
+ 			 */
+ 			UpdateSlotCurrPos(myslot, CurrPos);
+ 			/*
+ 			 * let GetXLogBuffer initialize next page if necessary.
+ 			 */
+ 			CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
+ 			(void) GetXLogBuffer(CurrPos);
+ 		}
  
! 		/*
! 		 * Even though we reserved the rest of the segment for us, which
! 		 * is reflected in EndPos, we need to return a value that points just
! 		 * to the end of the xlog-switch record.
! 		 */
! 		EndPos.xlogid = StartPos.xlogid;
! 		EndPos.xrecoff = StartPos.xrecoff + SizeOfXLogRecord;
! 	}
  
  	/*
! 	 * Done! Clear CurrPos in our slot to let others know that we're
! 	 * finished.
  	 */
! 	UpdateSlotCurrPos(myslot, InvalidXLogRecPtr);
  
  	/*
! 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
  	 */
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
! 
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
  
! 	/* update our global variables */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
  
! 	return EndPos;
! }
  
+ /*
+  * Reserves the right amount of space for a record of given size from the WAL.
+  * *StartPos_p is set to the beginning of the reserved section, *EndPos_p to
+  * its end, and *Prev_record_p points to the beginning of the previous record
+  * to set to the prev-link of the record header.
+  *
+  * A log-switch record is handled slightly differently. The rest of the
+  * segment will be reserved for this insertion, as indicated by the returned
+  * *EndPos_p value. However, if we are already at the beginning of the current
+  * segment, the *EndPos_p is set to the current location without reserving
+  * any space, and the function returns false.
+  *
+  * *updrqst_p is set to true, if this record ends on different page than
+  * the previous one - the caller should update the shared LogwrtRqst value
+  * after it's done inserting the record in that case, so that the WAL page
+  * that filled up gets written out at the next convenient moment.
+  *
+  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
+  * (or the end of previous record, to be exact) to let others know that we're
+  * busy inserting to the reserved area. The caller must clear it when the
+  * insertion is finished.
+  *
+  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
+  * changed. On failure, the shared state is not modified.
+  *
+  * This is the performance critical part of XLogInsert that must be
+  * serialized across backends. The rest can happen mostly in parallel.
+  *
+  * NB: The space calculation here must match the code in PerformXLogInsert,
+  * where we actually copy the record to the reserved space.
+  */
+ static bool
+ ReserveXLogInsertLocation(int size, bool didPageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p)
+ {
+ 	volatile XLogInsertSlot *myslot;
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			freespace;
+ 	XLogRecPtr	ptr;
+ 	XLogRecPtr	StartPos;
+ 	XLogRecPtr	LastEndPos;
+ 	int32		nextslot;
+ 	int32		lastslot;
+ 	bool		updrqst = false;
+ 
+ retry:
+ 	SpinLockAcquire(&Insert->insertpos_lck);
+ 
+ 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
+ 		(!didPageWrites && Insert->forcePageWrites))
+ 	{
  		/*
! 		 * Oops, a checkpoint just happened, or forcePageWrites was just
! 		 * turned on. Start XLogInsert() all over, because we might have to
! 		 * include more full-page images in the record.
  		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		*EndPos_p = InvalidXLogRecPtr;
! 		return false;
! 	}
! 
! 	/*
! 	 * Reserve the next insertion slot for us.
! 	 *
! 	 * First check that the slot is not still in use. Modifications to
! 	 * lastslot are protected by WALInsertTailLock, but here we assume that
! 	 * reading an int32 is atomic. Another process might advance lastslot at
! 	 * the same time, but not past nextslot.
! 	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	if (NextSlotNo(nextslot) == lastslot)
! 	{
! 		/*
! 		 * Oops, we've "caught our tail" and the oldest slot is still in use.
! 		 * Have to wait for it to become vacant.
! 		 */
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		WaitForXLogInsertionSlotToBecomeFree();
! 		goto retry;
! 	}
! 	myslot = &XLogCtl->XLogInsertSlots[nextslot];
! 	nextslot = NextSlotNo(nextslot);
  
! 	/*
! 	 * Got the slot, now reserve the right amount of space from the WAL for
! 	 * our record.
! 	 */
! 	LastEndPos = ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
  
+ 	/*
+ 	 * If there isn't enough space on the current XLOG page for a record
+ 	 * header, advance to the next page (leaving the unused space as zeroes).
+ 	 */
+ 	freespace = INSERT_FREESPACE(ptr);
+ 	if (freespace < SizeOfXLogRecord)
+ 	{
+ 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
+ 		freespace = INSERT_FREESPACE(ptr);
+ 		updrqst = true;
+ 	}
+ 
+ 	/*
+ 	 * We are now at the starting position of our record. Now figure out how
+ 	 * the data will be split across the WAL pages, to calculate where the
+ 	 * record ends.
+ 	 */
+ 	StartPos = ptr;
+ 
+ 	if (isLogSwitch)
+ 	{
  		/*
! 		 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 		 * segment, we need not insert it (and don't want to because we'd like
! 		 * consecutive switch requests to be no-ops). Otherwise the XLOG_SWITCH
! 		 * record should consume all the remaining space on the current segment.
  		 */
+ 		Assert(size == SizeOfXLogRecord);
+ 		if ((ptr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  		{
! 			/* We can release insert lock immediately */
! 			SpinLockRelease(&Insert->insertpos_lck);
  
! 			ptr.xrecoff -= SizeOfXLogLongPHD;
! 			if (ptr.xrecoff == 0)
! 			{
! 				/* crossing a logid boundary */
! 				ptr.xlogid -= 1;
! 				ptr.xrecoff = XLogFileSize;
! 			}
! 
! 			*EndPos_p = ptr;
! 			*StartPos_p = ptr;
! 			*myslot_p = NULL;
! 
! 			return false;
  		}
+ 		else
+ 		{
+ 			if (ptr.xrecoff % XLOG_SEG_SIZE != 0)
+ 			{
+ 				int segleft = XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE);
+ 				XLByteAdvance(ptr, segleft);
+ 			}
+ 			updrqst = true;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/* A normal record, ie. not xlog-switch */
+ 		int sizeleft = size;
+ 		while (freespace < sizeleft)
+ 		{
+ 			/* fill this page, and continue on next page */
+ 			sizeleft -= freespace;
+ 			ptr = AdvanceXLogRecPtrToNextPage(ptr);
  
! 			/* account for continuation record header */
! 			ptr.xrecoff += SizeOfXLogContRecord;
! 			freespace = INSERT_FREESPACE(ptr);
  
! 			updrqst = true;
! 		}
! 		/* the rest fits on this page */
! 		ptr.xrecoff += sizeleft;
! 
! 		/* Align the end position, so that the next record starts aligned */
! 		ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 	}
! 
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	myslot->CurrPos = LastEndPos;
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 	Insert->nextslot = nextslot;
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(LOG, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 PrevRecord_p->xlogid, PrevRecord_p->xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 size);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 	*myslot_p = (XLogInsertSlot *) myslot;
! 	*updrqst_p = updrqst;
! 
! 	return true;
! }
! 
! /*
!  * Update slot's CurrPos variable, and wake up anyone waiting on it.
!  */
! static void
! UpdateSlotCurrPos(volatile XLogInsertSlot *myslot, XLogRecPtr CurrPos)
! {
! 	PGPROC	   *head;
! 
! 	/*
! 	 * The write-barrier ensures that the changes we made to the WAL pages
! 	 * are visible to everyone before the update of CurrPos.
! 	 *
! 	 * XXX: I'm not sure if this is necessary. Does a function call act
! 	 * as an implicit barrier?
! 	 */
! 	pg_write_barrier();
! 
! 	SpinLockAcquire(&myslot->lck);
! 	myslot->CurrPos = CurrPos;
! 	head = myslot->head;
! 	myslot->head = myslot->tail = NULL;
! 	SpinLockRelease(&myslot->lck);
! 	while (head != NULL)
! 	{
! 		PGPROC *proc = head;
! 		head = proc->lwWaitLink;
! 		proc->lwWaitLink = NULL;
! 		proc->lwWaiting = false;
! 		PGSemaphoreUnlock(&proc->sem);
! 	}
! }
! 
! /*
!  * Get a pointer to the right location in the WAL buffer containing the
!  * given XLogRecPtr.
!  *
!  * If the page is not initialized yet, it is initialized. That might require
!  * evicting an old dirty buffer from the buffer cache, which means I/O.
!  *
!  * The caller must ensure that the page containing the requested location
!  * isn't evicted yet, and won't be evicted, by holding onto an
!  * XLogInsertSlot with CurrPos set to 'ptr'. Setting it to some value
!  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
!  * if we have to evict a buffer, we might have to wait for someone else to
!  * finish a write. And that someone else might not be able to finish the write
!  * if our CurrPos points to a buffer that's still in the buffer cache.
!  */
! static char *
! GetXLogBuffer(XLogRecPtr ptr)
! {
! 	int			idx;
! 	XLogRecPtr	endptr;
! 
! 	/*
! 	 * The XLog buffer cache is organized so that we can easily calculate the
! 	 * buffer a given page must be loaded into from the XLogRecPtr alone.
! 	 * A page must always be loaded to a particular buffer.
! 	 */
! 	idx = XLogRecPtrToBufIdx(ptr);
! 
! 	/*
! 	 * See what page is loaded in the buffer at the moment. It could be the
! 	 * page we're looking for, or something older. It can't be anything
! 	 * newer - that would imply the page we're looking for has already
! 	 * been written out to disk, which shouldn't happen as long as the caller
! 	 * has set its slot's CurrPos correctly.
! 	 *
! 	 * However, we don't hold a lock while we read the value. If someone has
! 	 * just initialized the page, it's possible that we get a "torn read",
! 	 * and read a bogus value. That's ok, we'll grab the mapping lock (in
! 	 * AdvanceXLInsertBuffer) and retry if we see anything else than the page
! 	 * we're looking for. But it means that when we do this unlocked read, we
! 	 * might see a value that appears to be ahead of the page we're looking
! 	 * for. So don't PANIC on that, until we've verified the value while
! 	 * holding the lock.
! 	 */
! 	endptr = XLogCtl->xlblocks[idx];
! 	if (ptr.xlogid != endptr.xlogid ||
! 		!(ptr.xrecoff < endptr.xrecoff &&
! 		  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 	{
! 		AdvanceXLInsertBuffer(ptr, false);
! 		endptr = XLogCtl->xlblocks[idx];
  
! 		if (ptr.xlogid != endptr.xlogid ||
! 			!(ptr.xrecoff < endptr.xrecoff &&
! 			  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 		{
! 			elog(PANIC, "could not find WAL buffer for %X/%X",
! 				 ptr.xlogid, ptr.xrecoff);
! 		}
  	}
+ 
+ 	/*
+ 	 * Found the buffer holding this page. Return a pointer to the right
+ 	 * offset within the page.
+ 	 */
+ 	return (char *) XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
+ 		ptr.xrecoff % XLOG_BLCKSZ;
+ }
+ 
+ /*
+  * Advance an XLogRecPtr to the first valid insertion location on the next
+  * page, right after the page header. An XLogRecPtr pointing to a boundary,
+  * ie. the first byte of a page, is taken to belong to the previous page.
+  */
+ static XLogRecPtr
+ AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr)
+ {
+ 	int			freespace;
+ 
+ 	freespace = INSERT_FREESPACE(ptr);
+ 	XLByteAdvance(ptr, freespace);
+ 	if (ptr.xrecoff % XLogSegSize == 0)
+ 		ptr.xrecoff += SizeOfXLogLongPHD;
  	else
+ 		ptr.xrecoff += SizeOfXLogShortPHD;
+ 
+ 	return ptr;
+ }
+ 
+ /*
+  * Wait for any insertions < upto to finish.
+  *
+  * Returns a value >= upto, which indicates the oldest in-progress insertion
+  * that we saw in the array, or CurrPos if there are no insertions in-progress
+  * at exit.
+  */
+ static XLogRecPtr
+ WaitXLogInsertionsToFinish(XLogRecPtr upto, XLogRecPtr CurrPos)
+ {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			lastslot;
+ 	int			nextslot;
+ 	volatile XLogInsertSlot *slot;
+ 	XLogRecPtr	slotptr = InvalidXLogRecPtr;
+ 	XLogRecPtr	LastPos;
+ 	int			extraWaits = 0;
+ 
+ 	if (MyProc == NULL)
+ 		elog(PANIC, "cannot wait without a PGPROC structure");
+ 
+ 	LastPos = CurrPos;
+ 
+ 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+ 
+ 	lastslot = Insert->lastslot;
+ 	nextslot = Insert->nextslot;
+ 
+ 	/* Skip over slots that have finished already */
+ 	while (lastslot != nextslot)
  	{
! 		slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		SpinLockAcquire(&slot->lck);
! 		slotptr = slot->CurrPos;
  
! 		if (XLogRecPtrIsInvalid(slotptr))
  		{
! 			lastslot = NextSlotNo(lastslot);
! 			SpinLockRelease(&slot->lck);
  		}
  		else
  		{
! 			/*
! 			 * This insertion is still in-progress. Wait for it to finish
! 			 * if it's <= upto, otherwise we're done.
! 			 */
! 			Insert->lastslot = lastslot;
! 
! 			if (XLogRecPtrIsInvalid(upto) || XLByteLE(upto, slotptr))
! 			{
! 				LastPos = slotptr;
! 				SpinLockRelease(&slot->lck);
! 				break;
! 			}
! 
! 			/* wait */
! 			MyProc->lwWaiting = true;
! 			MyProc->lwExclusive = false;
! 			MyProc->lwWaitLink = NULL;
! 			if (slot->head == NULL)
! 				slot->head = MyProc;
! 			else
! 				slot->tail->lwWaitLink = MyProc;
! 			slot->tail = MyProc;
! 			SpinLockRelease(&slot->lck);
! 			LWLockRelease(WALInsertTailLock);
! 			for (;;)
! 			{
! 				PGSemaphoreLock(&MyProc->sem, false);
! 				if (!MyProc->lwWaiting)
! 					break;
! 				extraWaits++;
! 			}
! 			LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 			lastslot = Insert->lastslot;
! 			nextslot = Insert->nextslot;
  		}
  	}
  
! 	Insert->lastslot = lastslot;
! 	LWLockRelease(WALInsertTailLock);
  
! 	while (extraWaits-- > 0)
! 		PGSemaphoreUnlock(&MyProc->sem);
! 
! 	return LastPos;
! }
! 
! /*
!  * Wait for the next insertion slot to become vacant.
!  */
! static void
! WaitForXLogInsertionSlotToBecomeFree(void)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			lastslot;
! 	int			nextslot;
! 	int			extraWaits = 0;
! 
! 	if (MyProc == NULL)
! 		elog(PANIC, "cannot wait without a PGPROC structure");
! 
! 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Re-read lastslot and nextslot, now that we have the wait-lock.
! 	 * We're reading nextslot without holding insertpos_lck. It could advance
! 	 * at the same time, but it can't advance beyond lastslot - 1.
! 	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 
! 	/*
! 	 * If there are still no slots available, wait for the oldest slot to
! 	 * become vacant.
! 	 */
! 	while (NextSlotNo(nextslot) == lastslot)
  	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
  
! 		SpinLockAcquire(&slot->lck);
! 		if (XLogRecPtrIsInvalid(slot->CurrPos))
! 		{
! 			SpinLockRelease(&slot->lck);
! 			break;
! 		}
! 
! 		/* wait */
! 		MyProc->lwWaiting = true;
! 		MyProc->lwExclusive = false;
! 		MyProc->lwWaitLink = NULL;
! 		if (slot->head == NULL)
! 			slot->head = MyProc;
! 		else
! 			slot->tail->lwWaitLink = MyProc;
! 		slot->tail = MyProc;
! 		SpinLockRelease(&slot->lck);
! 		LWLockRelease(WALInsertTailLock);
! 		for (;;)
! 		{
! 			PGSemaphoreLock(&MyProc->sem, false);
! 			if (!MyProc->lwWaiting)
! 				break;
! 			extraWaits++;
! 		}
! 		LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 		lastslot = Insert->lastslot;
! 		nextslot = Insert->nextslot;
  	}
  
! 	/*
! 	 * Ok, there is at least one empty slot now. That's enouugh for our
! 	 * insertion, but ẃhile we're at it, advance lastslot as much as we
! 	 * can. That way we don't need to come back here on the next call
! 	 * again.
! 	 */
! 	while (lastslot != nextslot)
! 	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		/*
! 		 * Don't need to grab the slot's spinlock here, because we're not
! 		 * interested in the exact value of CurrPos, only whether it's
! 		 * valid or not.
! 		 */
! 		if (!XLogRecPtrIsInvalid(slot->CurrPos))
! 			break;
  
! 		lastslot = NextSlotNo(lastslot);
! 	}
! 	Insert->lastslot = lastslot;
  
! 	LWLockRelease(WALInsertTailLock);
  }
  
  /*
***************
*** 1436,1470 **** XLogArchiveCleanup(const char *xlog)
  }
  
  /*
!  * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data.
!  *
!  * If new_segment is TRUE then we set up the next buffer page as the first
!  * page of the next xlog segment file, possibly but not usually the next
!  * consecutive file page.
!  *
!  * The global LogwrtRqst.Write pointer needs to be advanced to include the
!  * just-filled page.  If we can do this for free (without an extra lock),
!  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
!  * request update still needs to be done, FALSE if we did it internally.
!  *
!  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
! 	/* Use Insert->LogwrtResult copy if it's more fresh */
! 	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
! 		LogwrtResult = Insert->LogwrtResult;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 1983,2016 ----
  }
  
  /*
!  * Initialize XLOG buffers, writing out old buffers if they still contain
!  * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
!  * true, initialize as many pages as we can without having to write out
!  * unwritten data. Any new pages are initialized to zeros, with pages headers
!  * initialized properly.
   */
! static void
! AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
  	XLogPageHeader NewPage;
+ 	bool		needflush;
+ 	int			npages = 0;
  
! 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Now that we have the lock, check if someone initialized the page
! 	 * already.
! 	 */
! /* XXX: fix indentation before commit */
! while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
! {
! 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1472,1483 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
- 	{
- 		/* nope, got work to do... */
- 		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 2018,2034 ----
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  
! 	needflush = !XLByteLE(OldPageRqstPtr, LogwrtResult.Write);
! 
! 	if (needflush)
! 	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1485,1529 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		update_needed = false;	/* Did the shared-request update */
! 
! 		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/* OK, someone wrote it already */
! 			Insert->LogwrtResult = LogwrtResult;
! 		}
! 		else
! 		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding insert lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
--- 2036,2085 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 			{
! 				Assert(XLByteLE(OldPageRqstPtr, xlogctl->Insert.CurrPos));
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
! 			}
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr, InvalidXLogRecPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding mapping lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
***************
*** 1531,1544 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
! 
! 	if (new_segment)
! 	{
! 		/* force it to a segment start point */
! 		NewPageEndPtr.xrecoff += XLogSegSize - 1;
! 		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
! 	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 2087,2093 ----
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1548,1560 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
! 	Insert->curridx = nextidx;
! 	Insert->currpage = NewPage;
! 
! 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
--- 2097,2106 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
! 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
  
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
***************
*** 1598,1608 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 2144,2171 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * Make sure the initialization of the page becomes visible to others
! 	 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
! 	 * holding a lock.
! 	 */
! 	pg_write_barrier();
! 
! 	*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(DEBUG1, "initialized %d pages, upto %X/%X",
! 			 npages, NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff);
! #endif
  }
  
  /*
***************
*** 1647,1662 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * If xlog_switch == TRUE, we are intending an xlog segment switch, so
!  * perform end-of-segment actions after writing the last page, even if
!  * it's not physically the end of its segment.  (NB: this will work properly
!  * only if caller specifies WriteRqst == page-end and flexible == false,
!  * and there is some data to write.)
!  *
!  * Must be called with WALWriteLock held.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
--- 2210,2221 ----
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
!  * the data is ready to write.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
***************
*** 1674,1680 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = Write->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
--- 2233,2239 ----
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
***************
*** 1705,1718 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 2264,2277 ----
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
! 		if (!XLByteLT(LogwrtResult.Write, EndPtr))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 EndPtr.xlogid, EndPtr.xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = EndPtr;
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1809,1824 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
- 			 * We also do this if this is the last page written for an xlog
- 			 * switch.
- 			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg || (xlog_switch && last_iteration))
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
--- 2368,2380 ----
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg)
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
***************
*** 1908,1915 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
- 
- 	Write->LogwrtResult = LogwrtResult;
  }
  
  /*
--- 2464,2469 ----
***************
*** 2081,2114 **** XLogFlush(XLogRecPtr record)
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
! 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 				else
! 				{
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 					WriteRqstPtr.xrecoff -= freespace;
! 				}
! 				LWLockRelease(WALInsertLock);
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = WriteRqstPtr;
! 			}
! 			else
! 			{
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = record;
! 			}
! 			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
  	}
--- 2635,2683 ----
  	/* done already? */
  	if (!XLByteLE(record, LogwrtResult.Flush))
  	{
+ 		/* try to write/flush later additions to XLOG as well */
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos;
+ 
+ 		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to do LWLockConditionalAcquire(WALInsertLock), and
+ 		 * fall back to writing just up to 'record' if we couldn't get the
+ 		 * lock. I wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand,
+ 		 * it would be good to not cause more contention on the lock if it's
+ 		 * busy, but on the other hand, this spinlock is much more lightweight
+ 		 * than the WALInsertLock was, so maybe it's better to just grab the
+ 		 * spinlock. Also note that if we stored the XLogRecPtr as one 64-bit
+ 		 * integer, we could just read it with no lock on platforms where
+ 		 * 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)		/* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, insertpos);
+ 
  		/* now wait for the write lock */
  		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
! 			XLogWrite(WriteRqst, false);
  		}
  		LWLockRelease(WALWriteLock);
  	}
***************
*** 2218,2240 **** XLogBackgroundFlush(void)
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
- 
  	START_CRIT_SECTION();
  
! 	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2787,2817 ----
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
  	START_CRIT_SECTION();
  
! 	/* now wait for any in-progress insertions to finish and get write lock */
! 	WaitXLogInsertionsToFinish(WriteRqstPtr, InvalidXLogRecPtr);
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible);
  	}
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 */
+ 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5028,5033 **** XLOGShmemSize(void)
--- 5605,5613 ----
  	/* and the buffers themselves */
  	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
  
+ 	/* XLog insertion slots */
+ 	size = add_size(size, mul_size(sizeof(XLogInsertSlot), NumXLogInsertSlots));
+ 
  	/*
  	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
  	 * added by CreateSharedMemoryAndSemaphores.  This lets us use this
***************
*** 5043,5048 **** XLOGShmemInit(void)
--- 5623,5629 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5068,5073 **** XLOGShmemInit(void)
--- 5649,5667 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize insertion slots */
+ 	XLogCtl->XLogInsertSlots = (XLogInsertSlot *) allocptr;
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[i];
+ 		slot->CurrPos = InvalidXLogRecPtr;
+ 		slot->head = slot->tail = NULL;
+ 		SpinLockInit(&slot->lck);
+ 	}
+ 	XLogCtl->Insert.nextslot = 1;
+ 	XLogCtl->Insert.lastslot = 0;
+ 	allocptr += sizeof(XLogInsertSlot) * NumXLogInsertSlots;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5082,5092 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
- 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
--- 5676,5687 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
+ 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
+ 
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
***************
*** 5961,5966 **** StartupXLOG(void)
--- 6556,6562 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6717,6724 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7313,7324 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6726,6751 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
- 	XLogCtl->Write.LogwrtResult = LogwrtResult;
- 	Insert->LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7326,7348 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndRecPtr.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6757,6763 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7354,7360 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7251,7257 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
--- 7848,7854 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
***************
*** 7527,7532 **** CreateCheckPoint(int flags)
--- 8124,8130 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7595,7604 **** CreateCheckPoint(int flags)
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 8193,8202 ----
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * Determine the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7610,7616 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 8208,8214 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7619,7633 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 8217,8228 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7653,7666 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
--- 8248,8257 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(curInsert);
  	if (freespace < SizeOfXLogRecord)
! 		curInsert = AdvanceXLogRecPtrToNextPage(curInsert);
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
***************
*** 7686,7692 **** CreateCheckPoint(int flags)
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8277,8283 ----
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7706,7712 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8297,8303 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 8073,8087 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8664,8678 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * Like in CreateCheckPoint(), you need both insertpos_lck and info_lck
! 	 * to update it, although during recovery acquiring insertpos_lck is just
! 	 * pro forma, because no WAL insertions are happening.
  	 */
! 	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8274,8279 **** RequestXLogSwitch(void)
--- 8865,8871 ----
  {
  	XLogRecPtr	RecPtr;
  	XLogRecData rdata;
+ 	XLogwrtRqst FlushRqst;
  
  	/* XLOG SWITCH, alone among xlog record types, has no data */
  	rdata.buffer = InvalidBuffer;
***************
*** 8283,8288 **** RequestXLogSwitch(void)
--- 8875,8901 ----
  
  	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH, &rdata);
  
+ 	/* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
+ 	 * if the xlog switch had no work to do, ie. if we were already at the
+ 	 * beginning of a new XLOG segment. You can check if RecPtr points to
+ 	 * beginning of a segment if you want to keep the distinction.
+ 	 */
+ 	TRACE_POSTGRESQL_XLOG_SWITCH();
+ 
+ 	/*
+ 	 * Flush through the end of the page containing XLOG_SWITCH, and
+ 	 * perform end-of-segment actions (eg, notifying archiver).
+ 	 */
+ 	WaitXLogInsertionsToFinish(RecPtr, InvalidXLogRecPtr);
+ 
+ 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ 	FlushRqst.Write = RecPtr;
+ 	FlushRqst.Flush = RecPtr;
+ 	START_CRIT_SECTION();
+ 	XLogWrite(FlushRqst, false);
+ 	END_CRIT_SECTION();
+ 	LWLockRelease(WALWriteLock);
+ 
  	return RecPtr;
  }
  
***************
*** 8836,8841 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9449,9455 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
***************
*** 8885,8910 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9499,9524 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 8966,8978 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9580,9592 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9054,9063 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
--- 9668,9678 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
***************
*** 9074,9080 **** pg_start_backup_callback(int code, Datum arg)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9689,9695 ----
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9087,9092 **** pg_start_backup_callback(int code, Datum arg)
--- 9702,9708 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
***************
*** 9128,9136 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 9744,9752 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9139,9154 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 9755,9770 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9350,9365 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9966,9983 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
! 
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9413,9424 **** GetStandbyFlushRecPtr(void)
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 10031,10042 ----
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/storage/lmgr/spin.c
--- b/src/backend/storage/lmgr/spin.c
***************
*** 56,61 **** SpinlockSemas(void)
--- 56,64 ----
  	 *
  	 * For now, though, we just need a few spinlocks (10 should be plenty)
  	 * plus one for each LWLock and one for each buffer header.
+ 	 *
+ 	 * XXX: remember to adjust this for the number of spinlocks needed by the
+ 	 * xlog.c changes before committing!
  	 */
  	return NumLWLocks() + NBuffers + 10;
  }
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALInsertTailLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#30)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Fri, Jan 20, 2012 at 2:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 20.01.2012 15:32, Robert Haas wrote:

On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

Here's another version of the patch to make XLogInsert less of a
bottleneck
on multi-CPU systems. The basic idea is the same as before, but several
bugs
have been fixed, and lots of misc. clean up has been done.

This seems to need a rebase.

Here you go.

I put myself down as reviewer for this. I'm planning to review this
early next week, once I've finished Fujii-san's patches.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#32Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#30)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 20.01.2012 15:32, Robert Haas wrote:

On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

Here's another version of the patch to make XLogInsert less of a
bottleneck
on multi-CPU systems. The basic idea is the same as before, but several
bugs
have been fixed, and lots of misc. clean up has been done.

This seems to need a rebase.

Here you go.

The patch seems to need a rebase again.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#33Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#32)
1 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 31.01.2012 17:35, Fujii Masao wrote:

On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 20.01.2012 15:32, Robert Haas wrote:

On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here's another version of the patch to make XLogInsert less of a
bottleneck
on multi-CPU systems. The basic idea is the same as before, but several
bugs
have been fixed, and lots of misc. clean up has been done.

This seems to need a rebase.

Here you go.

The patch seems to need a rebase again.

Here you go again. It conflicted with the group commit patch, and the
patch to WAL-log and track changes to full_page_writes setting.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-7.patchtext/x-diff; name=xloginsert-scale-7.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 290,315 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually three shared-memory copies of
!  * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
!  *		XLogCtl->LogwrtResult is protected by info_lck
!  *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
!  *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
!  * One must hold the associated lock to read or write any of these, but
!  * of course no lock is needed to read/write the unshared LogwrtResult.
!  *
!  * XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always
!  * right", since both are updated by a write or flush operation before
!  * it releases WALWriteLock.  The point of keeping XLogCtl->Write.LogwrtResult
!  * is that it can be examined/modified by code that already holds WALWriteLock
!  * without needing to grab info_lck as well.
!  *
!  * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
!  * but is updated when convenient.	Again, it exists for the convenience of
!  * code that is already holding WALInsertLock but not the other locks.
!  *
!  * The unshared LogwrtResult may lag behind any or all of these, and again
!  * is updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
--- 291,301 ----
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There is one shared-memory copy of LogwrtResult,
!  * plus one unshared copy in each backend. To read the shared copy, you need
!  * to hold info_lck *or* WALWriteLock. To update it, you need to hold both
!  * locks. The unshared LogwrtResult may lag behind the shared copy, and is
!  * updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
***************
*** 319,328 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 305,319 ----
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 334,339 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 325,399 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL, and the next insertion
+  *    slot to advertise that the insertion is in progress. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    insertpos_lck. Try to keep this section as short as possible,
+  *    insertpos_lck can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each insertion is asssigned its own
+  * "XLog insertion slot", which is used to advertise the position the backend
+  * is writing to. The slot is marked as in-use in step 1, while holding
+  * insertpos_lck, by setting the position field in the slot. When the backend
+  * is finished with the insertion, it clears its slot. Each slot is protected
+  * by a separate spinlock, to keep contention minimal.
+  *
+  * The insertion slots also provide a mechanism to wait for an insertion to
+  * finish. This is important when an XLOG page is written out - any
+  * in-progress insertions must finish copying data to the page first, or the
+  * on-disk copy will be incomplete. Waiting is done by the
+  * WaitXLogInsertionsToFinish() function. It adds the current process to the
+  * waiting queue in the slot it needs to wait for, and when that insertion
+  * finishes (or proceeds to the next page, at least), the inserter wakes up
+  * the process.
+  *
+  * The insertion slots form a ring. Insert->nextslot points to the next free
+  * slot, and Insert->lastslot points to the last slot that's still in use.
+  * lastslot can lag behind reality by any number of slots, as long as nextslot
+  * doesn't catch up with it. lastslot is advanced by
+  * WaitXLogInsertionsToFinish(), and is protected by WALInsertTailLock.
+  * nextslot is advanced in ReserveXLogInsertLocation() and is protected by
+  * insertpos_lck. Both slot variables are 32-bit integers, so that they can
+  * be read atomically without holding a lock.
+  *
+  * Whenever the ring fills up, ie. when nextslot wraps around and catches up
+  * with lastslot, ReserveXLogInsertLocation() has to wait for the oldest
+  * insertion to finish and advance lastslot, to make room for the new
+  * insertion. This is done by WaitForXLogInsertionSlotToBecomeFree() function,
+  * which is similar to WaitXLogInsertionsToFinish(), but instead of waiting
+  * for all insertions up to a given point to finish, it just waits for the
+  * inserter in the oldest slot to finish.
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+  * finish (or at least advance to next uninitialized page), while you're
+  * holding WALWriteLock. That would be bad, because the backend you're waiting
+  * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+  * you'd get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * it's called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition, so the common pattern:
+  *
+  *   recptr = XLogInsert(...)
+  *   XLogFlush(recptr)
+  *
+  * is safe. It can't get stuck, because an insertion to a WAL page that's
+  * already initialized in cache can always proceed without waiting on a lock.
+  *
   *----------
   */
  
***************
*** 354,364 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 414,443 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct
! 								 * (except lastslot). */
! 
! 	int32		nextslot;		/* next insertion slot to use */
! 	int32		lastslot;		/* last in-use insertion slot (protected by
! 								 * WALInsertTailLock) */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment. The
! 	 * next record will be inserted there (or somewhere after it if there's
! 	 * not enough space on the current page). PrevRecord points to the
! 	 * beginning of the last record already reserved. It might not be fully
! 	 * copied into place yet, but we know its exact location already.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
! 
! 	/*
! 	 * padding to push RedoRecPtr and forcePageWrites, which rarely change,
! 	 * to a different cache line than the rapidly-changing CurrPos and
! 	 * PrevRecord values. XXX: verify if this makes any difference
! 	 */
! 	char		padding[128];
! 
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 388,406 **** typedef struct XLogCtlInsert
   */
  typedef struct XLogCtlWrite
  {
- 	XLogwrtResult LogwrtResult; /* current value of LogwrtResult */
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
--- 467,500 ----
   */
  typedef struct XLogCtlWrite
  {
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
+ 
+ /*
+  * Slots for in-progress WAL insertions.
+  */
+ typedef struct
+ {
+ 	slock_t		lck;
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ 	PGPROC	   *head;		/* head of list of waiting PGPROCs */
+ 	PGPROC	   *tail;		/* tail of list of waiting PGPROCs */
+ } XLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	1000
+ 
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	XLogInsertSlot *XLogInsertSlots;
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
***************
*** 414,422 **** typedef struct XLogCtlData
  	XLogCtlWrite Write;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 508,526 ----
  	XLogCtlWrite Write;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock. To change the identity of a buffer that's
+ 	 * still dirty, the old page needs to be written out first, and for that
+ 	 * you need WALWriteLock, and you need to ensure that there's no
+ 	 * in-progress insertions to the page by calling
+ 	 * WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* current (latest) block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 494,521 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
  
- /* Free space remaining in the current xlog page buffer */
- #define INSERT_FREESPACE(Insert)  \
- 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
! /* Construct XLogRecPtr value for current insertion point */
! #define INSERT_RECPTR(recptr,Insert,curridx)  \
! 	( \
! 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
! 	  (recptr).xrecoff = \
! 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
! 	)
  
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
--- 598,628 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
+ #define INSERT_FREESPACE(endptr)	\
+ 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
+ 
+ #define NextBufIdx(idx)		\
+ 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  
! #define NextSlotNo(idx)		\
! 		(((idx) == NumXLogInsertSlots) ? 0 : ((idx) + 1))
  
! /*
!  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
!  * would hold if it was in cache, the page containing 'recptr'.
!  *
!  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
!  * page is taken to mean the previous page.
!  */
! #define XLogRecPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
! #define XLogRecEndPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
***************
*** 641,649 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
--- 748,756 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
***************
*** 690,695 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 797,820 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static XLogRecPtr PerformXLogInsert(int write_len,
+ 				  bool isLogSwitch,
+ 				  XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  bool didPageWrites);
+ static bool ReserveXLogInsertLocation(int size, bool forcePageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p);
+ static void UpdateSlotCurrPos(volatile XLogInsertSlot *myslot,
+ 				  XLogRecPtr CurrPos);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto,
+ 						   XLogRecPtr CurrPos);
+ static void WaitForXLogInsertionSlotToBecomeFree(void);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ static XLogRecPtr AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 710,721 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
  	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
--- 835,841 ----
***************
*** 729,739 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	uint32		len,
  				write_len;
  	unsigned	i;
- 	bool		updrqst;
  	bool		doPageWrites;
! 	bool		isLogSwitch = false;
! 	bool		fpwChange = false;
  	uint8		info_orig = info;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 849,858 ----
  	uint32		len,
  				write_len;
  	unsigned	i;
  	bool		doPageWrites;
! 	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
+ 	XLogRecord	rechdr;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 746,775 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
  
  	/*
! 	 * Handle special cases/records.
  	 */
! 	if (rmid == RM_XLOG_ID)
  	{
- 		switch (info)
- 		{
- 			case XLOG_SWITCH:
- 				isLogSwitch = true;
- 				break;
- 
- 			case XLOG_FPW_CHANGE:
- 				fpwChange = true;
- 				break;
- 
- 			default:
- 				break;
- 		}
- 	}
- 	else if (IsBootstrapProcessingMode())
- 	{
- 		/*
- 		 * In bootstrap mode, we don't actually log anything but XLOG resources;
- 		 * return a phony record pointer.
- 		 */
  		RecPtr.xlogid = 0;
  		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
  		return RecPtr;
--- 865,875 ----
  	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
  
  	/*
! 	 * In bootstrap mode, we don't actually log anything but XLOG resources;
! 	 * return a phony record pointer.
  	 */
! 	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
  	{
  		RecPtr.xlogid = 0;
  		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
  		return RecPtr;
***************
*** 939,1072 **** begin:;
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
- 	START_CRIT_SECTION();
- 
- 	/* Now wait to get insert lock */
- 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
- 
  	/*
! 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
! 	 * back and recompute everything.  This can only happen just after a
! 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
! 	 *
! 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
! 	 * affect the contents of the XLOG record, so we'll update our local copy
! 	 * but not force a recomputation.
  	 */
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
! 	{
! 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
! 		RedoRecPtr = Insert->RedoRecPtr;
  
! 		if (doPageWrites)
! 		{
! 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 			{
! 				if (dtbuf[i] == InvalidBuffer)
! 					continue;
! 				if (dtbuf_bkp[i] == false &&
! 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
! 				{
! 					/*
! 					 * Oops, this buffer now needs to be backed up, but we
! 					 * didn't think so above.  Start over.
! 					 */
! 					LWLockRelease(WALInsertLock);
! 					END_CRIT_SECTION();
! 					rdt_lastnormal->next = NULL;
! 					info = info_orig;
! 					goto begin;
! 				}
! 			}
! 		}
! 	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
! 		/* Oops, must redo it with full-page data. */
! 		LWLockRelease(WALInsertLock);
! 		END_CRIT_SECTION();
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
  
- 	/*
- 	 * If there isn't enough space on the current XLOG page for a record
- 	 * header, advance to the next page (leaving the unused space as zeroes).
- 	 */
- 	updrqst = false;
- 	freespace = INSERT_FREESPACE(Insert);
- 	if (freespace < SizeOfXLogRecord)
- 	{
- 		updrqst = AdvanceXLInsertBuffer(false);
- 		freespace = INSERT_FREESPACE(Insert);
- 	}
- 
- 	/* Compute record's XLOG location */
- 	curridx = Insert->curridx;
- 	INSERT_RECPTR(RecPtr, Insert, curridx);
- 
- 	/*
- 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
- 	 * segment, we need not insert it (and don't want to because we'd like
- 	 * consecutive switch requests to be no-ops).  Instead, make sure
- 	 * everything is written and flushed through the end of the prior segment,
- 	 * and return the prior segment's end address.
- 	 */
- 	if (isLogSwitch &&
- 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
- 	{
- 		/* We can release insert lock immediately */
- 		LWLockRelease(WALInsertLock);
- 
- 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
- 		if (RecPtr.xrecoff == 0)
- 		{
- 			/* crossing a logid boundary */
- 			RecPtr.xlogid -= 1;
- 			RecPtr.xrecoff = XLogFileSize;
- 		}
- 
- 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
- 		LogwrtResult = XLogCtl->Write.LogwrtResult;
- 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
- 		{
- 			XLogwrtRqst FlushRqst;
- 
- 			FlushRqst.Write = RecPtr;
- 			FlushRqst.Flush = RecPtr;
- 			XLogWrite(FlushRqst, false, false);
- 		}
- 		LWLockRelease(WALWriteLock);
- 
- 		END_CRIT_SECTION();
- 
- 		return RecPtr;
- 	}
- 
- 	/* Insert record header */
- 
- 	record = (XLogRecord *) Insert->currpos;
- 	record->xl_prev = Insert->PrevRecord;
- 	record->xl_xid = GetCurrentTransactionIdIfAny();
- 	record->xl_tot_len = SizeOfXLogRecord + write_len;
- 	record->xl_len = len;		/* doesn't include backup blocks */
- 	record->xl_info = info;
- 	record->xl_rmid = rmid;
- 
- 	/* Now we can finish computing the record's CRC */
- 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
- 			   SizeOfXLogRecord - sizeof(pg_crc32));
- 	FIN_CRC32(rdata_crc);
- 	record->xl_crc = rdata_crc;
- 
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
--- 1039,1078 ----
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
  	/*
! 	 * Construct record header. We can't CRC it yet, because the prev-link
! 	 * needs to be covered by the CRC and we don't know that yet. We will
! 	 * finish computing the CRC when we do.
  	 */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	/* rechdr.xl_prev is set in PerformXLogInsert() */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to do the insertion.
  	 */
! 	RecPtr = PerformXLogInsert(write_len, isLogSwitch, &rechdr,
! 							   rdata, rdata_crc, doPageWrites);
! 	END_CRIT_SECTION();
! 
! 	if (XLogRecPtrIsInvalid(RecPtr))
  	{
! 		/*
! 		 * Oops, must redo it with full-page data. Unlink the backup blocks
! 		 * from the chain and reset info bitmask to undo the changes we've
! 		 * done.
! 		 */
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
***************
*** 1075,1267 **** begin:;
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
  						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, record);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
! 	/* Record begin of record in appropriate places */
! 	ProcLastRecPtr = RecPtr;
! 	Insert->PrevRecord = RecPtr;
  
! 	Insert->currpos += SizeOfXLogRecord;
! 	freespace -= SizeOfXLogRecord;
  
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
  		{
! 			if (rdata->len > freespace)
  			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
  			}
  		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Ensure next record will be properly aligned */
! 	Insert->currpos = (char *) Insert->currpage +
! 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
! 	freespace = INSERT_FREESPACE(Insert);
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
  	{
! 		XLogCtlWrite *Write = &XLogCtl->Write;
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
  		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
  		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == Write->curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
  		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
  		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		Write->LogwrtResult = LogwrtResult;
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
  	}
  	else
  	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
  		{
! 			/* curridx is filled and available for writing out */
! 			updrqst = true;
  		}
  		else
  		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
- 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
  	/*
! 	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
! 	 * in shared memory before releasing WALInsertLock. This ensures that
! 	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
! 	 * by this change of full_page_writes.
  	 */
! 	if (fpwChange)
! 		Insert->fullPageWrites = fullPageWrites;
! 
! 	LWLockRelease(WALInsertLock);
  
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
! 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1081,1785 ----
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
  						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, &rechdr);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rmid].rm_desc(&buf, rechdr.xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
! 	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
! 	 */
! 	return RecPtr;
! }
! 
! /*
!  * Subroutine of XLogInsert. All the changes to shared state are done here,
!  * XLogInsert only prepares the record for insertion.
!  *
!  * On success, returns pointer to end of inserted record like XLogInsert().
!  * If RedoRecPtr or forcePageWrites had changed, returns InvalidRecPtr, and
!  * the caller must recalculate full-page-images and retry.
!  */
! static XLogRecPtr
! PerformXLogInsert(int write_len, bool isLogSwitch, XLogRecord *rechdr,
! 				  XLogRecData *rdata, pg_crc32 rdata_crc,
! 				  bool didPageWrites)
! {
! 	volatile XLogInsertSlot *myslot = NULL;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			tot_len;
! 	int			freespace;
! 	int			written;
! 	XLogRecPtr	PrevRecord;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	EndPos;
! 	XLogRecPtr	CurrPos;
! 	bool		updrqst;
  
! 	/* Get an insert location  */
! 	tot_len = SizeOfXLogRecord + write_len;
! 	if (!ReserveXLogInsertLocation(tot_len, didPageWrites, isLogSwitch,
! 								   &PrevRecord, &StartPos, &EndPos,
! 								   (XLogInsertSlot **) &myslot, &updrqst))
! 	{
! 		return EndPos;
! 	}
  
  	/*
! 	 * Got it! Now that we know the prev-link, we can finish computing the
! 	 * record's CRC.
  	 */
! 	rechdr->xl_prev = PrevRecord;
! 
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
  
! 	/* Copy the record header in place */
! 	record = (XLogRecord *) currpos;
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	COMP_CRC32(rdata_crc, currpos + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
! 
! 	currpos += SizeOfXLogRecord;
! 	CurrPos.xrecoff += SizeOfXLogRecord;
! 
! 	if (!isLogSwitch)
! 	{
! 		/* Copy record data */
! 		written = 0;
! 		freespace = INSERT_FREESPACE(CurrPos);
! 		while (rdata != NULL)
  		{
! 			while (rdata->len > freespace)
  			{
! 				/*
! 				 * Write what fits on this page, then write the continuation
! 				 * record, and continue.
! 				 */
! 				XLogContRecord *contrecord;
! 
! 				memcpy(currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				written += freespace;
! 				CurrPos.xrecoff += freespace;
! 
! 				/*
! 				 * CurrPos now points to the page boundary, ie. the first byte
! 				 * of the next page. Update CurrPos with that before
! 				 * calling GetXLogBuffer(), because GetXLogBuffer() might need
! 				 * to wait for some insertions to finish so that it can write
! 				 * out a buffer to make room for the new page. Updating CurrPos
! 				 * before waiting for a new buffer ensures that we don't
! 				 * deadlock with ourselves if we run out of clean buffers.
! 				 *
! 				 * However, we must not advance CurrPos past the page header
! 				 * yet, otherwise someone might try to flush up to that point,
! 				 * which would fail if the next page was not initialized yet.
! 				 */
! 				UpdateSlotCurrPos(myslot, CurrPos);
! 
! 				/* Now skip page header */
! 				CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 
! 				currpos = GetXLogBuffer(CurrPos);
! 
! 				contrecord = (XLogContRecord *) currpos;
! 				contrecord->xl_rem_len = write_len - written;
! 
! 				currpos += SizeOfXLogContRecord;
! 				CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 				freespace = INSERT_FREESPACE(CurrPos);
  			}
+ 
+ 			memcpy(currpos, rdata->data, rdata->len);
+ 			currpos += rdata->len;
+ 			CurrPos.xrecoff += rdata->len;
+ 			freespace -= rdata->len;
+ 			written += rdata->len;
+ 
+ 			rdata = rdata->next;
  		}
+ 		Assert(written == write_len);
  
! 		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
! 		Assert(XLByteEQ(CurrPos, EndPos));
  	}
+ 	else
+ 	{
+ 		/* An xlog-switch record doesn't contain any data besides the header */
+ 		Assert(write_len == 0);
+ 
+ 		/*
+ 		 * An xlog-switch record consumes all the remaining space on the
+ 		 * WAL segment. We have already reserved it for us, but we still need
+ 		 * to make sure it's been allocated and zeroed in the WAL buffers so
+ 		 * that when the caller (or someone else) does XLogWrite(), it can
+ 		 * really write out all the zeros.
+ 		 *
+ 		 * We do this one page at a time, to make sure we don't deadlock
+ 		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
+ 		 */
+ 		while (XLByteLT(CurrPos, EndPos))
+ 		{
+ 			/* use up all the remaining space in this page */
+ 			freespace = INSERT_FREESPACE(CurrPos);
+ 			XLByteAdvance(CurrPos, freespace);
+ 			/*
+ 			 * like in the non-xlog-switch codepath, let others know that
+ 			 * we're done writing up to the end of this page
+ 			 */
+ 			UpdateSlotCurrPos(myslot, CurrPos);
+ 			/*
+ 			 * let GetXLogBuffer initialize next page if necessary.
+ 			 */
+ 			CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
+ 			(void) GetXLogBuffer(CurrPos);
+ 		}
  
! 		/*
! 		 * Even though we reserved the rest of the segment for us, which
! 		 * is reflected in EndPos, we need to return a value that points just
! 		 * to the end of the xlog-switch record.
! 		 */
! 		EndPos.xlogid = StartPos.xlogid;
! 		EndPos.xrecoff = StartPos.xrecoff + SizeOfXLogRecord;
! 	}
  
  	/*
! 	 * Done! Clear CurrPos in our slot to let others know that we're
! 	 * finished.
  	 */
! 	UpdateSlotCurrPos(myslot, InvalidXLogRecPtr);
  
  	/*
! 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
  	 */
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
! 
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
! 
! 	/* update our global variables */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
! 
! 	return EndPos;
! }
! 
! /*
!  * Reserves the right amount of space for a record of given size from the WAL.
!  * *StartPos_p is set to the beginning of the reserved section, *EndPos_p to
!  * its end, and *Prev_record_p points to the beginning of the previous record
!  * to set to the prev-link of the record header.
!  *
!  * A log-switch record is handled slightly differently. The rest of the
!  * segment will be reserved for this insertion, as indicated by the returned
!  * *EndPos_p value. However, if we are already at the beginning of the current
!  * segment, the *EndPos_p is set to the current location without reserving
!  * any space, and the function returns false.
!  *
!  * *updrqst_p is set to true, if this record ends on different page than
!  * the previous one - the caller should update the shared LogwrtRqst value
!  * after it's done inserting the record in that case, so that the WAL page
!  * that filled up gets written out at the next convenient moment.
!  *
!  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
!  * (or the end of previous record, to be exact) to let others know that we're
!  * busy inserting to the reserved area. The caller must clear it when the
!  * insertion is finished.
!  *
!  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
!  * changed. On failure, the shared state is not modified.
!  *
!  * This is the performance critical part of XLogInsert that must be
!  * serialized across backends. The rest can happen mostly in parallel.
!  *
!  * NB: The space calculation here must match the code in PerformXLogInsert,
!  * where we actually copy the record to the reserved space.
!  */
! static bool
! ReserveXLogInsertLocation(int size, bool didPageWrites,
! 						  bool isLogSwitch,
! 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
! 						  XLogRecPtr *EndPos_p,
! 						  XLogInsertSlot **myslot_p, bool *updrqst_p)
! {
! 	volatile XLogInsertSlot *myslot;
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			freespace;
! 	XLogRecPtr	ptr;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	LastEndPos;
! 	int32		nextslot;
! 	int32		lastslot;
! 	bool		updrqst = false;
  
! retry:
! 	SpinLockAcquire(&Insert->insertpos_lck);
  
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
! 		(!didPageWrites && (Insert->forcePageWrites || Insert->fullPageWrites)))
! 	{
! 		/*
! 		 * Oops, a checkpoint just happened, or forcePageWrites was just
! 		 * turned on. Start XLogInsert() all over, because we might have to
! 		 * include more full-page images in the record.
! 		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		*EndPos_p = InvalidXLogRecPtr;
! 		return false;
! 	}
  
+ 	/*
+ 	 * Reserve the next insertion slot for us.
+ 	 *
+ 	 * First check that the slot is not still in use. Modifications to
+ 	 * lastslot are protected by WALInsertTailLock, but here we assume that
+ 	 * reading an int32 is atomic. Another process might advance lastslot at
+ 	 * the same time, but not past nextslot.
+ 	 */
+ 	lastslot = Insert->lastslot;
+ 	nextslot = Insert->nextslot;
+ 	if (NextSlotNo(nextslot) == lastslot)
+ 	{
  		/*
! 		 * Oops, we've "caught our tail" and the oldest slot is still in use.
! 		 * Have to wait for it to become vacant.
  		 */
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		WaitForXLogInsertionSlotToBecomeFree();
! 		goto retry;
! 	}
! 	myslot = &XLogCtl->XLogInsertSlots[nextslot];
! 	nextslot = NextSlotNo(nextslot);
! 
! 	/*
! 	 * Got the slot, now reserve the right amount of space from the WAL for
! 	 * our record.
! 	 */
! 	LastEndPos = ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	freespace = INSERT_FREESPACE(ptr);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
! 		freespace = INSERT_FREESPACE(ptr);
! 		updrqst = true;
! 	}
  
! 	/*
! 	 * We are now at the starting position of our record. Now figure out how
! 	 * the data will be split across the WAL pages, to calculate where the
! 	 * record ends.
! 	 */
! 	StartPos = ptr;
  
+ 	if (isLogSwitch)
+ 	{
  		/*
! 		 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 		 * segment, we need not insert it (and don't want to because we'd like
! 		 * consecutive switch requests to be no-ops). Otherwise the XLOG_SWITCH
! 		 * record should consume all the remaining space on the current segment.
  		 */
+ 		Assert(size == SizeOfXLogRecord);
+ 		if ((ptr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  		{
! 			/* We can release insert lock immediately */
! 			SpinLockRelease(&Insert->insertpos_lck);
  
! 			ptr.xrecoff -= SizeOfXLogLongPHD;
! 			if (ptr.xrecoff == 0)
! 			{
! 				/* crossing a logid boundary */
! 				ptr.xlogid -= 1;
! 				ptr.xrecoff = XLogFileSize;
! 			}
! 
! 			*EndPos_p = ptr;
! 			*StartPos_p = ptr;
! 			*myslot_p = NULL;
! 
! 			return false;
! 		}
! 		else
! 		{
! 			if (ptr.xrecoff % XLOG_SEG_SIZE != 0)
! 			{
! 				int segleft = XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE);
! 				XLByteAdvance(ptr, segleft);
! 			}
! 			updrqst = true;
  		}
+ 	}
+ 	else
+ 	{
+ 		/* A normal record, ie. not xlog-switch */
+ 		int sizeleft = size;
+ 		while (freespace < sizeleft)
+ 		{
+ 			/* fill this page, and continue on next page */
+ 			sizeleft -= freespace;
+ 			ptr = AdvanceXLogRecPtrToNextPage(ptr);
  
! 			/* account for continuation record header */
! 			ptr.xrecoff += SizeOfXLogContRecord;
! 			freespace = INSERT_FREESPACE(ptr);
  
! 			updrqst = true;
! 		}
! 		/* the rest fits on this page */
! 		ptr.xrecoff += sizeleft;
! 
! 		/* Align the end position, so that the next record starts aligned */
! 		ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 	}
! 
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	myslot->CurrPos = LastEndPos;
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 	Insert->nextslot = nextslot;
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(LOG, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 PrevRecord_p->xlogid, PrevRecord_p->xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 size);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 	*myslot_p = (XLogInsertSlot *) myslot;
! 	*updrqst_p = updrqst;
! 
! 	return true;
! }
! 
! /*
!  * Update slot's CurrPos variable, and wake up anyone waiting on it.
!  */
! static void
! UpdateSlotCurrPos(volatile XLogInsertSlot *myslot, XLogRecPtr CurrPos)
! {
! 	PGPROC	   *head;
! 
! 	/*
! 	 * The write-barrier ensures that the changes we made to the WAL pages
! 	 * are visible to everyone before the update of CurrPos.
! 	 *
! 	 * XXX: I'm not sure if this is necessary. Does a function call act
! 	 * as an implicit barrier?
! 	 */
! 	pg_write_barrier();
! 
! 	SpinLockAcquire(&myslot->lck);
! 	myslot->CurrPos = CurrPos;
! 	head = myslot->head;
! 	myslot->head = myslot->tail = NULL;
! 	SpinLockRelease(&myslot->lck);
! 	while (head != NULL)
! 	{
! 		PGPROC *proc = head;
! 		head = proc->lwWaitLink;
! 		proc->lwWaitLink = NULL;
! 		proc->lwWaiting = false;
! 		PGSemaphoreUnlock(&proc->sem);
! 	}
! }
! 
! /*
!  * Get a pointer to the right location in the WAL buffer containing the
!  * given XLogRecPtr.
!  *
!  * If the page is not initialized yet, it is initialized. That might require
!  * evicting an old dirty buffer from the buffer cache, which means I/O.
!  *
!  * The caller must ensure that the page containing the requested location
!  * isn't evicted yet, and won't be evicted, by holding onto an
!  * XLogInsertSlot with CurrPos set to 'ptr'. Setting it to some value
!  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
!  * if we have to evict a buffer, we might have to wait for someone else to
!  * finish a write. And that someone else might not be able to finish the write
!  * if our CurrPos points to a buffer that's still in the buffer cache.
!  */
! static char *
! GetXLogBuffer(XLogRecPtr ptr)
! {
! 	int			idx;
! 	XLogRecPtr	endptr;
! 
! 	/*
! 	 * The XLog buffer cache is organized so that we can easily calculate the
! 	 * buffer a given page must be loaded into from the XLogRecPtr alone.
! 	 * A page must always be loaded to a particular buffer.
! 	 */
! 	idx = XLogRecPtrToBufIdx(ptr);
  
! 	/*
! 	 * See what page is loaded in the buffer at the moment. It could be the
! 	 * page we're looking for, or something older. It can't be anything
! 	 * newer - that would imply the page we're looking for has already
! 	 * been written out to disk, which shouldn't happen as long as the caller
! 	 * has set its slot's CurrPos correctly.
! 	 *
! 	 * However, we don't hold a lock while we read the value. If someone has
! 	 * just initialized the page, it's possible that we get a "torn read",
! 	 * and read a bogus value. That's ok, we'll grab the mapping lock (in
! 	 * AdvanceXLInsertBuffer) and retry if we see anything else than the page
! 	 * we're looking for. But it means that when we do this unlocked read, we
! 	 * might see a value that appears to be ahead of the page we're looking
! 	 * for. So don't PANIC on that, until we've verified the value while
! 	 * holding the lock.
! 	 */
! 	endptr = XLogCtl->xlblocks[idx];
! 	if (ptr.xlogid != endptr.xlogid ||
! 		!(ptr.xrecoff < endptr.xrecoff &&
! 		  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 	{
! 		AdvanceXLInsertBuffer(ptr, false);
! 		endptr = XLogCtl->xlblocks[idx];
! 
! 		if (ptr.xlogid != endptr.xlogid ||
! 			!(ptr.xrecoff < endptr.xrecoff &&
! 			  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
! 		{
! 			elog(PANIC, "could not find WAL buffer for %X/%X",
! 				 ptr.xlogid, ptr.xrecoff);
! 		}
  	}
+ 
+ 	/*
+ 	 * Found the buffer holding this page. Return a pointer to the right
+ 	 * offset within the page.
+ 	 */
+ 	return (char *) XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
+ 		ptr.xrecoff % XLOG_BLCKSZ;
+ }
+ 
+ /*
+  * Advance an XLogRecPtr to the first valid insertion location on the next
+  * page, right after the page header. An XLogRecPtr pointing to a boundary,
+  * ie. the first byte of a page, is taken to belong to the previous page.
+  */
+ static XLogRecPtr
+ AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr)
+ {
+ 	int			freespace;
+ 
+ 	freespace = INSERT_FREESPACE(ptr);
+ 	XLByteAdvance(ptr, freespace);
+ 	if (ptr.xrecoff % XLogSegSize == 0)
+ 		ptr.xrecoff += SizeOfXLogLongPHD;
  	else
+ 		ptr.xrecoff += SizeOfXLogShortPHD;
+ 
+ 	return ptr;
+ }
+ 
+ /*
+  * Wait for any insertions < upto to finish.
+  *
+  * Returns a value >= upto, which indicates the oldest in-progress insertion
+  * that we saw in the array, or CurrPos if there are no insertions in-progress
+  * at exit.
+  */
+ static XLogRecPtr
+ WaitXLogInsertionsToFinish(XLogRecPtr upto, XLogRecPtr CurrPos)
+ {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			lastslot;
+ 	int			nextslot;
+ 	volatile XLogInsertSlot *slot;
+ 	XLogRecPtr	slotptr = InvalidXLogRecPtr;
+ 	XLogRecPtr	LastPos;
+ 	int			extraWaits = 0;
+ 
+ 	if (MyProc == NULL)
+ 		elog(PANIC, "cannot wait without a PGPROC structure");
+ 
+ 	LastPos = CurrPos;
+ 
+ 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+ 
+ 	lastslot = Insert->lastslot;
+ 	nextslot = Insert->nextslot;
+ 
+ 	/* Skip over slots that have finished already */
+ 	while (lastslot != nextslot)
  	{
! 		slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		SpinLockAcquire(&slot->lck);
! 		slotptr = slot->CurrPos;
  
! 		if (XLogRecPtrIsInvalid(slotptr))
  		{
! 			lastslot = NextSlotNo(lastslot);
! 			SpinLockRelease(&slot->lck);
  		}
  		else
  		{
! 			/*
! 			 * This insertion is still in-progress. Wait for it to finish
! 			 * if it's <= upto, otherwise we're done.
! 			 */
! 			Insert->lastslot = lastslot;
! 
! 			if (XLogRecPtrIsInvalid(upto) || XLByteLE(upto, slotptr))
! 			{
! 				LastPos = slotptr;
! 				SpinLockRelease(&slot->lck);
! 				break;
! 			}
! 
! 			/* wait */
! 			MyProc->lwWaiting = true;
! 			MyProc->lwWaitMode = 0; /* doesn't matter */
! 			MyProc->lwWaitLink = NULL;
! 			if (slot->head == NULL)
! 				slot->head = MyProc;
! 			else
! 				slot->tail->lwWaitLink = MyProc;
! 			slot->tail = MyProc;
! 			SpinLockRelease(&slot->lck);
! 			LWLockRelease(WALInsertTailLock);
! 			for (;;)
! 			{
! 				PGSemaphoreLock(&MyProc->sem, false);
! 				if (!MyProc->lwWaiting)
! 					break;
! 				extraWaits++;
! 			}
! 			LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 			lastslot = Insert->lastslot;
! 			nextslot = Insert->nextslot;
  		}
  	}
  
+ 	Insert->lastslot = lastslot;
+ 	LWLockRelease(WALInsertTailLock);
+ 
+ 	while (extraWaits-- > 0)
+ 		PGSemaphoreUnlock(&MyProc->sem);
+ 
+ 	return LastPos;
+ }
+ 
+ /*
+  * Wait for the next insertion slot to become vacant.
+  */
+ static void
+ WaitForXLogInsertionSlotToBecomeFree(void)
+ {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			lastslot;
+ 	int			nextslot;
+ 	int			extraWaits = 0;
+ 
+ 	if (MyProc == NULL)
+ 		elog(PANIC, "cannot wait without a PGPROC structure");
+ 
+ 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+ 
  	/*
! 	 * Re-read lastslot and nextslot, now that we have the wait-lock.
! 	 * We're reading nextslot without holding insertpos_lck. It could advance
! 	 * at the same time, but it can't advance beyond lastslot - 1.
  	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
  
! 	/*
! 	 * If there are still no slots available, wait for the oldest slot to
! 	 * become vacant.
! 	 */
! 	while (NextSlotNo(nextslot) == lastslot)
  	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
  
! 		SpinLockAcquire(&slot->lck);
! 		if (XLogRecPtrIsInvalid(slot->CurrPos))
! 		{
! 			SpinLockRelease(&slot->lck);
! 			break;
! 		}
! 
! 		/* wait */
! 		MyProc->lwWaiting = true;
! 		MyProc->lwWaitMode = 0; /* doesn't matter */
! 		MyProc->lwWaitLink = NULL;
! 		if (slot->head == NULL)
! 			slot->head = MyProc;
! 		else
! 			slot->tail->lwWaitLink = MyProc;
! 		slot->tail = MyProc;
! 		SpinLockRelease(&slot->lck);
! 		LWLockRelease(WALInsertTailLock);
! 		for (;;)
! 		{
! 			PGSemaphoreLock(&MyProc->sem, false);
! 			if (!MyProc->lwWaiting)
! 				break;
! 			extraWaits++;
! 		}
! 		LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 		lastslot = Insert->lastslot;
! 		nextslot = Insert->nextslot;
  	}
  
! 	/*
! 	 * Ok, there is at least one empty slot now. That's enouugh for our
! 	 * insertion, but ẃhile we're at it, advance lastslot as much as we
! 	 * can. That way we don't need to come back here on the next call
! 	 * again.
! 	 */
! 	while (lastslot != nextslot)
! 	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		/*
! 		 * Don't need to grab the slot's spinlock here, because we're not
! 		 * interested in the exact value of CurrPos, only whether it's
! 		 * valid or not.
! 		 */
! 		if (!XLogRecPtrIsInvalid(slot->CurrPos))
! 			break;
  
! 		lastslot = NextSlotNo(lastslot);
! 	}
! 	Insert->lastslot = lastslot;
  
! 	LWLockRelease(WALInsertTailLock);
  }
  
  /*
***************
*** 1488,1522 **** XLogArchiveCleanup(const char *xlog)
  }
  
  /*
!  * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data.
!  *
!  * If new_segment is TRUE then we set up the next buffer page as the first
!  * page of the next xlog segment file, possibly but not usually the next
!  * consecutive file page.
!  *
!  * The global LogwrtRqst.Write pointer needs to be advanced to include the
!  * just-filled page.  If we can do this for free (without an extra lock),
!  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
!  * request update still needs to be done, FALSE if we did it internally.
!  *
!  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
! 	/* Use Insert->LogwrtResult copy if it's more fresh */
! 	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
! 		LogwrtResult = Insert->LogwrtResult;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 2006,2039 ----
  }
  
  /*
!  * Initialize XLOG buffers, writing out old buffers if they still contain
!  * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
!  * true, initialize as many pages as we can without having to write out
!  * unwritten data. Any new pages are initialized to zeros, with pages headers
!  * initialized properly.
   */
! static void
! AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
  	XLogPageHeader NewPage;
+ 	bool		needflush;
+ 	int			npages = 0;
  
! 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Now that we have the lock, check if someone initialized the page
! 	 * already.
! 	 */
! /* XXX: fix indentation before commit */
! while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
! {
! 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1524,1535 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
- 	{
- 		/* nope, got work to do... */
- 		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 2041,2057 ----
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  
! 	needflush = !XLByteLE(OldPageRqstPtr, LogwrtResult.Write);
! 
! 	if (needflush)
! 	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1537,1581 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		update_needed = false;	/* Did the shared-request update */
! 
! 		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
! 		{
! 			/* OK, someone wrote it already */
! 			Insert->LogwrtResult = LogwrtResult;
! 		}
! 		else
  		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding insert lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
--- 2059,2108 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 			{
! 				Assert(XLByteLE(OldPageRqstPtr, xlogctl->Insert.CurrPos));
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
! 			}
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr, InvalidXLogRecPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding mapping lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
***************
*** 1583,1596 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
! 
! 	if (new_segment)
! 	{
! 		/* force it to a segment start point */
! 		NewPageEndPtr.xrecoff += XLogSegSize - 1;
! 		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
! 	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 2110,2116 ----
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1600,1612 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
! 
! 	Insert->curridx = nextidx;
! 	Insert->currpage = NewPage;
  
! 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
--- 2120,2129 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
! 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
  
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
***************
*** 1650,1660 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 2167,2194 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * Make sure the initialization of the page becomes visible to others
! 	 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
! 	 * holding a lock.
! 	 */
! 	pg_write_barrier();
! 
! 	*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(DEBUG1, "initialized %d pages, upto %X/%X",
! 			 npages, NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff);
! #endif
  }
  
  /*
***************
*** 1699,1714 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * If xlog_switch == TRUE, we are intending an xlog segment switch, so
!  * perform end-of-segment actions after writing the last page, even if
!  * it's not physically the end of its segment.  (NB: this will work properly
!  * only if caller specifies WriteRqst == page-end and flexible == false,
!  * and there is some data to write.)
!  *
!  * Must be called with WALWriteLock held.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
--- 2233,2244 ----
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
!  * the data is ready to write.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
***************
*** 1726,1732 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = Write->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
--- 2256,2262 ----
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
***************
*** 1757,1770 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 2287,2300 ----
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
! 		if (!XLByteLT(LogwrtResult.Write, EndPtr))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 EndPtr.xlogid, EndPtr.xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = EndPtr;
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1861,1876 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
- 			 * We also do this if this is the last page written for an xlog
- 			 * switch.
- 			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg || (xlog_switch && last_iteration))
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
--- 2391,2403 ----
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg)
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
***************
*** 1960,1967 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
- 
- 	Write->LogwrtResult = LogwrtResult;
  }
  
  /*
--- 2487,2492 ----
***************
*** 2124,2131 **** XLogFlush(XLogRecPtr record)
  	 */
  	for (;;)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
--- 2649,2659 ----
  	 */
  	for (;;)
  	{
! 		/* use volatile pointers to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
***************
*** 2139,2144 **** XLogFlush(XLogRecPtr record)
--- 2667,2701 ----
  			break;
  
  		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to do LWLockConditionalAcquire(WALInsertLock),
+ 		 * fall back to writing just up to 'record' if we couldn't get t
+ 		 * lock. I wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand
+ 		 * it would be good to not cause more contention on the lock if
+ 		 * busy, but on the other hand, this spinlock is much more light
+ 		 * than the WALInsertLock was, so maybe it's better to just grab
+ 		 * spinlock. Also note that if we stored the XLogRecPtr as one 6
+ 		 * integer, we could just read it with no lock on platforms wher
+ 		 * 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)               /* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, insertpos);
+ 
+ 		/*
  		 * Try to get the write lock. If we can't get it immediately, wait
  		 * until it's released, and recheck if we still need to do the flush
  		 * or if the backend that held the lock did it for us already. This
***************
*** 2155,2186 **** XLogFlush(XLogRecPtr record)
  			continue;
  		}
  		/* Got the lock */
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
! 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 				else
! 				{
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 					WriteRqstPtr.xrecoff -= freespace;
! 				}
! 				LWLockRelease(WALInsertLock);
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = WriteRqstPtr;
! 			}
! 			else
! 			{
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = record;
! 			}
! 			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
--- 2712,2724 ----
  			continue;
  		}
  		/* Got the lock */
! 		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
! 			XLogWrite(WriteRqst, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
***************
*** 2292,2314 **** XLogBackgroundFlush(void)
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
- 
  	START_CRIT_SECTION();
  
! 	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2830,2860 ----
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
  	START_CRIT_SECTION();
  
! 	/* now wait for any in-progress insertions to finish and get write lock */
! 	WaitXLogInsertionsToFinish(WriteRqstPtr, InvalidXLogRecPtr);
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible);
  	}
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 */
+ 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5102,5107 **** XLOGShmemSize(void)
--- 5648,5656 ----
  	/* and the buffers themselves */
  	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
  
+ 	/* XLog insertion slots */
+ 	size = add_size(size, mul_size(sizeof(XLogInsertSlot), NumXLogInsertSlots));
+ 
  	/*
  	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
  	 * added by CreateSharedMemoryAndSemaphores.  This lets us use this
***************
*** 5117,5122 **** XLOGShmemInit(void)
--- 5666,5672 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5142,5147 **** XLOGShmemInit(void)
--- 5692,5710 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize insertion slots */
+ 	XLogCtl->XLogInsertSlots = (XLogInsertSlot *) allocptr;
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[i];
+ 		slot->CurrPos = InvalidXLogRecPtr;
+ 		slot->head = slot->tail = NULL;
+ 		SpinLockInit(&slot->lck);
+ 	}
+ 	XLogCtl->Insert.nextslot = 1;
+ 	XLogCtl->Insert.lastslot = 0;
+ 	allocptr += sizeof(XLogInsertSlot) * NumXLogInsertSlots;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5156,5166 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
- 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
--- 5719,5730 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
+ 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
+ 
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
***************
*** 6038,6043 **** StartupXLOG(void)
--- 6602,6608 ----
  	bool		backupEndRequired = false;
  	bool		backupFromStandby = false;
  	DBState		dbstate_at_startup;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6835,6842 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7400,7411 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6844,6869 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
- 	XLogCtl->Write.LogwrtResult = LogwrtResult;
- 	Insert->LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7413,7435 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndRecPtr.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6875,6881 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7441,7447 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7379,7385 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
--- 7945,7951 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
***************
*** 7655,7660 **** CreateCheckPoint(int flags)
--- 8221,8227 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7723,7732 **** CreateCheckPoint(int flags)
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 8290,8299 ----
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * Determine the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7738,7744 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 8305,8311 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7747,7761 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 8314,8325 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7782,7795 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
--- 8346,8355 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(curInsert);
  	if (freespace < SizeOfXLogRecord)
! 		curInsert = AdvanceXLogRecPtrToNextPage(curInsert);
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
***************
*** 7815,7821 **** CreateCheckPoint(int flags)
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8375,8381 ----
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7835,7841 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8395,8401 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 8202,8216 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8762,8776 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * Like in CreateCheckPoint(), you need both insertpos_lck and info_lck
! 	 * to update it, although during recovery acquiring insertpos_lck is just
! 	 * pro forma, because no WAL insertions are happening.
  	 */
! 	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8403,8408 **** RequestXLogSwitch(void)
--- 8963,8969 ----
  {
  	XLogRecPtr	RecPtr;
  	XLogRecData rdata;
+ 	XLogwrtRqst FlushRqst;
  
  	/* XLOG SWITCH, alone among xlog record types, has no data */
  	rdata.buffer = InvalidBuffer;
***************
*** 8412,8417 **** RequestXLogSwitch(void)
--- 8973,8999 ----
  
  	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH, &rdata);
  
+ 	/* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
+ 	 * if the xlog switch had no work to do, ie. if we were already at the
+ 	 * beginning of a new XLOG segment. You can check if RecPtr points to
+ 	 * beginning of a segment if you want to keep the distinction.
+ 	 */
+ 	TRACE_POSTGRESQL_XLOG_SWITCH();
+ 
+ 	/*
+ 	 * Flush through the end of the page containing XLOG_SWITCH, and
+ 	 * perform end-of-segment actions (eg, notifying archiver).
+ 	 */
+ 	WaitXLogInsertionsToFinish(RecPtr, InvalidXLogRecPtr);
+ 
+ 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ 	FlushRqst.Write = RecPtr;
+ 	FlushRqst.Flush = RecPtr;
+ 	START_CRIT_SECTION();
+ 	XLogWrite(FlushRqst, false);
+ 	END_CRIT_SECTION();
+ 	LWLockRelease(WALWriteLock);
+ 
  	return RecPtr;
  }
  
***************
*** 8490,8511 **** XLogReportParameters(void)
  /*
   * Update full_page_writes in shared memory, and write an
   * XLOG_FPW_CHANGE record if necessary.
   */
  void
  UpdateFullPageWrites(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
  	 *
  	 * It's safe to check the shared full_page_writes without the lock,
! 	 * because we can guarantee that there is no concurrently running
! 	 * process which can update it.
  	 */
  	if (fullPageWrites == Insert->fullPageWrites)
  		return;
  
  	/*
  	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
  	 * track of full_page_writes during archive recovery, if required.
--- 9072,9112 ----
  /*
   * Update full_page_writes in shared memory, and write an
   * XLOG_FPW_CHANGE record if necessary.
+  *
+  * Note: this function assumes there is no other process running
+  * concurrently that could update it.
   */
  void
  UpdateFullPageWrites(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
  	 *
  	 * It's safe to check the shared full_page_writes without the lock,
! 	 * because we assume that there is no concurrently running process
! 	 * which can update it.
  	 */
  	if (fullPageWrites == Insert->fullPageWrites)
  		return;
  
+ 	START_CRIT_SECTION();
+ 
+ 	/*
+ 	 * It's always safe to take full page images, even when not strictly
+ 	 * required, but not the other round. So if we're setting full_page_writes
+ 	 * to true, first set it true and then write the WAL record. If we're
+ 	 * setting it to false, first write the WAL record and then set the
+ 	 * global flag.
+ 	 */
+ 	if (fullPageWrites)
+ 	{
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		Insert->fullPageWrites = true;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 	}
+ 
  	/*
  	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
  	 * track of full_page_writes during archive recovery, if required.
***************
*** 8521,8532 **** UpdateFullPageWrites(void)
  
  		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
  	}
! 	else
  	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 		Insert->fullPageWrites = fullPageWrites;
! 		LWLockRelease(WALInsertLock);
  	}
  }
  
  /*
--- 9122,9135 ----
  
  		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
  	}
! 
! 	if (!fullPageWrites)
  	{
! 		SpinLockAcquire(&Insert->insertpos_lck);
! 		Insert->fullPageWrites = false;
! 		SpinLockRelease(&Insert->insertpos_lck);
  	}
+ 	END_CRIT_SECTION();
  }
  
  /*
***************
*** 9040,9045 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9643,9649 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	checkpointloc;
***************
*** 9102,9127 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9706,9731 ----
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 9234,9246 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9838,9850 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9324,9333 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
--- 9928,9938 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
***************
*** 9344,9350 **** pg_start_backup_callback(int code, Datum arg)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9949,9955 ----
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9357,9362 **** pg_start_backup_callback(int code, Datum arg)
--- 9962,9968 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	startpoint;
***************
*** 9410,9418 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 10016,10024 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9421,9436 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 10027,10042 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9708,9723 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 10314,10331 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
! 
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9771,9782 **** GetStandbyFlushRecPtr(void)
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 10379,10390 ----
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/storage/lmgr/spin.c
--- b/src/backend/storage/lmgr/spin.c
***************
*** 56,61 **** SpinlockSemas(void)
--- 56,64 ----
  	 *
  	 * For now, though, we just need a few spinlocks (10 should be plenty)
  	 * plus one for each LWLock and one for each buffer header.
+ 	 *
+ 	 * XXX: remember to adjust this for the number of spinlocks needed by the
+ 	 * xlog.c changes before committing!
  	 */
  	return NumLWLocks() + NBuffers + 10;
  }
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALInsertTailLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#34Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#33)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 31.01.2012 17:35, Fujii Masao wrote:

On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

On 20.01.2012 15:32, Robert Haas wrote:

On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>    wrote:

Here's another version of the patch to make XLogInsert less of a
bottleneck
on multi-CPU systems. The basic idea is the same as before, but several
bugs
have been fixed, and lots of misc. clean up has been done.

This seems to need a rebase.

Here you go.

The patch seems to need a rebase again.

Here you go again. It conflicted with the group commit patch, and the patch
to WAL-log and track changes to full_page_writes setting.

After applying this patch and then forcing crashes, upon recovery the
database is not correct.

If I make a table with 10,000 rows and then after that intensively
update it using a unique key:

update foo set count=count+1 where foobar=?

Then after the crash there are less than 10,000 visible rows:

select count(*) from foo

This not a subtle thing, it happens every time. I get counts of
between 1973 and 8827. Without this patch I always get exactly
10,000.

I don't really know where to start on tracking this down.

Cheers,

Jeff

#35Fujii Masao
masao.fujii@gmail.com
In reply to: Jeff Janes (#34)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 31.01.2012 17:35, Fujii Masao wrote:

On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

On 20.01.2012 15:32, Robert Haas wrote:

On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>    wrote:

Here's another version of the patch to make XLogInsert less of a
bottleneck
on multi-CPU systems. The basic idea is the same as before, but several
bugs
have been fixed, and lots of misc. clean up has been done.

This seems to need a rebase.

Here you go.

The patch seems to need a rebase again.

Here you go again. It conflicted with the group commit patch, and the patch
to WAL-log and track changes to full_page_writes setting.

After applying this patch and then forcing crashes, upon recovery the
database is not correct.

If I make a table with 10,000 rows and then after that intensively
update it using a unique key:

update foo set count=count+1 where foobar=?

Then after the crash there are less than 10,000 visible rows:

select count(*) from foo

This not a subtle thing, it happens every time.  I get counts of
between 1973 and 8827.  Without this patch I always get exactly
10,000.

I don't really know where to start on tracking this down.

Similar problem happened on my test. When I executed CREATE TABLE and
shut down the server with immediate mode, after recovery I could not see the
created table. Here are the server log of recovery with wal_debug = on:

LOG: database system was interrupted; last known up at 2012-02-09 19:18:50 JST
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/179CC90
LOG: REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
XLOG - nextOid: 24576
LOG: REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
Storage - file create: base/12277/16384
LOG: REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
LOG: there is no contrecord flag in log file 0, segment 1, offset 7987200
LOG: redo done at 0/179CCE8

According to the log "there is no contrecord flag", ISTM the path treats the
contrecord of backup block incorrectly, and which causes the problem.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#36Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#35)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Thu, Feb 9, 2012 at 7:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 31.01.2012 17:35, Fujii Masao wrote:

On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

On 20.01.2012 15:32, Robert Haas wrote:

On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>    wrote:

Here's another version of the patch to make XLogInsert less of a
bottleneck
on multi-CPU systems. The basic idea is the same as before, but several
bugs
have been fixed, and lots of misc. clean up has been done.

This seems to need a rebase.

Here you go.

The patch seems to need a rebase again.

Here you go again. It conflicted with the group commit patch, and the patch
to WAL-log and track changes to full_page_writes setting.

After applying this patch and then forcing crashes, upon recovery the
database is not correct.

If I make a table with 10,000 rows and then after that intensively
update it using a unique key:

update foo set count=count+1 where foobar=?

Then after the crash there are less than 10,000 visible rows:

select count(*) from foo

This not a subtle thing, it happens every time.  I get counts of
between 1973 and 8827.  Without this patch I always get exactly
10,000.

I don't really know where to start on tracking this down.

Similar problem happened on my test. When I executed CREATE TABLE and
shut down the server with immediate mode, after recovery I could not see the
created table. Here are the server log of recovery with wal_debug = on:

LOG:  database system was interrupted; last known up at 2012-02-09 19:18:50 JST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/179CC90
LOG:  REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
XLOG - nextOid: 24576
LOG:  REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
Storage - file create: base/12277/16384
LOG:  REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
LOG:  there is no contrecord flag in log file 0, segment 1, offset 7987200
LOG:  redo done at 0/179CCE8

According to the log "there is no contrecord flag", ISTM the path treats the
contrecord of backup block incorrectly, and which causes the problem.

Yep, as far as I read the patch, it seems to have forgotten to set
XLP_FIRST_IS_CONTRECORD flag.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#37Jeff Janes
jeff.janes@gmail.com
In reply to: Fujii Masao (#36)
1 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Thu, Feb 9, 2012 at 3:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Feb 9, 2012 at 7:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

After applying this patch and then forcing crashes, upon recovery the
database is not correct.

If I make a table with 10,000 rows and then after that intensively
update it using a unique key:

update foo set count=count+1 where foobar=?

Then after the crash there are less than 10,000 visible rows:

select count(*) from foo

This not a subtle thing, it happens every time.  I get counts of
between 1973 and 8827.  Without this patch I always get exactly
10,000.

I don't really know where to start on tracking this down.

Similar problem happened on my test. When I executed CREATE TABLE and
shut down the server with immediate mode, after recovery I could not see the
created table. Here are the server log of recovery with wal_debug = on:

LOG:  database system was interrupted; last known up at 2012-02-09 19:18:50 JST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/179CC90
LOG:  REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
XLOG - nextOid: 24576
LOG:  REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
Storage - file create: base/12277/16384
LOG:  REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
LOG:  there is no contrecord flag in log file 0, segment 1, offset 7987200
LOG:  redo done at 0/179CCE8

According to the log "there is no contrecord flag", ISTM the path treats the
contrecord of backup block incorrectly, and which causes the problem.

Yep, as far as I read the patch, it seems to have forgotten to set
XLP_FIRST_IS_CONTRECORD flag.

Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do. I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.

It does get rid of the "there is no contrecord flag" errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on "record with zero length at...",
that recovery is correct.

But when it ends on "invalid magic number 0000 in log file.." then the
recovery is screwed up.

Cheers,

Jeff

Attachments:

xloginsert_fix.patchapplication/octet-stream; name=xloginsert_fix.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ba83123..a950518 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1168,6 +1168,7 @@ PerformXLogInsert(int write_len, bool isLogSwitch, XLogRecord *rechdr,
 				 * record, and continue.
 				 */
 				XLogContRecord *contrecord;
+				int			freespace2;
 
 				memcpy(currpos, rdata->data, freespace);
 				rdata->data += freespace;
@@ -1191,7 +1192,18 @@ PerformXLogInsert(int write_len, bool isLogSwitch, XLogRecord *rechdr,
 				UpdateSlotCurrPos(myslot, CurrPos);
 
 				/* Now skip page header */
-				CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
+//				CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
+
+				freespace2 = INSERT_FREESPACE(CurrPos);
+				XLByteAdvance(CurrPos, freespace2);
+
+				currpos = GetXLogBuffer(CurrPos);
+                                ((XLogPageHeader)currpos)->xlp_info |= XLP_FIRST_IS_CONTRECORD;
+
+				if (CurrPos.xrecoff % XLogSegSize == 0)
+					CurrPos.xrecoff += SizeOfXLogLongPHD;
+				else
+					CurrPos.xrecoff += SizeOfXLogShortPHD;
 
 				currpos = GetXLogBuffer(CurrPos);
 
#38Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jeff Janes (#37)
1 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 13.02.2012 01:04, Jeff Janes wrote:

Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do. I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.

Thanks. That's basically the right approach. Attached patch contains a
cleaned up version of that.

It does get rid of the "there is no contrecord flag" errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on "record with zero length at...",
that recovery is correct.

But when it ends on "invalid magic number 0000 in log file.." then the
recovery is screwed up.

Can you write a self-contained test case for that? I've been trying to
reproduce that by running the regression tests and pgbench with a
streaming replication standby, which should be pretty much the same as
crash recovery. No luck this far.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-8.patchtext/x-diff; name=xloginsert-scale-8.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 290,315 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually three shared-memory copies of
!  * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
!  *		XLogCtl->LogwrtResult is protected by info_lck
!  *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
!  *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
!  * One must hold the associated lock to read or write any of these, but
!  * of course no lock is needed to read/write the unshared LogwrtResult.
!  *
!  * XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always
!  * right", since both are updated by a write or flush operation before
!  * it releases WALWriteLock.  The point of keeping XLogCtl->Write.LogwrtResult
!  * is that it can be examined/modified by code that already holds WALWriteLock
!  * without needing to grab info_lck as well.
!  *
!  * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
!  * but is updated when convenient.	Again, it exists for the convenience of
!  * code that is already holding WALInsertLock but not the other locks.
!  *
!  * The unshared LogwrtResult may lag behind any or all of these, and again
!  * is updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
--- 291,301 ----
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There is one shared-memory copy of LogwrtResult,
!  * plus one unshared copy in each backend. To read the shared copy, you need
!  * to hold info_lck *or* WALWriteLock. To update it, you need to hold both
!  * locks. The unshared LogwrtResult may lag behind the shared copy, and is
!  * updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
***************
*** 319,328 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 305,319 ----
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 334,339 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 325,399 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL, and the next insertion
+  *    slot to advertise that the insertion is in progress. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    insertpos_lck. Try to keep this section as short as possible,
+  *    insertpos_lck can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each insertion is asssigned its own
+  * "XLog insertion slot", which is used to advertise the position the backend
+  * is writing to. The slot is marked as in-use in step 1, while holding
+  * insertpos_lck, by setting the position field in the slot. When the backend
+  * is finished with the insertion, it clears its slot. Each slot is protected
+  * by a separate spinlock, to keep contention minimal.
+  *
+  * The insertion slots also provide a mechanism to wait for an insertion to
+  * finish. This is important when an XLOG page is written out - any
+  * in-progress insertions must finish copying data to the page first, or the
+  * on-disk copy will be incomplete. Waiting is done by the
+  * WaitXLogInsertionsToFinish() function. It adds the current process to the
+  * waiting queue in the slot it needs to wait for, and when that insertion
+  * finishes (or proceeds to the next page, at least), the inserter wakes up
+  * the process.
+  *
+  * The insertion slots form a ring. Insert->nextslot points to the next free
+  * slot, and Insert->lastslot points to the last slot that's still in use.
+  * lastslot can lag behind reality by any number of slots, as long as nextslot
+  * doesn't catch up with it. lastslot is advanced by
+  * WaitXLogInsertionsToFinish(), and is protected by WALInsertTailLock.
+  * nextslot is advanced in ReserveXLogInsertLocation() and is protected by
+  * insertpos_lck. Both slot variables are 32-bit integers, so that they can
+  * be read atomically without holding a lock.
+  *
+  * Whenever the ring fills up, ie. when nextslot wraps around and catches up
+  * with lastslot, ReserveXLogInsertLocation() has to wait for the oldest
+  * insertion to finish and advance lastslot, to make room for the new
+  * insertion. This is done by WaitForXLogInsertionSlotToBecomeFree() function,
+  * which is similar to WaitXLogInsertionsToFinish(), but instead of waiting
+  * for all insertions up to a given point to finish, it just waits for the
+  * inserter in the oldest slot to finish.
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+  * finish (or at least advance to next uninitialized page), while you're
+  * holding WALWriteLock. That would be bad, because the backend you're waiting
+  * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+  * you'd get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * it's called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition, so the common pattern:
+  *
+  *   recptr = XLogInsert(...)
+  *   XLogFlush(recptr)
+  *
+  * is safe. It can't get stuck, because an insertion to a WAL page that's
+  * already initialized in cache can always proceed without waiting on a lock.
+  *
   *----------
   */
  
***************
*** 354,364 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 414,443 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct
! 								 * (except lastslot). */
! 
! 	int32		nextslot;		/* next insertion slot to use */
! 	int32		lastslot;		/* last in-use insertion slot (protected by
! 								 * WALInsertTailLock) */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment. The
! 	 * next record will be inserted there (or somewhere after it if there's
! 	 * not enough space on the current page). PrevRecord points to the
! 	 * beginning of the last record already reserved. It might not be fully
! 	 * copied into place yet, but we know its exact location already.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
! 
! 	/*
! 	 * padding to push RedoRecPtr and forcePageWrites, which rarely change,
! 	 * to a different cache line than the rapidly-changing CurrPos and
! 	 * PrevRecord values. XXX: verify if this makes any difference
! 	 */
! 	char		padding[128];
! 
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 388,406 **** typedef struct XLogCtlInsert
   */
  typedef struct XLogCtlWrite
  {
- 	XLogwrtResult LogwrtResult; /* current value of LogwrtResult */
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
--- 467,500 ----
   */
  typedef struct XLogCtlWrite
  {
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
+ 
+ /*
+  * Slots for in-progress WAL insertions.
+  */
+ typedef struct
+ {
+ 	slock_t		lck;
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ 	PGPROC	   *head;		/* head of list of waiting PGPROCs */
+ 	PGPROC	   *tail;		/* tail of list of waiting PGPROCs */
+ } XLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	1000
+ 
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	XLogInsertSlot *XLogInsertSlots;
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
***************
*** 414,422 **** typedef struct XLogCtlData
  	XLogCtlWrite Write;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 508,526 ----
  	XLogCtlWrite Write;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock. To change the identity of a buffer that's
+ 	 * still dirty, the old page needs to be written out first, and for that
+ 	 * you need WALWriteLock, and you need to ensure that there's no
+ 	 * in-progress insertions to the page by calling
+ 	 * WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* current (latest) block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 494,521 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
  
! /* Free space remaining in the current xlog page buffer */
! #define INSERT_FREESPACE(Insert)  \
! 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
- /* Construct XLogRecPtr value for current insertion point */
- #define INSERT_RECPTR(recptr,Insert,curridx)  \
- 	( \
- 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
- 	  (recptr).xrecoff = \
- 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
- 	)
  
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
--- 598,628 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
+ #define INSERT_FREESPACE(endptr)	\
+ 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  
! #define NextSlotNo(idx)		\
! 		(((idx) == NumXLogInsertSlots) ? 0 : ((idx) + 1))
  
! /*
!  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
!  * would hold if it was in cache, the page containing 'recptr'.
!  *
!  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
!  * page is taken to mean the previous page.
!  */
! #define XLogRecPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
! 
! #define XLogRecEndPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
***************
*** 641,649 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
--- 748,756 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
***************
*** 690,695 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 797,820 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static XLogRecPtr PerformXLogInsert(int write_len,
+ 				  bool isLogSwitch,
+ 				  XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  bool didPageWrites);
+ static bool ReserveXLogInsertLocation(int size, bool forcePageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p);
+ static void UpdateSlotCurrPos(volatile XLogInsertSlot *myslot,
+ 				  XLogRecPtr CurrPos);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto,
+ 						   XLogRecPtr CurrPos);
+ static void WaitForXLogInsertionSlotToBecomeFree(void);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ static XLogRecPtr AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 710,721 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
  	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
--- 835,841 ----
***************
*** 729,739 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	uint32		len,
  				write_len;
  	unsigned	i;
- 	bool		updrqst;
  	bool		doPageWrites;
! 	bool		isLogSwitch = false;
! 	bool		fpwChange = false;
  	uint8		info_orig = info;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 849,858 ----
  	uint32		len,
  				write_len;
  	unsigned	i;
  	bool		doPageWrites;
! 	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
+ 	XLogRecord	rechdr;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 746,775 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
  
  	/*
! 	 * Handle special cases/records.
  	 */
! 	if (rmid == RM_XLOG_ID)
  	{
- 		switch (info)
- 		{
- 			case XLOG_SWITCH:
- 				isLogSwitch = true;
- 				break;
- 
- 			case XLOG_FPW_CHANGE:
- 				fpwChange = true;
- 				break;
- 
- 			default:
- 				break;
- 		}
- 	}
- 	else if (IsBootstrapProcessingMode())
- 	{
- 		/*
- 		 * In bootstrap mode, we don't actually log anything but XLOG resources;
- 		 * return a phony record pointer.
- 		 */
  		RecPtr.xlogid = 0;
  		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
  		return RecPtr;
--- 865,875 ----
  	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
  
  	/*
! 	 * In bootstrap mode, we don't actually log anything but XLOG resources;
! 	 * return a phony record pointer.
  	 */
! 	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
  	{
  		RecPtr.xlogid = 0;
  		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
  		return RecPtr;
***************
*** 939,1072 **** begin:;
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
- 	START_CRIT_SECTION();
- 
- 	/* Now wait to get insert lock */
- 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
- 
  	/*
! 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
! 	 * back and recompute everything.  This can only happen just after a
! 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
! 	 *
! 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
! 	 * affect the contents of the XLOG record, so we'll update our local copy
! 	 * but not force a recomputation.
  	 */
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
! 	{
! 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
! 		RedoRecPtr = Insert->RedoRecPtr;
  
! 		if (doPageWrites)
! 		{
! 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 			{
! 				if (dtbuf[i] == InvalidBuffer)
! 					continue;
! 				if (dtbuf_bkp[i] == false &&
! 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
! 				{
! 					/*
! 					 * Oops, this buffer now needs to be backed up, but we
! 					 * didn't think so above.  Start over.
! 					 */
! 					LWLockRelease(WALInsertLock);
! 					END_CRIT_SECTION();
! 					rdt_lastnormal->next = NULL;
! 					info = info_orig;
! 					goto begin;
! 				}
! 			}
! 		}
! 	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
! 		/* Oops, must redo it with full-page data. */
! 		LWLockRelease(WALInsertLock);
! 		END_CRIT_SECTION();
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
  
- 	/*
- 	 * If there isn't enough space on the current XLOG page for a record
- 	 * header, advance to the next page (leaving the unused space as zeroes).
- 	 */
- 	updrqst = false;
- 	freespace = INSERT_FREESPACE(Insert);
- 	if (freespace < SizeOfXLogRecord)
- 	{
- 		updrqst = AdvanceXLInsertBuffer(false);
- 		freespace = INSERT_FREESPACE(Insert);
- 	}
- 
- 	/* Compute record's XLOG location */
- 	curridx = Insert->curridx;
- 	INSERT_RECPTR(RecPtr, Insert, curridx);
- 
- 	/*
- 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
- 	 * segment, we need not insert it (and don't want to because we'd like
- 	 * consecutive switch requests to be no-ops).  Instead, make sure
- 	 * everything is written and flushed through the end of the prior segment,
- 	 * and return the prior segment's end address.
- 	 */
- 	if (isLogSwitch &&
- 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
- 	{
- 		/* We can release insert lock immediately */
- 		LWLockRelease(WALInsertLock);
- 
- 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
- 		if (RecPtr.xrecoff == 0)
- 		{
- 			/* crossing a logid boundary */
- 			RecPtr.xlogid -= 1;
- 			RecPtr.xrecoff = XLogFileSize;
- 		}
- 
- 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
- 		LogwrtResult = XLogCtl->Write.LogwrtResult;
- 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
- 		{
- 			XLogwrtRqst FlushRqst;
- 
- 			FlushRqst.Write = RecPtr;
- 			FlushRqst.Flush = RecPtr;
- 			XLogWrite(FlushRqst, false, false);
- 		}
- 		LWLockRelease(WALWriteLock);
- 
- 		END_CRIT_SECTION();
- 
- 		return RecPtr;
- 	}
- 
- 	/* Insert record header */
- 
- 	record = (XLogRecord *) Insert->currpos;
- 	record->xl_prev = Insert->PrevRecord;
- 	record->xl_xid = GetCurrentTransactionIdIfAny();
- 	record->xl_tot_len = SizeOfXLogRecord + write_len;
- 	record->xl_len = len;		/* doesn't include backup blocks */
- 	record->xl_info = info;
- 	record->xl_rmid = rmid;
- 
- 	/* Now we can finish computing the record's CRC */
- 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
- 			   SizeOfXLogRecord - sizeof(pg_crc32));
- 	FIN_CRC32(rdata_crc);
- 	record->xl_crc = rdata_crc;
- 
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
--- 1039,1078 ----
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
  	/*
! 	 * Construct record header. We can't CRC it yet, because the prev-link
! 	 * needs to be covered by the CRC and we don't know that yet. We will
! 	 * finish computing the CRC when we do.
  	 */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	/* rechdr.xl_prev is set in PerformXLogInsert() */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to do the insertion.
  	 */
! 	RecPtr = PerformXLogInsert(write_len, isLogSwitch, &rechdr,
! 							   rdata, rdata_crc, doPageWrites);
! 	END_CRIT_SECTION();
! 
! 	if (XLogRecPtrIsInvalid(RecPtr))
  	{
! 		/*
! 		 * Oops, must redo it with full-page data. Unlink the backup blocks
! 		 * from the chain and reset info bitmask to undo the changes we've
! 		 * done.
! 		 */
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
***************
*** 1075,1267 **** begin:;
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
  						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, record);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
! 	/* Record begin of record in appropriate places */
! 	ProcLastRecPtr = RecPtr;
! 	Insert->PrevRecord = RecPtr;
  
! 	Insert->currpos += SizeOfXLogRecord;
! 	freespace -= SizeOfXLogRecord;
  
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
  		{
! 			if (rdata->len > freespace)
  			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
  			}
  		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Ensure next record will be properly aligned */
! 	Insert->currpos = (char *) Insert->currpage +
! 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
! 	freespace = INSERT_FREESPACE(Insert);
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
  	{
! 		XLogCtlWrite *Write = &XLogCtl->Write;
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
  		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
  		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == Write->curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
  		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
  		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		Write->LogwrtResult = LogwrtResult;
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
  	}
  	else
  	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
  		{
! 			/* curridx is filled and available for writing out */
! 			updrqst = true;
  		}
  		else
  		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
- 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
  	/*
! 	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
! 	 * in shared memory before releasing WALInsertLock. This ensures that
! 	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
! 	 * by this change of full_page_writes.
  	 */
! 	if (fpwChange)
! 		Insert->fullPageWrites = fullPageWrites;
  
! 	LWLockRelease(WALInsertLock);
! 
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
! 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1081,1796 ----
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
  						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, &rechdr);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rmid].rm_desc(&buf, rechdr.xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
! 	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
! 	 */
! 	return RecPtr;
! }
! 
! /*
!  * Subroutine of XLogInsert. All the changes to shared state are done here,
!  * XLogInsert only prepares the record for insertion.
!  *
!  * On success, returns pointer to end of inserted record like XLogInsert().
!  * If RedoRecPtr or forcePageWrites had changed, returns InvalidRecPtr, and
!  * the caller must recalculate full-page-images and retry.
!  */
! static XLogRecPtr
! PerformXLogInsert(int write_len, bool isLogSwitch, XLogRecord *rechdr,
! 				  XLogRecData *rdata, pg_crc32 rdata_crc,
! 				  bool didPageWrites)
! {
! 	volatile XLogInsertSlot *myslot = NULL;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			tot_len;
! 	int			freespace;
! 	int			written;
! 	XLogRecPtr	PrevRecord;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	EndPos;
! 	XLogRecPtr	CurrPos;
! 	bool		updrqst;
  
! 	/* Get an insert location  */
! 	tot_len = SizeOfXLogRecord + write_len;
! 	if (!ReserveXLogInsertLocation(tot_len, didPageWrites, isLogSwitch,
! 								   &PrevRecord, &StartPos, &EndPos,
! 								   (XLogInsertSlot **) &myslot, &updrqst))
! 	{
! 		return EndPos;
! 	}
  
  	/*
! 	 * Got it! Now that we know the prev-link, we can finish computing the
! 	 * record's CRC.
  	 */
! 	rechdr->xl_prev = PrevRecord;
! 
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
! 
! 	/* Copy the record header in place */
! 	record = (XLogRecord *) currpos;
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	COMP_CRC32(rdata_crc, currpos + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
! 
! 	currpos += SizeOfXLogRecord;
! 	CurrPos.xrecoff += SizeOfXLogRecord;
  
! 	if (!isLogSwitch)
! 	{
! 		/* Copy record data */
! 		written = 0;
! 		freespace = INSERT_FREESPACE(CurrPos);
! 		while (rdata != NULL)
  		{
! 			while (rdata->len > freespace)
  			{
! 				/*
! 				 * Write what fits on this page, then write the continuation
! 				 * record, and continue.
! 				 */
! 				XLogContRecord *contrecord;
! 
! 				memcpy(currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				written += freespace;
! 				CurrPos.xrecoff += freespace;
! 
! 				/*
! 				 * CurrPos now points to the page boundary, ie. the first byte
! 				 * of the next page. Advertise that as our CurrPos before
! 				 * calling GetXLogBuffer(), because GetXLogBuffer() might need
! 				 * to wait for some insertions to finish so that it can write
! 				 * out a buffer to make room for the new page. Updating CurrPos
! 				 * before waiting for a new buffer ensures that we don't
! 				 * deadlock with ourselves if we run out of clean buffers.
! 				 *
! 				 * However, we must not advance CurrPos past the page header
! 				 * yet, otherwise someone might try to flush up to that point,
! 				 * which would fail if the next page was not initialized yet.
! 				 */
! 				UpdateSlotCurrPos(myslot, CurrPos);
! 
! 				/*
! 				 * Get pointer to beginning of next page, and set the
! 				 * XLP_FIRST_IS_CONTRECORD flag in the page header.
! 				 *
! 				 * It's safe to set the contrecord flag without a lock on the
! 				 * page. All the other flags are set in AdvanceXLInsertBuffer,
! 				 * and we're the only backend that needs to set the contrecord
! 				 * flag.
! 				 */
! 				currpos = GetXLogBuffer(CurrPos);
! 				((XLogPageHeader) currpos)->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 
! 				/* skip over the page header, and write continuation record */
! 				CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 				currpos = GetXLogBuffer(CurrPos);
! 
! 				contrecord = (XLogContRecord *) currpos;
! 				contrecord->xl_rem_len = write_len - written;
! 
! 				currpos += SizeOfXLogContRecord;
! 				CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 				freespace = INSERT_FREESPACE(CurrPos);
  			}
+ 
+ 			memcpy(currpos, rdata->data, rdata->len);
+ 			currpos += rdata->len;
+ 			CurrPos.xrecoff += rdata->len;
+ 			freespace -= rdata->len;
+ 			written += rdata->len;
+ 
+ 			rdata = rdata->next;
  		}
+ 		Assert(written == write_len);
  
! 		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
! 		Assert(XLByteEQ(CurrPos, EndPos));
  	}
+ 	else
+ 	{
+ 		/* An xlog-switch record doesn't contain any data besides the header */
+ 		Assert(write_len == 0);
  
! 		/*
! 		 * An xlog-switch record consumes all the remaining space on the
! 		 * WAL segment. We have already reserved it for us, but we still need
! 		 * to make sure it's been allocated and zeroed in the WAL buffers so
! 		 * that when the caller (or someone else) does XLogWrite(), it can
! 		 * really write out all the zeros.
! 		 *
! 		 * We do this one page at a time, to make sure we don't deadlock
! 		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
! 		 */
! 		while (XLByteLT(CurrPos, EndPos))
! 		{
! 			/* use up all the remaining space in this page */
! 			freespace = INSERT_FREESPACE(CurrPos);
! 			XLByteAdvance(CurrPos, freespace);
! 			/*
! 			 * like in the non-xlog-switch codepath, let others know that
! 			 * we're done writing up to the end of this page
! 			 */
! 			UpdateSlotCurrPos(myslot, CurrPos);
! 			/*
! 			 * let GetXLogBuffer initialize next page if necessary.
! 			 */
! 			CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 			(void) GetXLogBuffer(CurrPos);
! 		}
! 
! 		/*
! 		 * Even though we reserved the rest of the segment for us, which
! 		 * is reflected in EndPos, we need to return a value that points just
! 		 * to the end of the xlog-switch record.
! 		 */
! 		EndPos.xlogid = StartPos.xlogid;
! 		EndPos.xrecoff = StartPos.xrecoff + SizeOfXLogRecord;
! 	}
  
  	/*
! 	 * Done! Clear CurrPos in our slot to let others know that we're
! 	 * finished.
  	 */
! 	UpdateSlotCurrPos(myslot, InvalidXLogRecPtr);
  
  	/*
! 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
  	 */
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
! 
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
  
! 	/* update our global variables */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
  
! 	return EndPos;
! }
  
+ /*
+  * Reserves the right amount of space for a record of given size from the WAL.
+  * *StartPos_p is set to the beginning of the reserved section, *EndPos_p to
+  * its end, and *Prev_record_p points to the beginning of the previous record
+  * to set to the prev-link of the record header.
+  *
+  * A log-switch record is handled slightly differently. The rest of the
+  * segment will be reserved for this insertion, as indicated by the returned
+  * *EndPos_p value. However, if we are already at the beginning of the current
+  * segment, the *EndPos_p is set to the current location without reserving
+  * any space, and the function returns false.
+  *
+  * *updrqst_p is set to true, if this record ends on different page than
+  * the previous one - the caller should update the shared LogwrtRqst value
+  * after it's done inserting the record in that case, so that the WAL page
+  * that filled up gets written out at the next convenient moment.
+  *
+  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
+  * (or the end of previous record, to be exact) to let others know that we're
+  * busy inserting to the reserved area. The caller must clear it when the
+  * insertion is finished.
+  *
+  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
+  * changed. On failure, the shared state is not modified.
+  *
+  * This is the performance critical part of XLogInsert that must be
+  * serialized across backends. The rest can happen mostly in parallel.
+  *
+  * NB: The space calculation here must match the code in PerformXLogInsert,
+  * where we actually copy the record to the reserved space.
+  */
+ static bool
+ ReserveXLogInsertLocation(int size, bool didPageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p)
+ {
+ 	volatile XLogInsertSlot *myslot;
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			freespace;
+ 	XLogRecPtr	ptr;
+ 	XLogRecPtr	StartPos;
+ 	XLogRecPtr	LastEndPos;
+ 	int32		nextslot;
+ 	int32		lastslot;
+ 	bool		updrqst = false;
+ 
+ retry:
+ 	SpinLockAcquire(&Insert->insertpos_lck);
+ 
+ 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
+ 		(!didPageWrites && (Insert->forcePageWrites || Insert->fullPageWrites)))
+ 	{
  		/*
! 		 * Oops, a checkpoint just happened, or forcePageWrites was just
! 		 * turned on. Start XLogInsert() all over, because we might have to
! 		 * include more full-page images in the record.
  		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		*EndPos_p = InvalidXLogRecPtr;
! 		return false;
! 	}
! 
! 	/*
! 	 * Reserve the next insertion slot for us.
! 	 *
! 	 * First check that the slot is not still in use. Modifications to
! 	 * lastslot are protected by WALInsertTailLock, but here we assume that
! 	 * reading an int32 is atomic. Another process might advance lastslot at
! 	 * the same time, but not past nextslot.
! 	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	if (NextSlotNo(nextslot) == lastslot)
! 	{
! 		/*
! 		 * Oops, we've "caught our tail" and the oldest slot is still in use.
! 		 * Have to wait for it to become vacant.
! 		 */
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		WaitForXLogInsertionSlotToBecomeFree();
! 		goto retry;
! 	}
! 	myslot = &XLogCtl->XLogInsertSlots[nextslot];
! 	nextslot = NextSlotNo(nextslot);
! 
! 	/*
! 	 * Got the slot, now reserve the right amount of space from the WAL for
! 	 * our record.
! 	 */
! 	LastEndPos = ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	freespace = INSERT_FREESPACE(ptr);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
! 		freespace = INSERT_FREESPACE(ptr);
! 		updrqst = true;
! 	}
  
! 	/*
! 	 * We are now at the starting position of our record. Now figure out how
! 	 * the data will be split across the WAL pages, to calculate where the
! 	 * record ends.
! 	 */
! 	StartPos = ptr;
  
+ 	if (isLogSwitch)
+ 	{
  		/*
! 		 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 		 * segment, we need not insert it (and don't want to because we'd like
! 		 * consecutive switch requests to be no-ops). Otherwise the XLOG_SWITCH
! 		 * record should consume all the remaining space on the current segment.
  		 */
+ 		Assert(size == SizeOfXLogRecord);
+ 		if ((ptr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  		{
! 			/* We can release insert lock immediately */
! 			SpinLockRelease(&Insert->insertpos_lck);
  
! 			ptr.xrecoff -= SizeOfXLogLongPHD;
! 			if (ptr.xrecoff == 0)
! 			{
! 				/* crossing a logid boundary */
! 				ptr.xlogid -= 1;
! 				ptr.xrecoff = XLogFileSize;
! 			}
! 
! 			*EndPos_p = ptr;
! 			*StartPos_p = ptr;
! 			*myslot_p = NULL;
! 
! 			return false;
! 		}
! 		else
! 		{
! 			if (ptr.xrecoff % XLOG_SEG_SIZE != 0)
! 			{
! 				int segleft = XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE);
! 				XLByteAdvance(ptr, segleft);
! 			}
! 			updrqst = true;
  		}
+ 	}
+ 	else
+ 	{
+ 		/* A normal record, ie. not xlog-switch */
+ 		int sizeleft = size;
+ 		while (freespace < sizeleft)
+ 		{
+ 			/* fill this page, and continue on next page */
+ 			sizeleft -= freespace;
+ 			ptr = AdvanceXLogRecPtrToNextPage(ptr);
  
! 			/* account for continuation record header */
! 			ptr.xrecoff += SizeOfXLogContRecord;
! 			freespace = INSERT_FREESPACE(ptr);
  
! 			updrqst = true;
! 		}
! 		/* the rest fits on this page */
! 		ptr.xrecoff += sizeleft;
! 
! 		/* Align the end position, so that the next record starts aligned */
! 		ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 	}
! 
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	myslot->CurrPos = LastEndPos;
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 	Insert->nextslot = nextslot;
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(LOG, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 PrevRecord_p->xlogid, PrevRecord_p->xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 size);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 	*myslot_p = (XLogInsertSlot *) myslot;
! 	*updrqst_p = updrqst;
! 
! 	return true;
! }
! 
! /*
!  * Update slot's CurrPos variable, and wake up anyone waiting on it.
!  */
! static void
! UpdateSlotCurrPos(volatile XLogInsertSlot *myslot, XLogRecPtr CurrPos)
! {
! 	PGPROC	   *head;
! 
! 	/*
! 	 * The write-barrier ensures that the changes we made to the WAL pages
! 	 * are visible to everyone before the update of CurrPos.
! 	 *
! 	 * XXX: I'm not sure if this is necessary. Does a function call act
! 	 * as an implicit barrier?
! 	 */
! 	pg_write_barrier();
  
! 	SpinLockAcquire(&myslot->lck);
! 	myslot->CurrPos = CurrPos;
! 	head = myslot->head;
! 	myslot->head = myslot->tail = NULL;
! 	SpinLockRelease(&myslot->lck);
! 	while (head != NULL)
! 	{
! 		PGPROC *proc = head;
! 		head = proc->lwWaitLink;
! 		proc->lwWaitLink = NULL;
! 		proc->lwWaiting = false;
! 		PGSemaphoreUnlock(&proc->sem);
  	}
+ }
+ 
+ /*
+  * Get a pointer to the right location in the WAL buffer containing the
+  * given XLogRecPtr.
+  *
+  * If the page is not initialized yet, it is initialized. That might require
+  * evicting an old dirty buffer from the buffer cache, which means I/O.
+  *
+  * The caller must ensure that the page containing the requested location
+  * isn't evicted yet, and won't be evicted, by holding onto an
+  * XLogInsertSlot with CurrPos set to 'ptr'. Setting it to some value
+  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
+  * if we have to evict a buffer, we might have to wait for someone else to
+  * finish a write. And that someone else might not be able to finish the write
+  * if our CurrPos points to a buffer that's still in the buffer cache.
+  */
+ static char *
+ GetXLogBuffer(XLogRecPtr ptr)
+ {
+ 	int			idx;
+ 	XLogRecPtr	endptr;
+ 
+ 	/*
+ 	 * The XLog buffer cache is organized so that we can easily calculate the
+ 	 * buffer a given page must be loaded into from the XLogRecPtr alone.
+ 	 * A page must always be loaded to a particular buffer.
+ 	 */
+ 	idx = XLogRecPtrToBufIdx(ptr);
+ 
+ 	/*
+ 	 * See what page is loaded in the buffer at the moment. It could be the
+ 	 * page we're looking for, or something older. It can't be anything
+ 	 * newer - that would imply the page we're looking for has already
+ 	 * been written out to disk, which shouldn't happen as long as the caller
+ 	 * has set its slot's CurrPos correctly.
+ 	 *
+ 	 * However, we don't hold a lock while we read the value. If someone has
+ 	 * just initialized the page, it's possible that we get a "torn read",
+ 	 * and read a bogus value. That's ok, we'll grab the mapping lock (in
+ 	 * AdvanceXLInsertBuffer) and retry if we see anything else than the page
+ 	 * we're looking for. But it means that when we do this unlocked read, we
+ 	 * might see a value that appears to be ahead of the page we're looking
+ 	 * for. So don't PANIC on that, until we've verified the value while
+ 	 * holding the lock.
+ 	 */
+ 	endptr = XLogCtl->xlblocks[idx];
+ 	if (ptr.xlogid != endptr.xlogid ||
+ 		!(ptr.xrecoff < endptr.xrecoff &&
+ 		  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
+ 	{
+ 		AdvanceXLInsertBuffer(ptr, false);
+ 		endptr = XLogCtl->xlblocks[idx];
+ 
+ 		if (ptr.xlogid != endptr.xlogid ||
+ 			!(ptr.xrecoff < endptr.xrecoff &&
+ 			  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
+ 		{
+ 			elog(PANIC, "could not find WAL buffer for %X/%X",
+ 				 ptr.xlogid, ptr.xrecoff);
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Found the buffer holding this page. Return a pointer to the right
+ 	 * offset within the page.
+ 	 */
+ 	return (char *) XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
+ 		ptr.xrecoff % XLOG_BLCKSZ;
+ }
+ 
+ /*
+  * Advance an XLogRecPtr to the first valid insertion location on the next
+  * page, right after the page header. An XLogRecPtr pointing to a boundary,
+  * ie. the first byte of a page, is taken to belong to the previous page.
+  */
+ static XLogRecPtr
+ AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr)
+ {
+ 	int			freespace;
+ 
+ 	freespace = INSERT_FREESPACE(ptr);
+ 	XLByteAdvance(ptr, freespace);
+ 	if (ptr.xrecoff % XLogSegSize == 0)
+ 		ptr.xrecoff += SizeOfXLogLongPHD;
  	else
+ 		ptr.xrecoff += SizeOfXLogShortPHD;
+ 
+ 	return ptr;
+ }
+ 
+ /*
+  * Wait for any insertions < upto to finish.
+  *
+  * Returns a value >= upto, which indicates the oldest in-progress insertion
+  * that we saw in the array, or CurrPos if there are no insertions in-progress
+  * at exit.
+  */
+ static XLogRecPtr
+ WaitXLogInsertionsToFinish(XLogRecPtr upto, XLogRecPtr CurrPos)
+ {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			lastslot;
+ 	int			nextslot;
+ 	volatile XLogInsertSlot *slot;
+ 	XLogRecPtr	slotptr = InvalidXLogRecPtr;
+ 	XLogRecPtr	LastPos;
+ 	int			extraWaits = 0;
+ 
+ 	if (MyProc == NULL)
+ 		elog(PANIC, "cannot wait without a PGPROC structure");
+ 
+ 	LastPos = CurrPos;
+ 
+ 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+ 
+ 	lastslot = Insert->lastslot;
+ 	nextslot = Insert->nextslot;
+ 
+ 	/* Skip over slots that have finished already */
+ 	while (lastslot != nextslot)
  	{
! 		slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		SpinLockAcquire(&slot->lck);
! 		slotptr = slot->CurrPos;
  
! 		if (XLogRecPtrIsInvalid(slotptr))
  		{
! 			lastslot = NextSlotNo(lastslot);
! 			SpinLockRelease(&slot->lck);
  		}
  		else
  		{
! 			/*
! 			 * This insertion is still in-progress. Wait for it to finish
! 			 * if it's <= upto, otherwise we're done.
! 			 */
! 			Insert->lastslot = lastslot;
! 
! 			if (XLogRecPtrIsInvalid(upto) || XLByteLE(upto, slotptr))
! 			{
! 				LastPos = slotptr;
! 				SpinLockRelease(&slot->lck);
! 				break;
! 			}
! 
! 			/* wait */
! 			MyProc->lwWaiting = true;
! 			MyProc->lwWaitMode = 0; /* doesn't matter */
! 			MyProc->lwWaitLink = NULL;
! 			if (slot->head == NULL)
! 				slot->head = MyProc;
! 			else
! 				slot->tail->lwWaitLink = MyProc;
! 			slot->tail = MyProc;
! 			SpinLockRelease(&slot->lck);
! 			LWLockRelease(WALInsertTailLock);
! 			for (;;)
! 			{
! 				PGSemaphoreLock(&MyProc->sem, false);
! 				if (!MyProc->lwWaiting)
! 					break;
! 				extraWaits++;
! 			}
! 			LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 			lastslot = Insert->lastslot;
! 			nextslot = Insert->nextslot;
  		}
  	}
  
+ 	Insert->lastslot = lastslot;
+ 	LWLockRelease(WALInsertTailLock);
+ 
+ 	while (extraWaits-- > 0)
+ 		PGSemaphoreUnlock(&MyProc->sem);
+ 
+ 	return LastPos;
+ }
+ 
+ /*
+  * Wait for the next insertion slot to become vacant.
+  */
+ static void
+ WaitForXLogInsertionSlotToBecomeFree(void)
+ {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			lastslot;
+ 	int			nextslot;
+ 	int			extraWaits = 0;
+ 
+ 	if (MyProc == NULL)
+ 		elog(PANIC, "cannot wait without a PGPROC structure");
+ 
+ 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+ 
  	/*
! 	 * Re-read lastslot and nextslot, now that we have the wait-lock.
! 	 * We're reading nextslot without holding insertpos_lck. It could advance
! 	 * at the same time, but it can't advance beyond lastslot - 1.
  	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
  
! 	/*
! 	 * If there are still no slots available, wait for the oldest slot to
! 	 * become vacant.
! 	 */
! 	while (NextSlotNo(nextslot) == lastslot)
  	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
  
! 		SpinLockAcquire(&slot->lck);
! 		if (XLogRecPtrIsInvalid(slot->CurrPos))
! 		{
! 			SpinLockRelease(&slot->lck);
! 			break;
! 		}
! 
! 		/* wait */
! 		MyProc->lwWaiting = true;
! 		MyProc->lwWaitMode = 0; /* doesn't matter */
! 		MyProc->lwWaitLink = NULL;
! 		if (slot->head == NULL)
! 			slot->head = MyProc;
! 		else
! 			slot->tail->lwWaitLink = MyProc;
! 		slot->tail = MyProc;
! 		SpinLockRelease(&slot->lck);
! 		LWLockRelease(WALInsertTailLock);
! 		for (;;)
! 		{
! 			PGSemaphoreLock(&MyProc->sem, false);
! 			if (!MyProc->lwWaiting)
! 				break;
! 			extraWaits++;
! 		}
! 		LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 		lastslot = Insert->lastslot;
! 		nextslot = Insert->nextslot;
  	}
  
! 	/*
! 	 * Ok, there is at least one empty slot now. That's enouugh for our
! 	 * insertion, but ẃhile we're at it, advance lastslot as much as we
! 	 * can. That way we don't need to come back here on the next call
! 	 * again.
! 	 */
! 	while (lastslot != nextslot)
! 	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		/*
! 		 * Don't need to grab the slot's spinlock here, because we're not
! 		 * interested in the exact value of CurrPos, only whether it's
! 		 * valid or not.
! 		 */
! 		if (!XLogRecPtrIsInvalid(slot->CurrPos))
! 			break;
  
! 		lastslot = NextSlotNo(lastslot);
! 	}
! 	Insert->lastslot = lastslot;
  
! 	LWLockRelease(WALInsertTailLock);
  }
  
  /*
***************
*** 1488,1522 **** XLogArchiveCleanup(const char *xlog)
  }
  
  /*
!  * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data.
!  *
!  * If new_segment is TRUE then we set up the next buffer page as the first
!  * page of the next xlog segment file, possibly but not usually the next
!  * consecutive file page.
!  *
!  * The global LogwrtRqst.Write pointer needs to be advanced to include the
!  * just-filled page.  If we can do this for free (without an extra lock),
!  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
!  * request update still needs to be done, FALSE if we did it internally.
!  *
!  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
! 	/* Use Insert->LogwrtResult copy if it's more fresh */
! 	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
! 		LogwrtResult = Insert->LogwrtResult;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 2017,2050 ----
  }
  
  /*
!  * Initialize XLOG buffers, writing out old buffers if they still contain
!  * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
!  * true, initialize as many pages as we can without having to write out
!  * unwritten data. Any new pages are initialized to zeros, with pages headers
!  * initialized properly.
   */
! static void
! AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
  	XLogPageHeader NewPage;
+ 	bool		needflush;
+ 	int			npages = 0;
  
! 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Now that we have the lock, check if someone initialized the page
! 	 * already.
! 	 */
! /* XXX: fix indentation before commit */
! while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
! {
! 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1524,1535 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
- 	{
- 		/* nope, got work to do... */
- 		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 2052,2068 ----
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  
! 	needflush = !XLByteLE(OldPageRqstPtr, LogwrtResult.Write);
! 
! 	if (needflush)
! 	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1537,1581 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		update_needed = false;	/* Did the shared-request update */
! 
! 		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
! 		{
! 			/* OK, someone wrote it already */
! 			Insert->LogwrtResult = LogwrtResult;
! 		}
! 		else
  		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding insert lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
--- 2070,2119 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 			{
! 				Assert(XLByteLE(OldPageRqstPtr, xlogctl->Insert.CurrPos));
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
! 			}
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr, InvalidXLogRecPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding mapping lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
***************
*** 1583,1596 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
! 
! 	if (new_segment)
! 	{
! 		/* force it to a segment start point */
! 		NewPageEndPtr.xrecoff += XLogSegSize - 1;
! 		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
! 	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 2121,2127 ----
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1600,1612 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
! 	Insert->curridx = nextidx;
! 	Insert->currpage = NewPage;
! 
! 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
--- 2131,2140 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
! 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
  
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
***************
*** 1650,1660 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 2178,2205 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * Make sure the initialization of the page becomes visible to others
! 	 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
! 	 * holding a lock.
! 	 */
! 	pg_write_barrier();
! 
! 	*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(DEBUG1, "initialized %d pages, upto %X/%X",
! 			 npages, NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff);
! #endif
  }
  
  /*
***************
*** 1699,1714 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * If xlog_switch == TRUE, we are intending an xlog segment switch, so
!  * perform end-of-segment actions after writing the last page, even if
!  * it's not physically the end of its segment.  (NB: this will work properly
!  * only if caller specifies WriteRqst == page-end and flexible == false,
!  * and there is some data to write.)
!  *
!  * Must be called with WALWriteLock held.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
--- 2244,2255 ----
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
!  * the data is ready to write.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
***************
*** 1726,1732 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = Write->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
--- 2267,2273 ----
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
***************
*** 1757,1770 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 2298,2311 ----
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
! 		if (!XLByteLT(LogwrtResult.Write, EndPtr))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 EndPtr.xlogid, EndPtr.xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = EndPtr;
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1861,1876 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
- 			 * We also do this if this is the last page written for an xlog
- 			 * switch.
- 			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg || (xlog_switch && last_iteration))
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
--- 2402,2414 ----
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg)
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
***************
*** 1960,1967 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
- 
- 	Write->LogwrtResult = LogwrtResult;
  }
  
  /*
--- 2498,2503 ----
***************
*** 2124,2131 **** XLogFlush(XLogRecPtr record)
  	 */
  	for (;;)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
--- 2660,2670 ----
  	 */
  	for (;;)
  	{
! 		/* use volatile pointers to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
***************
*** 2139,2144 **** XLogFlush(XLogRecPtr record)
--- 2678,2712 ----
  			break;
  
  		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to do LWLockConditionalAcquire(WALInsertLock),
+ 		 * fall back to writing just up to 'record' if we couldn't get t
+ 		 * lock. I wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand
+ 		 * it would be good to not cause more contention on the lock if
+ 		 * busy, but on the other hand, this spinlock is much more light
+ 		 * than the WALInsertLock was, so maybe it's better to just grab
+ 		 * spinlock. Also note that if we stored the XLogRecPtr as one 6
+ 		 * integer, we could just read it with no lock on platforms wher
+ 		 * 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)               /* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, insertpos);
+ 
+ 		/*
  		 * Try to get the write lock. If we can't get it immediately, wait
  		 * until it's released, and recheck if we still need to do the flush
  		 * or if the backend that held the lock did it for us already. This
***************
*** 2155,2186 **** XLogFlush(XLogRecPtr record)
  			continue;
  		}
  		/* Got the lock */
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
! 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 				else
! 				{
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 					WriteRqstPtr.xrecoff -= freespace;
! 				}
! 				LWLockRelease(WALInsertLock);
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = WriteRqstPtr;
! 			}
! 			else
! 			{
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = record;
! 			}
! 			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
--- 2723,2735 ----
  			continue;
  		}
  		/* Got the lock */
! 		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
! 			XLogWrite(WriteRqst, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
***************
*** 2292,2314 **** XLogBackgroundFlush(void)
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
- 
  	START_CRIT_SECTION();
  
! 	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2841,2871 ----
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
  	START_CRIT_SECTION();
  
! 	/* now wait for any in-progress insertions to finish and get write lock */
! 	WaitXLogInsertionsToFinish(WriteRqstPtr, InvalidXLogRecPtr);
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible);
  	}
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 */
+ 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5102,5107 **** XLOGShmemSize(void)
--- 5659,5667 ----
  	/* and the buffers themselves */
  	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
  
+ 	/* XLog insertion slots */
+ 	size = add_size(size, mul_size(sizeof(XLogInsertSlot), NumXLogInsertSlots));
+ 
  	/*
  	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
  	 * added by CreateSharedMemoryAndSemaphores.  This lets us use this
***************
*** 5117,5122 **** XLOGShmemInit(void)
--- 5677,5683 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5142,5147 **** XLOGShmemInit(void)
--- 5703,5721 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize insertion slots */
+ 	XLogCtl->XLogInsertSlots = (XLogInsertSlot *) allocptr;
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[i];
+ 		slot->CurrPos = InvalidXLogRecPtr;
+ 		slot->head = slot->tail = NULL;
+ 		SpinLockInit(&slot->lck);
+ 	}
+ 	XLogCtl->Insert.nextslot = 1;
+ 	XLogCtl->Insert.lastslot = 0;
+ 	allocptr += sizeof(XLogInsertSlot) * NumXLogInsertSlots;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5156,5166 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
- 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
--- 5730,5741 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
+ 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
+ 
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
***************
*** 6038,6043 **** StartupXLOG(void)
--- 6613,6619 ----
  	bool		backupEndRequired = false;
  	bool		backupFromStandby = false;
  	DBState		dbstate_at_startup;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6844,6851 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7420,7431 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6853,6878 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
- 	XLogCtl->Write.LogwrtResult = LogwrtResult;
- 	Insert->LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7433,7455 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndRecPtr.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6884,6890 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7461,7467 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7390,7396 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
--- 7967,7973 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
***************
*** 7666,7671 **** CreateCheckPoint(int flags)
--- 8243,8249 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7734,7743 **** CreateCheckPoint(int flags)
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 8312,8321 ----
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * Determine the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7749,7755 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 8327,8333 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7758,7772 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 8336,8347 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7793,7806 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
--- 8368,8377 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(curInsert);
  	if (freespace < SizeOfXLogRecord)
! 		curInsert = AdvanceXLogRecPtrToNextPage(curInsert);
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
***************
*** 7826,7832 **** CreateCheckPoint(int flags)
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8397,8403 ----
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7846,7852 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8417,8423 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 8213,8227 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8784,8798 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * Like in CreateCheckPoint(), you need both insertpos_lck and info_lck
! 	 * to update it, although during recovery acquiring insertpos_lck is just
! 	 * pro forma, because no WAL insertions are happening.
  	 */
! 	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8414,8419 **** RequestXLogSwitch(void)
--- 8985,8991 ----
  {
  	XLogRecPtr	RecPtr;
  	XLogRecData rdata;
+ 	XLogwrtRqst FlushRqst;
  
  	/* XLOG SWITCH, alone among xlog record types, has no data */
  	rdata.buffer = InvalidBuffer;
***************
*** 8423,8428 **** RequestXLogSwitch(void)
--- 8995,9021 ----
  
  	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH, &rdata);
  
+ 	/* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
+ 	 * if the xlog switch had no work to do, ie. if we were already at the
+ 	 * beginning of a new XLOG segment. You can check if RecPtr points to
+ 	 * beginning of a segment if you want to keep the distinction.
+ 	 */
+ 	TRACE_POSTGRESQL_XLOG_SWITCH();
+ 
+ 	/*
+ 	 * Flush through the end of the page containing XLOG_SWITCH, and
+ 	 * perform end-of-segment actions (eg, notifying archiver).
+ 	 */
+ 	WaitXLogInsertionsToFinish(RecPtr, InvalidXLogRecPtr);
+ 
+ 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ 	FlushRqst.Write = RecPtr;
+ 	FlushRqst.Flush = RecPtr;
+ 	START_CRIT_SECTION();
+ 	XLogWrite(FlushRqst, false);
+ 	END_CRIT_SECTION();
+ 	LWLockRelease(WALWriteLock);
+ 
  	return RecPtr;
  }
  
***************
*** 8501,8522 **** XLogReportParameters(void)
  /*
   * Update full_page_writes in shared memory, and write an
   * XLOG_FPW_CHANGE record if necessary.
   */
  void
  UpdateFullPageWrites(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
  	 *
  	 * It's safe to check the shared full_page_writes without the lock,
! 	 * because we can guarantee that there is no concurrently running
! 	 * process which can update it.
  	 */
  	if (fullPageWrites == Insert->fullPageWrites)
  		return;
  
  	/*
  	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
  	 * track of full_page_writes during archive recovery, if required.
--- 9094,9134 ----
  /*
   * Update full_page_writes in shared memory, and write an
   * XLOG_FPW_CHANGE record if necessary.
+  *
+  * Note: this function assumes there is no other process running
+  * concurrently that could update it.
   */
  void
  UpdateFullPageWrites(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
  	 *
  	 * It's safe to check the shared full_page_writes without the lock,
! 	 * because we assume that there is no concurrently running process
! 	 * which can update it.
  	 */
  	if (fullPageWrites == Insert->fullPageWrites)
  		return;
  
+ 	START_CRIT_SECTION();
+ 
+ 	/*
+ 	 * It's always safe to take full page images, even when not strictly
+ 	 * required, but not the other round. So if we're setting full_page_writes
+ 	 * to true, first set it true and then write the WAL record. If we're
+ 	 * setting it to false, first write the WAL record and then set the
+ 	 * global flag.
+ 	 */
+ 	if (fullPageWrites)
+ 	{
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		Insert->fullPageWrites = true;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 	}
+ 
  	/*
  	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
  	 * track of full_page_writes during archive recovery, if required.
***************
*** 8532,8543 **** UpdateFullPageWrites(void)
  
  		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
  	}
! 	else
  	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 		Insert->fullPageWrites = fullPageWrites;
! 		LWLockRelease(WALInsertLock);
  	}
  }
  
  /*
--- 9144,9157 ----
  
  		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
  	}
! 
! 	if (!fullPageWrites)
  	{
! 		SpinLockAcquire(&Insert->insertpos_lck);
! 		Insert->fullPageWrites = false;
! 		SpinLockRelease(&Insert->insertpos_lck);
  	}
+ 	END_CRIT_SECTION();
  }
  
  /*
***************
*** 9063,9068 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9677,9683 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	checkpointloc;
***************
*** 9125,9150 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9740,9765 ----
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 9257,9269 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9872,9884 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9347,9356 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
--- 9962,9972 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
***************
*** 9367,9373 **** pg_start_backup_callback(int code, Datum arg)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9983,9989 ----
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9380,9385 **** pg_start_backup_callback(int code, Datum arg)
--- 9996,10002 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	startpoint;
***************
*** 9433,9441 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 10050,10058 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9444,9459 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 10061,10076 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9731,9746 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 10348,10365 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
! 
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9794,9805 **** GetStandbyFlushRecPtr(void)
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 10413,10424 ----
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/storage/lmgr/spin.c
--- b/src/backend/storage/lmgr/spin.c
***************
*** 56,61 **** SpinlockSemas(void)
--- 56,64 ----
  	 *
  	 * For now, though, we just need a few spinlocks (10 should be plenty)
  	 * plus one for each LWLock and one for each buffer header.
+ 	 *
+ 	 * XXX: remember to adjust this for the number of spinlocks needed by the
+ 	 * xlog.c changes before committing!
  	 */
  	return NumLWLocks() + NBuffers + 10;
  }
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALInsertTailLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#39Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#38)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 13.02.2012 01:04, Jeff Janes wrote:

Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
 I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do.  I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.

Thanks. That's basically the right approach. Attached patch contains a
cleaned up version of that.

It does get rid of the "there is no contrecord flag" errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on "record with zero length at...",
that recovery is correct.

But when it ends on "invalid magic number 0000 in log file.." then the
recovery is screwed up.

Can you write a self-contained test case for that? I've been trying to
reproduce that by running the regression tests and pgbench with a streaming
replication standby, which should be pretty much the same as crash recovery.
No luck this far.

Probably I could reproduce the same problem as Jeff got. Here is the test case:

$ initdb -D data
$ pg_ctl -D data start
$ psql -c "create table t (i int); insert into t
values(generate_series(1,10000)); delete from t"
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start

The crash recovery emitted the following server logs:

LOG: database system was interrupted; last known up at 2012-02-14 02:07:01 JST
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/179CC90
LOG: invalid magic number 0000 in log file 0, segment 1, offset 8060928
LOG: redo done at 0/17AD858
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

After recovery, I could not see the table "t" which I created before:

$ psql -c "select count(*) from t"
ERROR: relation "t" does not exist

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#40Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#39)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 13.02.2012 19:13, Fujii Masao wrote:

On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 13.02.2012 01:04, Jeff Janes wrote:

Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do. I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.

Thanks. That's basically the right approach. Attached patch contains a
cleaned up version of that.

It does get rid of the "there is no contrecord flag" errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on "record with zero length at...",
that recovery is correct.

But when it ends on "invalid magic number 0000 in log file.." then the
recovery is screwed up.

Can you write a self-contained test case for that? I've been trying to
reproduce that by running the regression tests and pgbench with a streaming
replication standby, which should be pretty much the same as crash recovery.
No luck this far.

Probably I could reproduce the same problem as Jeff got. Here is the test case:

$ initdb -D data
$ pg_ctl -D data start
$ psql -c "create table t (i int); insert into t
values(generate_series(1,10000)); delete from t"
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start

The crash recovery emitted the following server logs:

LOG: database system was interrupted; last known up at 2012-02-14 02:07:01 JST
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/179CC90
LOG: invalid magic number 0000 in log file 0, segment 1, offset 8060928
LOG: redo done at 0/17AD858
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

After recovery, I could not see the table "t" which I created before:

$ psql -c "select count(*) from t"
ERROR: relation "t" does not exist

Are you still seeing this failure with the latest patch I posted
(http://archives.postgresql.org/message-id/4F38F5E5.8050203@enterprisedb.com)?
That includes Jeff's fix for the original crash you and Jeff saw. With
that version, I can't get a crash anymore. I also can't reproduce the
inconsistency that Jeff still saw with his fix
(http://archives.postgresql.org/message-id/CAMkU=1zGWp2QnTjiyFe0VMu4gc+MoEexXYaVC2u=+ORfiYj6ow@mail.gmail.com).
Jeff, can you clarify if you're still seeing an issue with the latest
version of the patch? If so, can you give a self-contained test case for
that?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#41Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#40)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 13.02.2012 19:13, Fujii Masao wrote:

On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

On 13.02.2012 01:04, Jeff Janes wrote:

Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
 I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do.  I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.

Thanks. That's basically the right approach. Attached patch contains a
cleaned up version of that.

It does get rid of the "there is no contrecord flag" errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on "record with zero length at...",
that recovery is correct.

But when it ends on "invalid magic number 0000 in log file.." then the
recovery is screwed up.

Can you write a self-contained test case for that? I've been trying to
reproduce that by running the regression tests and pgbench with a
streaming
replication standby, which should be pretty much the same as crash
recovery.
No luck this far.

Probably I could reproduce the same problem as Jeff got. Here is the test
case:

$ initdb -D data
$ pg_ctl -D data start
$ psql -c "create table t (i int); insert into t
values(generate_series(1,10000)); delete from t"
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start

The crash recovery emitted the following server logs:

LOG:  database system was interrupted; last known up at 2012-02-14
02:07:01 JST
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 0/179CC90
LOG:  invalid magic number 0000 in log file 0, segment 1, offset 8060928
LOG:  redo done at 0/17AD858
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

After recovery, I could not see the table "t" which I created before:

$ psql -c "select count(*) from t"
ERROR:  relation "t" does not exist

Are you still seeing this failure with the latest patch I posted
(http://archives.postgresql.org/message-id/4F38F5E5.8050203@enterprisedb.com)?

Yes. Just to be safe, I again applied the latest patch to HEAD,
compiled that and tried
the same test. Then unfortunately I got the same failure again.

I ran the configure with '--enable-debug' '--enable-cassert'
'CPPFLAGS=-DWAL_DEBUG',
and make with -j 2 option.

When I ran the test with wal_debug = on, I got the following assertion failure.

LOG: INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/197
STATEMENT: create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
LOG: INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/198
STATEMENT: create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
TRAP: FailedAssertion("!(((bool) (((void*)(&(target->tid)) != ((void
*)0)) && ((&(target->tid))->ip_posid != 0))))", File: "heapam.c",
Line: 5578)
LOG: xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0
LOG: xlog bg flush request 0/17B4000; write 0/17B0000; flush 0/17B0000
LOG: server process (PID 16806) was terminated by signal 6: Abort trap

This might be related to the original problem which Jeff and I saw.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#42Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#41)
1 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 15.02.2012 18:52, Fujii Masao wrote:

On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Are you still seeing this failure with the latest patch I posted
(http://archives.postgresql.org/message-id/4F38F5E5.8050203@enterprisedb.com)?

Yes. Just to be safe, I again applied the latest patch to HEAD,
compiled that and tried
the same test. Then unfortunately I got the same failure again.

Ok.

I ran the configure with '--enable-debug' '--enable-cassert'
'CPPFLAGS=-DWAL_DEBUG',
and make with -j 2 option.

When I ran the test with wal_debug = on, I got the following assertion failure.

LOG: INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/197
STATEMENT: create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
LOG: INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/198
STATEMENT: create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
TRAP: FailedAssertion("!(((bool) (((void*)(&(target->tid)) != ((void
*)0))&& ((&(target->tid))->ip_posid != 0))))", File: "heapam.c",
Line: 5578)
LOG: xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0
LOG: xlog bg flush request 0/17B4000; write 0/17B0000; flush 0/17B0000
LOG: server process (PID 16806) was terminated by signal 6: Abort trap

This might be related to the original problem which Jeff and I saw.

That's strange. I made a fresh checkout, too, and applied the patch, but
still can't reproduce. I used the attached script to test it.

It's surprising that the crash happens when the records are inserted,
not at recovery. I don't see anything obviously wrong there, so could
you please take a look around in gdb and see if you can get a clue
what's going on? What's the stack trace?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

fujii-crash.shapplication/x-sh; name=fujii-crash.shDownload
#43Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#42)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Thu, Feb 16, 2012 at 5:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 15.02.2012 18:52, Fujii Masao wrote:

On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

Are you still seeing this failure with the latest patch I posted

(http://archives.postgresql.org/message-id/4F38F5E5.8050203@enterprisedb.com)?

Yes. Just to be safe, I again applied the latest patch to HEAD,
compiled that and tried
the same test. Then unfortunately I got the same failure again.

Ok.

I ran the configure with '--enable-debug' '--enable-cassert'
'CPPFLAGS=-DWAL_DEBUG',
and make with -j 2 option.

When I ran the test with wal_debug = on, I got the following assertion
failure.

LOG:  INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/197
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
LOG:  INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/198
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
TRAP: FailedAssertion("!(((bool) (((void*)(&(target->tid)) != ((void
*)0))&&  ((&(target->tid))->ip_posid != 0))))", File: "heapam.c",

Line: 5578)
LOG:  xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0
LOG:  xlog bg flush request 0/17B4000; write 0/17B0000; flush 0/17B0000
LOG:  server process (PID 16806) was terminated by signal 6: Abort trap

This might be related to the original problem which Jeff and I saw.

That's strange. I made a fresh checkout, too, and applied the patch, but
still can't reproduce. I used the attached script to test it.

It's surprising that the crash happens when the records are inserted, not at
recovery. I don't see anything obviously wrong there, so could you please
take a look around in gdb and see if you can get a clue what's going on?
What's the stack trace?

According to the above log messages, one strange thing is that the location
of the WAL record (i.e., 0/17B3F90) is not the same as the previous location
of the following WAL record (i.e., 0/17B3F50). Is this intentional?

BTW, when I ran the test on my Ubuntu, I could not reproduce the problem.
I could reproduce the problem only in MacOS.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#44Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#43)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Thu, Feb 16, 2012 at 6:15 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Feb 16, 2012 at 5:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 15.02.2012 18:52, Fujii Masao wrote:

On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

Are you still seeing this failure with the latest patch I posted

(http://archives.postgresql.org/message-id/4F38F5E5.8050203@enterprisedb.com)?

Yes. Just to be safe, I again applied the latest patch to HEAD,
compiled that and tried
the same test. Then unfortunately I got the same failure again.

Ok.

I ran the configure with '--enable-debug' '--enable-cassert'
'CPPFLAGS=-DWAL_DEBUG',
and make with -j 2 option.

When I ran the test with wal_debug = on, I got the following assertion
failure.

LOG:  INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/197
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
LOG:  INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/198
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
TRAP: FailedAssertion("!(((bool) (((void*)(&(target->tid)) != ((void
*)0))&&  ((&(target->tid))->ip_posid != 0))))", File: "heapam.c",

Line: 5578)
LOG:  xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0
LOG:  xlog bg flush request 0/17B4000; write 0/17B0000; flush 0/17B0000
LOG:  server process (PID 16806) was terminated by signal 6: Abort trap

This might be related to the original problem which Jeff and I saw.

That's strange. I made a fresh checkout, too, and applied the patch, but
still can't reproduce. I used the attached script to test it.

It's surprising that the crash happens when the records are inserted, not at
recovery. I don't see anything obviously wrong there, so could you please
take a look around in gdb and see if you can get a clue what's going on?
What's the stack trace?

According to the above log messages, one strange thing is that the location
of the WAL record (i.e., 0/17B3F90) is not the same as the previous location
of the following WAL record (i.e., 0/17B3F50). Is this intentional?

BTW, when I ran the test on my Ubuntu, I could not reproduce the problem.
I could reproduce the problem only in MacOS.

+	nextslot = Insert->nextslot;
+	if (NextSlotNo(nextslot) == lastslot)
+	{
+		/*
+		 * Oops, we've "caught our tail" and the oldest slot is still in use.
+		 * Have to wait for it to become vacant.
+		 */
+		SpinLockRelease(&Insert->insertpos_lck);
+		WaitForXLogInsertionSlotToBecomeFree();
+		goto retry;
+	}
+	myslot = &XLogCtl->XLogInsertSlots[nextslot];
+	nextslot = NextSlotNo(nextslot);

nextslot can reach NumXLogInsertSlots, which would be a bug, I guess.
When I did the quick-fix and ran the test, I could not reproduce the problem
any more. I'm not sure if this is really the cause of the problem, though.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#45Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#38)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 13.02.2012 01:04, Jeff Janes wrote:

Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
 I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do.  I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.

Thanks. That's basically the right approach. Attached patch contains a
cleaned up version of that.

Got another problem: when I ran pg_stop_backup to take an online backup,
it got stuck until I had generated new WAL record. This happens because,
in the patch, when pg_stop_backup forces a switch to new WAL file, old
WAL file is not marked as archivable until next new WAL record has been
inserted, but pg_stop_backup keeps waiting for that WAL file to be archived.
OTOH, without the patch, WAL file is marked as archivable as soon as WAL
file switch occurs.

So, in short, the patch seems to handle the WAL file switch logic incorrectly.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#44)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 16.02.2012 13:31, Fujii Masao wrote:

On Thu, Feb 16, 2012 at 6:15 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

BTW, when I ran the test on my Ubuntu, I could not reproduce the problem.
I could reproduce the problem only in MacOS.

+	nextslot = Insert->nextslot;
+	if (NextSlotNo(nextslot) == lastslot)
+	{
+		/*
+		 * Oops, we've "caught our tail" and the oldest slot is still in use.
+		 * Have to wait for it to become vacant.
+		 */
+		SpinLockRelease(&Insert->insertpos_lck);
+		WaitForXLogInsertionSlotToBecomeFree();
+		goto retry;
+	}
+	myslot =&XLogCtl->XLogInsertSlots[nextslot];
+	nextslot = NextSlotNo(nextslot);

nextslot can reach NumXLogInsertSlots, which would be a bug, I guess.
When I did the quick-fix and ran the test, I could not reproduce the problem
any more. I'm not sure if this is really the cause of the problem, though.

Ah, I see. That explains why you only see it on some platforms -
depending on ALIGNOF_XLOG_BUFFER, there is often enough padding after
the last valid slot to accommodate the extra bogus slot. Thanks for the
debugging!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#47Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#45)
1 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 17.02.2012 07:27, Fujii Masao wrote:

Got another problem: when I ran pg_stop_backup to take an online backup,
it got stuck until I had generated new WAL record. This happens because,
in the patch, when pg_stop_backup forces a switch to new WAL file, old
WAL file is not marked as archivable until next new WAL record has been
inserted, but pg_stop_backup keeps waiting for that WAL file to be archived.
OTOH, without the patch, WAL file is marked as archivable as soon as WAL
file switch occurs.

So, in short, the patch seems to handle the WAL file switch logic incorrectly.

Yep. For a WAL-switch record, XLogInsert returns the location of the end
of the record, not the end of the empty padding space. So when the
caller flushed up to that point, it didn't flush the empty space and
therefore didn't notify the archiver.

Attached is a new version, fixing that, and off-by-one bug you pointed
out in the slot wraparound handling. I also moved code around a bit, I
think this new division of labor between the XLogInsert subroutines is
more readable.

Thanks for the testing!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-9.patchtext/x-diff; name=xloginsert-scale-9.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 290,315 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually three shared-memory copies of
!  * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
!  *		XLogCtl->LogwrtResult is protected by info_lck
!  *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
!  *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
!  * One must hold the associated lock to read or write any of these, but
!  * of course no lock is needed to read/write the unshared LogwrtResult.
!  *
!  * XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always
!  * right", since both are updated by a write or flush operation before
!  * it releases WALWriteLock.  The point of keeping XLogCtl->Write.LogwrtResult
!  * is that it can be examined/modified by code that already holds WALWriteLock
!  * without needing to grab info_lck as well.
!  *
!  * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
!  * but is updated when convenient.	Again, it exists for the convenience of
!  * code that is already holding WALInsertLock but not the other locks.
!  *
!  * The unshared LogwrtResult may lag behind any or all of these, and again
!  * is updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
--- 291,301 ----
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There is one shared-memory copy of LogwrtResult,
!  * plus one unshared copy in each backend. To read the shared copy, you need
!  * to hold info_lck *or* WALWriteLock. To update it, you need to hold both
!  * locks. The unshared LogwrtResult may lag behind the shared copy, and is
!  * updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
***************
*** 319,328 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 305,319 ----
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 334,339 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 325,399 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL, and the next insertion
+  *    slot to advertise that the insertion is in progress. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    insertpos_lck. Try to keep this section as short as possible,
+  *    insertpos_lck can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each insertion is asssigned its own
+  * "XLog insertion slot", which is used to advertise the position the backend
+  * is writing to. The slot is marked as in-use in step 1, while holding
+  * insertpos_lck, by setting the position field in the slot. When the backend
+  * is finished with the insertion, it clears its slot. Each slot is protected
+  * by a separate spinlock, to keep contention minimal.
+  *
+  * The insertion slots also provide a mechanism to wait for an insertion to
+  * finish. This is important when an XLOG page is written out - any
+  * in-progress insertions must finish copying data to the page first, or the
+  * on-disk copy will be incomplete. Waiting is done by the
+  * WaitXLogInsertionsToFinish() function. It adds the current process to the
+  * waiting queue in the slot it needs to wait for, and when that insertion
+  * finishes (or proceeds to the next page, at least), the inserter wakes up
+  * the process.
+  *
+  * The insertion slots form a ring. Insert->nextslot points to the next free
+  * slot, and Insert->lastslot points to the last slot that's still in use.
+  * lastslot can lag behind reality by any number of slots, as long as nextslot
+  * doesn't catch up with it. lastslot is advanced by
+  * WaitXLogInsertionsToFinish(), and is protected by WALInsertTailLock.
+  * nextslot is advanced in ReserveXLogInsertLocation() and is protected by
+  * insertpos_lck. Both slot variables are 32-bit integers, so that they can
+  * be read atomically without holding a lock.
+  *
+  * Whenever the ring fills up, ie. when nextslot wraps around and catches up
+  * with lastslot, ReserveXLogInsertLocation() has to wait for the oldest
+  * insertion to finish and advance lastslot, to make room for the new
+  * insertion. This is done by WaitForXLogInsertionSlotToBecomeFree() function,
+  * which is similar to WaitXLogInsertionsToFinish(), but instead of waiting
+  * for all insertions up to a given point to finish, it just waits for the
+  * inserter in the oldest slot to finish.
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+  * finish (or at least advance to next uninitialized page), while you're
+  * holding WALWriteLock. That would be bad, because the backend you're waiting
+  * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+  * you'd get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * it's called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition, so the common pattern:
+  *
+  *   recptr = XLogInsert(...)
+  *   XLogFlush(recptr)
+  *
+  * is safe. It can't get stuck, because an insertion to a WAL page that's
+  * already initialized in cache can always proceed without waiting on a lock.
+  *
   *----------
   */
  
***************
*** 354,364 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 414,443 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct
! 								 * (except lastslot). */
! 
! 	int32		nextslot;		/* next insertion slot to use */
! 	int32		lastslot;		/* last in-use insertion slot (protected by
! 								 * WALInsertTailLock) */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment. The
! 	 * next record will be inserted there (or somewhere after it if there's
! 	 * not enough space on the current page). PrevRecord points to the
! 	 * beginning of the last record already reserved. It might not be fully
! 	 * copied into place yet, but we know its exact location already.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
! 
! 	/*
! 	 * padding to push RedoRecPtr and forcePageWrites, which rarely change,
! 	 * to a different cache line than the rapidly-changing CurrPos and
! 	 * PrevRecord values. XXX: verify if this makes any difference
! 	 */
! 	char		padding[128];
! 
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 388,406 **** typedef struct XLogCtlInsert
   */
  typedef struct XLogCtlWrite
  {
- 	XLogwrtResult LogwrtResult; /* current value of LogwrtResult */
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
--- 467,500 ----
   */
  typedef struct XLogCtlWrite
  {
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
+ 
+ /*
+  * Slots for in-progress WAL insertions.
+  */
+ typedef struct
+ {
+ 	slock_t		lck;
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ 	PGPROC	   *head;		/* head of list of waiting PGPROCs */
+ 	PGPROC	   *tail;		/* tail of list of waiting PGPROCs */
+ } XLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	1000
+ 
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	XLogInsertSlot *XLogInsertSlots;
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
***************
*** 414,422 **** typedef struct XLogCtlData
  	XLogCtlWrite Write;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 508,526 ----
  	XLogCtlWrite Write;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock. To change the identity of a buffer that's
+ 	 * still dirty, the old page needs to be written out first, and for that
+ 	 * you need WALWriteLock, and you need to ensure that there's no
+ 	 * in-progress insertions to the page by calling
+ 	 * WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* current (latest) block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 494,521 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
  
- /* Free space remaining in the current xlog page buffer */
- #define INSERT_FREESPACE(Insert)  \
- 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
! /* Construct XLogRecPtr value for current insertion point */
! #define INSERT_RECPTR(recptr,Insert,curridx)  \
! 	( \
! 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
! 	  (recptr).xrecoff = \
! 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
! 	)
  
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
--- 598,628 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
+ #define INSERT_FREESPACE(endptr)	\
+ 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
+ 
+ #define NextBufIdx(idx)		\
+ 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  
! #define NextSlotNo(idx)		\
! 		(((idx) == NumXLogInsertSlots - 1) ? 0 : ((idx) + 1))
  
! /*
!  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
!  * would hold if it was in cache, the page containing 'recptr'.
!  *
!  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
!  * page is taken to mean the previous page.
!  */
! #define XLogRecPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
! #define XLogRecEndPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogSegsPerFile * XLogSegSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
***************
*** 641,649 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
--- 748,756 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
***************
*** 690,695 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 797,820 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
+ 				  XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  XLogInsertSlot *myslot,
+ 				  XLogRecPtr StartPos, XLogRecPtr EndPos);
+ static bool ReserveXLogInsertLocation(int size, bool forcePageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p);
+ static void UpdateSlotCurrPos(volatile XLogInsertSlot *myslot,
+ 				  XLogRecPtr CurrPos);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto,
+ 						   XLogRecPtr CurrPos);
+ static void WaitForXLogInsertionSlotToBecomeFree(void);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ static XLogRecPtr AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 710,721 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
- 	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
--- 835,840 ----
***************
*** 729,739 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	uint32		len,
  				write_len;
  	unsigned	i;
- 	bool		updrqst;
  	bool		doPageWrites;
! 	bool		isLogSwitch = false;
! 	bool		fpwChange = false;
  	uint8		info_orig = info;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 848,862 ----
  	uint32		len,
  				write_len;
  	unsigned	i;
  	bool		doPageWrites;
! 	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
+ 	XLogRecord	rechdr;
+ 	XLogRecPtr	PrevRecord;
+ 	XLogRecPtr	StartPos;
+ 	XLogRecPtr	EndPos;
+ 	XLogInsertSlot *myslot;
+ 	bool		updrqst;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 746,778 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
  
  	/*
! 	 * Handle special cases/records.
  	 */
! 	if (rmid == RM_XLOG_ID)
! 	{
! 		switch (info)
! 		{
! 			case XLOG_SWITCH:
! 				isLogSwitch = true;
! 				break;
! 
! 			case XLOG_FPW_CHANGE:
! 				fpwChange = true;
! 				break;
! 
! 			default:
! 				break;
! 		}
! 	}
! 	else if (IsBootstrapProcessingMode())
  	{
! 		/*
! 		 * In bootstrap mode, we don't actually log anything but XLOG resources;
! 		 * return a phony record pointer.
! 		 */
! 		RecPtr.xlogid = 0;
! 		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
! 		return RecPtr;
  	}
  
  	/*
--- 869,882 ----
  	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
  
  	/*
! 	 * In bootstrap mode, we don't actually log anything but XLOG resources;
! 	 * return a phony record pointer.
  	 */
! 	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
  	{
! 		EndPos.xlogid = 0;
! 		EndPos.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
! 		return EndPos;
  	}
  
  	/*
***************
*** 939,1071 **** begin:;
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
! 	START_CRIT_SECTION();
  
! 	/* Now wait to get insert lock */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
! 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
! 	 * back and recompute everything.  This can only happen just after a
! 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
! 	 *
! 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
! 	 * affect the contents of the XLOG record, so we'll update our local copy
! 	 * but not force a recomputation.
  	 */
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
  	{
! 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
! 		RedoRecPtr = Insert->RedoRecPtr;
  
! 		if (doPageWrites)
  		{
! 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 			{
! 				if (dtbuf[i] == InvalidBuffer)
! 					continue;
! 				if (dtbuf_bkp[i] == false &&
! 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
! 				{
! 					/*
! 					 * Oops, this buffer now needs to be backed up, but we
! 					 * didn't think so above.  Start over.
! 					 */
! 					LWLockRelease(WALInsertLock);
! 					END_CRIT_SECTION();
! 					rdt_lastnormal->next = NULL;
! 					info = info_orig;
! 					goto begin;
! 				}
! 			}
  		}
- 	}
  
! 	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
! 	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
! 	{
! 		/* Oops, must redo it with full-page data. */
! 		LWLockRelease(WALInsertLock);
! 		END_CRIT_SECTION();
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	updrqst = false;
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
  	{
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Compute record's XLOG location */
! 	curridx = Insert->curridx;
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 	 * segment, we need not insert it (and don't want to because we'd like
! 	 * consecutive switch requests to be no-ops).  Instead, make sure
! 	 * everything is written and flushed through the end of the prior segment,
! 	 * and return the prior segment's end address.
  	 */
! 	if (isLogSwitch &&
! 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  	{
! 		/* We can release insert lock immediately */
! 		LWLockRelease(WALInsertLock);
! 
! 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
! 		if (RecPtr.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			RecPtr.xlogid -= 1;
! 			RecPtr.xrecoff = XLogFileSize;
! 		}
! 
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
! 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
! 		{
! 			XLogwrtRqst FlushRqst;
! 
! 			FlushRqst.Write = RecPtr;
! 			FlushRqst.Flush = RecPtr;
! 			XLogWrite(FlushRqst, false, false);
! 		}
! 		LWLockRelease(WALWriteLock);
! 
! 		END_CRIT_SECTION();
  
! 		return RecPtr;
  	}
  
! 	/* Insert record header */
  
! 	record = (XLogRecord *) Insert->currpos;
! 	record->xl_prev = Insert->PrevRecord;
! 	record->xl_xid = GetCurrentTransactionIdIfAny();
! 	record->xl_tot_len = SizeOfXLogRecord + write_len;
! 	record->xl_len = len;		/* doesn't include backup blocks */
! 	record->xl_info = info;
! 	record->xl_rmid = rmid;
  
! 	/* Now we can finish computing the record's CRC */
! 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
--- 1043,1148 ----
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
! 	/*
! 	 * Construct record header. We can't CRC it yet, because the prev-link
! 	 * needs to be covered by the CRC and we don't know that yet. We will
! 	 * finish computing the CRC when we do.
! 	 */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	/* rechdr.xl_prev is set later */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to reserve space for the record from the WAL.
  	 */
! 	if (!ReserveXLogInsertLocation(write_len, doPageWrites, isLogSwitch,
! 								   &PrevRecord, &StartPos, &EndPos,
! 								   (XLogInsertSlot **) &myslot, &updrqst))
  	{
! 		END_CRIT_SECTION();
  
! 		/*
! 		 * If the record is an XLOG_SWITCH, and we were exactly at the start
! 		 * of a segment, we need not insert it (and don't want to because we'd
! 		 * like consecutive switch requests to be no-ops).  Instead, make sure
! 		 * everything is written and flushed through the end of the prior
! 		 * segment, and return the prior segment's end address.
! 		 */
! 		if (isLogSwitch && !XLogRecPtrIsInvalid(EndPos))
  		{
! 			XLogFlush(EndPos);
! 			return EndPos;
  		}
  
! 		/*
! 		 * Oops, must redo it with full-page data. Unlink the backup blocks
! 		 * from the chain and reset info bitmask to undo the changes we've
! 		 * done.
! 		 */
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
! 	else
  	{
! 		/*
! 		 * Finish the record header by setting prev-link (now that we know it),
! 		 * and finish computing the record's CRC (in CopyXLogRecordToWAL). Then
! 		 * copy the record to the space we reserved.
! 		 */
! 		rechdr.xl_prev = PrevRecord;
! 		CopyXLogRecordToWAL(write_len, isLogSwitch, &rechdr,
! 							rdata, rdata_crc, myslot, StartPos, EndPos);
  	}
  
! 	END_CRIT_SECTION();
  
  	/*
! 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
  	 */
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	/*
! 	 * If this was an XLOG_SWITCH record, flush the record and the empty
! 	 * padding space that fills the rest of the segment, and perform
! 	 * end-of-segment actions (eg, notifying archiver).
! 	 */
! 	if (isLogSwitch)
! 	{
! 		XLogFlush(EndPos);
  
! 		/*
! 		 * Even though we reserved the rest of the segment for us, which is
! 		 * reflected in EndPos, we return a pointer to just the end of the
! 		 * xlog-switch record.
! 		 */
! 		EndPos.xlogid = StartPos.xlogid;
! 		EndPos.xrecoff = StartPos.xrecoff + SizeOfXLogRecord;
! 	}
  
! 	/*
! 	 * Update our global variables
! 	 */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
***************
*** 1074,1267 **** begin:;
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
! 						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, record);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
- 	/* Record begin of record in appropriate places */
- 	ProcLastRecPtr = RecPtr;
- 	Insert->PrevRecord = RecPtr;
- 
- 	Insert->currpos += SizeOfXLogRecord;
- 	freespace -= SizeOfXLogRecord;
- 
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
  		{
! 			if (rdata->len > freespace)
  			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
  			}
  		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Ensure next record will be properly aligned */
! 	Insert->currpos = (char *) Insert->currpage +
! 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
! 	freespace = INSERT_FREESPACE(Insert);
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
  	{
! 		XLogCtlWrite *Write = &XLogCtl->Write;
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
  		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
  		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == Write->curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
  		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
! 		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
! 		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		Write->LogwrtResult = LogwrtResult;
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
  	}
  	else
  	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
  		{
! 			/* curridx is filled and available for writing out */
! 			updrqst = true;
  		}
  		else
  		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
- 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
  	/*
! 	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
! 	 * in shared memory before releasing WALInsertLock. This ensures that
! 	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
! 	 * by this change of full_page_writes.
  	 */
! 	if (fpwChange)
! 		Insert->fullPageWrites = fullPageWrites;
  
! 	LWLockRelease(WALInsertLock);
! 
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
! 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1151,1815 ----
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
! 						 EndPos.xlogid, EndPos.xrecoff);
! 		xlog_outrec(&buf, &rechdr);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rmid].rm_desc(&buf, rechdr.xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	return EndPos;
! }
  
! /*
!  * Subroutine of XLogInsert. Copies a WAL record to an already-reserved area
!  * in the WAL.
!  */
! static void
! CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecord *rechdr,
! 					XLogRecData *rdata, pg_crc32 rdata_crc,
! 					XLogInsertSlot *myslot_p,
! 					XLogRecPtr StartPos, XLogRecPtr EndPos)
! {
! 	volatile XLogInsertSlot *myslot = myslot_p;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			freespace;
! 	int			written;
! 	XLogRecPtr	CurrPos;
! 
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
! 
! 	/* Copy the record header in place, and finish calculating CRC */
! 	record = (XLogRecord *) currpos;
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	COMP_CRC32(rdata_crc, currpos + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
! 
! 	currpos += SizeOfXLogRecord;
! 	CurrPos.xrecoff += SizeOfXLogRecord;
! 
! 	if (!isLogSwitch)
! 	{
! 		/* Copy record data */
! 		written = 0;
! 		freespace = INSERT_FREESPACE(CurrPos);
! 		while (rdata != NULL)
  		{
! 			while (rdata->len > freespace)
  			{
! 				/*
! 				 * Write what fits on this page, then write the continuation
! 				 * record, and continue.
! 				 */
! 				XLogContRecord *contrecord;
! 
! 				memcpy(currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				written += freespace;
! 				CurrPos.xrecoff += freespace;
! 
! 				/*
! 				 * CurrPos now points to the page boundary, ie. the first byte
! 				 * of the next page. Advertise that as our CurrPos before
! 				 * calling GetXLogBuffer(), because GetXLogBuffer() might need
! 				 * to wait for some insertions to finish so that it can write
! 				 * out a buffer to make room for the new page. Updating CurrPos
! 				 * before waiting for a new buffer ensures that we don't
! 				 * deadlock with ourselves if we run out of clean buffers.
! 				 *
! 				 * However, we must not advance CurrPos past the page header
! 				 * yet, otherwise someone might try to flush up to that point,
! 				 * which would fail if the next page was not initialized yet.
! 				 */
! 				UpdateSlotCurrPos(myslot, CurrPos);
! 
! 				/*
! 				 * Get pointer to beginning of next page, and set the
! 				 * XLP_FIRST_IS_CONTRECORD flag in the page header.
! 				 *
! 				 * It's safe to set the contrecord flag without a lock on the
! 				 * page. All the other flags are set in AdvanceXLInsertBuffer,
! 				 * and we're the only backend that needs to set the contrecord
! 				 * flag.
! 				 */
! 				currpos = GetXLogBuffer(CurrPos);
! 				((XLogPageHeader) currpos)->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 
! 				/* skip over the page header, and write continuation record */
! 				CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 				currpos = GetXLogBuffer(CurrPos);
! 
! 				contrecord = (XLogContRecord *) currpos;
! 				contrecord->xl_rem_len = write_len - written;
! 
! 				currpos += SizeOfXLogContRecord;
! 				CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 				freespace = INSERT_FREESPACE(CurrPos);
  			}
+ 
+ 			memcpy(currpos, rdata->data, rdata->len);
+ 			currpos += rdata->len;
+ 			CurrPos.xrecoff += rdata->len;
+ 			freespace -= rdata->len;
+ 			written += rdata->len;
+ 
+ 			rdata = rdata->next;
  		}
+ 		Assert(written == write_len);
  
! 		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
! 		Assert(XLByteEQ(CurrPos, EndPos));
  	}
+ 	else
+ 	{
+ 		/* An xlog-switch record doesn't contain any data besides the header */
+ 		Assert(write_len == 0);
  
! 		/*
! 		 * An xlog-switch record consumes all the remaining space on the
! 		 * WAL segment. We have already reserved it for us, but we still need
! 		 * to make sure it's been allocated and zeroed in the WAL buffers so
! 		 * that when the caller (or someone else) does XLogWrite(), it can
! 		 * really write out all the zeros.
! 		 *
! 		 * We do this one page at a time, to make sure we don't deadlock
! 		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
! 		 */
! 		while (XLByteLT(CurrPos, EndPos))
! 		{
! 			/* use up all the remaining space in this page */
! 			freespace = INSERT_FREESPACE(CurrPos);
! 			XLByteAdvance(CurrPos, freespace);
! 			/*
! 			 * like in the non-xlog-switch codepath, let others know that
! 			 * we're done writing up to the end of this page
! 			 */
! 			UpdateSlotCurrPos(myslot, CurrPos);
! 			/*
! 			 * let GetXLogBuffer initialize next page if necessary.
! 			 */
! 			CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 			(void) GetXLogBuffer(CurrPos);
! 		}
! 	}
  
  	/*
! 	 * Done! Clear CurrPos in our slot to let others know that we're
! 	 * finished.
  	 */
! 	UpdateSlotCurrPos(myslot, InvalidXLogRecPtr);
! }
! 
! /*
!  * Reserves the right amount of space for a record of given size from the WAL.
!  * *StartPos_p is set to the beginning of the reserved section, *EndPos_p to
!  * its end, and *Prev_record_p points to the beginning of the previous record
!  * to set to the prev-link of the record header.
!  *
!  * A log-switch record is handled slightly differently. The rest of the
!  * segment will be reserved for this insertion, as indicated by the returned
!  * *EndPos_p value. However, if we are already at the beginning of the current
!  * segment, the *EndPos_p is set to the current location without reserving
!  * any space, and the function returns false.
!  *
!  * *updrqst_p is set to true, if this record ends on different page than
!  * the previous one - the caller should update the shared LogwrtRqst value
!  * after it's done inserting the record in that case, so that the WAL page
!  * that filled up gets written out at the next convenient moment.
!  *
!  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
!  * (or the end of previous record, to be exact) to let others know that we're
!  * busy inserting to the reserved area. The caller must clear it when the
!  * insertion is finished.
!  *
!  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
!  * changed. On failure, the shared state is not modified.
!  *
!  * This is the performance critical part of XLogInsert that must be
!  * serialized across backends. The rest can happen mostly in parallel.
!  *
!  * NB: The space calculation here must match the code in PerformXLogInsert,
!  * where we actually copy the record to the reserved space.
!  */
! static bool
! ReserveXLogInsertLocation(int size, bool didPageWrites,
! 						  bool isLogSwitch,
! 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
! 						  XLogRecPtr *EndPos_p,
! 						  XLogInsertSlot **myslot_p, bool *updrqst_p)
! {
! 	volatile XLogInsertSlot *myslot;
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			freespace;
! 	XLogRecPtr	ptr;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	LastEndPos;
! 	int32		nextslot;
! 	int32		lastslot;
! 	bool		updrqst = false;
! 
! 	size = SizeOfXLogRecord + size;
! 
! retry:
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
! 		(!didPageWrites && (Insert->forcePageWrites || Insert->fullPageWrites)))
! 	{
! 		/*
! 		 * Oops, a checkpoint just happened, or forcePageWrites was just
! 		 * turned on. Start XLogInsert() all over, because we might have to
! 		 * include more full-page images in the record.
! 		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		*EndPos_p = InvalidXLogRecPtr;
! 		return false;
! 	}
  
  	/*
! 	 * Reserve the next insertion slot for us.
! 	 *
! 	 * First check that the slot is not still in use. Modifications to
! 	 * lastslot are protected by WALInsertTailLock, but here we assume that
! 	 * reading an int32 is atomic. Another process might advance lastslot at
! 	 * the same time, but not past nextslot.
  	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	if (NextSlotNo(nextslot) == lastslot)
  	{
! 		/*
! 		 * Oops, we've "caught our tail" and the oldest slot is still in use.
! 		 * Have to wait for it to become vacant.
! 		 */
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		WaitForXLogInsertionSlotToBecomeFree();
! 		goto retry;
! 	}
! 	myslot = &XLogCtl->XLogInsertSlots[nextslot];
! 	nextslot = NextSlotNo(nextslot);
  
! 	/*
! 	 * Got the slot, now reserve the right amount of space from the WAL for
! 	 * our record.
! 	 */
! 	LastEndPos = ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	freespace = INSERT_FREESPACE(ptr);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
! 		freespace = INSERT_FREESPACE(ptr);
! 		updrqst = true;
! 	}
  
! 	/*
! 	 * We are now at the starting position of our record. Now figure out how
! 	 * the data will be split across the WAL pages, to calculate where the
! 	 * record ends.
! 	 */
! 	StartPos = ptr;
  
+ 	if (isLogSwitch)
+ 	{
  		/*
! 		 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 		 * segment, we need not insert it (and don't want to because we'd like
! 		 * consecutive switch requests to be no-ops). Otherwise the XLOG_SWITCH
! 		 * record should consume all the remaining space on the current segment.
  		 */
! 		Assert(size == SizeOfXLogRecord);
! 		if ((ptr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  		{
! 			/* We can release insert lock immediately */
! 			SpinLockRelease(&Insert->insertpos_lck);
  
! 			ptr.xrecoff -= SizeOfXLogLongPHD;
! 			if (ptr.xrecoff == 0)
! 			{
! 				/* crossing a logid boundary */
! 				ptr.xlogid -= 1;
! 				ptr.xrecoff = XLogFileSize;
! 			}
  
! 			*EndPos_p = ptr;
! 			*StartPos_p = ptr;
! 			*myslot_p = NULL;
! 
! 			return false;
! 		}
! 		else
  		{
! 			if (ptr.xrecoff % XLOG_SEG_SIZE != 0)
! 			{
! 				int segleft = XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE);
! 				XLByteAdvance(ptr, segleft);
! 			}
! 			updrqst = true;
! 		}
! 	}
! 	else
! 	{
! 		/* A normal record, ie. not xlog-switch */
! 		int sizeleft = size;
! 		while (freespace < sizeleft)
! 		{
! 			/* fill this page, and continue on next page */
! 			sizeleft -= freespace;
! 			ptr = AdvanceXLogRecPtrToNextPage(ptr);
  
! 			/* account for continuation record header */
! 			ptr.xrecoff += SizeOfXLogContRecord;
! 			freespace = INSERT_FREESPACE(ptr);
! 
! 			updrqst = true;
  		}
+ 		/* the rest fits on this page */
+ 		ptr.xrecoff += sizeleft;
  
! 		/* Align the end position, so that the next record starts aligned */
! 		ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 	}
  
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	myslot->CurrPos = LastEndPos;
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 	Insert->nextslot = nextslot;
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(LOG, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 PrevRecord_p->xlogid, PrevRecord_p->xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 size);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 	*myslot_p = (XLogInsertSlot *) myslot;
! 	*updrqst_p = updrqst;
  
! 	return true;
! }
! 
! /*
!  * Update slot's CurrPos variable, and wake up anyone waiting on it.
!  */
! static void
! UpdateSlotCurrPos(volatile XLogInsertSlot *myslot, XLogRecPtr CurrPos)
! {
! 	PGPROC	   *head;
! 
! 	/*
! 	 * The write-barrier ensures that the changes we made to the WAL pages
! 	 * are visible to everyone before the update of CurrPos.
! 	 *
! 	 * XXX: I'm not sure if this is necessary. Does a function call act
! 	 * as an implicit barrier?
! 	 */
! 	pg_write_barrier();
! 
! 	SpinLockAcquire(&myslot->lck);
! 	myslot->CurrPos = CurrPos;
! 	head = myslot->head;
! 	myslot->head = myslot->tail = NULL;
! 	SpinLockRelease(&myslot->lck);
! 	while (head != NULL)
! 	{
! 		PGPROC *proc = head;
! 		head = proc->lwWaitLink;
! 		proc->lwWaitLink = NULL;
! 		proc->lwWaiting = false;
! 		PGSemaphoreUnlock(&proc->sem);
  	}
+ }
+ 
+ /*
+  * Get a pointer to the right location in the WAL buffer containing the
+  * given XLogRecPtr.
+  *
+  * If the page is not initialized yet, it is initialized. That might require
+  * evicting an old dirty buffer from the buffer cache, which means I/O.
+  *
+  * The caller must ensure that the page containing the requested location
+  * isn't evicted yet, and won't be evicted, by holding onto an
+  * XLogInsertSlot with CurrPos set to 'ptr'. Setting it to some value
+  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
+  * if we have to evict a buffer, we might have to wait for someone else to
+  * finish a write. And that someone else might not be able to finish the write
+  * if our CurrPos points to a buffer that's still in the buffer cache.
+  */
+ static char *
+ GetXLogBuffer(XLogRecPtr ptr)
+ {
+ 	int			idx;
+ 	XLogRecPtr	endptr;
+ 
+ 	/*
+ 	 * The XLog buffer cache is organized so that we can easily calculate the
+ 	 * buffer a given page must be loaded into from the XLogRecPtr alone.
+ 	 * A page must always be loaded to a particular buffer.
+ 	 */
+ 	idx = XLogRecPtrToBufIdx(ptr);
+ 
+ 	/*
+ 	 * See what page is loaded in the buffer at the moment. It could be the
+ 	 * page we're looking for, or something older. It can't be anything
+ 	 * newer - that would imply the page we're looking for has already
+ 	 * been written out to disk, which shouldn't happen as long as the caller
+ 	 * has set its slot's CurrPos correctly.
+ 	 *
+ 	 * However, we don't hold a lock while we read the value. If someone has
+ 	 * just initialized the page, it's possible that we get a "torn read",
+ 	 * and read a bogus value. That's ok, we'll grab the mapping lock (in
+ 	 * AdvanceXLInsertBuffer) and retry if we see anything else than the page
+ 	 * we're looking for. But it means that when we do this unlocked read, we
+ 	 * might see a value that appears to be ahead of the page we're looking
+ 	 * for. So don't PANIC on that, until we've verified the value while
+ 	 * holding the lock.
+ 	 */
+ 	endptr = XLogCtl->xlblocks[idx];
+ 	if (ptr.xlogid != endptr.xlogid ||
+ 		!(ptr.xrecoff < endptr.xrecoff &&
+ 		  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
+ 	{
+ 		AdvanceXLInsertBuffer(ptr, false);
+ 		endptr = XLogCtl->xlblocks[idx];
+ 
+ 		if (ptr.xlogid != endptr.xlogid ||
+ 			!(ptr.xrecoff < endptr.xrecoff &&
+ 			  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
+ 		{
+ 			elog(PANIC, "could not find WAL buffer for %X/%X",
+ 				 ptr.xlogid, ptr.xrecoff);
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Found the buffer holding this page. Return a pointer to the right
+ 	 * offset within the page.
+ 	 */
+ 	return (char *) XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
+ 		ptr.xrecoff % XLOG_BLCKSZ;
+ }
+ 
+ /*
+  * Advance an XLogRecPtr to the first valid insertion location on the next
+  * page, right after the page header. An XLogRecPtr pointing to a boundary,
+  * ie. the first byte of a page, is taken to belong to the previous page.
+  */
+ static XLogRecPtr
+ AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr)
+ {
+ 	int			freespace;
+ 
+ 	freespace = INSERT_FREESPACE(ptr);
+ 	XLByteAdvance(ptr, freespace);
+ 	if (ptr.xrecoff % XLogSegSize == 0)
+ 		ptr.xrecoff += SizeOfXLogLongPHD;
  	else
+ 		ptr.xrecoff += SizeOfXLogShortPHD;
+ 
+ 	return ptr;
+ }
+ 
+ /*
+  * Wait for any insertions < upto to finish.
+  *
+  * Returns a value >= upto, which indicates the oldest in-progress insertion
+  * that we saw in the array, or CurrPos if there are no insertions in-progress
+  * at exit.
+  */
+ static XLogRecPtr
+ WaitXLogInsertionsToFinish(XLogRecPtr upto, XLogRecPtr CurrPos)
+ {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			lastslot;
+ 	int			nextslot;
+ 	volatile XLogInsertSlot *slot;
+ 	XLogRecPtr	slotptr = InvalidXLogRecPtr;
+ 	XLogRecPtr	LastPos;
+ 	int			extraWaits = 0;
+ 
+ 	if (MyProc == NULL)
+ 		elog(PANIC, "cannot wait without a PGPROC structure");
+ 
+ 	LastPos = CurrPos;
+ 
+ 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+ 
+ 	lastslot = Insert->lastslot;
+ 	nextslot = Insert->nextslot;
+ 
+ 	/* Skip over slots that have finished already */
+ 	while (lastslot != nextslot)
  	{
! 		slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		SpinLockAcquire(&slot->lck);
! 		slotptr = slot->CurrPos;
  
! 		if (XLogRecPtrIsInvalid(slotptr))
  		{
! 			lastslot = NextSlotNo(lastslot);
! 			SpinLockRelease(&slot->lck);
  		}
  		else
  		{
! 			/*
! 			 * This insertion is still in-progress. Wait for it to finish
! 			 * if it's <= upto, otherwise we're done.
! 			 */
! 			Insert->lastslot = lastslot;
! 
! 			if (XLogRecPtrIsInvalid(upto) || XLByteLE(upto, slotptr))
! 			{
! 				LastPos = slotptr;
! 				SpinLockRelease(&slot->lck);
! 				break;
! 			}
! 
! 			/* wait */
! 			MyProc->lwWaiting = true;
! 			MyProc->lwWaitMode = 0; /* doesn't matter */
! 			MyProc->lwWaitLink = NULL;
! 			if (slot->head == NULL)
! 				slot->head = MyProc;
! 			else
! 				slot->tail->lwWaitLink = MyProc;
! 			slot->tail = MyProc;
! 			SpinLockRelease(&slot->lck);
! 			LWLockRelease(WALInsertTailLock);
! 			for (;;)
! 			{
! 				PGSemaphoreLock(&MyProc->sem, false);
! 				if (!MyProc->lwWaiting)
! 					break;
! 				extraWaits++;
! 			}
! 			LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 			lastslot = Insert->lastslot;
! 			nextslot = Insert->nextslot;
  		}
  	}
  
+ 	Insert->lastslot = lastslot;
+ 	LWLockRelease(WALInsertTailLock);
+ 
+ 	while (extraWaits-- > 0)
+ 		PGSemaphoreUnlock(&MyProc->sem);
+ 
+ 	return LastPos;
+ }
+ 
+ /*
+  * Wait for the next insertion slot to become vacant.
+  */
+ static void
+ WaitForXLogInsertionSlotToBecomeFree(void)
+ {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 	int			lastslot;
+ 	int			nextslot;
+ 	int			extraWaits = 0;
+ 
+ 	if (MyProc == NULL)
+ 		elog(PANIC, "cannot wait without a PGPROC structure");
+ 
+ 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+ 
  	/*
! 	 * Re-read lastslot and nextslot, now that we have the wait-lock.
! 	 * We're reading nextslot without holding insertpos_lck. It could advance
! 	 * at the same time, but it can't advance beyond lastslot - 1.
  	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
  
! 	/*
! 	 * If there are still no slots available, wait for the oldest slot to
! 	 * become vacant.
! 	 */
! 	while (NextSlotNo(nextslot) == lastslot)
  	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
  
! 		SpinLockAcquire(&slot->lck);
! 		if (XLogRecPtrIsInvalid(slot->CurrPos))
! 		{
! 			SpinLockRelease(&slot->lck);
! 			break;
! 		}
! 
! 		/* wait */
! 		MyProc->lwWaiting = true;
! 		MyProc->lwWaitMode = 0; /* doesn't matter */
! 		MyProc->lwWaitLink = NULL;
! 		if (slot->head == NULL)
! 			slot->head = MyProc;
! 		else
! 			slot->tail->lwWaitLink = MyProc;
! 		slot->tail = MyProc;
! 		SpinLockRelease(&slot->lck);
! 		LWLockRelease(WALInsertTailLock);
! 		for (;;)
! 		{
! 			PGSemaphoreLock(&MyProc->sem, false);
! 			if (!MyProc->lwWaiting)
! 				break;
! 			extraWaits++;
! 		}
! 		LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 		lastslot = Insert->lastslot;
! 		nextslot = Insert->nextslot;
  	}
  
! 	/*
! 	 * Ok, there is at least one empty slot now. That's enouugh for our
! 	 * insertion, but ẃhile we're at it, advance lastslot as much as we
! 	 * can. That way we don't need to come back here on the next call
! 	 * again.
! 	 */
! 	while (lastslot != nextslot)
! 	{
! 		volatile XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		/*
! 		 * Don't need to grab the slot's spinlock here, because we're not
! 		 * interested in the exact value of CurrPos, only whether it's
! 		 * valid or not.
! 		 */
! 		if (!XLogRecPtrIsInvalid(slot->CurrPos))
! 			break;
  
! 		lastslot = NextSlotNo(lastslot);
! 	}
! 	Insert->lastslot = lastslot;
  
! 	LWLockRelease(WALInsertTailLock);
  }
  
  /*
***************
*** 1488,1522 **** XLogArchiveCleanup(const char *xlog)
  }
  
  /*
!  * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data.
!  *
!  * If new_segment is TRUE then we set up the next buffer page as the first
!  * page of the next xlog segment file, possibly but not usually the next
!  * consecutive file page.
!  *
!  * The global LogwrtRqst.Write pointer needs to be advanced to include the
!  * just-filled page.  If we can do this for free (without an extra lock),
!  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
!  * request update still needs to be done, FALSE if we did it internally.
!  *
!  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
! 	/* Use Insert->LogwrtResult copy if it's more fresh */
! 	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
! 		LogwrtResult = Insert->LogwrtResult;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 2036,2069 ----
  }
  
  /*
!  * Initialize XLOG buffers, writing out old buffers if they still contain
!  * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
!  * true, initialize as many pages as we can without having to write out
!  * unwritten data. Any new pages are initialized to zeros, with pages headers
!  * initialized properly.
   */
! static void
! AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
  	XLogPageHeader NewPage;
+ 	bool		needflush;
+ 	int			npages = 0;
  
! 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Now that we have the lock, check if someone initialized the page
! 	 * already.
! 	 */
! /* XXX: fix indentation before commit */
! while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
! {
! 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1524,1535 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
- 	{
- 		/* nope, got work to do... */
- 		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 2071,2087 ----
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  
! 	needflush = !XLByteLE(OldPageRqstPtr, LogwrtResult.Write);
! 
! 	if (needflush)
! 	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1537,1581 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		update_needed = false;	/* Did the shared-request update */
! 
! 		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/* OK, someone wrote it already */
! 			Insert->LogwrtResult = LogwrtResult;
! 		}
! 		else
! 		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding insert lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
--- 2089,2138 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 			{
! 				Assert(XLByteLE(OldPageRqstPtr, xlogctl->Insert.CurrPos));
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
! 			}
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr, InvalidXLogRecPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding mapping lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
***************
*** 1583,1596 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
! 
! 	if (new_segment)
! 	{
! 		/* force it to a segment start point */
! 		NewPageEndPtr.xrecoff += XLogSegSize - 1;
! 		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
! 	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 2140,2146 ----
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1600,1612 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
! 
! 	Insert->curridx = nextidx;
! 	Insert->currpage = NewPage;
  
! 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
--- 2150,2159 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
! 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
  
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
***************
*** 1650,1660 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 2197,2224 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * Make sure the initialization of the page becomes visible to others
! 	 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
! 	 * holding a lock.
! 	 */
! 	pg_write_barrier();
! 
! 	*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(DEBUG1, "initialized %d pages, upto %X/%X",
! 			 npages, NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff);
! #endif
  }
  
  /*
***************
*** 1699,1714 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * If xlog_switch == TRUE, we are intending an xlog segment switch, so
!  * perform end-of-segment actions after writing the last page, even if
!  * it's not physically the end of its segment.  (NB: this will work properly
!  * only if caller specifies WriteRqst == page-end and flexible == false,
!  * and there is some data to write.)
!  *
!  * Must be called with WALWriteLock held.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
--- 2263,2274 ----
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
!  * the data is ready to write.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
***************
*** 1726,1732 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = Write->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
--- 2286,2292 ----
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
***************
*** 1757,1770 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 2317,2330 ----
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
! 		if (!XLByteLT(LogwrtResult.Write, EndPtr))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 EndPtr.xlogid, EndPtr.xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = EndPtr;
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1861,1876 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
- 			 * We also do this if this is the last page written for an xlog
- 			 * switch.
- 			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg || (xlog_switch && last_iteration))
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
--- 2421,2433 ----
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg)
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
***************
*** 1960,1967 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
- 
- 	Write->LogwrtResult = LogwrtResult;
  }
  
  /*
--- 2517,2522 ----
***************
*** 2124,2131 **** XLogFlush(XLogRecPtr record)
  	 */
  	for (;;)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
--- 2679,2689 ----
  	 */
  	for (;;)
  	{
! 		/* use volatile pointers to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
***************
*** 2139,2144 **** XLogFlush(XLogRecPtr record)
--- 2697,2731 ----
  			break;
  
  		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to do LWLockConditionalAcquire(WALInsertLock),
+ 		 * fall back to writing just up to 'record' if we couldn't get t
+ 		 * lock. I wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand
+ 		 * it would be good to not cause more contention on the lock if
+ 		 * busy, but on the other hand, this spinlock is much more light
+ 		 * than the WALInsertLock was, so maybe it's better to just grab
+ 		 * spinlock. Also note that if we stored the XLogRecPtr as one 6
+ 		 * integer, we could just read it with no lock on platforms wher
+ 		 * 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)               /* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, insertpos);
+ 
+ 		/*
  		 * Try to get the write lock. If we can't get it immediately, wait
  		 * until it's released, and recheck if we still need to do the flush
  		 * or if the backend that held the lock did it for us already. This
***************
*** 2155,2186 **** XLogFlush(XLogRecPtr record)
  			continue;
  		}
  		/* Got the lock */
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
! 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 				else
! 				{
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 					WriteRqstPtr.xrecoff -= freespace;
! 				}
! 				LWLockRelease(WALInsertLock);
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = WriteRqstPtr;
! 			}
! 			else
! 			{
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = record;
! 			}
! 			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
--- 2742,2754 ----
  			continue;
  		}
  		/* Got the lock */
! 		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
! 			XLogWrite(WriteRqst, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
***************
*** 2292,2314 **** XLogBackgroundFlush(void)
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
- 
  	START_CRIT_SECTION();
  
! 	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2860,2890 ----
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
  	START_CRIT_SECTION();
  
! 	/* now wait for any in-progress insertions to finish and get write lock */
! 	WaitXLogInsertionsToFinish(WriteRqstPtr, InvalidXLogRecPtr);
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible);
  	}
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 */
+ 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5102,5107 **** XLOGShmemSize(void)
--- 5678,5686 ----
  	/* and the buffers themselves */
  	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
  
+ 	/* XLog insertion slots */
+ 	size = add_size(size, mul_size(sizeof(XLogInsertSlot), NumXLogInsertSlots));
+ 
  	/*
  	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
  	 * added by CreateSharedMemoryAndSemaphores.  This lets us use this
***************
*** 5117,5122 **** XLOGShmemInit(void)
--- 5696,5702 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5142,5147 **** XLOGShmemInit(void)
--- 5722,5740 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize insertion slots */
+ 	XLogCtl->XLogInsertSlots = (XLogInsertSlot *) allocptr;
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[i];
+ 		slot->CurrPos = InvalidXLogRecPtr;
+ 		slot->head = slot->tail = NULL;
+ 		SpinLockInit(&slot->lck);
+ 	}
+ 	XLogCtl->Insert.nextslot = 1;
+ 	XLogCtl->Insert.lastslot = 0;
+ 	allocptr += sizeof(XLogInsertSlot) * NumXLogInsertSlots;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5156,5166 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
- 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
--- 5749,5760 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
+ 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
+ 
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
***************
*** 6038,6043 **** StartupXLOG(void)
--- 6632,6638 ----
  	bool		backupEndRequired = false;
  	bool		backupFromStandby = false;
  	DBState		dbstate_at_startup;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6844,6851 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7439,7450 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6853,6878 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
- 	XLogCtl->Write.LogwrtResult = LogwrtResult;
- 	Insert->LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7452,7474 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndRecPtr.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6884,6890 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7480,7486 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7390,7396 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
--- 7986,7992 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
***************
*** 7666,7671 **** CreateCheckPoint(int flags)
--- 8262,8268 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7734,7743 **** CreateCheckPoint(int flags)
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 8331,8340 ----
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * Determine the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7749,7755 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 8346,8352 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7758,7772 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 8355,8366 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7793,7806 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
--- 8387,8396 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(curInsert);
  	if (freespace < SizeOfXLogRecord)
! 		curInsert = AdvanceXLogRecPtrToNextPage(curInsert);
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
***************
*** 7826,7832 **** CreateCheckPoint(int flags)
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8416,8422 ----
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7846,7852 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8436,8442 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 8213,8227 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8803,8817 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * Like in CreateCheckPoint(), you need both insertpos_lck and info_lck
! 	 * to update it, although during recovery acquiring insertpos_lck is just
! 	 * pro forma, because no WAL insertions are happening.
  	 */
! 	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8412,8418 **** XLogPutNextOid(Oid nextOid)
  XLogRecPtr
  RequestXLogSwitch(void)
  {
! 	XLogRecPtr	RecPtr;
  	XLogRecData rdata;
  
  	/* XLOG SWITCH, alone among xlog record types, has no data */
--- 9002,9008 ----
  XLogRecPtr
  RequestXLogSwitch(void)
  {
! 	XLogRecPtr	EndPos;
  	XLogRecData rdata;
  
  	/* XLOG SWITCH, alone among xlog record types, has no data */
***************
*** 8421,8429 **** RequestXLogSwitch(void)
  	rdata.len = 0;
  	rdata.next = NULL;
  
! 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH, &rdata);
  
! 	return RecPtr;
  }
  
  /*
--- 9011,9031 ----
  	rdata.len = 0;
  	rdata.next = NULL;
  
! 	EndPos = XLogInsert(RM_XLOG_ID, XLOG_SWITCH, &rdata);
  
! 	/* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
! 	 * if the xlog switch had no work to do, ie. if we were already at the
! 	 * beginning of a new XLOG segment. You can check if RecPtr points to
! 	 * beginning of a segment if you want to keep the distinction.
! 	 */
! 	TRACE_POSTGRESQL_XLOG_SWITCH();
! 
! 	/*
! 	 * XLogInsert returns a pointer to the end of segment, but we want
! 	 * to return a pointer to just the end of the xlog-switch record. The
! 	 * rest of the segment is padded out.
! 	 */
! 	return EndPos;
  }
  
  /*
***************
*** 8501,8522 **** XLogReportParameters(void)
  /*
   * Update full_page_writes in shared memory, and write an
   * XLOG_FPW_CHANGE record if necessary.
   */
  void
  UpdateFullPageWrites(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
  	 *
  	 * It's safe to check the shared full_page_writes without the lock,
! 	 * because we can guarantee that there is no concurrently running
! 	 * process which can update it.
  	 */
  	if (fullPageWrites == Insert->fullPageWrites)
  		return;
  
  	/*
  	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
  	 * track of full_page_writes during archive recovery, if required.
--- 9103,9143 ----
  /*
   * Update full_page_writes in shared memory, and write an
   * XLOG_FPW_CHANGE record if necessary.
+  *
+  * Note: this function assumes there is no other process running
+  * concurrently that could update it.
   */
  void
  UpdateFullPageWrites(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
  	 *
  	 * It's safe to check the shared full_page_writes without the lock,
! 	 * because we assume that there is no concurrently running process
! 	 * which can update it.
  	 */
  	if (fullPageWrites == Insert->fullPageWrites)
  		return;
  
+ 	START_CRIT_SECTION();
+ 
+ 	/*
+ 	 * It's always safe to take full page images, even when not strictly
+ 	 * required, but not the other round. So if we're setting full_page_writes
+ 	 * to true, first set it true and then write the WAL record. If we're
+ 	 * setting it to false, first write the WAL record and then set the
+ 	 * global flag.
+ 	 */
+ 	if (fullPageWrites)
+ 	{
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		Insert->fullPageWrites = true;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 	}
+ 
  	/*
  	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
  	 * track of full_page_writes during archive recovery, if required.
***************
*** 8532,8543 **** UpdateFullPageWrites(void)
  
  		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
  	}
! 	else
  	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 		Insert->fullPageWrites = fullPageWrites;
! 		LWLockRelease(WALInsertLock);
  	}
  }
  
  /*
--- 9153,9166 ----
  
  		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
  	}
! 
! 	if (!fullPageWrites)
  	{
! 		SpinLockAcquire(&Insert->insertpos_lck);
! 		Insert->fullPageWrites = false;
! 		SpinLockRelease(&Insert->insertpos_lck);
  	}
+ 	END_CRIT_SECTION();
  }
  
  /*
***************
*** 9063,9068 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9686,9692 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	checkpointloc;
***************
*** 9125,9150 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9749,9774 ----
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 9257,9269 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9881,9893 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9347,9356 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
--- 9971,9981 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
  		Assert(XLogCtl->Insert.exclusiveBackup);
***************
*** 9367,9373 **** pg_start_backup_callback(int code, Datum arg)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9992,9998 ----
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9380,9385 **** pg_start_backup_callback(int code, Datum arg)
--- 10005,10011 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	startpoint;
***************
*** 9433,9441 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 10059,10067 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9444,9459 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 10070,10085 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9731,9746 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 10357,10374 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
! 
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9794,9805 **** GetStandbyFlushRecPtr(void)
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 10422,10433 ----
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/storage/lmgr/spin.c
--- b/src/backend/storage/lmgr/spin.c
***************
*** 56,61 **** SpinlockSemas(void)
--- 56,64 ----
  	 *
  	 * For now, though, we just need a few spinlocks (10 should be plenty)
  	 * plus one for each LWLock and one for each buffer header.
+ 	 *
+ 	 * XXX: remember to adjust this for the number of spinlocks needed by the
+ 	 * xlog.c changes before committing!
  	 */
  	return NumLWLocks() + NBuffers + 10;
  }
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALInsertTailLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#48Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#47)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Fri, Feb 17, 2012 at 7:36 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 17.02.2012 07:27, Fujii Masao wrote:

Got another problem: when I ran pg_stop_backup to take an online backup,
it got stuck until I had generated new WAL record. This happens because,
in the patch, when pg_stop_backup forces a switch to new WAL file, old
WAL file is not marked as archivable until next new WAL record has been
inserted, but pg_stop_backup keeps waiting for that WAL file to be
archived.
OTOH, without the patch, WAL file is marked as archivable as soon as WAL
file switch occurs.

So, in short, the patch seems to handle the WAL file switch logic
incorrectly.

Yep. For a WAL-switch record, XLogInsert returns the location of the end of
the record, not the end of the empty padding space. So when the caller
flushed up to that point, it didn't flush the empty space and therefore
didn't notify the archiver.

Attached is a new version, fixing that, and off-by-one bug you pointed out
in the slot wraparound handling. I also moved code around a bit, I think
this new division of labor between the XLogInsert subroutines is more
readable.

Thanks for the testing!

Hi Heikki,

Sorry for the week long radio silence, I haven't been able to find
much time during the week. I'll try to extract my test case from it's
quite messy testing harness and get a self-contained version, but it
will probably take a week or two to do it. I can probably refactor it
to rely just on Perl and the modules DBI, DBD::Pg, IO::Pipe and
Storable. Some of those are not core Perl modules, but they are all
common ones. Would that be a good option?

I've tested your v9 patch. I no longer see any inconsistencies or
lost transactions in the recovered database. But occasionally I get
databases that fail to recover at all.
It has always been with the exact same failed assertion, at xlog.c line 2154.

I've only seen this 4 times out of 2202 cycles of crash and recovery,
so it must be some rather obscure situation.

LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/180001B0
LOG: unexpected pageaddr 0/15084000 in log file 0, segment 25, offset 540672
LOG: redo done at 0/19083FD0
LOG: last completed transaction was at log time 2012-02-17 11:13:50.369488-08
LOG: checkpoint starting: end-of-recovery immediate
TRAP: FailedAssertion("!(((((((uint64) (NewPageEndPtr).xlogid *
(uint64) (((uint32) 0xffffffff) / ((uint32) (16 * 1024 * 1024))) *
((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192)
% (XLogCtl->XLogCacheBlck + 1)) == nextidx)", File: "xlog.c", Line:
2154)
LOG: startup process (PID 5390) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure

Cheers,

Jeff

#49Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#47)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

I was trying to understand this patch and had few doubts:

1. In PerformXLogInsert(), why there is need to check freespace when already
during ReserveXLogInsertLocation(),
the space is reserved.
Is it possible that the record size is more than actually calculted in
ReserveXLogInsertLocation(), if so in that
case what I understand is it is moving to next page to write, however
isn't it possible that some other backend had
already reserved that space.

2. In function WaitForXLogInsertionSlotToBecomeFree(), chances are there
such that when nextslot equals lastslot, all new
backends try to reserve a slot will start waiting on same last slot
which can lead to serialization for those
backends and can impact latency.

3. GetXlogBuffer - This will get called twice, once for normal buffer,
second time for when there is not enough space in
current page, and both times it can lead to I/O whereas in earlier
algorithm, the chances of I/O is only once.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Heikki Linnakangas
Sent: Friday, February 17, 2012 9:07 PM
To: Fujii Masao
Cc: Jeff Janes; Robert Haas; PostgreSQL-development
Subject: Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work
outside WALInsertLock)

On 17.02.2012 07:27, Fujii Masao wrote:

Got another problem: when I ran pg_stop_backup to take an online
backup, it got stuck until I had generated new WAL record. This
happens because, in the patch, when pg_stop_backup forces a switch to
new WAL file, old WAL file is not marked as archivable until next new
WAL record has been inserted, but pg_stop_backup keeps waiting for that

WAL file to be archived.

OTOH, without the patch, WAL file is marked as archivable as soon as
WAL file switch occurs.

So, in short, the patch seems to handle the WAL file switch logic

incorrectly.

Yep. For a WAL-switch record, XLogInsert returns the location of the end of
the record, not the end of the empty padding space. So when the caller
flushed up to that point, it didn't flush the empty space and therefore
didn't notify the archiver.

Attached is a new version, fixing that, and off-by-one bug you pointed out
in the slot wraparound handling. I also moved code around a bit, I think
this new division of labor between the XLogInsert subroutines is more
readable.

Thanks for the testing!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#50Fujii Masao
masao.fujii@gmail.com
In reply to: Jeff Janes (#48)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Sun, Feb 19, 2012 at 3:01 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

I've tested your v9 patch.  I no longer see any inconsistencies or
lost transactions in the recovered database.  But occasionally I get
databases that fail to recover at all.
It has always been with the exact same failed assertion, at xlog.c line 2154.

I've only seen this 4 times out of 2202 cycles of crash and recovery,
so it must be some rather obscure situation.

LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/180001B0
LOG:  unexpected pageaddr 0/15084000 in log file 0, segment 25, offset 540672
LOG:  redo done at 0/19083FD0
LOG:  last completed transaction was at log time 2012-02-17 11:13:50.369488-08
LOG:  checkpoint starting: end-of-recovery immediate
TRAP: FailedAssertion("!(((((((uint64) (NewPageEndPtr).xlogid *
(uint64) (((uint32) 0xffffffff) / ((uint32) (16 * 1024 * 1024))) *
((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192)
% (XLogCtl->XLogCacheBlck + 1)) == nextidx)", File: "xlog.c", Line:
2154)
LOG:  startup process (PID 5390) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

I could reproduce this when I made the server crash just after executing
"select pg_switch_xlog()".

$ initdb -D data
$ pg_ctl -D data start
$ psql -c "select pg_switch_xlog()"
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start
...
LOG: redo done at 0/16E3B0C
TRAP: FailedAssertion("!(((((((uint64) (NewPageEndPtr).xlogid *
(uint64) (((uint32) 0xffffffff) / ((uint32) (16 * 1024 * 1024))) *
((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192)
% (XLogCtl->XLogCacheBlck + 1)) == nextidx)", File: "xlog.c", Line:
2154)
LOG: startup process (PID 16361) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure

Though I've not read new patch yet, I doubt that xlog switch code would
still have a bug.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#51Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#47)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Attached is a new version, fixing that, and off-by-one bug you pointed out
in the slot wraparound handling. I also moved code around a bit, I think
this new division of labor between the XLogInsert subroutines is more
readable.

This patch includes not only xlog scaling improvement but also other ones.
I think it's better to extract them as separate patches and commit them first.
If we do so, the main patch would become more readable.
For example, I think that the followings can be extracted as a separate patch.

(1) Make walwriter try to initialize as many of the no-longer-needed
WAL buffers
for future use as we can.

(2) Refactor the "update full_page_writes code".

(3) Get rid of XLogCtl->Write.LogwrtResult and XLogCtl->Insert.LogwrtResult.

(4) Call TRACE_POSTGRESQL_XLOG_SWITCH() even if the xlog switch has no
work to do.

Others?

I'm not sure if (3) makes sense. In current master, those two shared variables
are used to reduce the contention of XLogCtl->info_lck and WALWriteLock.
You think they have no effect on reducing the lock contention?

In some places, the spinlock "insertpos_lck" is taken while another
spinlock "info_lck"
is being held. Is this OK? What if unfortunately inner spinlock takes
long to be taken?

+	 * An xlog-switch record consumes all the remaining space on the
+	 * WAL segment. We have already reserved it for us, but we still need
+	 * to make sure it's been allocated and zeroed in the WAL buffers so
+	 * that when the caller (or someone else) does XLogWrite(), it can
+	 * really write out all the zeros.

Why do we need to write out all the remaining space with zeros? In
current master,
we don't do that. A recovery code ignores the data following XLOG_SWITCH record,
so I don't think that's required.

+	/* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
+	 * if the xlog switch had no work to do, ie. if we were already at the
+	 * beginning of a new XLOG segment. You can check if RecPtr points to
+	 * beginning of a segment if you want to keep the distinction.
+	 */
+	TRACE_POSTGRESQL_XLOG_SWITCH();

If so, RecPtr (or the flag indicating whether the xlog switch has no
work to do) should
be given to TRACE_POSTGRESQL_XLOG_SWITCH() as an argument?

The followings are trivial comments:

+ * insertion, but ẃhile we're at it, advance lastslot as much as we

Typo: 'ẃ' should be 'w'

In XLogRecPtrToBufIdx() and XLogRecEndPtrToBufIdx(), why don't you use
XLogFileSize
instead of XLogSegsPerFile * XLogSegSize?

There are some source code comments which still refer to WALInsertLock.

It's cleaner if pg_start_backup_callback() uses Insert instead of
XLogCtl->Insert,
like do_pg_start_backup(), do_pg_stop_backup() and do_pg_abort_backup() do.

+ freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;

Though this is not incorrect, it's better to use EndOfLog instead of EndRecPtr,
like near code does.

while (extraWaits-- > 0)
PGSemaphoreUnlock(&MyProc->sem);

This should be executed also in WaitForXLogInsertionSlotToBecomeFree()?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#52Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#51)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Tue, Feb 21, 2012 at 8:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Attached is a new version, fixing that, and off-by-one bug you pointed out
in the slot wraparound handling. I also moved code around a bit, I think
this new division of labor between the XLogInsert subroutines is more
readable.

When I ran the long-running performance test, I encountered the following
panic error.

PANIC: could not find WAL buffer for 0/FF000000

0/FF000000 is the xlog file boundary, so the patch seems to handle
the xlog file boundary incorrectly. In the patch, current insertion lsn
is advanced by directly incrementing XLogRecPtr.xrecoff as follows.
But to handle the xlog file boundary correctly, we should use
XLByteAdvance() for that, instead?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#53Jeff Janes
jeff.janes@gmail.com
In reply to: Fujii Masao (#52)
3 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Tue, Feb 21, 2012 at 5:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Feb 21, 2012 at 8:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Attached is a new version, fixing that, and off-by-one bug you pointed out
in the slot wraparound handling. I also moved code around a bit, I think
this new division of labor between the XLogInsert subroutines is more
readable.

When I ran the long-running performance test, I encountered the following
panic error.

   PANIC:  could not find WAL buffer for 0/FF000000

I too see this panic when the system survives long enough to get to
that log switch.

But I'm also still seeing (with version 9) the assert failure at
"xlog.c", Line: 2154 during the end-of-recovery checkpoint.

Here is a set up for repeating my tests. I used this test simply
because I had it sitting around after having written it for other
purposes. Indeed I'm not all that sure I should publish it.
Hopefully other people will write other tests which exercise other
corner cases, rather than exercising the same ones I am.

The patch creates a guc which causes the md writer routine to panic
and bring down the database, triggering recovery, after a given number
for writes. In this context probably any other method of forcing a
crash and recovery would be just as good as this specific method of
crashing.

The choice of 400 for the cutoff for crashing is based on:

1) If the number is too low, you re-crash within recovery so you never
get a chance to inspect the database. In my hands, recovery doesn't
need to do more than 400 writes. (I don't know how to make the
database use different guc setting during recovery than it did before
the crash).

2) If the number is too high, it takes too long for a crash to happen
and I'm not all that patient.

Some of the changes to postgresql.conf.sample are purely my
preferences and have nothing in particular to do with this set up.
But archive_timeout = 30 is necessary in order to get checkpoints, and
thus mdwrites, to happen often enough to trigger crashes often enough
to satisfy my impatience.

The Perl script exercises the integrity of the database by launching
multiple processes (4 by default) to run updates and memorize what
updates they have run. After a crash, the Perl processes all
communicate their data up to the parent, which consolidates that
information and then queries the post-recovery database to make sure
it agrees. Transactions that are in-flight at the time of a crash are
indeterminate. Maybe the crash happened before the commit, and maybe
it happened after the commit but before we received notification of
the commit. So whichever way those turn out, it is not proof of
corruption.

With the xloginsert-scale-9.patch, the above features are not needed
because the problem is not that the database is incorrect after
recovery, but that the database doesn't recover in the first place. So
just running pgbench would be good enough to detect that. But in
earlier versions this feature did detect incorrect recovery.

This logs an awful lot of stuff, most of which merely indicates normal
operation. The problem is that corruption is rare, so if you wait
until you see corruption before turning on logging, then you have to
wait l long time to get another instance of corruption so you can
dissect the log information. So, I just log everything all of the
time.
A warning from 'line 63' which is not marked as in-flight indicates
database corruption. A warning from 'line 66' indicates even worse
corruption. A failure of the entire outer script to execute for the
expected number of iterations (i.e. failure of the warning issued on
'line 18' to show up 100 times) indicates the database failed to
restart.

Also attached is a bash script that exercises the whole thing. Note
that it has various directories hard coded that really ought not be,
and that it has no compunctions about calling rm -r /tmp/data. I run
it is as "./do.sh >& log" and then inspect the log file for unusual
lines.

To run this, you first have to apply your own xlog patch, and apply my
crash-inducing patch, and build and install the resulting pgsql. And
edit the shell script to point to it, etc.. The whole thing is a bit
of a idiosyncratic mess.

Cheers,

Jeff

Attachments:

crash_REL9_2CF4.patchapplication/octet-stream; name=crash_REL9_2CF4.patchDownload
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
new file mode 100644
index 5d7888a..4aa7115
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** checkDataDir(void)
*** 1235,1241 ****
  	 * reasonable check to apply on Windows.
  	 */
  #if !defined(WIN32) && !defined(__CYGWIN__)
! 	if (stat_buf.st_mode & (S_IRWXG | S_IRWXO))
  		ereport(FATAL,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("data directory \"%s\" has group or world access",
--- 1235,1241 ----
  	 * reasonable check to apply on Windows.
  	 */
  #if !defined(WIN32) && !defined(__CYGWIN__)
! 	if (0 && stat_buf.st_mode & (S_IRWXG | S_IRWXO))
  		ereport(FATAL,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("data directory \"%s\" has group or world access",
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
new file mode 100644
index bfc9f06..bf5cfe7
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 59,64 ****
--- 59,66 ----
  #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
  #endif
  
+ int JJ_torn_page=0;
+ 
  /*
   *	The magnetic disk storage manager keeps track of open file
   *	descriptors in its own descriptor pool.  This is done to make it
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 689,694 ****
--- 691,697 ----
  	off_t		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
+         static int counter=0;
  
  	/* This assert is too expensive to have on normally ... */
  #ifdef CHECK_WRITE_VS_EXTEND
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 713,719 ****
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
! 	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
  
  	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
  										reln->smgr_rnode.node.spcNode,
--- 716,733 ----
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
!         if (JJ_torn_page > 0 && counter++ > JJ_torn_page) {
! 	  nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! 		ereport(FATAL,
! 				(errcode(ERRCODE_DISK_FULL),
! 				 errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! 						blocknum,
! 						relpath(reln->smgr_rnode, forknum),
! 						nbytes, BLCKSZ),
! 				 errhint("JJ is screwing with the database.")));
!         } else {
! 	  nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! 	}
  
  	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
  										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index a5becbe..b6d8ebf
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 127,132 ****
--- 127,133 ----
  /* XXX these should appear in other modules' header files */
  extern bool Log_disconnections;
  extern int	CommitDelay;
+ int	JJ_torn_page;
  extern int	CommitSiblings;
  extern char *default_tablespace;
  extern char *temp_tablespaces;
*************** static struct config_int ConfigureNamesI
*** 2031,2036 ****
--- 2032,2046 ----
  	},
  
  	{
+ 		{"JJ_torn_page", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Simulate a torn-page crash after this number of page writes (0 to turn off)"),
+ 			NULL
+ 		},
+ 		&JJ_torn_page,
+ 		0, 0, 100000, NULL, NULL
+ 	},
+ 
+ 	{
  		{"commit_siblings", PGC_USERSET, WAL_SETTINGS,
  			gettext_noop("Sets the minimum concurrent open transactions before performing "
  						 "commit_delay."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
new file mode 100644
index 96da086..1cdebbb
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 159,165 ****
  
  # - Settings -
  
! #wal_level = minimal			# minimal, archive, or hot_standby
  					# (change requires restart)
  #fsync = on				# turns forced synchronization on or off
  #synchronous_commit = on		# synchronization level; on, off, or local
--- 159,165 ----
  
  # - Settings -
  
! wal_level = archive			# minimal, archive, or hot_standby
  					# (change requires restart)
  #fsync = on				# turns forced synchronization on or off
  #synchronous_commit = on		# synchronization level; on, off, or local
***************
*** 180,199 ****
  
  # - Checkpoints -
  
! #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
! #checkpoint_timeout = 5min		# range 30s-1h
  #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
  #checkpoint_warning = 30s		# 0 disables
  
  # - Archiving -
  
! #archive_mode = off		# allows archiving to be done
  				# (change requires restart)
! #archive_command = ''		# command to use to archive a logfile segment
  				# placeholders: %p = path of file to archive
  				#               %f = file name only
  				# e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
! #archive_timeout = 0		# force a logfile segment switch after this
  				# number of seconds; 0 disables
  
  
--- 180,199 ----
  
  # - Checkpoints -
  
! checkpoint_segments = 1		# in logfile segments, min 1, 16MB each
! checkpoint_timeout = 30s		# range 30s-1h
  #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
  #checkpoint_warning = 30s		# 0 disables
  
  # - Archiving -
  
! archive_mode = on		# allows archiving to be done
  				# (change requires restart)
! archive_command = 'echo archive_command %p %f `date`'		# command to use to archive a logfile segment
  				# placeholders: %p = path of file to archive
  				#               %f = file name only
  				# e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
! archive_timeout = 30		# force a logfile segment switch after this
  				# number of seconds; 0 disables
  
  
***************
*** 382,394 ****
  #debug_print_rewritten = off
  #debug_print_plan = off
  #debug_pretty_print = on
! #log_checkpoints = off
  #log_connections = off
  #log_disconnections = off
  #log_duration = off
  #log_error_verbosity = default		# terse, default, or verbose messages
  #log_hostname = off
! #log_line_prefix = ''			# special values:
  					#   %a = application name
  					#   %u = user name
  					#   %d = database name
--- 382,394 ----
  #debug_print_rewritten = off
  #debug_print_plan = off
  #debug_pretty_print = on
! log_checkpoints = on
  #log_connections = off
  #log_disconnections = off
  #log_duration = off
  #log_error_verbosity = default		# terse, default, or verbose messages
  #log_hostname = off
! log_line_prefix = '%p %i %m:'			# special values:
  					#   %a = application name
  					#   %u = user name
  					#   %d = database name
count.plapplication/octet-stream; name=count.plDownload
do.shapplication/x-sh; name=do.shDownload
#54Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#49)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 20.02.2012 08:00, Amit Kapila wrote:

I was trying to understand this patch and had few doubts:

1. In PerformXLogInsert(), why there is need to check freespace when
already during ReserveXLogInsertLocation(), the space is reserved. Is
it possible that the record size is more than actually calculted in
ReserveXLogInsertLocation(), if so in that case what I understand is
it is moving to next page to write, however isn't it possible that
some other backend had already reserved that space.

The calculations between PerformXLogInsert (called CopyXLogRecordToWAL()
in the latest patch version) and ReserveXLogInsertLocation() must always
match, otherwise we have reserved incorrect amount of WAL and you get
corrupt WAL. They both need to do the same calculations of how the WAL
record is split across pages, which depends on how much free space there
is on the first page. There is an assertion in CopyXLogRecordToWAL() to
check that once it's finished writing the WAL record, the last byte
landed on the position that ReserveXLogInsertLocation() calculated it
would.

Another way to do that would be to remember the calculations done in
ReserveXLogInsertLocation(), in an extra array or something. But we want
to keep ReserveXLogInsertLocation() as simple as possible, as that runs
while holding the spinlock. Any extra CPU cycles there will hurt
scalability.

2. In function WaitForXLogInsertionSlotToBecomeFree(), chances are
there such that when nextslot equals lastslot, all new backends try
to reserve a slot will start waiting on same last slot which can
lead to serialization for those backends and can impact latency.

True. That warrants some performance testing to see if that effect is
significant. (it's surely better than the current situation, anyway,
where all WAL insertions block on the single lock)

3. GetXlogBuffer - This will get called twice, once for normal
buffer, second time for when there is not enough space in current
page, and both times it can lead to I/O whereas in earlier algorithm,
the chances of I/O is only once.

I don't see any difference to the previous situation. In both cases, if
you need a new page to copy the WAL record to, you need to first flush
out some old pages from the WAL buffers if they're all dirty. The patch
doesn't change the number of WAL buffers consumed. Note that
GetXLogBuffer() is very cheap when it doesn't need to do I/O, extra
calls to it don't matter if the page is already initialized.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#55Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#51)
2 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 21.02.2012 13:19, Fujii Masao wrote:

On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Attached is a new version, fixing that, and off-by-one bug you pointed out
in the slot wraparound handling. I also moved code around a bit, I think
this new division of labor between the XLogInsert subroutines is more
readable.

This patch includes not only xlog scaling improvement but also other ones.
I think it's better to extract them as separate patches and commit them first.
If we do so, the main patch would become more readable.

Good point.

For example, I think that the followings can be extracted as a separate patch.

(1) Make walwriter try to initialize as many of the no-longer-needed
WAL buffers
for future use as we can.

This is pretty hard to extract from the larger patch. The current code
in master assumes that there's only one page that is currently inserted
to, and relies on WALInsertLock being held in AdvanceXLInsertBuffer().
The logic with the scaling patch is quite different.

(2) Refactor the "update full_page_writes code".
(3) Get rid of XLogCtl->Write.LogwrtResult and XLogCtl->Insert.LogwrtResult.

Attached are patches for these two items. Barring objections, I'll
commit these.

(4) Call TRACE_POSTGRESQL_XLOG_SWITCH() even if the xlog switch has no
work to do.

Actually, I think I'll just move it in the patch to keep the existing
behavior.

I'm not sure if (3) makes sense. In current master, those two shared variables
are used to reduce the contention of XLogCtl->info_lck and WALWriteLock.
You think they have no effect on reducing the lock contention?

XLogCtl->Write.LogwrtResult certainly seems redundant with
XLogCtl->LogwrtResult. There might be some value in
XLogCtl->Insert.LogwrtResult, it's used in AdvanceXLInsertBuffer() to
before acquiring info_lck. But I doubt that makes any difference in
practice either. At best it's saving one spinlock acquisition per WAL
buffer, which isn't all much compared to all the other work involved.
(once the scaling patch is committed, this point is moot anyway)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

simplify-update-full_page_writes-1.patchtext/x-diff; name=simplify-update-full_page_writes-1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 266c0de..eb7932e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -731,8 +731,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	unsigned	i;
 	bool		updrqst;
 	bool		doPageWrites;
-	bool		isLogSwitch = false;
-	bool		fpwChange = false;
+	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
 	uint8		info_orig = info;
 
 	/* cross-check on whether we should be here or not */
@@ -746,30 +745,11 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
 
 	/*
-	 * Handle special cases/records.
+	 * In bootstrap mode, we don't actually log anything but XLOG resources;
+	 * return a phony record pointer.
 	 */
-	if (rmid == RM_XLOG_ID)
+	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
 	{
-		switch (info)
-		{
-			case XLOG_SWITCH:
-				isLogSwitch = true;
-				break;
-
-			case XLOG_FPW_CHANGE:
-				fpwChange = true;
-				break;
-
-			default:
-				break;
-		}
-	}
-	else if (IsBootstrapProcessingMode())
-	{
-		/*
-		 * In bootstrap mode, we don't actually log anything but XLOG resources;
-		 * return a phony record pointer.
-		 */
 		RecPtr.xlogid = 0;
 		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
 		return RecPtr;
@@ -1232,15 +1212,6 @@ begin:;
 		WriteRqst = XLogCtl->xlblocks[curridx];
 	}
 
-	/*
-	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
-	 * in shared memory before releasing WALInsertLock. This ensures that
-	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
-	 * by this change of full_page_writes.
-	 */
-	if (fpwChange)
-		Insert->fullPageWrites = fullPageWrites;
-
 	LWLockRelease(WALInsertLock);
 
 	if (updrqst)
@@ -8517,6 +8488,22 @@ UpdateFullPageWrites(void)
 	if (fullPageWrites == Insert->fullPageWrites)
 		return;
 
+	START_CRIT_SECTION();
+
+	/*
+	 * It's always safe to take full page images, even when not strictly
+	 * required, but not the other round. So if we're setting full_page_writes
+	 * to true, first set it true and then write the WAL record. If we're
+	 * setting it to false, first write the WAL record and then set the
+	 * global flag.
+	 */
+	if (fullPageWrites)
+	{
+		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+		Insert->fullPageWrites = true;
+		LWLockRelease(WALInsertLock);
+	}
+
 	/*
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
 	 * track of full_page_writes during archive recovery, if required.
@@ -8532,12 +8519,15 @@ UpdateFullPageWrites(void)
 
 		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
 	}
-	else
+
+
+	if (!fullPageWrites)
 	{
 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-		Insert->fullPageWrites = fullPageWrites;
+		Insert->fullPageWrites = false;
 		LWLockRelease(WALInsertLock);
 	}
+	END_CRIT_SECTION();
 }
 
 /*
remove-extra-logwrtresult-copies-1.patchtext/x-diff; name=remove-extra-logwrtresult-copies-1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eb7932e..61eb9bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -289,35 +289,16 @@ static XLogRecPtr RedoStartLSN = {0, 0};
  * These structs are identical but are declared separately to indicate their
  * slightly different functions.
  *
- * We do a lot of pushups to minimize the amount of access to lockable
- * shared memory values.  There are actually three shared-memory copies of
- * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
- *		XLogCtl->LogwrtResult is protected by info_lck
- *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
- *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
- * One must hold the associated lock to read or write any of these, but
- * of course no lock is needed to read/write the unshared LogwrtResult.
- *
- * XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always
- * right", since both are updated by a write or flush operation before
- * it releases WALWriteLock.  The point of keeping XLogCtl->Write.LogwrtResult
- * is that it can be examined/modified by code that already holds WALWriteLock
- * without needing to grab info_lck as well.
- *
- * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
- * but is updated when convenient.	Again, it exists for the convenience of
- * code that is already holding WALInsertLock but not the other locks.
- *
- * The unshared LogwrtResult may lag behind any or all of these, and again
- * is updated when convenient.
+ * To read XLogCtl->LogwrtResult, you must hold either info_lck or
+ * WALWriteLock.  To update it, you need to hold both locks.  The point of
+ * this arrangement is that the value can be examined by code that already
+ * holds WALWriteLock without needing to grab info_lck as well.  In addition
+ * to the shared variable, each backend has a private copy of LogwrtResult,
+ * which is updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
  *
- * Note that this all works because the request and result positions can only
- * advance forward, never back up, and so we can easily determine which of two
- * values is "more up to date".
- *
  * info_lck is only held long enough to read/update the protected variables,
  * so it's a plain spinlock.  The other locks are held longer (potentially
  * over I/O operations), so we use LWLocks for them.  These locks are:
@@ -354,7 +335,6 @@ typedef struct XLogwrtResult
  */
 typedef struct XLogCtlInsert
 {
-	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
 	int			curridx;		/* current block index in cache */
 	XLogPageHeader currpage;	/* points to header of block in cache */
@@ -388,7 +368,6 @@ typedef struct XLogCtlInsert
  */
 typedef struct XLogCtlWrite
 {
-	XLogwrtResult LogwrtResult; /* current value of LogwrtResult */
 	int			curridx;		/* cache index of next block to write */
 	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
 } XLogCtlWrite;
@@ -401,9 +380,14 @@ typedef struct XLogCtlData
 	/* Protected by WALInsertLock: */
 	XLogCtlInsert Insert;
 
+	/*
+	 * Protected by info_lck and WALWriteLock (you must hold either lock to
+	 * read it, but both to update)
+	 */
+	XLogwrtResult LogwrtResult;
+
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
-	XLogwrtResult LogwrtResult;
 	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
 	TransactionId ckptXid;
 	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
@@ -1015,7 +999,7 @@ begin:;
 		}
 
 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-		LogwrtResult = XLogCtl->Write.LogwrtResult;
+		LogwrtResult = XLogCtl->LogwrtResult;
 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
 		{
 			XLogwrtRqst FlushRqst;
@@ -1188,8 +1172,6 @@ begin:;
 			SpinLockRelease(&xlogctl->info_lck);
 		}
 
-		Write->LogwrtResult = LogwrtResult;
-
 		LWLockRelease(WALWriteLock);
 
 		updrqst = false;		/* done already */
@@ -1477,7 +1459,6 @@ static bool
 AdvanceXLInsertBuffer(bool new_segment)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	XLogCtlWrite *Write = &XLogCtl->Write;
 	int			nextidx = NextBufIdx(Insert->curridx);
 	bool		update_needed = true;
 	XLogRecPtr	OldPageRqstPtr;
@@ -1485,10 +1466,6 @@ AdvanceXLInsertBuffer(bool new_segment)
 	XLogRecPtr	NewPageEndPtr;
 	XLogPageHeader NewPage;
 
-	/* Use Insert->LogwrtResult copy if it's more fresh */
-	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
-		LogwrtResult = Insert->LogwrtResult;
-
 	/*
 	 * Get ending-offset of the buffer page we need to replace (this may be
 	 * zero if the buffer hasn't been used yet).  Fall through if it's already
@@ -1516,21 +1493,19 @@ AdvanceXLInsertBuffer(bool new_segment)
 
 		update_needed = false;	/* Did the shared-request update */
 
-		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
-		{
-			/* OK, someone wrote it already */
-			Insert->LogwrtResult = LogwrtResult;
-		}
-		else
+		/*
+		 * Now that we have an up-to-date LogwrtResult value, see if we still
+		 * need to write it or if someone else already did.
+		 */
+		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
 		{
 			/* Must acquire write lock */
 			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-			LogwrtResult = Write->LogwrtResult;
+			LogwrtResult = XLogCtl->LogwrtResult;
 			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
 			{
 				/* OK, someone wrote it already */
 				LWLockRelease(WALWriteLock);
-				Insert->LogwrtResult = LogwrtResult;
 			}
 			else
 			{
@@ -1544,7 +1519,6 @@ AdvanceXLInsertBuffer(bool new_segment)
 				WriteRqst.Flush.xrecoff = 0;
 				XLogWrite(WriteRqst, false, false);
 				LWLockRelease(WALWriteLock);
-				Insert->LogwrtResult = LogwrtResult;
 				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
 			}
 		}
@@ -1697,7 +1671,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 	/*
 	 * Update local LogwrtResult (caller probably did this already, but...)
 	 */
-	LogwrtResult = Write->LogwrtResult;
+	LogwrtResult = XLogCtl->LogwrtResult;
 
 	/*
 	 * Since successive pages in the xlog cache are consecutively allocated,
@@ -1931,8 +1905,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
 		SpinLockRelease(&xlogctl->info_lck);
 	}
-
-	Write->LogwrtResult = LogwrtResult;
 }
 
 /*
@@ -2126,7 +2098,7 @@ XLogFlush(XLogRecPtr record)
 			continue;
 		}
 		/* Got the lock */
-		LogwrtResult = XLogCtl->Write.LogwrtResult;
+		LogwrtResult = XLogCtl->LogwrtResult;
 		if (!XLByteLE(record, LogwrtResult.Flush))
 		{
 			/* try to write/flush later additions to XLOG as well */
@@ -2268,7 +2240,7 @@ XLogBackgroundFlush(void)
 
 	/* now wait for the write lock */
 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-	LogwrtResult = XLogCtl->Write.LogwrtResult;
+	LogwrtResult = XLogCtl->LogwrtResult;
 	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
 	{
 		XLogwrtRqst WriteRqst;
@@ -6831,8 +6803,6 @@ StartupXLOG(void)
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
-	XLogCtl->Write.LogwrtResult = LogwrtResult;
-	Insert->LogwrtResult = LogwrtResult;
 	XLogCtl->LogwrtResult = LogwrtResult;
 
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
#56Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#51)
1 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 21.02.2012 13:19, Fujii Masao wrote:

In some places, the spinlock "insertpos_lck" is taken while another
spinlock "info_lck" is being held. Is this OK? What if unfortunately
inner spinlock takes long to be taken?

Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt
that's a big issue in practice. We had the same pattern before the
patch, just with WALInsertLock instead of insertpos_lck. Holding a
spinlock longer is much worse than holding a lwlock longer, but
nevertheless I don't think that's a problem.

If it does become a problem, we could provide a second copy of
RedoRecPtr that could be read while holding info_lck, and would be
allowed to lag behind slightly, while requiring insertpos_lck to
read/update the main shared copy of RedoRecPtr. The only place that
reads RedoRecPtr while holding info_lck is GetRedoRecPtr(), which would
be happy with a value that lags behind a few milliseconds. We could
still update that copy right after releasing insertpos_lck, so the delay
between the two would be tiny.

+	 * An xlog-switch record consumes all the remaining space on the
+	 * WAL segment. We have already reserved it for us, but we still need
+	 * to make sure it's been allocated and zeroed in the WAL buffers so
+	 * that when the caller (or someone else) does XLogWrite(), it can
+	 * really write out all the zeros.

Why do we need to write out all the remaining space with zeros? In
current master, we don't do that. A recovery code ignores the data
following XLOG_SWITCH record, so I don't think that's required.

It's to keep the logic simpler. Before the patch, an xlog-switch just
initialized the next page in the WAL buffers to insert to, to be the
first page in the next segment. With this patch, we rely on a simple
linear mapping from an XLogRecPtr to the WAL buffer that should contain
that page (see XLogRecPtrToBufIdx()). Such a mapping is not possible if
you sometimes skip over pages in the WAL buffers, so we allocate the
buffers for those empty pages, too. Note that this means that an
xlog-switch can blow through all your WAL buffers.

We could probably optimize that so that you don't need to actually
write() and fsync() all the zeros, perhaps by setting a flag on the WAL
buffer to indicate that it only contains padding for an xlog-switch.
However, I don't see any easy way to avoid blowing the cache.

I'm thinking that xlog-switching happens so seldom, and typically on a
fairly idle system, so we don't need to optimize it much. I guess we
should measure the impact, though..

+	/* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
+	 * if the xlog switch had no work to do, ie. if we were already at the
+	 * beginning of a new XLOG segment. You can check if RecPtr points to
+	 * beginning of a segment if you want to keep the distinction.
+	 */
+	TRACE_POSTGRESQL_XLOG_SWITCH();

If so, RecPtr (or the flag indicating whether the xlog switch has no
work to do) should
be given to TRACE_POSTGRESQL_XLOG_SWITCH() as an argument?

I think I'll just move that call, so that the current behavior is retained.

The followings are trivial comments:

Thanks, fixed these!

On 22.02.2012 03:34, Fujii Masao wrote:

When I ran the long-running performance test, I encountered the following
panic error.

PANIC: could not find WAL buffer for 0/FF000000

0/FF000000 is the xlog file boundary, so the patch seems to handle
the xlog file boundary incorrectly. In the patch, current insertion lsn
is advanced by directly incrementing XLogRecPtr.xrecoff as follows.
But to handle the xlog file boundary correctly, we should use
XLByteAdvance() for that, instead?

Thanks, fixed this, too.

I made the locking a bit more strict in WaitXLogInsertionsToFinish(), so
that it grabs the insertpos_lck to read nextslot. I previously thought
that was not necessary, assuming that reading/writing an int32 is
atomic, but I'm afraid there might be memory-ordering issues where the
CurrPos of the most recent slot has not become visible to other backends
yet, while the advancing of nextslot has.

That particular issue would be very hard to hit in practice, so I don't
know if this could explain the recovery failures that Jeff saw. I got
the test script running (thanks for that Jeff!), but unfortunately have
not seen any failures yet (aside from the issue with crossing xlogid
boundary), with either this or the older version of the patch.

Attached is a new version of the patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-10.patchtext/x-diff; name=xloginsert-scale-10.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 290,315 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There are actually three shared-memory copies of
!  * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:
!  *		XLogCtl->LogwrtResult is protected by info_lck
!  *		XLogCtl->Write.LogwrtResult is protected by WALWriteLock
!  *		XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
!  * One must hold the associated lock to read or write any of these, but
!  * of course no lock is needed to read/write the unshared LogwrtResult.
!  *
!  * XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always
!  * right", since both are updated by a write or flush operation before
!  * it releases WALWriteLock.  The point of keeping XLogCtl->Write.LogwrtResult
!  * is that it can be examined/modified by code that already holds WALWriteLock
!  * without needing to grab info_lck as well.
!  *
!  * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
!  * but is updated when convenient.	Again, it exists for the convenience of
!  * code that is already holding WALInsertLock but not the other locks.
!  *
!  * The unshared LogwrtResult may lag behind any or all of these, and again
!  * is updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
--- 291,301 ----
   * slightly different functions.
   *
   * We do a lot of pushups to minimize the amount of access to lockable
!  * shared memory values.  There is one shared-memory copy of LogwrtResult,
!  * plus one unshared copy in each backend. To read the shared copy, you need
!  * to hold info_lck *or* WALWriteLock. To update it, you need to hold both
!  * locks. The unshared LogwrtResult may lag behind the shared copy, and is
!  * updated when convenient.
   *
   * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
   * (protected by info_lck), but we don't need to cache any copies of it.
***************
*** 319,328 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 305,319 ----
   * values is "more up to date".
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 334,339 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 325,397 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL, and the next insertion
+  *    slot to advertise that the insertion is in progress. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    insertpos_lck. Try to keep this section as short as possible,
+  *    insertpos_lck can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each insertion is asssigned its own
+  * "XLog insertion slot", which is used to advertise the position the backend
+  * is writing to. The slot is marked as in-use in step 1, while holding
+  * insertpos_lck, by setting the position field in the slot. When the backend
+  * is finished with the insertion, it clears its slot. Each slot is protected
+  * by a separate spinlock, to keep contention minimal.
+  *
+  * The insertion slots also provide a mechanism to wait for an insertion to
+  * finish. This is important when an XLOG page is written out - any
+  * in-progress insertions must finish copying data to the page first, or the
+  * on-disk copy will be incomplete. Waiting is done by the
+  * WaitXLogInsertionsToFinish() function. It adds the current process to the
+  * waiting queue in the slot it needs to wait for, and when that insertion
+  * finishes (or proceeds to the next page, at least), the inserter wakes up
+  * the process.
+  *
+  * The insertion slots form a ring. Insert->nextslot points to the next free
+  * slot, and Insert->lastslot points to the last slot that's still in use.
+  * lastslot can lag behind reality by any number of slots, as long as nextslot
+  * doesn't catch up with it. lastslot is advanced by
+  * WaitXLogInsertionsToFinish(), and is protected by WALInsertTailLock.
+  * nextslot is advanced in ReserveXLogInsertLocation() and is protected by
+  * insertpos_lck. Both slot variables are 32-bit integers, so that they can
+  * be read atomically without holding a lock. nextslot == lastslot means that
+  * all the slots are empty.
+  *
+  * Whenever the ring fills up, ie. when nextslot wraps around and catches up
+  * with lastslot, ReserveXLogInsertLocation() has to wait for the oldest
+  * insertion to finish and advance lastslot, to make room for the new
+  * insertion. This is also handled by WaitXLogInsertionsToFinish().
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+  * finish (or at least advance to next uninitialized page), while you're
+  * holding WALWriteLock. That would be bad, because the backend you're waiting
+  * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+  * you'd get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * it's called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition, so the common pattern:
+  *
+  *   recptr = XLogInsert(...)
+  *   XLogFlush(recptr)
+  *
+  * is safe. It can't get stuck, because an insertion to a WAL page that's
+  * already initialized in cache can always proceed without waiting on a lock.
+  *
   *----------
   */
  
***************
*** 354,364 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogwrtResult LogwrtResult; /* a recent value of LogwrtResult */
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 412,441 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct
! 								 * (except lastslot). */
! 
! 	int32		nextslot;		/* next insertion slot to use */
! 	int32		lastslot;		/* last in-use insertion slot (protected by
! 								 * WALInsertTailLock) */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment. The
! 	 * next record will be inserted there (or somewhere after it if there's
! 	 * not enough space on the current page). PrevRecord points to the
! 	 * beginning of the last record already reserved. It might not be fully
! 	 * copied into place yet, but we know its exact location already.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
! 
! 	/*
! 	 * padding to push RedoRecPtr and forcePageWrites, which rarely change,
! 	 * to a different cache line than the rapidly-changing CurrPos and
! 	 * PrevRecord values. XXX: verify if this makes any difference
! 	 */
! 	char		padding[128];
! 
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 388,406 **** typedef struct XLogCtlInsert
   */
  typedef struct XLogCtlWrite
  {
- 	XLogwrtResult LogwrtResult; /* current value of LogwrtResult */
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
--- 465,498 ----
   */
  typedef struct XLogCtlWrite
  {
  	int			curridx;		/* cache index of next block to write */
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
+ 
+ /*
+  * Slots for in-progress WAL insertions.
+  */
+ typedef struct
+ {
+ 	slock_t		lck;
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ 	PGPROC	   *head;		/* head of list of waiting PGPROCs */
+ 	PGPROC	   *tail;		/* tail of list of waiting PGPROCs */
+ } XLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	1000
+ 
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	XLogInsertSlot *XLogInsertSlots;
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
***************
*** 414,422 **** typedef struct XLogCtlData
  	XLogCtlWrite Write;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 506,524 ----
  	XLogCtlWrite Write;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock. To change the identity of a buffer that's
+ 	 * still dirty, the old page needs to be written out first, and for that
+ 	 * you need WALWriteLock, and you need to ensure that there's no
+ 	 * in-progress insertions to the page by calling
+ 	 * WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* current (latest) block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 494,521 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
  
! /* Free space remaining in the current xlog page buffer */
! #define INSERT_FREESPACE(Insert)  \
! 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
- /* Construct XLogRecPtr value for current insertion point */
- #define INSERT_RECPTR(recptr,Insert,curridx)  \
- 	( \
- 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
- 	  (recptr).xrecoff = \
- 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
- 	)
  
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
--- 596,626 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
+ #define INSERT_FREESPACE(endptr)	\
+ 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  
! #define NextSlotNo(idx)		\
! 		(((idx) == NumXLogInsertSlots - 1) ? 0 : ((idx) + 1))
  
! /*
!  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
!  * would hold if it was in cache, the page containing 'recptr'.
!  *
!  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
!  * page is taken to mean the previous page.
!  */
! #define XLogRecPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogFileSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
! 
! #define XLogRecEndPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogFileSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
***************
*** 641,649 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
--- 746,754 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
***************
*** 690,695 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 795,816 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
+ 				  XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  XLogInsertSlot *myslot,
+ 				  XLogRecPtr StartPos, XLogRecPtr EndPos);
+ static bool ReserveXLogInsertLocation(int size, bool forcePageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p);
+ static void UpdateSlotCurrPos(volatile XLogInsertSlot *myslot,
+ 				  XLogRecPtr CurrPos);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ static XLogRecPtr AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 710,721 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
- 	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
--- 831,836 ----
***************
*** 729,739 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	uint32		len,
  				write_len;
  	unsigned	i;
- 	bool		updrqst;
  	bool		doPageWrites;
! 	bool		isLogSwitch = false;
! 	bool		fpwChange = false;
  	uint8		info_orig = info;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 844,858 ----
  	uint32		len,
  				write_len;
  	unsigned	i;
  	bool		doPageWrites;
! 	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
+ 	XLogRecord	rechdr;
+ 	XLogRecPtr	PrevRecord;
+ 	XLogRecPtr	StartPos;
+ 	XLogRecPtr	EndPos;
+ 	XLogInsertSlot *myslot;
+ 	bool		updrqst;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 746,778 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
  
  	/*
! 	 * Handle special cases/records.
  	 */
! 	if (rmid == RM_XLOG_ID)
  	{
! 		switch (info)
! 		{
! 			case XLOG_SWITCH:
! 				isLogSwitch = true;
! 				break;
! 
! 			case XLOG_FPW_CHANGE:
! 				fpwChange = true;
! 				break;
! 
! 			default:
! 				break;
! 		}
! 	}
! 	else if (IsBootstrapProcessingMode())
! 	{
! 		/*
! 		 * In bootstrap mode, we don't actually log anything but XLOG resources;
! 		 * return a phony record pointer.
! 		 */
! 		RecPtr.xlogid = 0;
! 		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
! 		return RecPtr;
  	}
  
  	/*
--- 865,878 ----
  	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
  
  	/*
! 	 * In bootstrap mode, we don't actually log anything but XLOG resources;
! 	 * return a phony record pointer.
  	 */
! 	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
  	{
! 		EndPos.xlogid = 0;
! 		EndPos.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
! 		return EndPos;
  	}
  
  	/*
***************
*** 939,1071 **** begin:;
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
! 	START_CRIT_SECTION();
  
! 	/* Now wait to get insert lock */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
! 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
! 	 * back and recompute everything.  This can only happen just after a
! 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
! 	 *
! 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
! 	 * affect the contents of the XLOG record, so we'll update our local copy
! 	 * but not force a recomputation.
  	 */
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
  	{
! 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
! 		RedoRecPtr = Insert->RedoRecPtr;
  
! 		if (doPageWrites)
  		{
! 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 			{
! 				if (dtbuf[i] == InvalidBuffer)
! 					continue;
! 				if (dtbuf_bkp[i] == false &&
! 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
! 				{
! 					/*
! 					 * Oops, this buffer now needs to be backed up, but we
! 					 * didn't think so above.  Start over.
! 					 */
! 					LWLockRelease(WALInsertLock);
! 					END_CRIT_SECTION();
! 					rdt_lastnormal->next = NULL;
! 					info = info_orig;
! 					goto begin;
! 				}
! 			}
  		}
- 	}
  
! 	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
! 	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
! 	{
! 		/* Oops, must redo it with full-page data. */
! 		LWLockRelease(WALInsertLock);
! 		END_CRIT_SECTION();
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	updrqst = false;
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
  	{
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Compute record's XLOG location */
! 	curridx = Insert->curridx;
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 	 * segment, we need not insert it (and don't want to because we'd like
! 	 * consecutive switch requests to be no-ops).  Instead, make sure
! 	 * everything is written and flushed through the end of the prior segment,
! 	 * and return the prior segment's end address.
  	 */
! 	if (isLogSwitch &&
! 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  	{
! 		/* We can release insert lock immediately */
! 		LWLockRelease(WALInsertLock);
! 
! 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
! 		if (RecPtr.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			RecPtr.xlogid -= 1;
! 			RecPtr.xrecoff = XLogFileSize;
! 		}
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
! 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
! 		{
! 			XLogwrtRqst FlushRqst;
  
! 			FlushRqst.Write = RecPtr;
! 			FlushRqst.Flush = RecPtr;
! 			XLogWrite(FlushRqst, false, false);
! 		}
! 		LWLockRelease(WALWriteLock);
  
! 		END_CRIT_SECTION();
  
! 		return RecPtr;
  	}
  
! 	/* Insert record header */
! 
! 	record = (XLogRecord *) Insert->currpos;
! 	record->xl_prev = Insert->PrevRecord;
! 	record->xl_xid = GetCurrentTransactionIdIfAny();
! 	record->xl_tot_len = SizeOfXLogRecord + write_len;
! 	record->xl_len = len;		/* doesn't include backup blocks */
! 	record->xl_info = info;
! 	record->xl_rmid = rmid;
! 
! 	/* Now we can finish computing the record's CRC */
! 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
--- 1039,1146 ----
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
! 	/*
! 	 * Construct record header. We can't CRC it yet, because the prev-link
! 	 * needs to be covered by the CRC and we don't know that yet. We will
! 	 * finish computing the CRC when we do.
! 	 */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	/* rechdr.xl_prev is set later */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to reserve space for the record from the WAL.
  	 */
! 	if (!ReserveXLogInsertLocation(write_len, doPageWrites, isLogSwitch,
! 								   &PrevRecord, &StartPos, &EndPos,
! 								   (XLogInsertSlot **) &myslot, &updrqst))
  	{
! 		END_CRIT_SECTION();
  
! 		/*
! 		 * If the record is an XLOG_SWITCH, and we were exactly at the start
! 		 * of a segment, we need not insert it (and don't want to because we'd
! 		 * like consecutive switch requests to be no-ops).  Instead, make sure
! 		 * everything is written and flushed through the end of the prior
! 		 * segment, and return the prior segment's end address.
! 		 */
! 		if (isLogSwitch && !XLogRecPtrIsInvalid(EndPos))
  		{
! 			XLogFlush(EndPos);
! 			return EndPos;
  		}
  
! 		/*
! 		 * Oops, must redo it with full-page data. Unlink the backup blocks
! 		 * from the chain and reset info bitmask to undo the changes we've
! 		 * done.
! 		 */
  		rdt_lastnormal->next = NULL;
  		info = info_orig;
  		goto begin;
  	}
! 	else
  	{
! 		/*
! 		 * Finish the record header by setting prev-link (now that we know it),
! 		 * and finish computing the record's CRC (in CopyXLogRecordToWAL). Then
! 		 * copy the record to the space we reserved.
! 		 */
! 		rechdr.xl_prev = PrevRecord;
! 		CopyXLogRecordToWAL(write_len, isLogSwitch, &rechdr,
! 							rdata, rdata_crc, myslot, StartPos, EndPos);
  	}
  
! 	END_CRIT_SECTION();
  
  	/*
! 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
  	 */
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
  
! 	/*
! 	 * If this was an XLOG_SWITCH record, flush the record and the empty
! 	 * padding space that fills the rest of the segment, and perform
! 	 * end-of-segment actions (eg, notifying archiver).
! 	 */
! 	if (isLogSwitch)
! 	{
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		XLogFlush(EndPos);
  
! 		/*
! 		 * Even though we reserved the rest of the segment for us, which is
! 		 * reflected in EndPos, we return a pointer to just the end of the
! 		 * xlog-switch record.
! 		 */
! 		EndPos.xlogid = StartPos.xlogid;
! 		EndPos.xrecoff = StartPos.xrecoff + SizeOfXLogRecord;
  	}
  
! 	/*
! 	 * Update our global variables
! 	 */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
***************
*** 1074,1267 **** begin:;
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
! 						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, record);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
- 	/* Record begin of record in appropriate places */
- 	ProcLastRecPtr = RecPtr;
- 	Insert->PrevRecord = RecPtr;
- 
- 	Insert->currpos += SizeOfXLogRecord;
- 	freespace -= SizeOfXLogRecord;
- 
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
  		{
! 			if (rdata->len > freespace)
  			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
  			}
  		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Ensure next record will be properly aligned */
! 	Insert->currpos = (char *) Insert->currpage +
! 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
! 	freespace = INSERT_FREESPACE(Insert);
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
  	{
! 		XLogCtlWrite *Write = &XLogCtl->Write;
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
  		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
  		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == Write->curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
  		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
! 		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
! 		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		Write->LogwrtResult = LogwrtResult;
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
  	}
! 	else
  	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
! 		{
! 			/* curridx is filled and available for writing out */
! 			updrqst = true;
! 		}
! 		else
  		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
- 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
  	/*
! 	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
! 	 * in shared memory before releasing WALInsertLock. This ensures that
! 	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
! 	 * by this change of full_page_writes.
  	 */
! 	if (fpwChange)
! 		Insert->fullPageWrites = fullPageWrites;
  
! 	LWLockRelease(WALInsertLock);
  
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
! 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1149,1767 ----
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
! 						 EndPos.xlogid, EndPos.xrecoff);
! 		xlog_outrec(&buf, &rechdr);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rmid].rm_desc(&buf, rechdr.xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	return EndPos;
! }
  
! /*
!  * Subroutine of XLogInsert. Copies a WAL record to an already-reserved area
!  * in the WAL.
!  */
! static void
! CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecord *rechdr,
! 					XLogRecData *rdata, pg_crc32 rdata_crc,
! 					XLogInsertSlot *myslot_p,
! 					XLogRecPtr StartPos, XLogRecPtr EndPos)
! {
! 	volatile XLogInsertSlot *myslot = myslot_p;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			freespace;
! 	int			written;
! 	XLogRecPtr	CurrPos;
! 
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
! 
! 	/* Copy the record header in place, and finish calculating CRC */
! 	record = (XLogRecord *) currpos;
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	COMP_CRC32(rdata_crc, currpos + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
! 
! 	currpos += SizeOfXLogRecord;
! 	XLByteAdvance(CurrPos, SizeOfXLogRecord);
! 
! 	if (!isLogSwitch)
! 	{
! 		/* Copy record data */
! 		written = 0;
! 		freespace = INSERT_FREESPACE(CurrPos);
! 		while (rdata != NULL)
  		{
! 			while (rdata->len > freespace)
  			{
! 				/*
! 				 * Write what fits on this page, then write the continuation
! 				 * record, and continue on the next page.
! 				 */
! 				XLogContRecord *contrecord;
! 
! 				memcpy(currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				written += freespace;
! 				XLByteAdvance(CurrPos, freespace);
! 
! 				/*
! 				 * CurrPos now points to the page boundary, ie. the first byte
! 				 * of the next page. Advertise that as our CurrPos before
! 				 * calling GetXLogBuffer(), because GetXLogBuffer() might need
! 				 * to wait for some insertions to finish so that it can write
! 				 * out a buffer to make room for the new page. Updating CurrPos
! 				 * before waiting for a new buffer ensures that we don't
! 				 * deadlock with ourselves if we run out of clean buffers.
! 				 *
! 				 * However, we must not advance CurrPos past the page header
! 				 * yet, otherwise someone might try to flush up to that point,
! 				 * which would fail if the next page was not initialized yet.
! 				 */
! 				UpdateSlotCurrPos(myslot, CurrPos);
! 
! 				/*
! 				 * Get pointer to beginning of next page, and set the
! 				 * XLP_FIRST_IS_CONTRECORD flag in the page header.
! 				 *
! 				 * It's safe to set the contrecord flag without a lock on the
! 				 * page. All the other flags are set in AdvanceXLInsertBuffer,
! 				 * and we're the only backend that needs to set the contrecord
! 				 * flag.
! 				 */
! 				currpos = GetXLogBuffer(CurrPos);
! 				((XLogPageHeader) currpos)->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 
! 				/* skip over the page header, and write continuation record */
! 				CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 				currpos = GetXLogBuffer(CurrPos);
! 
! 				contrecord = (XLogContRecord *) currpos;
! 				contrecord->xl_rem_len = write_len - written;
! 
! 				currpos += SizeOfXLogContRecord;
! 				CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 				freespace = INSERT_FREESPACE(CurrPos);
  			}
+ 
+ 			memcpy(currpos, rdata->data, rdata->len);
+ 			currpos += rdata->len;
+ 			XLByteAdvance(CurrPos, rdata->len);
+ 			freespace -= rdata->len;
+ 			written += rdata->len;
+ 
+ 			rdata = rdata->next;
  		}
+ 		Assert(written == write_len);
  
! 		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
! 		if (CurrPos.xrecoff >= XLogFileSize)
! 		{
! 			/* crossed a logid boundary */
! 			CurrPos.xlogid += 1;
! 			CurrPos.xrecoff = 0;
! 		}
! 		Assert(XLByteEQ(CurrPos, EndPos));
  	}
+ 	else
+ 	{
+ 		/* An xlog-switch record doesn't contain any data besides the header */
+ 		Assert(write_len == 0);
  
! 		/*
! 		 * An xlog-switch record consumes all the remaining space on the
! 		 * WAL segment. We have already reserved it for us, but we still need
! 		 * to make sure it's been allocated and zeroed in the WAL buffers so
! 		 * that when the caller (or someone else) does XLogWrite(), it can
! 		 * really write out all the zeros.
! 		 *
! 		 * We do this one page at a time, to make sure we don't deadlock
! 		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
! 		 */
! 		while (XLByteLT(CurrPos, EndPos))
! 		{
! 			/* use up all the remaining space in this page */
! 			freespace = INSERT_FREESPACE(CurrPos);
! 			XLByteAdvance(CurrPos, freespace);
! 			/*
! 			 * like in the non-xlog-switch codepath, let others know that
! 			 * we're done writing up to the end of this page
! 			 */
! 			UpdateSlotCurrPos(myslot, CurrPos);
! 			/*
! 			 * let GetXLogBuffer initialize next page if necessary.
! 			 */
! 			CurrPos = AdvanceXLogRecPtrToNextPage(CurrPos);
! 			(void) GetXLogBuffer(CurrPos);
! 		}
! 	}
  
  	/*
! 	 * Done! Clear CurrPos in our slot to let others know that we're
! 	 * finished.
  	 */
! 	UpdateSlotCurrPos(myslot, InvalidXLogRecPtr);
! }
! 
! /*
!  * Reserves the right amount of space for a record of given size from the WAL.
!  * *StartPos_p is set to the beginning of the reserved section, *EndPos_p to
!  * its end, and *Prev_record_p points to the beginning of the previous record
!  * to set to the prev-link of the record header.
!  *
!  * A log-switch record is handled slightly differently. The rest of the
!  * segment will be reserved for this insertion, as indicated by the returned
!  * *EndPos_p value. However, if we are already at the beginning of the current
!  * segment, the *EndPos_p is set to the current location without reserving
!  * any space, and the function returns false.
!  *
!  * *updrqst_p is set to true, if this record ends on different page than
!  * the previous one - the caller should update the shared LogwrtRqst value
!  * after it's done inserting the record in that case, so that the WAL page
!  * that filled up gets written out at the next convenient moment.
!  *
!  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
!  * (or the end of previous record, to be exact) to let others know that we're
!  * busy inserting to the reserved area. The caller must clear it when the
!  * insertion is finished.
!  *
!  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
!  * changed. On failure, the shared state is not modified.
!  *
!  * This is the performance critical part of XLogInsert that must be serialized
!  * across backends. The rest can happen mostly in parallel.
!  *
!  * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
!  * where we actually copy the record to the reserved space.
!  */
! static bool
! ReserveXLogInsertLocation(int size, bool didPageWrites,
! 						  bool isLogSwitch,
! 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
! 						  XLogRecPtr *EndPos_p,
! 						  XLogInsertSlot **myslot_p, bool *updrqst_p)
! {
! 	volatile XLogInsertSlot *myslot;
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			freespace;
! 	XLogRecPtr	ptr;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	LastEndPos;
! 	int32		nextslot;
! 	int32		lastslot;
! 	bool		updrqst = false;
! 
! 	/* log-switch records should contain no data */
! 	Assert(!isLogSwitch || size == 0);
! 
! 	size = SizeOfXLogRecord + size;
! 
! retry:
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
! 		(!didPageWrites && (Insert->forcePageWrites || Insert->fullPageWrites)))
! 	{
! 		/*
! 		 * Oops, a checkpoint just happened, or forcePageWrites was just
! 		 * turned on. Start XLogInsert() all over, because we might have to
! 		 * include more full-page images in the record.
! 		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		*EndPos_p = InvalidXLogRecPtr;
! 		return false;
! 	}
  
  	/*
! 	 * Reserve the next insertion slot for us.
! 	 *
! 	 * First check that the slot is not still in use. Modifications to
! 	 * lastslot are protected by WALInsertTailLock, but here we assume that
! 	 * reading an int32 is atomic. Another process might advance lastslot at
! 	 * the same time, but not past nextslot.
  	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	if (NextSlotNo(nextslot) == lastslot)
  	{
! 		/*
! 		 * Oops, we've "caught our tail" and the oldest slot is still in use.
! 		 * Have to wait for it to become vacant.
! 		 */
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		WaitXLogInsertionsToFinish(InvalidXLogRecPtr);
! 		goto retry;
! 	}
! 	myslot = &XLogCtl->XLogInsertSlots[nextslot];
! 	nextslot = NextSlotNo(nextslot);
  
! 	/*
! 	 * Got the slot, now reserve the right amount of space from the WAL for
! 	 * our record.
! 	 */
! 	LastEndPos = ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	freespace = INSERT_FREESPACE(ptr);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		ptr = AdvanceXLogRecPtrToNextPage(ptr);
! 		freespace = INSERT_FREESPACE(ptr);
! 		updrqst = true;
! 	}
  
! 	/*
! 	 * We are now at the starting position of our record. Now figure out how
! 	 * the data will be split across the WAL pages, to calculate where the
! 	 * record ends.
! 	 */
! 	StartPos = ptr;
  
+ 	if (isLogSwitch)
+ 	{
  		/*
! 		 * If the record is an XLOG_SWITCH, and we are exactly at the start
! 		 * of a segment, we need not insert it (and don't want to because
! 		 * we'd like consecutive switch requests to be no-ops). Otherwise the
! 		 * XLOG_SWITCH record should consume all the remaining space on the
! 		 * current segment.
  		 */
! 		if ((ptr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  		{
! 			/* We can release insert lock immediately */
! 			SpinLockRelease(&Insert->insertpos_lck);
  
! 			ptr.xrecoff -= SizeOfXLogLongPHD;
! 			if (ptr.xrecoff == 0)
! 			{
! 				/* crossing a logid boundary */
! 				ptr.xlogid -= 1;
! 				ptr.xrecoff = XLogFileSize;
! 			}
  
! 			*EndPos_p = ptr;
! 			*StartPos_p = ptr;
! 			*myslot_p = NULL;
! 
! 			return false;
! 		}
! 		else
  		{
! 			if (ptr.xrecoff % XLOG_SEG_SIZE != 0)
! 			{
! 				int segleft = XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE);
! 				XLByteAdvance(ptr, segleft);
! 			}
! 			updrqst = true;
! 		}
! 	}
! 	else
! 	{
! 		/* A normal record, ie. not xlog-switch */
! 		int sizeleft = size;
! 		while (freespace < sizeleft)
! 		{
! 			/* fill this page, and continue on next page */
! 			sizeleft -= freespace;
! 			ptr = AdvanceXLogRecPtrToNextPage(ptr);
  
! 			/* account for continuation record header */
! 			ptr.xrecoff += SizeOfXLogContRecord;
! 			freespace = INSERT_FREESPACE(ptr);
! 
! 			updrqst = true;
  		}
+ 		/* the rest fits on this page */
+ 		ptr.xrecoff += sizeleft;
  
! 		/* Align the end position, so that the next record starts aligned */
! 		ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 		if (ptr.xrecoff >= XLogFileSize)
! 		{
! 			/* crossed a logid boundary */
! 			ptr.xlogid += 1;
! 			ptr.xrecoff = 0;
! 		}
! 	}
  
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	myslot->CurrPos = LastEndPos;
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 	Insert->nextslot = nextslot;
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(LOG, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 PrevRecord_p->xlogid, PrevRecord_p->xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 size);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 	*myslot_p = (XLogInsertSlot *) myslot;
! 	*updrqst_p = updrqst;
  
! 	return true;
! }
! 
! /*
!  * Update slot's CurrPos variable, and wake up anyone waiting on it.
!  */
! static void
! UpdateSlotCurrPos(volatile XLogInsertSlot *myslot, XLogRecPtr CurrPos)
! {
! 	PGPROC	   *head;
! 
! 	/*
! 	 * The write-barrier ensures that the changes we made to the WAL pages
! 	 * are visible to everyone before the update of CurrPos.
! 	 *
! 	 * XXX: I'm not sure if this is necessary. Does a function call act
! 	 * as an implicit barrier?
! 	 */
! 	pg_write_barrier();
! 
! 	SpinLockAcquire(&myslot->lck);
! 	myslot->CurrPos = CurrPos;
! 	head = myslot->head;
! 	myslot->head = myslot->tail = NULL;
! 	SpinLockRelease(&myslot->lck);
! 	while (head != NULL)
! 	{
! 		PGPROC *proc = head;
! 		head = proc->lwWaitLink;
! 		proc->lwWaitLink = NULL;
! 		proc->lwWaiting = false;
! 		PGSemaphoreUnlock(&proc->sem);
  	}
! }
! 
! /*
!  * Get a pointer to the right location in the WAL buffer containing the
!  * given XLogRecPtr.
!  *
!  * If the page is not initialized yet, it is initialized. That might require
!  * evicting an old dirty buffer from the buffer cache, which means I/O.
!  *
!  * The caller must ensure that the page containing the requested location
!  * isn't evicted yet, and won't be evicted, by holding onto an
!  * XLogInsertSlot with CurrPos set to 'ptr'. Setting it to some value
!  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
!  * if we have to evict a buffer, we might have to wait for someone else to
!  * finish a write. And that someone else might not be able to finish the write
!  * if our CurrPos points to a buffer that's still in the buffer cache.
!  */
! static char *
! GetXLogBuffer(XLogRecPtr ptr)
! {
! 	int			idx;
! 	XLogRecPtr	endptr;
! 
! 	/*
! 	 * The XLog buffer cache is organized so that we can easily calculate the
! 	 * buffer a given page must be loaded into from the XLogRecPtr alone.
! 	 * A page must always be loaded to a particular buffer.
! 	 */
! 	idx = XLogRecPtrToBufIdx(ptr);
! 
! 	/*
! 	 * See what page is loaded in the buffer at the moment. It could be the
! 	 * page we're looking for, or something older. It can't be anything
! 	 * newer - that would imply the page we're looking for has already
! 	 * been written out to disk, which shouldn't happen as long as the caller
! 	 * has set its slot's CurrPos correctly.
! 	 *
! 	 * However, we don't hold a lock while we read the value. If someone has
! 	 * just initialized the page, it's possible that we get a "torn read",
! 	 * and read a bogus value. That's ok, we'll grab the mapping lock (in
! 	 * AdvanceXLInsertBuffer) and retry if we see anything else than the page
! 	 * we're looking for. But it means that when we do this unlocked read, we
! 	 * might see a value that appears to be ahead of the page we're looking
! 	 * for. So don't PANIC on that, until we've verified the value while
! 	 * holding the lock.
! 	 */
! 	endptr = XLogCtl->xlblocks[idx];
! 	if (ptr.xlogid != endptr.xlogid ||
! 		!(ptr.xrecoff < endptr.xrecoff &&
! 		  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
  	{
! 		AdvanceXLInsertBuffer(ptr, false);
! 		endptr = XLogCtl->xlblocks[idx];
  
! 		if (ptr.xlogid != endptr.xlogid ||
! 			!(ptr.xrecoff < endptr.xrecoff &&
! 			  ptr.xrecoff >= endptr.xrecoff - XLOG_BLCKSZ))
  		{
! 			elog(PANIC, "could not find WAL buffer for %X/%X",
! 				 ptr.xlogid, ptr.xrecoff);
  		}
  	}
  
  	/*
! 	 * Found the buffer holding this page. Return a pointer to the right
! 	 * offset within the page.
  	 */
! 	return (char *) XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
! 		ptr.xrecoff % XLOG_BLCKSZ;
! }
  
! /*
!  * Advance an XLogRecPtr to the first valid insertion location on the next
!  * page, right after the page header. An XLogRecPtr pointing to a boundary,
!  * ie. the first byte of a page, is taken to belong to the previous page.
!  */
! static XLogRecPtr
! AdvanceXLogRecPtrToNextPage(XLogRecPtr ptr)
! {
! 	int			freespace;
  
! 	freespace = INSERT_FREESPACE(ptr);
! 	XLByteAdvance(ptr, freespace);
! 	if (ptr.xrecoff % XLogSegSize == 0)
! 		ptr.xrecoff += SizeOfXLogLongPHD;
! 	else
! 		ptr.xrecoff += SizeOfXLogShortPHD;
! 
! 	return ptr;
! }
! 
! /*
!  * Wait for any insertions < upto to finish. If upto is invalid, we wait until
!  * at least one slot is available for insertion.
!  *
!  * Returns a value >= upto, which indicates the oldest in-progress insertion
!  * that we saw in the array, or InvalidXLogRecPtr if there are no insertions
!  * in-progress at exit.
!  */
! static XLogRecPtr
! WaitXLogInsertionsToFinish(XLogRecPtr upto)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			lastslot;
! 	int			nextslot;
! 	XLogRecPtr	LastPos = InvalidXLogRecPtr;
! 	int			extraWaits = 0;
! 
! 	if (MyProc == NULL)
! 		elog(PANIC, "cannot wait without a PGPROC structure");
! 
! retry:
! 	/*
! 	 * Read lastslot and nextslot. lastslot cannot change while we hold the
! 	 * tail-lock. nextslot can advance while we run, but it not beyond
! 	 * lastslot - 1. We still have to acquire insertpos_lck to make sure that
! 	 * we see the CurrPos of the latest slot correctly.
! 	 */
! 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 	lastslot = Insert->lastslot;
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	nextslot = Insert->nextslot;
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! 	while (lastslot != nextslot)
  	{
! 		/*
! 		 * Examine the oldest slot still in use.
! 		 */
! 		volatile XLogInsertSlot *slot;
! 		XLogRecPtr	slotptr;
  
! 		slot = &XLogCtl->XLogInsertSlots[lastslot];
! 		SpinLockAcquire(&slot->lck);
! 		slotptr = slot->CurrPos;
! 
! 		if (XLogRecPtrIsInvalid(slotptr))
! 		{
! 			/*
! 			 * The insertion has already finished, we just need to advance
! 			 * lastslot to make the slot available for reuse.
! 			 */
! 			SpinLockRelease(&slot->lck);
! 			lastslot = NextSlotNo(lastslot);
! 			continue;
! 		}
! 		else
! 		{
! 			/*
! 			 * The insertion is still in-progress. If we just needed for
! 			 * any slot to become available and there is at least one slot
! 			 * free already, or if this slot's CurrPos >= upto, we can
! 			 * stop here. Otherwise we have to wait for it to finish.
! 			 */
! 			if ((XLogRecPtrIsInvalid(upto) && NextSlotNo(nextslot) != lastslot)
! 				|| (!XLogRecPtrIsInvalid(upto) && XLByteLE(upto, slotptr)))
! 			{
! 				SpinLockRelease(&slot->lck);
! 				LastPos = slotptr;
! 				break;
! 			}
! 			else
! 			{
! 				/* Wait for this insertion to finish. */
! 				MyProc->lwWaiting = true;
! 				MyProc->lwWaitMode = 0; /* doesn't matter */
! 				MyProc->lwWaitLink = NULL;
! 				if (slot->head == NULL)
! 					slot->head = MyProc;
! 				else
! 					slot->tail->lwWaitLink = MyProc;
! 				slot->tail = MyProc;
! 				SpinLockRelease(&slot->lck);
! 
! 				Insert->lastslot = lastslot;
! 				LWLockRelease(WALInsertTailLock);
! 				for (;;)
! 				{
! 					PGSemaphoreLock(&MyProc->sem, false);
! 					if (!MyProc->lwWaiting)
! 						break;
! 					extraWaits++;
! 				}
! 
! 				/*
! 				 * The insertion has now finished. Start all over. While we
! 				 * were not holding the tail-lock, someone might've filled up
! 				 * all slots again.
! 				 */
! 				goto retry;
! 			}
! 		}
  	}
  
! 	/* Update lastslot before we release the lock */
! 	Insert->lastslot = lastslot;
! 	LWLockRelease(WALInsertTailLock);
  
! 	while (extraWaits-- > 0)
! 		PGSemaphoreUnlock(&MyProc->sem);
  
! 	return LastPos;
  }
  
  /*
***************
*** 1488,1522 **** XLogArchiveCleanup(const char *xlog)
  }
  
  /*
!  * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data.
!  *
!  * If new_segment is TRUE then we set up the next buffer page as the first
!  * page of the next xlog segment file, possibly but not usually the next
!  * consecutive file page.
!  *
!  * The global LogwrtRqst.Write pointer needs to be advanced to include the
!  * just-filled page.  If we can do this for free (without an extra lock),
!  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
!  * request update still needs to be done, FALSE if we did it internally.
!  *
!  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	XLogCtlWrite *Write = &XLogCtl->Write;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
! 	/* Use Insert->LogwrtResult copy if it's more fresh */
! 	if (XLByteLT(LogwrtResult.Write, Insert->LogwrtResult.Write))
! 		LogwrtResult = Insert->LogwrtResult;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 1988,2021 ----
  }
  
  /*
!  * Initialize XLOG buffers, writing out old buffers if they still contain
!  * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
!  * true, initialize as many pages as we can without having to write out
!  * unwritten data. Any new pages are initialized to zeros, with pages headers
!  * initialized properly.
   */
! static void
! AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
  	XLogPageHeader NewPage;
+ 	bool		needflush;
+ 	int			npages = 0;
  
! 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Now that we have the lock, check if someone initialized the page
! 	 * already.
! 	 */
! /* XXX: fix indentation before commit */
! while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
! {
! 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1524,1535 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
- 	{
- 		/* nope, got work to do... */
- 		XLogRecPtr	FinishedPageRqstPtr;
  
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 2023,2039 ----
  	 * written out.
  	 */
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  
! 	needflush = !XLByteLE(OldPageRqstPtr, LogwrtResult.Write);
! 
! 	if (needflush)
! 	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1537,1581 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		update_needed = false;	/* Did the shared-request update */
! 
! 		if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/* OK, someone wrote it already */
! 			Insert->LogwrtResult = LogwrtResult;
! 		}
! 		else
! 		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = Write->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding insert lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
- 				Insert->LogwrtResult = LogwrtResult;
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
--- 2041,2090 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 			{
! 				Assert(XLByteLE(OldPageRqstPtr, xlogctl->Insert.CurrPos));
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
! 			}
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
! 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  			{
  				/* OK, someone wrote it already */
  				LWLockRelease(WALWriteLock);
  			}
  			else
  			{
  				/*
! 				 * Have to write buffers while holding mapping lock. This is
  				 * not good, so only write as much as we absolutely must.
  				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
***************
*** 1583,1596 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
! 
! 	if (new_segment)
! 	{
! 		/* force it to a segment start point */
! 		NewPageEndPtr.xrecoff += XLogSegSize - 1;
! 		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
! 	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 2092,2098 ----
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1600,1612 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
! 
! 	Insert->curridx = nextidx;
! 	Insert->currpage = NewPage;
  
! 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
--- 2102,2111 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
! 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
  
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
***************
*** 1650,1660 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 2149,2176 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * Make sure the initialization of the page becomes visible to others
! 	 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
! 	 * holding a lock.
! 	 */
! 	pg_write_barrier();
! 
! 	*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(DEBUG1, "initialized %d pages, upto %X/%X",
! 			 npages, NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff);
! #endif
  }
  
  /*
***************
*** 1699,1714 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * If xlog_switch == TRUE, we are intending an xlog segment switch, so
!  * perform end-of-segment actions after writing the last page, even if
!  * it's not physically the end of its segment.  (NB: this will work properly
!  * only if caller specifies WriteRqst == page-end and flexible == false,
!  * and there is some data to write.)
!  *
!  * Must be called with WALWriteLock held.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
--- 2215,2226 ----
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
!  * the data is ready to write.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
***************
*** 1726,1732 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = Write->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
--- 2238,2244 ----
  	/*
  	 * Update local LogwrtResult (caller probably did this already, but...)
  	 */
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	/*
  	 * Since successive pages in the xlog cache are consecutively allocated,
***************
*** 1757,1770 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 2269,2282 ----
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
! 		if (!XLByteLT(LogwrtResult.Write, EndPtr))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 EndPtr.xlogid, EndPtr.xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = EndPtr;
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1861,1876 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
- 			 * We also do this if this is the last page written for an xlog
- 			 * switch.
- 			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg || (xlog_switch && last_iteration))
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
--- 2373,2385 ----
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg)
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
***************
*** 1960,1967 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
- 
- 	Write->LogwrtResult = LogwrtResult;
  }
  
  /*
--- 2469,2474 ----
***************
*** 2124,2131 **** XLogFlush(XLogRecPtr record)
  	 */
  	for (;;)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
--- 2631,2642 ----
  	 */
  	for (;;)
  	{
! 		/* use volatile pointers to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos,
+ 					inprogresspos;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
***************
*** 2139,2144 **** XLogFlush(XLogRecPtr record)
--- 2650,2686 ----
  			break;
  
  		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to do LWLockConditionalAcquire(WALInsertLock),
+ 		 * fall back to writing just up to 'record' if we couldn't get t
+ 		 * lock. I wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand
+ 		 * it would be good to not cause more contention on the lock if
+ 		 * busy, but on the other hand, this spinlock is much more light
+ 		 * than the WALInsertLock was, so maybe it's better to just grab
+ 		 * spinlock. Also note that if we stored the XLogRecPtr as one 6
+ 		 * integer, we could just read it with no lock on platforms wher
+ 		 * 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)               /* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		inprogresspos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+ 		if (!XLogRecPtrIsInvalid(inprogresspos))
+ 			insertpos = inprogresspos;
+ 
+ 		/*
  		 * Try to get the write lock. If we can't get it immediately, wait
  		 * until it's released, and recheck if we still need to do the flush
  		 * or if the backend that held the lock did it for us already. This
***************
*** 2155,2186 **** XLogFlush(XLogRecPtr record)
  			continue;
  		}
  		/* Got the lock */
! 		LogwrtResult = XLogCtl->Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
! 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 				else
! 				{
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 					WriteRqstPtr.xrecoff -= freespace;
! 				}
! 				LWLockRelease(WALInsertLock);
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = WriteRqstPtr;
! 			}
! 			else
! 			{
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = record;
! 			}
! 			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
--- 2697,2709 ----
  			continue;
  		}
  		/* Got the lock */
! 		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
! 			XLogWrite(WriteRqst, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
***************
*** 2292,2314 **** XLogBackgroundFlush(void)
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
- 
  	START_CRIT_SECTION();
  
! 	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->Write.LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible, false);
  	}
! 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2815,2845 ----
  			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
  #endif
  	START_CRIT_SECTION();
  
! 	/* now wait for any in-progress insertions to finish and get write lock */
! 	WaitXLogInsertionsToFinish(WriteRqstPtr);
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
  	{
  		XLogwrtRqst WriteRqst;
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible);
  	}
! 	LogwrtResult = XLogCtl->LogwrtResult;
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 */
+ 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5102,5107 **** XLOGShmemSize(void)
--- 5633,5641 ----
  	/* and the buffers themselves */
  	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
  
+ 	/* XLog insertion slots */
+ 	size = add_size(size, mul_size(sizeof(XLogInsertSlot), NumXLogInsertSlots));
+ 
  	/*
  	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
  	 * added by CreateSharedMemoryAndSemaphores.  This lets us use this
***************
*** 5117,5122 **** XLOGShmemInit(void)
--- 5651,5657 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5142,5147 **** XLOGShmemInit(void)
--- 5677,5695 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize insertion slots */
+ 	XLogCtl->XLogInsertSlots = (XLogInsertSlot *) allocptr;
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[i];
+ 		slot->CurrPos = InvalidXLogRecPtr;
+ 		slot->head = slot->tail = NULL;
+ 		SpinLockInit(&slot->lck);
+ 	}
+ 	XLogCtl->Insert.nextslot = 0;
+ 	XLogCtl->Insert.lastslot = 0;
+ 	allocptr += sizeof(XLogInsertSlot) * NumXLogInsertSlots;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5156,5166 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
- 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
--- 5704,5715 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
  
+ 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
+ 
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
  	 * and validate it immediately (see comments in ReadControlFile() for the
***************
*** 6038,6043 **** StartupXLOG(void)
--- 6587,6593 ----
  	bool		backupEndRequired = false;
  	bool		backupFromStandby = false;
  	DBState		dbstate_at_startup;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6844,6851 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7394,7405 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6853,6878 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
- 	XLogCtl->Write.LogwrtResult = LogwrtResult;
- 	Insert->LogwrtResult = LogwrtResult;
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7407,7429 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
  	XLogCtl->LogwrtResult = LogwrtResult;
  
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = XLOG_BLCKSZ - EndOfLog.xrecoff % XLOG_BLCKSZ;
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndOfLog.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6884,6890 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7435,7441 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7390,7396 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
--- 7941,7947 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
***************
*** 7666,7671 **** CreateCheckPoint(int flags)
--- 8217,8223 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7734,7743 **** CreateCheckPoint(int flags)
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 8286,8295 ----
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * Determine the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7749,7755 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 8301,8307 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7758,7772 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 8310,8321 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7793,7806 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
--- 8342,8351 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(curInsert);
  	if (freespace < SizeOfXLogRecord)
! 		curInsert = AdvanceXLogRecPtrToNextPage(curInsert);
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
***************
*** 7826,7832 **** CreateCheckPoint(int flags)
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8371,8377 ----
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7846,7852 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8391,8397 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 8213,8227 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8758,8772 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * Like in CreateCheckPoint(), you need both insertpos_lck and info_lck
! 	 * to update it, although during recovery acquiring insertpos_lck is just
! 	 * pro forma, because no WAL insertions are happening.
  	 */
! 	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
  	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
! 	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8412,8418 **** XLogPutNextOid(Oid nextOid)
  XLogRecPtr
  RequestXLogSwitch(void)
  {
! 	XLogRecPtr	RecPtr;
  	XLogRecData rdata;
  
  	/* XLOG SWITCH, alone among xlog record types, has no data */
--- 8957,8963 ----
  XLogRecPtr
  RequestXLogSwitch(void)
  {
! 	XLogRecPtr	EndPos;
  	XLogRecData rdata;
  
  	/* XLOG SWITCH, alone among xlog record types, has no data */
***************
*** 8421,8429 **** RequestXLogSwitch(void)
  	rdata.len = 0;
  	rdata.next = NULL;
  
! 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH, &rdata);
  
! 	return RecPtr;
  }
  
  /*
--- 8966,8979 ----
  	rdata.len = 0;
  	rdata.next = NULL;
  
! 	EndPos = XLogInsert(RM_XLOG_ID, XLOG_SWITCH, &rdata);
  
! 	/*
! 	 * XLogInsert returns a pointer to the end of segment, but we want
! 	 * to return a pointer to just the end of the xlog-switch record. The
! 	 * rest of the segment is padded out.
! 	 */
! 	return EndPos;
  }
  
  /*
***************
*** 8501,8522 **** XLogReportParameters(void)
  /*
   * Update full_page_writes in shared memory, and write an
   * XLOG_FPW_CHANGE record if necessary.
   */
  void
  UpdateFullPageWrites(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
  	 *
  	 * It's safe to check the shared full_page_writes without the lock,
! 	 * because we can guarantee that there is no concurrently running
! 	 * process which can update it.
  	 */
  	if (fullPageWrites == Insert->fullPageWrites)
  		return;
  
  	/*
  	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
  	 * track of full_page_writes during archive recovery, if required.
--- 9051,9091 ----
  /*
   * Update full_page_writes in shared memory, and write an
   * XLOG_FPW_CHANGE record if necessary.
+  *
+  * Note: this function assumes there is no other process running
+  * concurrently that could update it.
   */
  void
  UpdateFullPageWrites(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
  	 *
  	 * It's safe to check the shared full_page_writes without the lock,
! 	 * because we assume that there is no concurrently running process
! 	 * which can update it.
  	 */
  	if (fullPageWrites == Insert->fullPageWrites)
  		return;
  
+ 	START_CRIT_SECTION();
+ 
+ 	/*
+ 	 * It's always safe to take full page images, even when not strictly
+ 	 * required, but not the other round. So if we're setting full_page_writes
+ 	 * to true, first set it true and then write the WAL record. If we're
+ 	 * setting it to false, first write the WAL record and then set the
+ 	 * global flag.
+ 	 */
+ 	if (fullPageWrites)
+ 	{
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		Insert->fullPageWrites = true;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 	}
+ 
  	/*
  	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
  	 * track of full_page_writes during archive recovery, if required.
***************
*** 8532,8543 **** UpdateFullPageWrites(void)
  
  		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
  	}
! 	else
  	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 		Insert->fullPageWrites = fullPageWrites;
! 		LWLockRelease(WALInsertLock);
  	}
  }
  
  /*
--- 9101,9114 ----
  
  		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
  	}
! 
! 	if (!fullPageWrites)
  	{
! 		SpinLockAcquire(&Insert->insertpos_lck);
! 		Insert->fullPageWrites = false;
! 		SpinLockRelease(&Insert->insertpos_lck);
  	}
+ 	END_CRIT_SECTION();
  }
  
  /*
***************
*** 9063,9068 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9634,9640 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	checkpointloc;
***************
*** 9125,9150 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9697,9722 ----
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 9257,9269 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9829,9841 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9347,9373 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		Assert(XLogCtl->Insert.exclusiveBackup);
! 		XLogCtl->Insert.exclusiveBackup = false;
  	}
  	else
  	{
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 9919,9946 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		Assert(Insert->exclusiveBackup);
! 		Insert->exclusiveBackup = false;
  	}
  	else
  	{
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9380,9385 **** pg_start_backup_callback(int code, Datum arg)
--- 9953,9959 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	startpoint;
***************
*** 9433,9441 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 10007,10015 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9444,9459 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 10018,10033 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9731,9746 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 10305,10322 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9794,9805 **** GetStandbyFlushRecPtr(void)
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 10370,10381 ----
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 1753,1761 **** GetOldestActiveTransactionId(void)
   * the result is somewhat indeterminate, but we don't really care.  Even in
   * a multiprocessor with delayed writes to shared memory, it should be certain
   * that setting of inCommit will propagate to shared memory when the backend
!  * takes the WALInsertLock, so we cannot fail to see an xact as inCommit if
!  * it's already inserted its commit record.  Whether it takes a little while
!  * for clearing of inCommit to propagate is unimportant for correctness.
   */
  int
  GetTransactionsInCommit(TransactionId **xids_p)
--- 1753,1762 ----
   * the result is somewhat indeterminate, but we don't really care.  Even in
   * a multiprocessor with delayed writes to shared memory, it should be certain
   * that setting of inCommit will propagate to shared memory when the backend
!  * takes a lock to write the WAL record, so we cannot fail to see an xact as
!  * inCommit if it's already inserted its commit record.  Whether it takes a
!  * little while for clearing of inCommit to propagate is unimportant for
!  * correctness.
   */
  int
  GetTransactionsInCommit(TransactionId **xids_p)
*** a/src/backend/storage/lmgr/spin.c
--- b/src/backend/storage/lmgr/spin.c
***************
*** 56,61 **** SpinlockSemas(void)
--- 56,64 ----
  	 *
  	 * For now, though, we just need a few spinlocks (10 should be plenty)
  	 * plus one for each LWLock and one for each buffer header.
+ 	 *
+ 	 * XXX: remember to adjust this for the number of spinlocks needed by the
+ 	 * xlog.c changes before committing!
  	 */
  	return NumLWLocks() + NBuffers + 10;
  }
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALInsertTailLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#56)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 21.02.2012 13:19, Fujii Masao wrote:

In some places, the spinlock "insertpos_lck" is taken while another
spinlock "info_lck" is being held. Is this OK? What if unfortunately
inner spinlock takes long to be taken?

Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt
that's a big issue in practice. We had the same pattern before the
patch, just with WALInsertLock instead of insertpos_lck. Holding a
spinlock longer is much worse than holding a lwlock longer, but
nevertheless I don't think that's a problem.

No, that's NOT okay. A spinlock is only supposed to be held across a
short straight-line sequence of instructions. Something that could
involve a spin loop, or worse a sleep() kernel call, is right out.
Please change this.

regards, tom lane

#58Fujii Masao
masao.fujii@gmail.com
In reply to: Tom Lane (#57)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Tue, Mar 6, 2012 at 2:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 21.02.2012 13:19, Fujii Masao wrote:

In some places, the spinlock "insertpos_lck" is taken while another
spinlock "info_lck" is being held. Is this OK? What if unfortunately
inner spinlock takes long to be taken?

Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt
that's a big issue in practice. We had the same pattern before the
patch, just with WALInsertLock instead of insertpos_lck. Holding a
spinlock longer is much worse than holding a lwlock longer, but
nevertheless I don't think that's a problem.

No, that's NOT okay.  A spinlock is only supposed to be held across a
short straight-line sequence of instructions.

This also strikes me that the usage of the spinlock insertpos_lck might
not be OK in ReserveXLogInsertLocation() because a few dozen instructions
can be performed while holding the spinlock....

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#59Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#56)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Tue, Mar 6, 2012 at 1:50 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

+        * An xlog-switch record consumes all the remaining space on the
+        * WAL segment. We have already reserved it for us, but we still
need
+        * to make sure it's been allocated and zeroed in the WAL buffers
so
+        * that when the caller (or someone else) does XLogWrite(), it can
+        * really write out all the zeros.

Why do we need to write out all the remaining space with zeros? In
current master, we don't do that. A recovery code ignores the data
following XLOG_SWITCH record, so I don't think that's required.

It's to keep the logic simpler. Before the patch, an xlog-switch just
initialized the next page in the WAL buffers to insert to, to be the first
page in the next segment. With this patch, we rely on a simple linear
mapping from an XLogRecPtr to the WAL buffer that should contain that page
(see XLogRecPtrToBufIdx()). Such a mapping is not possible if you sometimes
skip over pages in the WAL buffers, so we allocate the buffers for those
empty pages, too. Note that this means that an xlog-switch can blow through
all your WAL buffers.

We could probably optimize that so that you don't need to actually write()
and fsync() all the zeros, perhaps by setting a flag on the WAL buffer to
indicate that it only contains padding for an xlog-switch. However, I don't
see any easy way to avoid blowing the cache.

I'm thinking that xlog-switching happens so seldom, and typically on a
fairly idle system, so we don't need to optimize it much. I guess we should
measure the impact, though..

Online backup which forces an xlog-switch twice might be performed under
a certain amount of load. So, to avoid the performance spike during online
backup, I think that the optimization which skips write() and fsync() of
the padding bytes is helpful.

On 22.02.2012 03:34, Fujii Masao wrote:

When I ran the long-running performance test, I encountered the following
panic error.

     PANIC:  could not find WAL buffer for 0/FF000000

0/FF000000 is the xlog file boundary, so the patch seems to handle
the xlog file boundary incorrectly. In the patch, current insertion lsn
is advanced by directly incrementing XLogRecPtr.xrecoff as follows.
But to handle the xlog file boundary correctly, we should use
XLByteAdvance() for that, instead?

Thanks, fixed this, too.

I made the locking a bit more strict in WaitXLogInsertionsToFinish(), so
that it grabs the insertpos_lck to read nextslot. I previously thought that
was not necessary, assuming that reading/writing an int32 is atomic, but I'm
afraid there might be memory-ordering issues where the CurrPos of the most
recent slot has not become visible to other backends yet, while the
advancing of nextslot has.

That particular issue would be very hard to hit in practice, so I don't know
if this could explain the recovery failures that Jeff saw. I got the test
script running (thanks for that Jeff!), but unfortunately have not seen any
failures yet (aside from the issue with crossing xlogid boundary), with
either this or the older version of the patch.

Attached is a new version of the patch.

Thanks for the new patch!

In this new patch, I again was able to reproduce the assertion failure which
I described on the upthread.
http://archives.postgresql.org/message-id/CAHGQGwGRuNJ%3D_ctXwteNkFkdvMDNFYxFdn0D1cd-CqL0OgNCLg%40mail.gmail.com

$ uname -a
Linux hermes 3.0.0-16-generic #28-Ubuntu SMP Fri Jan 27 17:50:54 UTC
2012 i686 i686 i386 GNU/Linux

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#58)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 06.03.2012 14:52, Fujii Masao wrote:

On Tue, Mar 6, 2012 at 2:17 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 21.02.2012 13:19, Fujii Masao wrote:

In some places, the spinlock "insertpos_lck" is taken while another
spinlock "info_lck" is being held. Is this OK? What if unfortunately
inner spinlock takes long to be taken?

Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt
that's a big issue in practice. We had the same pattern before the
patch, just with WALInsertLock instead of insertpos_lck. Holding a
spinlock longer is much worse than holding a lwlock longer, but
nevertheless I don't think that's a problem.

No, that's NOT okay. A spinlock is only supposed to be held across a
short straight-line sequence of instructions.

Ok, that's easy enough to fix.

This also strikes me that the usage of the spinlock insertpos_lck might
not be OK in ReserveXLogInsertLocation() because a few dozen instructions
can be performed while holding the spinlock....

I admit that block is longer than any of our existing spinlock blocks.
However, it's important for performance. I tried using a lwlock earlier,
and that negated the gains. So if that's a serious objection, then let's
resolve that now before I spend any more time on other aspects of the
patch. Any ideas how to make that block shorter?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#61Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#60)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Tue, Mar 6, 2012 at 10:07 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I admit that block is longer than any of our existing spinlock blocks.
However, it's important for performance. I tried using a lwlock earlier, and
that negated the gains. So if that's a serious objection, then let's resolve
that now before I spend any more time on other aspects of the patch. Any
ideas how to make that block shorter?

We shouldn't put the cart in front of the horse. The point of keeping
spinlock acquisitions short is to improve performance by preventing
excess spinning. If the performance is better with a spinlock than
with an lwlock, then clearly the spinning isn't excessive, or at least
not in the case you tested.

That having been said, shorter critical sections are always good, of course...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#60)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 06.03.2012 14:52, Fujii Masao wrote:

This also strikes me that the usage of the spinlock insertpos_lck might
not be OK in ReserveXLogInsertLocation() because a few dozen instructions
can be performed while holding the spinlock....

I admit that block is longer than any of our existing spinlock blocks.
However, it's important for performance. I tried using a lwlock earlier,
and that negated the gains. So if that's a serious objection, then let's
resolve that now before I spend any more time on other aspects of the
patch. Any ideas how to make that block shorter?

How long is the current locked code exactly --- does it contain a loop?

I'm not sure where the threshold of pain is for length of time holding a
spinlock. I wouldn't go out of the way to avoid using a spinlock for
say a hundred instructions, at least not unless it was a very
high-contention lock. But sleeping while holding a spinlock is right out.

regards, tom lane

#63Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#62)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 06.03.2012 17:12, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 06.03.2012 14:52, Fujii Masao wrote:

This also strikes me that the usage of the spinlock insertpos_lck might
not be OK in ReserveXLogInsertLocation() because a few dozen instructions
can be performed while holding the spinlock....

I admit that block is longer than any of our existing spinlock blocks.
However, it's important for performance. I tried using a lwlock earlier,
and that negated the gains. So if that's a serious objection, then let's
resolve that now before I spend any more time on other aspects of the
patch. Any ideas how to make that block shorter?

How long is the current locked code exactly --- does it contain a loop?

Perhaps best if you take a look for yourself, the function is called
ReserveXLogInsertLocation() in patch. It calls a helper function called
AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined.
It does contain one loop, which iterates once for every WAL page the
record crosses.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#64Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#63)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 06.03.2012 17:12, Tom Lane wrote:

How long is the current locked code exactly --- does it contain a loop?

Perhaps best if you take a look for yourself, the function is called
ReserveXLogInsertLocation() in patch. It calls a helper function called
AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined.
It does contain one loop, which iterates once for every WAL page the
record crosses.

Hm. The loop makes me a tad uncomfortable, because it is possible for
WAL records to be very long (many pages). I see the point that
replacing the spinlock with an LWLock would likely negate any
performance win from this patch, but having other processes arrive and
spin while somebody is busy calculating the size of a multi-megabyte
commit record would be bad too.

What I suggest is that it should not be necessary to crawl forward one
page at a time to figure out how many pages will be needed to store N
bytes worth of WAL data. You're basically implementing a division
problem as repeated subtraction. Getting the extra WAL-segment-start
overhead right would be slightly tricky; but even if you didn't want to
try to make it pure straight-line code, at the very least it seems like
you could set it up so that the loop iterates only once per segment not
page.

regards, tom lane

#65Fujii Masao
masao.fujii@gmail.com
In reply to: Tom Lane (#64)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Wed, Mar 7, 2012 at 5:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What I suggest is that it should not be necessary to crawl forward one
page at a time to figure out how many pages will be needed to store N
bytes worth of WAL data.  You're basically implementing a division
problem as repeated subtraction.  Getting the extra WAL-segment-start
overhead right would be slightly tricky; but even if you didn't want to
try to make it pure straight-line code, at the very least it seems like
you could set it up so that the loop iterates only once per segment not
page.

+1

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#66Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#64)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Tue, Mar 6, 2012 at 8:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 06.03.2012 17:12, Tom Lane wrote:

How long is the current locked code exactly --- does it contain a loop?

Perhaps best if you take a look for yourself, the function is called
ReserveXLogInsertLocation() in patch. It calls a helper function called
  AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined.
It does contain one loop, which iterates once for every WAL page the
record crosses.

Hm.  The loop makes me a tad uncomfortable, because it is possible for
WAL records to be very long (many pages).  I see the point that
replacing the spinlock with an LWLock would likely negate any
performance win from this patch, but having other processes arrive and
spin while somebody is busy calculating the size of a multi-megabyte
commit record would be bad too.

I would have thought the existence of a multi-megabyte commit record
would already imply a huge performance effect elsewhere and we
wouldn't care too much about a few spinlock cycles. But I think
they're as rare as Higgs bosons.

If/when such records do exist it isn't likely to be on a high
transaction rate server. Even allocating ~1 million xids takes long
enough that we wouldn't be expecting a very high commit rate even with
200 concurrent sessions doing this. If such records are rare, then the
minor blip they cause is just a drop in the ocean.

So I think Tom's concern is valid, but the frequency of problems
resulting from it will be so low as to not even be measurable. And
before we fix a perceived performance issue, we really should prove
its existence first, then confirm that this area is the bottleneck
that is slowing such workloads.

So +1 to Heikki keeping the spinlock, as is, and not redesign anything.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#67Alvaro Herrera
alvherre@commandprompt.com
In reply to: Simon Riggs (#66)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Excerpts from Simon Riggs's message of mié mar 07 05:35:44 -0300 2012:

On Tue, Mar 6, 2012 at 8:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 06.03.2012 17:12, Tom Lane wrote:

How long is the current locked code exactly --- does it contain a loop?

Perhaps best if you take a look for yourself, the function is called
ReserveXLogInsertLocation() in patch. It calls a helper function called
  AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined.
It does contain one loop, which iterates once for every WAL page the
record crosses.

Hm.  The loop makes me a tad uncomfortable, because it is possible for
WAL records to be very long (many pages).  I see the point that
replacing the spinlock with an LWLock would likely negate any
performance win from this patch, but having other processes arrive and
spin while somebody is busy calculating the size of a multi-megabyte
commit record would be bad too.

I would have thought the existence of a multi-megabyte commit record
would already imply a huge performance effect elsewhere and we
wouldn't care too much about a few spinlock cycles. But I think
they're as rare as Higgs bosons.

Just to keep things in perspective -- For a commit record to reach one
megabyte, it would have to be a transaction that drops over 43k tables.
Or have 64k smgr inval messages (for example, a TRUNCATE might send half
a dozen of these messages). Or have 262k subtransactions. Or
combinations thereof.

Now admittedly, a page is only 8 kB, so for a commit record to be "many
pages long" (that is, >=3) it would require about 1500 smgr inval
messages, or, say, about 250 TRUNCATEs (of permanent tables with at
least one toastable field and at least one index).

So they are undoubtely rare. Not sure if as rare as Higgs bosons.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#67)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Alvaro Herrera <alvherre@commandprompt.com> writes:

Just to keep things in perspective -- For a commit record to reach one
megabyte, it would have to be a transaction that drops over 43k tables.
Or have 64k smgr inval messages (for example, a TRUNCATE might send half
a dozen of these messages). Or have 262k subtransactions. Or
combinations thereof.

Now admittedly, a page is only 8 kB, so for a commit record to be "many
pages long" (that is, >=3) it would require about 1500 smgr inval
messages, or, say, about 250 TRUNCATEs (of permanent tables with at
least one toastable field and at least one index).

What about the locks (if running hot-standby)?

So they are undoubtely rare. Not sure if as rare as Higgs bosons.

Even if they're rare, having a major performance hiccup when one happens
is not a side-effect I want to see from a patch whose only reason to
exist is better performance.

regards, tom lane

#69Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#68)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Wed, Mar 7, 2012 at 3:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Just to keep things in perspective -- For a commit record to reach one
megabyte, it would have to be a transaction that drops over 43k tables.
Or have 64k smgr inval messages (for example, a TRUNCATE might send half
a dozen of these messages). Or have 262k subtransactions.  Or
combinations thereof.

Now admittedly, a page is only 8 kB, so for a commit record to be "many
pages long" (that is, >=3) it would require about 1500 smgr inval
messages, or, say, about 250 TRUNCATEs (of permanent tables with at
least one toastable field and at least one index).

What about the locks (if running hot-standby)?

It's a list of active AccessExclusiveLocks. If that list is long you
can be sure not much else is happening on the server.

So they are undoubtely rare.  Not sure if as rare as Higgs bosons.

Even if they're rare, having a major performance hiccup when one happens
is not a side-effect I want to see from a patch whose only reason to
exist is better performance.

I agree the effect you point out can exist, I just don't want to slow
down the main case as a result.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#69)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Simon Riggs <simon@2ndQuadrant.com> writes:

On Wed, Mar 7, 2012 at 3:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

So they are undoubtely rare. �Not sure if as rare as Higgs bosons.

Even if they're rare, having a major performance hiccup when one happens
is not a side-effect I want to see from a patch whose only reason to
exist is better performance.

I agree the effect you point out can exist, I just don't want to slow
down the main case as a result.

I don't see any reason to think that what I suggested would slow things
down, especially not if the code were set up to fall through quickly in
the typical case where no page boundary is crossed. Integer division is
not slow on any machine made in the last 15 years or so.

regards, tom lane

#71Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#56)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Mon, Mar 5, 2012 at 8:50 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

That particular issue would be very hard to hit in practice, so I don't know
if this could explain the recovery failures that Jeff saw. I got the test
script running (thanks for that Jeff!), but unfortunately have not seen any
failures yet (aside from the issue with crossing xlogid boundary), with
either this or the older version of the patch.

Attached is a new version of the patch.

I've run patch v10 for 14109 cycles of crash and recovery, and there
were 8 assertion failures at "xlog.c", Line: 2106 during the
end-of-recovery checkpoint.

How many cycles have you run? Assuming the crashes follow a simple
binomial distribution with the frequency I see, you would have to run
for ~1230 cycles for a 50% chance of experiencing at least one, or for
~8120 cycles for a 99% chance of experiencing at least one.

I think Fujii's method if provoking this problem is more efficient
than mine, although I haven't tried it myself.

Dual Core AMD Opteron(tm) Processor 275
2.6.32.36-0.5-default #1 SMP 2011-04-14 10:12:31 +0200 x86_64 x86_64
x86_64 GNU/Linux

Cheers,

Jeff

#72Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#70)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 07.03.2012 17:28, Tom Lane wrote:

Simon Riggs<simon@2ndQuadrant.com> writes:

On Wed, Mar 7, 2012 at 3:04 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera<alvherre@commandprompt.com> writes:

So they are undoubtely rare. Not sure if as rare as Higgs bosons.

Even if they're rare, having a major performance hiccup when one happens
is not a side-effect I want to see from a patch whose only reason to
exist is better performance.

I agree the effect you point out can exist, I just don't want to slow
down the main case as a result.

I don't see any reason to think that what I suggested would slow things
down, especially not if the code were set up to fall through quickly in
the typical case where no page boundary is crossed. Integer division is
not slow on any machine made in the last 15 years or so.

Agreed. I wasn't worried about the looping with extra-large records, but
might as well not do it.

Here's an updated patch. It now only loops once per segment that a
record crosses. Plus a lot of other small cleanup.

I've been doing some performance testing with this, using a simple C
function that just inserts a dummy WAL record of given size. I'm not
totally satisfied. Although the patch helps with scalability at 3-4
concurrent backends doing WAL insertions, it seems to slow down the
single-client case with small WAL records by about 5-10%. This is what
Robert also saw with an earlier version of the patch
(http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I
tested this with the data directory on a RAM drive, unfortunately I
don't have a server with a hard drive that can sustain the high
insertion rate. I'll post more detailed results, once I've refined the
tests a bit.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#73Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#72)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Fri, Mar 9, 2012 at 7:04 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here's an updated patch. It now only loops once per segment that a record
crosses. Plus a lot of other small cleanup.

Thanks! But you forgot to attach the patch.

I've been doing some performance testing with this, using a simple C
function that just inserts a dummy WAL record of given size. I'm not totally
satisfied. Although the patch helps with scalability at 3-4 concurrent
backends doing WAL insertions, it seems to slow down the single-client case
with small WAL records by about 5-10%. This is what Robert also saw with an
earlier version of the patch
(http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I
tested this with the data directory on a RAM drive, unfortunately I don't
have a server with a hard drive that can sustain the high insertion rate.
I'll post more detailed results, once I've refined the tests a bit.

I'm also doing performance test. If I get interesting result, I'll post it.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#74Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#73)
1 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 09.03.2012 12:34, Fujii Masao wrote:

On Fri, Mar 9, 2012 at 7:04 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here's an updated patch. It now only loops once per segment that a record
crosses. Plus a lot of other small cleanup.

Thanks! But you forgot to attach the patch.

Sorry, here you go.

I've been doing some performance testing with this, using a simple C
function that just inserts a dummy WAL record of given size. I'm not totally
satisfied. Although the patch helps with scalability at 3-4 concurrent
backends doing WAL insertions, it seems to slow down the single-client case
with small WAL records by about 5-10%. This is what Robert also saw with an
earlier version of the patch
(http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I
tested this with the data directory on a RAM drive, unfortunately I don't
have a server with a hard drive that can sustain the high insertion rate.
I'll post more detailed results, once I've refined the tests a bit.

I'm also doing performance test. If I get interesting result, I'll post it.

Thanks!

BTW, I haven't forgotten about the recovery bugs Jeff found earlier. I'm
planning to do a longer run with his test script - I only run it for
about 1000 iterations - to see if I can reproduce the PANIC with both
the earlier patch version he tested, and this new one.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-12.patchtext/x-diff; name=xloginsert-scale-12.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 262,268 **** XLogRecPtr	XactLastRecEnd = {0, 0};
   * CHECKPOINT record).	We update this from the shared-memory copy,
   * XLogCtl->Insert.RedoRecPtr, whenever we can safely do so (ie, when we
   * hold the Insert lock).  See XLogInsert for details.	We are also allowed
!  * to update from XLogCtl->Insert.RedoRecPtr if we hold the info_lck;
   * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
   * InitXLOGAccess.
   */
--- 263,269 ----
   * CHECKPOINT record).	We update this from the shared-memory copy,
   * XLogCtl->Insert.RedoRecPtr, whenever we can safely do so (ie, when we
   * hold the Insert lock).  See XLogInsert for details.	We are also allowed
!  * to update from XLogCtl->RedoRecPtr if we hold the info_lck;
   * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
   * InitXLOGAccess.
   */
***************
*** 300,309 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * (protected by info_lck), but we don't need to cache any copies of it.
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 301,315 ----
   * (protected by info_lck), but we don't need to cache any copies of it.
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 315,320 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 321,393 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL, and the next insertion
+  *    slot to advertise that the insertion is in progress. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    insertpos_lck. Try to keep this section as short as possible,
+  *    insertpos_lck can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each insertion is asssigned its own
+  * "XLog insertion slot", which is used to advertise the position the backend
+  * is writing to. The slot is marked as in-use in step 1, while holding
+  * insertpos_lck, by setting the position field in the slot. When the backend
+  * is finished with the insertion, it clears its slot. Each slot is protected
+  * by a separate spinlock, to keep contention minimal.
+  *
+  * The insertion slots also provide a mechanism to wait for an insertion to
+  * finish. This is important when an XLOG page is written out - any
+  * in-progress insertions must finish copying data to the page first, or the
+  * on-disk copy will be incomplete. Waiting is done by the
+  * WaitXLogInsertionsToFinish() function. It adds the current process to the
+  * waiting queue in the slot it needs to wait for, and when that insertion
+  * finishes (or proceeds to the next page, at least), the inserter wakes up
+  * the process.
+  *
+  * The insertion slots form a ring. Insert->nextslot points to the next free
+  * slot, and Insert->lastslot points to the last slot that's still in use.
+  * lastslot can lag behind reality by any number of slots, as long as nextslot
+  * doesn't catch up with it. lastslot is advanced by
+  * WaitXLogInsertionsToFinish(), and is protected by WALInsertTailLock.
+  * nextslot is advanced in ReserveXLogInsertLocation() and is protected by
+  * insertpos_lck. Both slot variables are 32-bit integers, so that they can
+  * be read atomically without holding a lock. nextslot == lastslot means that
+  * all the slots are empty.
+  *
+  * Whenever the ring fills up, ie. when nextslot wraps around and catches up
+  * with lastslot, ReserveXLogInsertLocation() has to wait for the oldest
+  * insertion to finish and advance lastslot, to make room for the new
+  * insertion. This is also handled by WaitXLogInsertionsToFinish().
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+  * finish (or at least advance to next uninitialized page), while you're
+  * holding WALWriteLock. That would be bad, because the backend you're waiting
+  * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+  * you'd get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * it's called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition, so the common pattern:
+  *
+  *   recptr = XLogInsert(...)
+  *   XLogFlush(recptr)
+  *
+  * is safe. It can't get stuck, because an insertion to a WAL page that's
+  * already initialized in cache can always proceed without waiting on a lock.
+  *
   *----------
   */
  
***************
*** 335,344 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 408,429 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct
! 								 * (except lastslot). */
! 
! 	int32		nextslot;		/* next insertion slot to use */
! 	int32		lastslot;		/* last in-use insertion slot (protected by
! 								 * WALInsertTailLock) */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment.
! 	 * The next record will be inserted there (or somewhere after it if
! 	 * there's not enough space on the current page).  PrevRecord points to
! 	 * the beginning of the last record already reserved.  It might not be
! 	 * fully copied into place yet, but we know its exact location already.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 372,387 **** typedef struct XLogCtlWrite
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
--- 457,489 ----
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
+ 
+ /*
+  * Slots for in-progress WAL insertions.
+  */
+ typedef struct
+ {
+ 	slock_t		lck;
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ 	PGPROC	   *head;		/* head of list of waiting PGPROCs */
+ 	PGPROC	   *tail;		/* tail of list of waiting PGPROCs */
+ } XLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	512
+ 
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by insertpos_lck: */
  	XLogCtlInsert Insert;
  
+ 	XLogInsertSlot XLogInsertSlots[NumXLogInsertSlots];
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
+ 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
***************
*** 398,406 **** typedef struct XLogCtlData
  	XLogwrtResult LogwrtResult;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 500,517 ----
  	XLogwrtResult LogwrtResult;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock.  To change the identity of a buffer that's still
+ 	 * dirty, the old page needs to be written out first, and for that you
+ 	 * need WALWriteLock, and you need to ensure that there's no in-progress
+ 	 * insertions to the page by calling WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* latest initialized block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 478,505 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
  
! /* Free space remaining in the current xlog page buffer */
! #define INSERT_FREESPACE(Insert)  \
! 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
! /* Construct XLogRecPtr value for current insertion point */
! #define INSERT_RECPTR(recptr,Insert,curridx)  \
! 	( \
! 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
! 	  (recptr).xrecoff = \
! 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
! 	)
  
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
--- 589,620 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
+ #define INSERT_FREESPACE(endptr)	\
+ 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
  
! /*
!  * Macros to advance to next buffer index and insertion slot.
!  */
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
! #define NextSlotNo(idx)		(((idx) + 1) % NumXLogInsertSlots)
  
! /*
!  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
!  * would hold if it was in cache, the page containing 'recptr'.
!  *
!  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
!  * page is taken to mean the previous page.
!  */
! #define XLogRecPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogFileSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
! #define XLogRecEndPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogFileSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
***************
*** 625,633 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
--- 740,748 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
***************
*** 674,679 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 789,810 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
+ 				  XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  XLogInsertSlot *myslot,
+ 				  XLogRecPtr StartPos, XLogRecPtr EndPos);
+ static bool ReserveXLogInsertLocation(int size, bool forcePageWrites,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  XLogInsertSlot **myslot_p, bool *updrqst_p);
+ static void UpdateSlotCurrPos(volatile XLogInsertSlot *myslot,
+ 				  XLogRecPtr CurrPos);
+ static void	ReuseOldSlots(void);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
+ static char *GetXLogBuffer(XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 694,705 **** XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
- 	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
--- 825,830 ----
***************
*** 717,722 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 842,852 ----
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
+ 	XLogRecord	rechdr;
+ 	XLogRecPtr	PrevRecord;
+ 	XLogRecPtr	StartPos;
+ 	XLogRecPtr	EndPos;
+ 	XLogInsertSlot *myslot;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 734,742 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	 */
  	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
  	{
! 		RecPtr.xlogid = 0;
! 		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
! 		return RecPtr;
  	}
  
  	/*
--- 864,872 ----
  	 */
  	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
  	{
! 		EndPos.xlogid = 0;
! 		EndPos.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
! 		return EndPos;
  	}
  
  	/*
***************
*** 903,1035 **** begin:;
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
! 	START_CRIT_SECTION();
  
! 	/* Now wait to get insert lock */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
! 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
! 	 * back and recompute everything.  This can only happen just after a
! 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
! 	 *
! 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
! 	 * affect the contents of the XLOG record, so we'll update our local copy
! 	 * but not force a recomputation.
  	 */
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
  	{
! 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
! 		RedoRecPtr = Insert->RedoRecPtr;
  
! 		if (doPageWrites)
  		{
! 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 			{
! 				if (dtbuf[i] == InvalidBuffer)
! 					continue;
! 				if (dtbuf_bkp[i] == false &&
! 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
! 				{
! 					/*
! 					 * Oops, this buffer now needs to be backed up, but we
! 					 * didn't think so above.  Start over.
! 					 */
! 					LWLockRelease(WALInsertLock);
! 					END_CRIT_SECTION();
! 					rdt_lastnormal->next = NULL;
! 					info = info_orig;
! 					goto begin;
! 				}
! 			}
  		}
  	}
! 
! 	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
! 	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
! 		/* Oops, must redo it with full-page data. */
! 		LWLockRelease(WALInsertLock);
! 		END_CRIT_SECTION();
! 		rdt_lastnormal->next = NULL;
! 		info = info_orig;
! 		goto begin;
  	}
  
  	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
  	 */
! 	updrqst = false;
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
  	{
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
  
! 	/* Compute record's XLOG location */
! 	curridx = Insert->curridx;
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 	 * segment, we need not insert it (and don't want to because we'd like
! 	 * consecutive switch requests to be no-ops).  Instead, make sure
! 	 * everything is written and flushed through the end of the prior segment,
! 	 * and return the prior segment's end address.
  	 */
! 	if (isLogSwitch &&
! 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  	{
! 		/* We can release insert lock immediately */
! 		LWLockRelease(WALInsertLock);
! 
! 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
! 		if (RecPtr.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			RecPtr.xlogid -= 1;
! 			RecPtr.xrecoff = XLogFileSize;
! 		}
! 
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->LogwrtResult;
! 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
! 		{
! 			XLogwrtRqst FlushRqst;
! 
! 			FlushRqst.Write = RecPtr;
! 			FlushRqst.Flush = RecPtr;
! 			XLogWrite(FlushRqst, false, false);
! 		}
! 		LWLockRelease(WALWriteLock);
  
! 		END_CRIT_SECTION();
  
! 		return RecPtr;
  	}
  
! 	/* Insert record header */
! 
! 	record = (XLogRecord *) Insert->currpos;
! 	record->xl_prev = Insert->PrevRecord;
! 	record->xl_xid = GetCurrentTransactionIdIfAny();
! 	record->xl_tot_len = SizeOfXLogRecord + write_len;
! 	record->xl_len = len;		/* doesn't include backup blocks */
! 	record->xl_info = info;
! 	record->xl_rmid = rmid;
! 
! 	/* Now we can finish computing the record's CRC */
! 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
--- 1033,1138 ----
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
! 	/* Construct record header. */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	/* rechdr.xl_prev is set later */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to reserve space for the record from the WAL.
  	 */
! 	if (!ReserveXLogInsertLocation(write_len, doPageWrites, isLogSwitch,
! 								   &PrevRecord, &StartPos, &EndPos,
! 								   (XLogInsertSlot **) &myslot, &updrqst))
  	{
! 		/*
! 		 * Reservation failed. This could be because the record was an
! 		 * XLOG_SWITCH, and we're exactly at the start of a segment. In that
! 		 * case we need not insert it (and don't want to because we'd like
! 		 * consecutive switch requests to be no-ops).  Instead, make sure
! 		 * everything is written and flushed through the end of the prior
! 		 * segment, and return the prior segment's end address.
! 		 *
! 		 * The other reason for failure is that someone changed RedoRecPtr
! 		 * or forcePageWrites after we had constructed our WAL record. In
! 		 * that case we need to redo it with full-page data.
! 		 */
! 		END_CRIT_SECTION();
  
! 		if (isLogSwitch && !XLogRecPtrIsInvalid(EndPos))
  		{
! 			XLogFlush(EndPos);
! 			return EndPos;
! 		}
! 		else
! 		{
! 			rdt_lastnormal->next = NULL;
! 			info = info_orig;
! 			goto begin;
  		}
  	}
! 	else
  	{
! 		/*
! 		 * Reservation succeeded.  Finish the record header by setting
! 		 * prev-link (now that we know it), and finish computing the record's
! 		 * CRC (in CopyXLogRecordToWAL).  Then copy the record to the space
! 		 * we reserved.
! 		 */
! 		rechdr.xl_prev = PrevRecord;
! 		CopyXLogRecordToWAL(write_len, isLogSwitch, &rechdr,
! 							rdata, rdata_crc, myslot, StartPos, EndPos);
  	}
+ 	END_CRIT_SECTION();
  
  	/*
! 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
  	 */
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
  
  	/*
! 	 * If this was an XLOG_SWITCH record, flush the record and the empty
! 	 * padding space that fills the rest of the segment, and perform
! 	 * end-of-segment actions (eg, notifying archiver).
  	 */
! 	if (isLogSwitch)
  	{
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		XLogFlush(EndPos);
  
! 		/*
! 		 * Even though we reserved the rest of the segment for us, which is
! 		 * reflected in EndPos, we return a pointer to just the end of the
! 		 * xlog-switch record.
! 		 */
! 		EndPos.xlogid = StartPos.xlogid;
! 		EndPos.xrecoff = StartPos.xrecoff + SizeOfXLogRecord;
  	}
  
! 	/*
! 	 * Update our global variables
! 	 */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
***************
*** 1038,1219 **** begin:;
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
! 						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, record);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
- 	/* Record begin of record in appropriate places */
- 	ProcLastRecPtr = RecPtr;
- 	Insert->PrevRecord = RecPtr;
- 
- 	Insert->currpos += SizeOfXLogRecord;
- 	freespace -= SizeOfXLogRecord;
- 
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
  		{
! 			if (rdata->len > freespace)
  			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
  			}
  		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Ensure next record will be properly aligned */
! 	Insert->currpos = (char *) Insert->currpage +
! 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
! 	freespace = INSERT_FREESPACE(Insert);
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
! 	{
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
  		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
  		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == XLogCtl->Write.curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
  		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
  		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
! 		}
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
  	}
  	else
  	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
  		{
! 			/* curridx is filled and available for writing out */
  			updrqst = true;
  		}
! 		else
  		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
  		}
- 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
! 	LWLockRelease(WALInsertLock);
  
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
! 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1141,1898 ----
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
! 						 EndPos.xlogid, EndPos.xrecoff);
! 		xlog_outrec(&buf, &rechdr);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rmid].rm_desc(&buf, rechdr.xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will
! 	 * be stored as LSN for changed data pages...
  	 */
! 	return EndPos;
! }
! 
! /*
!  * Subroutine of XLogInsert.  Copies a WAL record to an already-reserved
!  * area in the WAL.
!  */
! static void
! CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecord *rechdr,
! 					XLogRecData *rdata, pg_crc32 rdata_crc,
! 					XLogInsertSlot *myslot_p,
! 					XLogRecPtr StartPos, XLogRecPtr EndPos)
! {
! 	volatile XLogInsertSlot *myslot = myslot_p;
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			freespace;
! 	int			written;
! 	XLogRecPtr	CurrPos;
! 
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(CurrPos);
! 
! 	/* Copy the record header in place, and finish calculating CRC */
! 	record = (XLogRecord *) currpos;
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	COMP_CRC32(rdata_crc, currpos + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
! 
! 	currpos += SizeOfXLogRecord;
! 	XLByteAdvance(CurrPos, SizeOfXLogRecord);
  
! 	freespace = INSERT_FREESPACE(CurrPos);
! 
! 	if (!isLogSwitch)
! 	{
! 		/* Copy record data */
! 		written = 0;
! 		while (rdata != NULL)
  		{
! 			while (rdata->len > freespace)
  			{
! 				/*
! 				 * Write what fits on this page, then write the continuation
! 				 * record, and continue on the next page.
! 				 */
! 				XLogContRecord *contrecord;
! 
! 				memcpy(currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				written += freespace;
! 				XLByteAdvance(CurrPos, freespace);
! 
! 				/*
! 				 * CurrPos now points to the page boundary, ie. the first byte
! 				 * of the next page. Advertise that position in our insertion
! 				 * slot before calling GetXLogBuffer(), because GetXLogBuffer()
! 				 * might need to wait for some insertions to finish so that it
! 				 * can write out a buffer to make room for the new page.
! 				 * Updating the slot before waiting for a new buffer ensures
! 				 * that we don't deadlock with ourselves if we run out of
! 				 * clean buffers.
! 				 *
! 				 * Note that we must not advance CurrPos past the page header
! 				 * yet, otherwise someone might try to flush up to that point,
! 				 * which would fail if the next page was not initialized yet.
! 				 */
! 				UpdateSlotCurrPos(myslot, CurrPos);
! 
! 				/*
! 				 * Get pointer to beginning of next page, and set the
! 				 * XLP_FIRST_IS_CONTRECORD flag in the page header.
! 				 *
! 				 * It's safe to set the contrecord flag without a lock on the
! 				 * page. All the other flags are set in AdvanceXLInsertBuffer,
! 				 * and we're the only backend that needs to set the contrecord
! 				 * flag.
! 				 */
! 				currpos = GetXLogBuffer(CurrPos);
! 				((XLogPageHeader) currpos)->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 
! 				/* skip over the page header, and write continuation record */
! 				if (CurrPos.xrecoff % XLogSegSize == 0)
! 				{
! 					CurrPos.xrecoff += SizeOfXLogLongPHD;
! 					currpos += SizeOfXLogLongPHD;
! 				}
! 				else
! 				{
! 					CurrPos.xrecoff += SizeOfXLogShortPHD;
! 					currpos += SizeOfXLogShortPHD;
! 				}
! 				contrecord = (XLogContRecord *) currpos;
! 				contrecord->xl_rem_len = write_len - written;
! 
! 				currpos += SizeOfXLogContRecord;
! 				CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 				freespace = INSERT_FREESPACE(CurrPos);
  			}
+ 
+ 			memcpy(currpos, rdata->data, rdata->len);
+ 			currpos += rdata->len;
+ 			XLByteAdvance(CurrPos, rdata->len);
+ 			freespace -= rdata->len;
+ 			written += rdata->len;
+ 
+ 			rdata = rdata->next;
  		}
+ 		Assert(written == write_len);
  
! 		/* Align the end position, so that the next record starts aligned */
! 		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
! 		if (CurrPos.xrecoff >= XLogFileSize)
! 		{
! 			/* crossed a logid boundary */
! 			CurrPos.xlogid += 1;
! 			CurrPos.xrecoff = 0;
! 		}
! 
! 		if (!XLByteEQ(CurrPos, EndPos))
! 			elog(PANIC, "space reserved for WAL record does not match what was written");
  	}
+ 	else
+ 	{
+ 		/* An xlog-switch record doesn't contain any data besides the header */
+ 		Assert(write_len == 0);
+ 
+ 		/*
+ 		 * An xlog-switch record consumes all the remaining space on the
+ 		 * WAL segment. We have already reserved it for us, but we still need
+ 		 * to make sure it's been allocated and zeroed in the WAL buffers so
+ 		 * that when the caller (or someone else) does XLogWrite(), it can
+ 		 * really write out all the zeros.
+ 		 *
+ 		 * We do this one page at a time, to make sure we don't deadlock
+ 		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
+ 		 */
+ 		Assert(EndPos.xrecoff % XLogSegSize == 0);
  
! 		/* Use up all the remaining space on the first page */
! 		XLByteAdvance(CurrPos, freespace);
! 
! 		while (XLByteLT(CurrPos, EndPos))
! 		{
! 			/*
! 			 * like in the non-xlog-switch codepath, let others know that
! 			 * we're done writing up to the end of this page
! 			 */
! 			UpdateSlotCurrPos(myslot, CurrPos);
! 			/* initialize the next page (if not initialized already */
! 			AdvanceXLInsertBuffer(CurrPos, false);
! 			XLByteAdvance(CurrPos, XLOG_BLCKSZ);
! 		}
! 	}
  
  	/*
! 	 * Done! Clear CurrPos in our slot to let others know that we're
! 	 * finished.
  	 */
! 	UpdateSlotCurrPos(myslot, InvalidXLogRecPtr);
  
  	/*
! 	 * When we run out of insertion slots, the next inserter has to grab the
! 	 * WALInsertTailLock to clean up some old slots.  That stalls all new
! 	 * insertions. The WAL writer process cleans up old slots periodically,
! 	 * but on a busy system that might not be enough. So we try to clean up
! 	 * old ones every time we've gone through 1/4 of all the slots.
  	 */
! 	if ((myslot_p - XLogCtl->XLogInsertSlots) % (NumXLogInsertSlots / 4) == 0)
! 		ReuseOldSlots();
! }
  
! /*
!  * Reserves the right amount of space for a record of given size from the WAL.
!  * *StartPos_p is set to the beginning of the reserved section, *EndPos_p to
!  * its end+1, and *PrevRecord_p to the beginning of the previous record to set
!  * to the prev-link of the record header.
!  *
!  * A log-switch record is handled slightly differently. The rest of the
!  * segment will be reserved for this insertion, as indicated by the returned
!  * *EndPos_p value. However, if we are already at the beginning of the current
!  * segment, the *EndPos_p is set to the current location without reserving
!  * any space, and the function returns false.
!  *
!  * *updrqst_p is set to true, if this record ends on different page than
!  * the previous one - the caller should update the shared LogwrtRqst value
!  * after it's done inserting the record in that case, so that the WAL page
!  * that filled up gets written out at the next convenient moment.
!  *
!  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
!  * (or the end of previous record, to be exact) to let others know that we're
!  * busy inserting to the reserved area. The caller must clear it when the
!  * insertion is finished.
!  *
!  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
!  * changed. On failure, the shared state is not modified.
!  *
!  * This is the performance critical part of XLogInsert that must be serialized
!  * across backends. The rest can happen mostly in parallel.
!  *
!  * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
!  * where we actually copy the record to the reserved space.
!  */
! static bool
! ReserveXLogInsertLocation(int size, bool didPageWrites,
! 						  bool isLogSwitch,
! 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
! 						  XLogRecPtr *EndPos_p,
! 						  XLogInsertSlot **myslot_p, bool *updrqst_p)
! {
! 	volatile XLogInsertSlot *myslot;
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			freespace;
! 	XLogRecPtr	ptr;
! 	XLogRecPtr	StartPos;
! 	XLogRecPtr	BeginCurrPos;
! 	int32		nextslot;
! 	int32		lastslot;
! 	bool		updrqst = false;
! 
! 	/* log-switch records should contain no data */
! 	Assert(!isLogSwitch || size == 0);
  
! 	size = SizeOfXLogRecord + size;
  
+ retry:
+ 	SpinLockAcquire(&Insert->insertpos_lck);
+ 
+ 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
+ 		(!didPageWrites && (Insert->forcePageWrites || Insert->fullPageWrites)))
+ 	{
  		/*
! 		 * Oops, a checkpoint just happened, or forcePageWrites was just
! 		 * turned on. Start XLogInsert() all over, because we might have to
! 		 * include more full-page images in the record.
  		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		*EndPos_p = InvalidXLogRecPtr;
! 		return false;
! 	}
! 
! 	/*
! 	 * Reserve the next insertion slot for us.
! 	 *
! 	 * First check that the slot is not still in use. Modifications to
! 	 * lastslot are protected by WALInsertTailLock, but here we assume that
! 	 * reading an int32 is atomic. Another process might advance lastslot at
! 	 * the same time, but not past nextslot.
! 	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	if (NextSlotNo(nextslot) == lastslot)
! 	{
! 		/*
! 		 * Oops, we've "caught our tail" and the oldest slot is still in use.
! 		 * Have to wait for it to become vacant, and retry.
! 		 */
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		WaitXLogInsertionsToFinish(InvalidXLogRecPtr);
! 		goto retry;
! 	}
! 
! 	/*
! 	 * Got the slot. Now reserve the right amount of space from the WAL for
! 	 * our record.
! 	 */
! 	ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	freespace = INSERT_FREESPACE(ptr);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		XLByteAdvance(ptr, freespace);
! 		BeginCurrPos = ptr;
! 
! 		if (ptr.xrecoff % XLogSegSize == 0)
! 			ptr.xrecoff += SizeOfXLogLongPHD;
! 		else
! 			ptr.xrecoff += SizeOfXLogShortPHD;
! 		freespace = INSERT_FREESPACE(ptr);
! 		updrqst = true;
! 	}
! 	else
! 		BeginCurrPos = ptr;
  
! 	/*
! 	 * We are now at the starting position of our record. Now figure out how
! 	 * the data will be split across the WAL pages, to calculate where the
! 	 * record ends.
! 	 */
! 	StartPos = ptr;
  
+ 	if (isLogSwitch)
+ 	{
  		/*
! 		 * If the record is an XLOG_SWITCH, and we are exactly at the start
! 		 * of a segment, we need not insert it (and don't want to because
! 		 * we'd like consecutive switch requests to be no-ops). Otherwise the
! 		 * XLOG_SWITCH record should consume all the remaining space on the
! 		 * current segment.
  		 */
+ 		if ((ptr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  		{
! 			/* We can release insert lock immediately */
! 			SpinLockRelease(&Insert->insertpos_lck);
  
! 			ptr.xrecoff -= SizeOfXLogLongPHD;
! 			if (ptr.xrecoff == 0)
! 			{
! 				/* crossing a logid boundary */
! 				ptr.xlogid -= 1;
! 				ptr.xrecoff = XLogFileSize;
! 			}
  
! 			*EndPos_p = ptr;
! 			*StartPos_p = ptr;
! 			*myslot_p = NULL;
  
! 			return false;
! 		}
! 		else
! 		{
! 			if (ptr.xrecoff % XLOG_SEG_SIZE != 0)
! 			{
! 				int segleft = XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE);
! 				XLByteAdvance(ptr, segleft);
! 			}
! 			updrqst = true;
! 		}
  	}
  	else
  	{
! 		/*
! 		 * A normal record, ie. not xlog-switch. Calculate how the record will
! 		 * be layed out across WAL pages. The straightforward way to do this
! 		 * would be a loop that fills in the WAL pages one at a time, tracking
! 		 * how much of the size is still left.  That's how the
! 		 * CopyXLogRecordToWAL() works when actually copying the data.
! 		 * However, we want to avoid looping to keep this spinlock-protected
! 		 * as short as possible, if the record spans many pages.
! 		 */
! 		int		sizeleft = size;
  
! 		if (sizeleft > freespace)
  		{
! 			int		pagesneeded;
! 			int		pagesleftonseg;
! 			int		fullpages;
! 
! 			/* First fill the first page with as much data as fits. */
! 			sizeleft -= freespace;
! 			XLByteAdvance(ptr, freespace);
! 
! 			/* We're now positioned at the beginning of the next page */
! 			Assert(ptr.xrecoff % XLOG_BLCKSZ == 0);
! 			do
! 			{
! 				/*
! 				 * If we're positioned at the beginning of a segment, take
! 				 * into account that the first page needs a long header.
! 				 */
! 				if (ptr.xrecoff % XLOG_SEG_SIZE == 0)
! 					sizeleft += (SizeOfXLogLongPHD - SizeOfXLogShortPHD);
! 
! 				/*
! 				 * Calculate the number of extra pages we need.  Each page
! 				 * will have a continuation record at the beginning.
! 				 *
! 				 * We do the calculation assuming that all the pages have a
! 				 * short header.  We don't know whether we have to cross to
! 				 * the next segment until we've calculated how many pages we
! 				 * need. If it turns out that we do, we'll fill up the current
! 				 * segment, and loop back to add the long page header to
! 				 * sizeleft, and continue calculation from there.
! 				 */
! #define SpaceOnXLogPage	(XLOG_BLCKSZ - SizeOfXLogShortPHD - SizeOfXLogContRecord)
! 				pagesneeded = (sizeleft + SpaceOnXLogPage - 1) / SpaceOnXLogPage;
! 
! 				pagesleftonseg = (XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE)) / XLOG_BLCKSZ;
! 
! 				if (pagesneeded <= pagesleftonseg)
! 				{
! 					/*
! 					 * Fits in this segment. Skip over all the full pages, to
! 					 * the last page that will (possibly) be only partially
! 					 * filled.
! 					 */
! 					fullpages = pagesneeded - 1;
! 				}
! 				else
! 				{
! 					/*
! 					 * Doesn't fit in this segment. Fit as much as does, and
! 					 * continue from next segment.
! 					 */
! 					fullpages = pagesleftonseg;
! 				}
! 
! 				sizeleft -= fullpages * SpaceOnXLogPage;
! 				XLByteAdvance(ptr, fullpages * XLOG_BLCKSZ);
! 			} while (pagesneeded > pagesleftonseg);
! 
! 			/*
! 			 * We're now positioned at the beginning of the last page this
! 			 * record spans.  The rest should fit on this page.
! 			 *
! 			 * Note: We already took into account the long header above.
! 			 */
! 			ptr.xrecoff += SizeOfXLogShortPHD;
! 			ptr.xrecoff += SizeOfXLogContRecord;
! 
! 			Assert(sizeleft <= INSERT_FREESPACE(ptr));
! 
  			updrqst = true;
  		}
! 
! 		/* the rest fits on this page */
! 		ptr.xrecoff += sizeleft;
! 
! 		/* Align the end position, so that the next record starts aligned */
! 		ptr.xrecoff = MAXALIGN(ptr.xrecoff);
! 		if (ptr.xrecoff >= XLogFileSize)
  		{
! 			/* crossed a logid boundary */
! 			ptr.xlogid += 1;
! 			ptr.xrecoff = 0;
  		}
  	}
  
! 	/* Update the shared state, and our slot, before releasing the lock */
! 	myslot = &XLogCtl->XLogInsertSlots[nextslot];
! 	myslot->CurrPos = BeginCurrPos;
  
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 	Insert->nextslot = NextSlotNo(nextslot);
! 
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(LOG, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 PrevRecord_p->xlogid, PrevRecord_p->xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 size);
! #endif
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 	*myslot_p = (XLogInsertSlot *) myslot;
! 	*updrqst_p = updrqst;
! 
! 	return true;
! }
! 
! /*
!  * Update slot's CurrPos variable, and wake up anyone waiting on it.
!  */
! static void
! UpdateSlotCurrPos(volatile XLogInsertSlot *myslot, XLogRecPtr CurrPos)
! {
! 	PGPROC	   *head;
! 
! 	/*
! 	 * The write-barrier ensures that the changes we made to the WAL pages
! 	 * are visible to everyone before the update of CurrPos.
! 	 *
! 	 * XXX: I'm not sure if this is necessary. Doesn't the spinlock
! 	 * acquire/release act as an implicit barrier?
! 	 */
! 	pg_write_barrier();
! 
! 	SpinLockAcquire(&myslot->lck);
! 	myslot->CurrPos = CurrPos;
! 	head = myslot->head;
! 	myslot->head = myslot->tail = NULL;
! 	SpinLockRelease(&myslot->lck);
! 	while (head != NULL)
  	{
! 		PGPROC *proc = head;
! 		head = proc->lwWaitLink;
! 		proc->lwWaitLink = NULL;
! 		proc->lwWaiting = false;
! 		PGSemaphoreUnlock(&proc->sem);
! 	}
! }
  
! /*
!  * Get a pointer to the right location in the WAL buffer containing the
!  * given XLogRecPtr.
!  *
!  * If the page is not initialized yet, it is initialized. That might require
!  * evicting an old dirty buffer from the buffer cache, which means I/O.
!  *
!  * The caller must ensure that the page containing the requested location
!  * isn't evicted yet, and won't be evicted, by holding onto an
!  * XLOG insertion slot with CurrPos set to 'ptr'. Setting it to some value
!  * less than 'ptr' would suffice for GetXLogBuffer(), but risks deadlock:
!  * If we have to evict a buffer, we might have to wait for someone else to
!  * finish a write. And that someone else might not be able to finish the
!  * write, if our CurrPos points to a buffer that's still in the buffer cache.
!  */
! static char *
! GetXLogBuffer(XLogRecPtr ptr)
! {
! 	int			idx;
! 	XLogRecPtr	endptr;
! 	static uint32 cachedXlogid = 0;
! 	static uint32 cachedPage = 0;
! 	static char *cachedPos = NULL;
! 	XLogRecPtr	expectedEndPtr;
! 
! 	/*
! 	 * Fast path for the common case that we need to access again the same
! 	 * page as last time.
! 	 */
! 	if (ptr.xlogid == cachedXlogid && ptr.xrecoff / XLOG_BLCKSZ == cachedPage)
! 		return cachedPos + ptr.xrecoff % XLOG_BLCKSZ;
! 
! 	cachedXlogid = ptr.xlogid;
! 	cachedPage = ptr.xrecoff / XLOG_BLCKSZ;
! 
! 	/*
! 	 * The XLog buffer cache is organized so a page must always be loaded
! 	 * to a particular buffer.  That way we can easily calculate the buffer
! 	 * a given page must be loaded into, from the XLogRecPtr alone.
! 	 */
! 	idx = XLogRecPtrToBufIdx(ptr);
! 
! 	/*
! 	 * See what page is loaded in the buffer at the moment. It could be the
! 	 * page we're looking for, or something older. It can't be anything
! 	 * newer - that would imply the page we're looking for has already
! 	 * been written out to disk, which shouldn't happen as long as the caller
! 	 * has set its slot's CurrPos correctly.
! 	 *
! 	 * However, we don't hold a lock while we read the value. If someone has
! 	 * just initialized the page, it's possible that we get a "torn read" of
! 	 * the XLogRecPtr, and see a bogus value. That's ok, we'll grab the
! 	 * mapping lock (in AdvanceXLInsertBuffer) and retry if we see anything
! 	 * else than the page we're looking for. But it means that when we do this
! 	 * unlocked read, we might see a value that appears to be ahead of the
! 	 * page we're looking for. Don't PANIC on that, until we've verified the
! 	 * value while holding the lock.
! 	 */
! 	expectedEndPtr.xlogid = ptr.xlogid;
! 	expectedEndPtr.xrecoff = ptr.xrecoff - ptr.xrecoff % XLOG_BLCKSZ + XLOG_BLCKSZ;
! 
! 	endptr = XLogCtl->xlblocks[idx];
! 	if (!XLByteEQ(expectedEndPtr, endptr))
! 	{
! 		AdvanceXLInsertBuffer(ptr, false);
! 		endptr = XLogCtl->xlblocks[idx];
! 
! 		if (!XLByteEQ(expectedEndPtr, endptr))
! 			elog(PANIC, "could not find WAL buffer for %X/%X",
! 				 ptr.xlogid, ptr.xrecoff);
  	}
  
! 	/*
! 	 * Found the buffer holding this page. Return a pointer to the right
! 	 * offset within the page.
! 	 */
! 	cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
! 	return XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
! 		ptr.xrecoff % XLOG_BLCKSZ;
! }
  
! /*
!  * Try to mark old insertion slots as free for reuse.
!  */
! static void
! ReuseOldSlots(void)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			lastslot;
! 	int			nextslot;
  
! 	/* Give up if someone else is already doing this */
! 	if (!LWLockConditionalAcquire(WALInsertTailLock, LW_EXCLUSIVE))
! 		return;
! 
! 	lastslot = Insert->lastslot;
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	nextslot = Insert->nextslot;
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! 	while (lastslot != nextslot)
! 	{
! 		/*
! 		 * Check if the oldest slot is still in use. We don't do any locking
! 		 * here, we just give up as soon as we find a slot that's still in
! 		 * use.
! 		 */
! 		volatile XLogInsertSlot *slot;
! 		slot = &XLogCtl->XLogInsertSlots[lastslot];
! 
! 		if (!XLByteEQ(slot->CurrPos, InvalidXLogRecPtr))
! 			break;
! 
! 		lastslot = NextSlotNo(lastslot);
! 	}
! 
! 	Insert->lastslot = lastslot;
! 	LWLockRelease(WALInsertTailLock);
! }
! 
! /*
!  * Wait for any insertions < upto to finish. If upto is invalid, we wait until
!  * at least one slot is available for insertion.
!  *
!  * Returns a value >= upto, which indicates the oldest in-progress insertion
!  * that we saw in the array, or InvalidXLogRecPtr if there are no insertions
!  * in-progress at exit.
!  */
! static XLogRecPtr
! WaitXLogInsertionsToFinish(XLogRecPtr upto)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			lastslot;
! 	int			nextslot;
! 	XLogRecPtr	LastPos = InvalidXLogRecPtr;
! 	int			extraWaits = 0;
! 
! 	if (MyProc == NULL)
! 		elog(PANIC, "cannot wait without a PGPROC structure");
! 
! retry:
! 	/*
! 	 * Read lastslot and nextslot. lastslot cannot change while we hold the
! 	 * tail-lock. nextslot can advance while we run, but not beyond
! 	 * lastslot - 1. We still have to acquire insertpos_lck to make sure that
! 	 * we see the CurrPos of the latest slot correctly.
! 	 */
! 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 	lastslot = Insert->lastslot;
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	nextslot = Insert->nextslot;
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! 	while (lastslot != nextslot)
! 	{
! 		/*
! 		 * Examine the oldest slot still in use.
! 		 */
! 		volatile XLogInsertSlot *slot;
! 		XLogRecPtr	slotptr;
! 
! 		slot = &XLogCtl->XLogInsertSlots[lastslot];
! 
! 		/* First, a quick check without the lock. */
! 		if (XLByteEQ(slot->CurrPos, InvalidXLogRecPtr))
! 		{
! 			lastslot = NextSlotNo(lastslot);
! 			continue;
! 		}
! 
! 		SpinLockAcquire(&slot->lck);
! 		slotptr = slot->CurrPos;
! 
! 		if (XLByteEQ(slotptr, InvalidXLogRecPtr))
! 		{
! 			/*
! 			 * The insertion has already finished, we just need to advance
! 			 * lastslot to make the slot available for reuse.
! 			 */
! 			SpinLockRelease(&slot->lck);
! 			lastslot = NextSlotNo(lastslot);
! 			continue;
! 		}
! 		else
! 		{
! 			/*
! 			 * The insertion is still in-progress. If we just needed for
! 			 * any slot to become available and there is at least one slot
! 			 * free now, or if this slot's CurrPos >= upto, we can
! 			 * stop here. Otherwise we have to wait for it to finish.
! 			 */
! 			if ((XLogRecPtrIsInvalid(upto) && NextSlotNo(nextslot) != lastslot)
! 				|| (!XLogRecPtrIsInvalid(upto) && XLByteLE(upto, slotptr)))
! 			{
! 				SpinLockRelease(&slot->lck);
! 				LastPos = slotptr;
! 				break;
! 			}
! 			else
! 			{
! 				/* Wait for this insertion to finish. */
! 				MyProc->lwWaiting = true;
! 				MyProc->lwWaitMode = 0; /* doesn't matter */
! 				MyProc->lwWaitLink = NULL;
! 				if (slot->head == NULL)
! 					slot->head = MyProc;
! 				else
! 					slot->tail->lwWaitLink = MyProc;
! 				slot->tail = MyProc;
! 				SpinLockRelease(&slot->lck);
! 
! 				Insert->lastslot = lastslot;
! 				LWLockRelease(WALInsertTailLock);
! 				for (;;)
! 				{
! 					PGSemaphoreLock(&MyProc->sem, false);
! 					if (!MyProc->lwWaiting)
! 						break;
! 					extraWaits++;
! 				}
! 
! 				/*
! 				 * The insertion has now finished. Start all over. While we
! 				 * were not holding the tail-lock, someone might've filled up
! 				 * all slots again.
! 				 */
! 				goto retry;
! 			}
! 		}
! 	}
! 
! 	/* Update lastslot before we release the lock */
! 	Insert->lastslot = lastslot;
! 	LWLockRelease(WALInsertTailLock);
! 
! 	while (extraWaits-- > 0)
! 		PGSemaphoreUnlock(&MyProc->sem);
! 
! 	return LastPos;
  }
  
  /*
***************
*** 1440,1469 **** XLogArchiveCleanup(const char *xlog)
  }
  
  /*
!  * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data.
!  *
!  * If new_segment is TRUE then we set up the next buffer page as the first
!  * page of the next xlog segment file, possibly but not usually the next
!  * consecutive file page.
!  *
!  * The global LogwrtRqst.Write pointer needs to be advanced to include the
!  * just-filled page.  If we can do this for free (without an extra lock),
!  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
!  * request update still needs to be done, FALSE if we did it internally.
!  *
!  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 2119,2151 ----
  }
  
  /*
!  * Initialize XLOG buffers, writing out old buffers if they still contain
!  * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
!  * true, initialize as many pages as we can without having to write out
!  * unwritten data. Any new pages are initialized to zeros, with pages headers
!  * initialized properly.
   */
! static void
! AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
  	XLogPageHeader NewPage;
+ 	int			npages = 0;
+ 
+ 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 
+ 	/*
+ 	 * Now that we have the lock, check if someone initialized the page
+ 	 * already.
+ 	 */
+ /* XXX: fix indentation before commit */
+ while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
+ {
+ 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1473,1482 **** AdvanceXLInsertBuffer(bool new_segment)
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  	{
! 		/* nope, got work to do... */
! 		XLogRecPtr	FinishedPageRqstPtr;
! 
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 2155,2166 ----
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1484,1504 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
- 		update_needed = false;	/* Did the shared-request update */
- 
  		/*
  		 * Now that we have an up-to-date LogwrtResult value, see if we still
  		 * need to write it or if someone else already did.
  		 */
  		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
--- 2168,2194 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
  		/*
  		 * Now that we have an up-to-date LogwrtResult value, see if we still
  		 * need to write it or if someone else already did.
  		 */
  		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
***************
*** 1508,1525 **** AdvanceXLInsertBuffer(bool new_segment)
  			}
  			else
  			{
! 				/*
! 				 * Have to write buffers while holding insert lock. This is
! 				 * not good, so only write as much as we absolutely must.
! 				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
--- 2198,2215 ----
  			}
  			else
  			{
! 				/* Have to write it ourselves */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
***************
*** 1527,1540 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
! 
! 	if (new_segment)
! 	{
! 		/* force it to a segment start point */
! 		NewPageEndPtr.xrecoff += XLogSegSize - 1;
! 		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
! 	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 2217,2223 ----
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1544,1556 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
! 
! 	Insert->curridx = nextidx;
! 	Insert->currpage = NewPage;
  
! 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
--- 2227,2236 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
! 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
  
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
***************
*** 1594,1604 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 2274,2301 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * Make sure the initialization of the page becomes visible to others
! 	 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
! 	 * holding a lock.
! 	 */
! 	pg_write_barrier();
! 
! 	*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(DEBUG1, "initialized %d pages, upto %X/%X",
! 			 npages, NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff);
! #endif
  }
  
  /*
***************
*** 1643,1658 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * If xlog_switch == TRUE, we are intending an xlog segment switch, so
!  * perform end-of-segment actions after writing the last page, even if
!  * it's not physically the end of its segment.  (NB: this will work properly
!  * only if caller specifies WriteRqst == page-end and flexible == false,
!  * and there is some data to write.)
!  *
!  * Must be called with WALWriteLock held.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
--- 2340,2351 ----
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
!  * the data is ready to write.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
***************
*** 1701,1714 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 2394,2407 ----
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
! 		if (!XLByteLT(LogwrtResult.Write, EndPtr))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 EndPtr.xlogid, EndPtr.xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = EndPtr;
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1805,1820 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
- 			 * We also do this if this is the last page written for an xlog
- 			 * switch.
- 			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg || (xlog_switch && last_iteration))
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
--- 2498,2510 ----
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg)
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
***************
*** 2066,2073 **** XLogFlush(XLogRecPtr record)
  	 */
  	for (;;)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
--- 2756,2767 ----
  	 */
  	for (;;)
  	{
! 		/* use volatile pointers to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 		uint32		freespace;
+ 		XLogRecPtr	insertpos,
+ 					inprogresspos;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
***************
*** 2081,2086 **** XLogFlush(XLogRecPtr record)
--- 2775,2812 ----
  			break;
  
  		/*
+ 		 * Get the current insert position.
+ 		 *
+ 		 * XXX: This used to do LWLockConditionalAcquire(WALInsertLock),
+ 		 * fall back to writing just up to 'record' if we couldn't get the
+ 		 * lock. I wonder if it would be a good idea to have a
+ 		 * SpinLockConditionalAcquire function and use that? On one hand,
+ 		 * it would be good to not cause more contention on the lock if it's
+ 		 * busy, but on the other hand, this spinlock is much more lightweight
+ 		 * than the WALInsertLock was, so maybe it's better to just grab the
+ 		 * spinlock. In fact, LWLockConditionalAcquire did a spinlock acquire
+ 		 * + release, anyway. Also note that if we stored the XLogRecPtr as
+ 		 * one 64-bit integer, we could just read it with no lock on platforms
+ 		 * where 64-bit integer accesses are atomic, which covers many common
+ 		 * platforms nowadays.
+ 		 */
+ 		SpinLockAcquire(&Insert->insertpos_lck);
+ 		insertpos = Insert->CurrPos;
+ 		SpinLockRelease(&Insert->insertpos_lck);
+ 
+ 		freespace = INSERT_FREESPACE(insertpos);
+ 		if (freespace < SizeOfXLogRecord)               /* buffer is full */
+ 			insertpos.xrecoff += freespace;
+ 
+ 		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		inprogresspos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+ 		if (!XLogRecPtrIsInvalid(inprogresspos))
+ 			insertpos = inprogresspos;
+ 
+ 		/*
  		 * Try to get the write lock. If we can't get it immediately, wait
  		 * until it's released, and recheck if we still need to do the flush
  		 * or if the backend that held the lock did it for us already. This
***************
*** 2100,2128 **** XLogFlush(XLogRecPtr record)
  		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
! 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 				else
! 				{
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 					WriteRqstPtr.xrecoff -= freespace;
! 				}
! 				LWLockRelease(WALInsertLock);
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = WriteRqstPtr;
! 			}
! 			else
! 			{
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = record;
! 			}
! 			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
--- 2826,2835 ----
  		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
! 			XLogWrite(WriteRqst, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
***************
*** 2237,2243 **** XLogBackgroundFlush(void)
  
  	START_CRIT_SECTION();
  
! 	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
--- 2944,2951 ----
  
  	START_CRIT_SECTION();
  
! 	/* now wait for any in-progress insertions to finish and get write lock */
! 	WaitXLogInsertionsToFinish(WriteRqstPtr);
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
***************
*** 2246,2256 **** XLogBackgroundFlush(void)
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible, false);
  	}
- 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2954,2971 ----
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible);
  	}
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 */
+ 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5059,5064 **** XLOGShmemInit(void)
--- 5774,5780 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 5084,5089 **** XLOGShmemInit(void)
--- 5800,5816 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize insertion slots */
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[i];
+ 		slot->CurrPos = InvalidXLogRecPtr;
+ 		slot->head = slot->tail = NULL;
+ 		SpinLockInit(&slot->lck);
+ 	}
+ 	XLogCtl->Insert.nextslot = 0;
+ 	XLogCtl->Insert.lastslot = 0;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5098,5104 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
! 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
--- 5825,5831 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
! 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
***************
*** 5980,5985 **** StartupXLOG(void)
--- 6707,6713 ----
  	bool		backupEndRequired = false;
  	bool		backupFromStandby = false;
  	DBState		dbstate_at_startup;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6232,6238 **** StartupXLOG(void)
  
  	lastFullPageWrites = checkPoint.fullPageWrites;
  
! 	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
  
  	if (XLByteLT(RecPtr, checkPoint.redo))
  		ereport(PANIC,
--- 6960,6966 ----
  
  	lastFullPageWrites = checkPoint.fullPageWrites;
  
! 	RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
  
  	if (XLByteLT(RecPtr, checkPoint.redo))
  		ereport(PANIC,
***************
*** 6786,6793 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7514,7525 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecEndPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6795,6804 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
--- 7527,7535 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
***************
*** 6807,6818 **** StartupXLOG(void)
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7538,7549 ----
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(EndOfLog);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndOfLog.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6824,6830 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7555,7561 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 7307,7327 **** InitXLOGAccess(void)
  }
  
  /*
!  * Once spawned, a backend may update its local RedoRecPtr from
!  * XLogCtl->Insert.RedoRecPtr; it must hold the insert lock or info_lck
!  * to do so.  This is done in XLogInsert() or GetRedoRecPtr().
   */
  XLogRecPtr
  GetRedoRecPtr(void)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
  
  	SpinLockAcquire(&xlogctl->info_lck);
! 	Assert(XLByteLE(RedoRecPtr, xlogctl->Insert.RedoRecPtr));
! 	RedoRecPtr = xlogctl->Insert.RedoRecPtr;
  	SpinLockRelease(&xlogctl->info_lck);
  
  	return RedoRecPtr;
  }
  
--- 8038,8066 ----
  }
  
  /*
!  * Return the current Redo pointer from shared memory.
!  *
!  * As a side-effect, the local RedoRecPtr copy is updated.
   */
  XLogRecPtr
  GetRedoRecPtr(void)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	XLogRecPtr ptr;
  
+ 	/*
+ 	 * The possibly not up-to-date copy in XlogCtl is enough. Even if we
+ 	 * grabbed insertpos_lck to read the master copy, someone might update
+ 	 * it just after we've released the lock.
+ 	 */
  	SpinLockAcquire(&xlogctl->info_lck);
! 	ptr = xlogctl->RedoRecPtr;
  	SpinLockRelease(&xlogctl->info_lck);
  
+ 	if (XLByteLT(RedoRecPtr, ptr))
+ 		RedoRecPtr = xlogctl->RedoRecPtr;
+ 
  	return RedoRecPtr;
  }
  
***************
*** 7330,7336 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
--- 8069,8075 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
***************
*** 7592,7597 **** LogCheckpointEnd(bool restartpoint)
--- 8331,8338 ----
  void
  CreateCheckPoint(int flags)
  {
+ 	/* use volatile pointer to prevent code rearrangement */
+ 	volatile XLogCtlData *xlogctl = XLogCtl;
  	bool		shutdown;
  	CheckPoint	checkPoint;
  	XLogRecPtr	recptr;
***************
*** 7606,7611 **** CreateCheckPoint(int flags)
--- 8347,8353 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7674,7683 **** CreateCheckPoint(int flags)
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
  	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 8416,8426 ----
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold insertpos_lck while examining insert state to determine
  	 * the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7689,7695 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 8432,8438 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7698,7712 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 8441,8452 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7733,7750 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
  	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
  	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
! 	 * must be done while holding the insert lock AND the info_lck.
  	 *
  	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
  	 * pointing past where it really needs to point.  This is okay; the only
--- 8473,8492 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(curInsert);
  	if (freespace < SizeOfXLogRecord)
  	{
! 		XLByteAdvance(curInsert, freespace);
! 		if (curInsert.xrecoff % XLogSegSize == 0)
! 			curInsert.xrecoff += SizeOfXLogLongPHD;
! 		else
! 			curInsert.xrecoff += SizeOfXLogShortPHD;
  	}
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
! 	 * must be done while holding the insert lock.
  	 *
  	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
  	 * pointing past where it really needs to point.  This is okay; the only
***************
*** 7753,7772 **** CreateCheckPoint(int flags)
  	 * XLogInserts that happen while we are dumping buffers must assume that
  	 * their buffer changes are not included in the checkpoint.
  	 */
! 	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
! 
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
  
  	/*
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8495,8512 ----
  	 * XLogInserts that happen while we are dumping buffers must assume that
  	 * their buffer changes are not included in the checkpoint.
  	 */
! 	RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
  
  	/*
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! 	/* Update the info_lck-protected copy of RedoRecPtr as well */
! 	SpinLockAcquire(&xlogctl->info_lck);
! 	xlogctl->RedoRecPtr = checkPoint.redo;
! 	SpinLockRelease(&xlogctl->info_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7786,7792 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8526,8532 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 8153,8167 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
- 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8893,8910 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * Like in CreatecheckPoint(), hold insertpos_lck to update it, although
! 	 * during recovery acquiring insertpos_lck is just pro forma, because no
! 	 * WAL insertions are happening.
  	 */
! 	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
+ 	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
+ 
+ 	/* Also update the info_lck-protected copy */
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	xlogctl->RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8448,8454 **** XLogReportParameters(void)
  void
  UpdateFullPageWrites(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
--- 9191,9197 ----
  void
  UpdateFullPageWrites(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
***************
*** 8471,8479 **** UpdateFullPageWrites(void)
  	 */
  	if (fullPageWrites)
  	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  		Insert->fullPageWrites = true;
! 		LWLockRelease(WALInsertLock);
  	}
  
  	/*
--- 9214,9222 ----
  	 */
  	if (fullPageWrites)
  	{
! 		SpinLockAcquire(&Insert->insertpos_lck);
  		Insert->fullPageWrites = true;
! 		SpinLockRelease(&Insert->insertpos_lck);
  	}
  
  	/*
***************
*** 8494,8502 **** UpdateFullPageWrites(void)
  
  	if (!fullPageWrites)
  	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  		Insert->fullPageWrites = false;
! 		LWLockRelease(WALInsertLock);
  	}
  	END_CRIT_SECTION();
  }
--- 9237,9245 ----
  
  	if (!fullPageWrites)
  	{
! 		SpinLockAcquire(&Insert->insertpos_lck);
  		Insert->fullPageWrites = false;
! 		SpinLockRelease(&Insert->insertpos_lck);
  	}
  	END_CRIT_SECTION();
  }
***************
*** 9024,9029 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9767,9773 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	checkpointloc;
***************
*** 9086,9111 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9830,9855 ----
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 9218,9230 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 9962,9974 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9308,9334 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		Assert(XLogCtl->Insert.exclusiveBackup);
! 		XLogCtl->Insert.exclusiveBackup = false;
  	}
  	else
  	{
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 10052,10079 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		Assert(Insert->exclusiveBackup);
! 		Insert->exclusiveBackup = false;
  	}
  	else
  	{
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9341,9346 **** pg_start_backup_callback(int code, Datum arg)
--- 10086,10092 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	startpoint;
***************
*** 9394,9402 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 10140,10148 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9405,9420 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 10151,10166 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9692,9707 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 10438,10455 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9755,9766 **** GetStandbyFlushRecPtr(void)
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 10503,10514 ----
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 1753,1761 **** GetOldestActiveTransactionId(void)
   * the result is somewhat indeterminate, but we don't really care.  Even in
   * a multiprocessor with delayed writes to shared memory, it should be certain
   * that setting of inCommit will propagate to shared memory when the backend
!  * takes the WALInsertLock, so we cannot fail to see an xact as inCommit if
!  * it's already inserted its commit record.  Whether it takes a little while
!  * for clearing of inCommit to propagate is unimportant for correctness.
   */
  int
  GetTransactionsInCommit(TransactionId **xids_p)
--- 1753,1762 ----
   * the result is somewhat indeterminate, but we don't really care.  Even in
   * a multiprocessor with delayed writes to shared memory, it should be certain
   * that setting of inCommit will propagate to shared memory when the backend
!  * takes a lock to write the WAL record, so we cannot fail to see an xact as
!  * inCommit if it's already inserted its commit record.  Whether it takes a
!  * little while for clearing of inCommit to propagate is unimportant for
!  * correctness.
   */
  int
  GetTransactionsInCommit(TransactionId **xids_p)
*** a/src/backend/storage/lmgr/spin.c
--- b/src/backend/storage/lmgr/spin.c
***************
*** 56,61 **** SpinlockSemas(void)
--- 56,64 ----
  	 *
  	 * For now, though, we just need a few spinlocks (10 should be plenty)
  	 * plus one for each LWLock and one for each buffer header.
+ 	 *
+ 	 * XXX: remember to adjust this for the number of spinlocks needed by the
+ 	 * xlog.c changes before committing!
  	 */
  	return NumLWLocks() + NBuffers + 10;
  }
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALInsertTailLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#75Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#74)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Fri, Mar 9, 2012 at 2:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Thanks!

BTW, I haven't forgotten about the recovery bugs Jeff found earlier. I'm
planning to do a longer run with his test script - I only run it for about
1000 iterations - to see if I can reproduce the PANIC with both the earlier
patch version he tested, and this new one.

Hi Heikki,

I've run the v12 patch for 17,489 rounds of crash and recovery, and
detected no assertion failures or other problems.

Cheers,

Jeff

#76Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#72)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 09.03.2012 12:04, Heikki Linnakangas wrote:

I've been doing some performance testing with this, using a simple C
function that just inserts a dummy WAL record of given size. I'm not
totally satisfied. Although the patch helps with scalability at 3-4
concurrent backends doing WAL insertions, it seems to slow down the
single-client case with small WAL records by about 5-10%. This is what
Robert also saw with an earlier version of the patch
(http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I
tested this with the data directory on a RAM drive, unfortunately I
don't have a server with a hard drive that can sustain the high
insertion rate. I'll post more detailed results, once I've refined the
tests a bit.

So, here's more detailed test results, using Greg Smith's excellent
pgbench-tools test suite:

http://community.enterprisedb.com/xloginsert-scale-tests/

The workload in all of these tests was a simple C function that writes a
lot of very small WAL records, with 16 bytes of payload each. I ran the
tests with the data directory on a regular hard drive, on an SDD, and on
a ram drive (/dev/shm). With HDD, I also tried fsync=off and
synchronous_commit_off. For each of those, I ran the tests with 1-16
concurrent backends.

Summary: The patch hurts single-backend performance by about 10%, except
for the synchronous_commit=off test. Between 2-6 clients, it either
helps, doesn't make any difference, or hurts. With > 6 clients, it hurts.

So, that's quite disappointing. The patch has two problems: the 10%
slowdown in single-client case, and the slowdown with > 6 clients. I
don't know where exactly the single-client slowdown comes from, although
I'm not surprised that the bookkeeping with slots etc. has some
overhead. Hopefully that overhead can be made smaller, if not eliminated
completely..

The slowdown with > 6 clients seems to be spinlock contention. I ran
"perf record" for a short duration during one of the ramdrive tests, and
saw the spinlock acquisition in ReserveXLogInsertLocation() consuming
about 80% of all CPU time.

I then hacked the patch a little bit, removing the check in XLogInsert
for fullPageWrites and forcePageWrites, as well as the check for "did a
checkpoint just happen" (see
http://community.enterprisedb.com/xloginsert-scale-tests/disable-fpwcheck.patch).
My hunch was that accessing those fields causes cache line stealing,
making the cache line containing the spinlock even more busy. That hunch
seems to be correct; when I reran the tests with that patch, the
performance with high # of clients became much better. See the results
with "xloginsert-scale-13.patch". With that change, the single-client
case is still about 10% slower than current code, but the performance
with > 8 clients is almost as good as with current code. Between 2-6
clients, the patch is a win.

The hack that restored the > 6 clients performance to current level is
not safe, of course, so I'll have to figure out a safe way to get that
effect. Also, even when the performance is as good as current code, it's
not good to spend all the CPU time spinning on the spinlock. I didn't
measure the CPU usage with current code, but I would expect it to be
sleeping, not spinning, when not doing useful work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#77Fujii Masao
masao.fujii@gmail.com
In reply to: Jeff Janes (#75)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Tue, Mar 13, 2012 at 1:59 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Mar 9, 2012 at 2:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Thanks!

BTW, I haven't forgotten about the recovery bugs Jeff found earlier. I'm
planning to do a longer run with his test script - I only run it for about
1000 iterations - to see if I can reproduce the PANIC with both the earlier
patch version he tested, and this new one.

Hi Heikki,

I've run the v12 patch for 17,489 rounds of crash and recovery, and
detected no assertion failures or other problems.

In v12 patch, I could no longer reproduce the assertion failure, too.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#78Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#76)
1 attachment(s)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 12.03.2012 21:33, I wrote:

The slowdown with > 6 clients seems to be spinlock contention. I ran
"perf record" for a short duration during one of the ramdrive tests, and
saw the spinlock acquisition in ReserveXLogInsertLocation() consuming
about 80% of all CPU time.

I then hacked the patch a little bit, removing the check in XLogInsert
for fullPageWrites and forcePageWrites, as well as the check for "did a
checkpoint just happen" (see
http://community.enterprisedb.com/xloginsert-scale-tests/disable-fpwcheck.patch).
My hunch was that accessing those fields causes cache line stealing,
making the cache line containing the spinlock even more busy. That hunch
seems to be correct; when I reran the tests with that patch, the
performance with high # of clients became much better. See the results
with "xloginsert-scale-13.patch". With that change, the single-client
case is still about 10% slower than current code, but the performance
with > 8 clients is almost as good as with current code. Between 2-6
clients, the patch is a win.

The hack that restored the > 6 clients performance to current level is
not safe, of course, so I'll have to figure out a safe way to get that
effect.

I managed to do that in a safe way, and also found a couple of other
small changes that made a big difference to performance. I found out
that the number of cache misses while holding the spinlock matter a lot,
which in hindsight isn't surprising. I aligned the XLogCtlInsert struct
on a 64-byte boundary, so that the new spinlock and the fields it
protects all fit on the same cache line (on boxes with cache line size

= 64 bytes, anyway). I also changed the logic of the insertion slots

slightly, so that when a slot is reserved, while holding the spinlock,
it doesn't need to be immediately updated. That avoids one cache miss,
as the cache line holding the slot doesn't need to be accessed while
holding the spinlock. And to reduce contention on cache lines when an
insertion is finished and the insertion slot is updated, I shuffled the
slots so that slots that are logically adjacent are spaced apart in memory.

When all those changes are put together, the patched version now beats
or matches the current code in the RAM drive tests, except that the
single-client case is still about 10% slower. I added the new test
results at http://community.enterprisedb.com/xloginsert-scale-tests/,
and a new version of the patch is attached.

If all of this sounds pessimistic, let me remind that I've been testing
the cases where I'm seeing regressions, so that I can fix them, and not
trying to demonstrate how good this is in the best case. These tests
have been with very small WAL records, with only 16 bytes of payload.
Larger WAL records benefit more. I also ran one test with larger, 100
byte WAL records, and put the results up on that site.

Also, even when the performance is as good as current code, it's
not good to spend all the CPU time spinning on the spinlock. I didn't
measure the CPU usage with current code, but I would expect it to be
sleeping, not spinning, when not doing useful work.

This is still an issue.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xloginsert-scale-18.patchtext/x-diff; name=xloginsert-scale-18.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 261,274 **** XLogRecPtr	XactLastRecEnd = {0, 0};
   * (which is almost but not quite the same as a pointer to the most recent
   * CHECKPOINT record).	We update this from the shared-memory copy,
   * XLogCtl->Insert.RedoRecPtr, whenever we can safely do so (ie, when we
!  * hold the Insert lock).  See XLogInsert for details.	We are also allowed
!  * to update from XLogCtl->Insert.RedoRecPtr if we hold the info_lck;
   * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
   * InitXLOGAccess.
   */
  static XLogRecPtr RedoRecPtr;
  
  /*
   * RedoStartLSN points to the checkpoint's REDO location which is specified
   * in a backup label file, backup history file or control file. In standby
   * mode, XLOG streaming usually starts from the position where an invalid
--- 262,281 ----
   * (which is almost but not quite the same as a pointer to the most recent
   * CHECKPOINT record).	We update this from the shared-memory copy,
   * XLogCtl->Insert.RedoRecPtr, whenever we can safely do so (ie, when we
!  * hold the insertpos lock).  See XLogInsert for details.	We are also allowed
!  * to update from XLogCtl->RedoRecPtr if we hold the info_lck;
   * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
   * InitXLOGAccess.
   */
  static XLogRecPtr RedoRecPtr;
  
  /*
+  * doPageWrites is this backend's local copy of the Insert->fullPageWrites ||
+  * Insert->forcePageWrites. It is refreshed at every insertion.
+  */
+ static bool doPageWrites;
+ 
+ /*
   * RedoStartLSN points to the checkpoint's REDO location which is specified
   * in a backup label file, backup history file or control file. In standby
   * mode, XLOG streaming usually starts from the position where an invalid
***************
*** 300,309 **** static XLogRecPtr RedoStartLSN = {0, 0};
   * (protected by info_lck), but we don't need to cache any copies of it.
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  The other locks are held longer (potentially
!  * over I/O operations), so we use LWLocks for them.  These locks are:
   *
!  * WALInsertLock: must be held to insert a record into the WAL buffers.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
--- 307,321 ----
   * (protected by info_lck), but we don't need to cache any copies of it.
   *
   * info_lck is only held long enough to read/update the protected variables,
!  * so it's a plain spinlock.  insertpos_lck protects the current logical
!  * insert location, ie. the head of reserved WAL space.  The other locks are
!  * held longer (potentially over I/O operations), so we use LWLocks for them.
!  * These locks are:
   *
!  * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
!  * This is only held while initializing and changing the mapping. If the
!  * contents of the buffer being replaced haven't been written yet, the mapping
!  * lock is released while the write is done, and reacquired afterwards.
   *
   * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
   * XLogFlush).
***************
*** 315,320 **** static XLogRecPtr RedoStartLSN = {0, 0};
--- 327,399 ----
   * only one checkpointer at a time; currently, with all checkpoints done by
   * the checkpointer, this is just pro forma).
   *
+  *
+  * Inserting a new WAL record is a two-step process:
+  *
+  * 1. Reserve the right amount of space from the WAL, and the next insertion
+  *    slot to advertise that the insertion is in progress. The current head
+  *    of reserved space is kept in Insert->CurrPos, and is protected by
+  *    insertpos_lck. Try to keep this section as short as possible,
+  *    insertpos_lck can be heavily contended on a busy system
+  *
+  * 2. Copy the record to the reserved WAL space. This involves finding the
+  *    correct WAL buffer containing the reserved space, and copying the
+  *    record in place. This can be done concurrently in multiple processes.
+  *
+  * To allow as much parallelism as possible for step 2, we try hard to avoid
+  * lock contention in that code path. Each insertion is asssigned its own
+  * "XLog insertion slot", which is used to advertise the position the backend
+  * is writing to. The slot is marked as in-use in step 1, while holding
+  * insertpos_lck, by setting the position field in the slot. When the backend
+  * is finished with the insertion, it clears its slot. Each slot is protected
+  * by a separate spinlock, to keep contention minimal.
+  *
+  * The insertion slots also provide a mechanism to wait for an insertion to
+  * finish. This is important when an XLOG page is written out - any
+  * in-progress insertions must finish copying data to the page first, or the
+  * on-disk copy will be incomplete. Waiting is done by the
+  * WaitXLogInsertionsToFinish() function. It adds the current process to the
+  * waiting queue in the slot it needs to wait for, and when that insertion
+  * finishes (or proceeds to the next page, at least), the inserter wakes up
+  * the process.
+  *
+  * The insertion slots form a ring. Insert->nextslot points to the next free
+  * slot, and Insert->lastslot points to the last slot that's still in use.
+  * lastslot can lag behind reality by any number of slots, as long as nextslot
+  * doesn't catch up with it. lastslot is advanced by
+  * WaitXLogInsertionsToFinish(), and is protected by WALInsertTailLock.
+  * nextslot is advanced in ReserveXLogInsertLocation() and is protected by
+  * insertpos_lck. Both slot variables are 32-bit integers, so that they can
+  * be read atomically without holding a lock. nextslot == lastslot means that
+  * all the slots are empty.
+  *
+  * Whenever the ring fills up, ie. when nextslot wraps around and catches up
+  * with lastslot, ReserveXLogInsertLocation() has to wait for the oldest
+  * insertion to finish and advance lastslot, to make room for the new
+  * insertion. This is also handled by WaitXLogInsertionsToFinish().
+  *
+  *
+  * Deadlock analysis
+  * -----------------
+  *
+  * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+  * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+  * finish (or at least advance to next uninitialized page), while you're
+  * holding WALWriteLock. That would be bad, because the backend you're waiting
+  * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+  * you'd get deadlock.
+  *
+  * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+  * it's called with a location that's known to be already allocated in the WAL
+  * buffers. Calling it with the position of a record you've already inserted
+  * satisfies that condition, so the common pattern:
+  *
+  *   recptr = XLogInsert(...)
+  *   XLogFlush(recptr)
+  *
+  * is safe. It can't get stuck, because an insertion to a WAL page that's
+  * already initialized in cache can always proceed without waiting on a lock.
+  *
   *----------
   */
  
***************
*** 335,344 **** typedef struct XLogwrtResult
   */
  typedef struct XLogCtlInsert
  {
! 	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
! 	int			curridx;		/* current block index in cache */
! 	XLogPageHeader currpage;	/* points to header of block in cache */
! 	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
--- 414,435 ----
   */
  typedef struct XLogCtlInsert
  {
! 	slock_t		insertpos_lck;	/* protects all the fields in this struct
! 								 * (except lastslot). */
! 
! 	int32		nextslot;		/* next insertion slot to use */
! 	int32		lastslot;		/* last in-use insertion slot (protected by
! 								 * WALInsertTailLock) */
! 
! 	/*
! 	 * CurrPos is the very tip of the reserved WAL space at the moment.
! 	 * The next record will be inserted there (or somewhere after it if
! 	 * there's not enough space on the current page).  PrevRecord points to
! 	 * the beginning of the last record already reserved.  It might not be
! 	 * fully copied into place yet, but we know its exact location already.
! 	 */
! 	XLogRecPtr	CurrPos;
! 	XLogRecPtr	PrevRecord;
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
***************
*** 364,369 **** typedef struct XLogCtlInsert
--- 455,471 ----
  } XLogCtlInsert;
  
  /*
+  * We force XLogCtlInsert to be aligned at a 64-byte boundary, to ensure
+  * that the inserpos_lck spinlock and the fields it protects are all on the
+  * same cache line (assuming the cache line size is at least 64 bytes).
+  * Testing shows that that makes a big difference in performance. If the
+  * struct grows larger than 64 bytes, this needs to be enlarged, too, but
+  * then it won't fit on a single cache line on systems with smaller cache
+  * line size anyway.
+  */
+ #define XLOGCTLINSERT_ALIGNMENT (64)
+ 
+ /*
   * Shared state data for XLogWrite/XLogFlush.
   */
  typedef struct XLogCtlWrite
***************
*** 372,387 **** typedef struct XLogCtlWrite
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
--- 474,512 ----
  	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
  } XLogCtlWrite;
  
+ 
+ /*
+  * Slots for in-progress WAL insertions.
+  */
+ typedef struct
+ {
+ 	slock_t		lck;
+ 	bool		finished;
+ 	XLogRecPtr	CurrPos;	/* current position this process is inserting to */
+ 	PGPROC	   *head;		/* head of list of waiting PGPROCs */
+ 	PGPROC	   *tail;		/* tail of list of waiting PGPROCs */
+ } XLogInsertSlot;
+ 
+ #define NumXLogInsertSlots	512
+ 
  /*
   * Total shared-memory state for XLOG.
   */
  typedef struct XLogCtlData
  {
! 	/*
! 	 * Note: Insert must be the first field in the struct or it won't be
! 	 * aligned to a cache-line boundary like we want it to be.
! 	 *
! 	 * Protected by insertpos_lck.
! 	 */
  	XLogCtlInsert Insert;
  
+ 	XLogInsertSlot XLogInsertSlots[NumXLogInsertSlots];
+ 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
+ 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
***************
*** 398,406 **** typedef struct XLogCtlData
  	XLogwrtResult LogwrtResult;
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
--- 523,540 ----
  	XLogwrtResult LogwrtResult;
  
  	/*
+ 	 * To change curridx and the identity of a buffer, you need to hold
+ 	 * WALBufMappingLock.  To change the identity of a buffer that's still
+ 	 * dirty, the old page needs to be written out first, and for that you
+ 	 * need WALWriteLock, and you need to ensure that there's no in-progress
+ 	 * insertions to the page by calling WaitXLogInsertionsToFinish().
+ 	 */
+ 	int			curridx;		/* latest initialized block index in cache */
+ 
+ 	/*
  	 * These values do not change after startup, although the pointed-to pages
  	 * and xlblocks values certainly do.  Permission to read/write the pages
! 	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
  	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
***************
*** 478,505 **** static XLogCtlData *XLogCtl = NULL;
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Macros for managing XLogInsert state.  In most cases, the calling routine
!  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
!  * so these are passed as parameters instead of being fetched via XLogCtl.
   */
  
! /* Free space remaining in the current xlog page buffer */
! #define INSERT_FREESPACE(Insert)  \
! 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
! /* Construct XLogRecPtr value for current insertion point */
! #define INSERT_RECPTR(recptr,Insert,curridx)  \
! 	( \
! 	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
! 	  (recptr).xrecoff = \
! 		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
! 	)
  
! #define PrevBufIdx(idx)		\
! 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
  
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
--- 612,660 ----
  static ControlFileData *ControlFile = NULL;
  
  /*
!  * Calculate the amount of space left on the page after 'endptr'.
!  * Beware multiple evaluation!
   */
+ #define INSERT_FREESPACE(endptr)	\
+ 	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
  
! /*
!  * Macros to advance to next buffer index and insertion slot.
!  */
! #define NextBufIdx(idx)		\
! 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
  
! #define NextSlotNo(idx)		(((idx) + 1) % NumXLogInsertSlots)
  
! /*
!  * The mapping from a slot number to an element in the XLogInsertSlots array
!  * is not a simple linear mapping as you would expect. We want to spread out
!  * the accesses to the slots, to reduce the contention on the cache lines of
!  * logically adjacent slots.
!  *
!  * To do that, we swap the four low-order bits with high order bits:
!  *
!  * 123456789 ->  1567892345
!  */
! static inline int
! SlotNoToIdx(int slotno)
! {
! 	return (slotno & 0xffffff00) | (slotno & 0xf0) >> 4 | (slotno & 0x0f) << 4;
! }
  
! 
! /*
!  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
!  * would hold if it was in cache, the page containing 'recptr'.
!  *
!  * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
!  * page is taken to mean the previous page.
!  */
! #define XLogRecPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogFileSize) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
! 
! #define XLogRecEndPtrToBufIdx(recptr)	\
! 	((((((uint64) (recptr).xlogid * (uint64) XLogFileSize) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
  
  /*
   * Private, possibly out-of-date copy of shared LogwrtResult.
***************
*** 625,633 **** static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static bool AdvanceXLInsertBuffer(bool new_segment);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
--- 780,788 ----
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
! static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
! static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
***************
*** 674,679 **** static bool read_backup_label(XLogRecPtr *checkPointLoc,
--- 829,849 ----
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
+ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
+ 				  XLogRecord *rechdr,
+ 				  XLogRecData *rdata, pg_crc32 rdata_crc,
+ 				  int slotno,
+ 				  XLogRecPtr StartPos, XLogRecPtr EndPos);
+ static bool ReserveXLogInsertLocation(int size,
+ 						  bool isLogSwitch,
+ 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
+ 						  XLogRecPtr *EndPos_p,
+ 						  int *myslotno_p, bool *updrqst_p);
+ static void UpdateSlotCurrPos(int myslotno, XLogRecPtr CurrPos, bool finished);
+ static void	ReuseOldSlots(void);
+ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
+ static char *GetXLogBuffer(int slotno, XLogRecPtr ptr);
+ 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 693,705 **** static int	get_sync_bit(int method);
  XLogRecPtr
  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  {
- 	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecord *record;
- 	XLogContRecord *contrecord;
- 	XLogRecPtr	RecPtr;
- 	XLogRecPtr	WriteRqst;
- 	uint32		freespace;
- 	int			curridx;
  	XLogRecData *rdt;
  	XLogRecData *rdt_lastnormal;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
--- 863,868 ----
***************
*** 714,722 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  				write_len;
  	unsigned	i;
  	bool		updrqst;
- 	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
--- 877,889 ----
  				write_len;
  	unsigned	i;
  	bool		updrqst;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
  	uint8		info_orig = info;
+ 	XLogRecord	rechdr;
+ 	XLogRecPtr	PrevRecord;
+ 	XLogRecPtr	StartPos;
+ 	XLogRecPtr	EndPos;
+ 	int			myslotno;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 734,742 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	 */
  	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
  	{
! 		RecPtr.xlogid = 0;
! 		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
! 		return RecPtr;
  	}
  
  	/*
--- 901,909 ----
  	 */
  	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
  	{
! 		EndPos.xlogid = 0;
! 		EndPos.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
! 		return EndPos;
  	}
  
  	/*
***************
*** 760,773 **** begin:;
  		dtbuf_bkp[i] = false;
  	}
  
- 	/*
- 	 * Decide if we need to do full-page writes in this XLOG record: true if
- 	 * full_page_writes is on or we have a PITR request for it.  Since we
- 	 * don't yet have the insert lock, fullPageWrites and forcePageWrites
- 	 * could change under us, but we'll recheck them once we have the lock.
- 	 */
- 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
- 
  	len = 0;
  	for (rdt = rdata;;)
  	{
--- 927,932 ----
***************
*** 903,1035 **** begin:;
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
! 	START_CRIT_SECTION();
  
! 	/* Now wait to get insert lock */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
! 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
! 	 * back and recompute everything.  This can only happen just after a
! 	 * checkpoint, so it's better to be slow in this case and fast otherwise.
! 	 *
! 	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
! 	 * affect the contents of the XLOG record, so we'll update our local copy
! 	 * but not force a recomputation.
  	 */
! 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
  	{
! 		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
! 		RedoRecPtr = Insert->RedoRecPtr;
  
! 		if (doPageWrites)
  		{
! 			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 			{
! 				if (dtbuf[i] == InvalidBuffer)
! 					continue;
! 				if (dtbuf_bkp[i] == false &&
! 					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
! 				{
! 					/*
! 					 * Oops, this buffer now needs to be backed up, but we
! 					 * didn't think so above.  Start over.
! 					 */
! 					LWLockRelease(WALInsertLock);
! 					END_CRIT_SECTION();
! 					rdt_lastnormal->next = NULL;
! 					info = info_orig;
! 					goto begin;
! 				}
! 			}
  		}
  	}
! 
! 	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
! 	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
! 		/* Oops, must redo it with full-page data. */
! 		LWLockRelease(WALInsertLock);
! 		END_CRIT_SECTION();
! 		rdt_lastnormal->next = NULL;
! 		info = info_orig;
! 		goto begin;
  	}
  
  	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
  	 */
! 	updrqst = false;
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
  	{
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
  
! 	/* Compute record's XLOG location */
! 	curridx = Insert->curridx;
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
! 	 * segment, we need not insert it (and don't want to because we'd like
! 	 * consecutive switch requests to be no-ops).  Instead, make sure
! 	 * everything is written and flushed through the end of the prior segment,
! 	 * and return the prior segment's end address.
  	 */
! 	if (isLogSwitch &&
! 		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  	{
! 		/* We can release insert lock immediately */
! 		LWLockRelease(WALInsertLock);
! 
! 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
! 		if (RecPtr.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			RecPtr.xlogid -= 1;
! 			RecPtr.xrecoff = XLogFileSize;
! 		}
! 
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! 		LogwrtResult = XLogCtl->LogwrtResult;
! 		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
! 		{
! 			XLogwrtRqst FlushRqst;
! 
! 			FlushRqst.Write = RecPtr;
! 			FlushRqst.Flush = RecPtr;
! 			XLogWrite(FlushRqst, false, false);
! 		}
! 		LWLockRelease(WALWriteLock);
  
! 		END_CRIT_SECTION();
  
! 		return RecPtr;
  	}
  
! 	/* Insert record header */
! 
! 	record = (XLogRecord *) Insert->currpos;
! 	record->xl_prev = Insert->PrevRecord;
! 	record->xl_xid = GetCurrentTransactionIdIfAny();
! 	record->xl_tot_len = SizeOfXLogRecord + write_len;
! 	record->xl_len = len;		/* doesn't include backup blocks */
! 	record->xl_info = info;
! 	record->xl_rmid = rmid;
! 
! 	/* Now we can finish computing the record's CRC */
! 	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
--- 1062,1167 ----
  	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
  		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
  
! 	/* Construct record header. */
! 	MemSet(&rechdr, 0, sizeof(rechdr));
! 	/* rechdr.xl_prev is set later */
! 	rechdr.xl_xid = GetCurrentTransactionIdIfAny();
! 	rechdr.xl_tot_len = SizeOfXLogRecord + write_len;
! 	rechdr.xl_len = len;		/* doesn't include backup blocks */
! 	rechdr.xl_info = info;
! 	rechdr.xl_rmid = rmid;
  
! 	START_CRIT_SECTION();
  
  	/*
! 	 * Try to reserve space for the record from the WAL.
  	 */
! 	if (!ReserveXLogInsertLocation(write_len, isLogSwitch,
! 								   &PrevRecord, &StartPos, &EndPos,
! 								   &myslotno, &updrqst))
  	{
! 		/*
! 		 * Reservation failed. This could be because the record was an
! 		 * XLOG_SWITCH, and we're exactly at the start of a segment. In that
! 		 * case we need not insert it (and don't want to because we'd like
! 		 * consecutive switch requests to be no-ops).  Instead, make sure
! 		 * everything is written and flushed through the end of the prior
! 		 * segment, and return the prior segment's end address.
! 		 *
! 		 * The other reason for failure is that someone changed RedoRecPtr
! 		 * or forcePageWrites after we had constructed our WAL record. In
! 		 * that case we need to redo it with full-page data.
! 		 */
! 		END_CRIT_SECTION();
  
! 		if (isLogSwitch && !XLogRecPtrIsInvalid(EndPos))
  		{
! 			XLogFlush(EndPos);
! 			return EndPos;
! 		}
! 		else
! 		{
! 			rdt_lastnormal->next = NULL;
! 			info = info_orig;
! 			goto begin;
  		}
  	}
! 	else
  	{
! 		/*
! 		 * Reservation succeeded.  Finish the record header by setting
! 		 * prev-link (now that we know it), and finish computing the record's
! 		 * CRC (in CopyXLogRecordToWAL).  Then copy the record to the space
! 		 * we reserved.
! 		 */
! 		rechdr.xl_prev = PrevRecord;
! 		CopyXLogRecordToWAL(write_len, isLogSwitch, &rechdr,
! 							rdata, rdata_crc, myslotno, StartPos, EndPos);
  	}
+ 	END_CRIT_SECTION();
  
  	/*
! 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
  	 */
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
! 			xlogctl->LogwrtRqst.Write = EndPos;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
  
  	/*
! 	 * If this was an XLOG_SWITCH record, flush the record and the empty
! 	 * padding space that fills the rest of the segment, and perform
! 	 * end-of-segment actions (eg, notifying archiver).
  	 */
! 	if (isLogSwitch)
  	{
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		XLogFlush(EndPos);
  
! 		/*
! 		 * Even though we reserved the rest of the segment for us, which is
! 		 * reflected in EndPos, we return a pointer to just the end of the
! 		 * xlog-switch record.
! 		 */
! 		EndPos.xlogid = StartPos.xlogid;
! 		EndPos.xrecoff = StartPos.xrecoff + SizeOfXLogRecord;
  	}
  
! 	/*
! 	 * Update our global variables
! 	 */
! 	ProcLastRecPtr = StartPos;
! 	XactLastRecEnd = EndPos;
  
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
***************
*** 1038,1219 **** begin:;
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
! 						 RecPtr.xlogid, RecPtr.xrecoff);
! 		xlog_outrec(&buf, record);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
- 	/* Record begin of record in appropriate places */
- 	ProcLastRecPtr = RecPtr;
- 	Insert->PrevRecord = RecPtr;
- 
- 	Insert->currpos += SizeOfXLogRecord;
- 	freespace -= SizeOfXLogRecord;
- 
  	/*
! 	 * Append the data, including backup blocks if any
  	 */
! 	while (write_len)
! 	{
! 		while (rdata->data == NULL)
! 			rdata = rdata->next;
  
! 		if (freespace > 0)
  		{
! 			if (rdata->len > freespace)
  			{
! 				memcpy(Insert->currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				write_len -= freespace;
! 			}
! 			else
! 			{
! 				memcpy(Insert->currpos, rdata->data, rdata->len);
! 				freespace -= rdata->len;
! 				write_len -= rdata->len;
! 				Insert->currpos += rdata->len;
! 				rdata = rdata->next;
! 				continue;
  			}
  		}
  
! 		/* Use next buffer */
! 		updrqst = AdvanceXLInsertBuffer(false);
! 		curridx = Insert->curridx;
! 		/* Insert cont-record header */
! 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 		contrecord = (XLogContRecord *) Insert->currpos;
! 		contrecord->xl_rem_len = write_len;
! 		Insert->currpos += SizeOfXLogContRecord;
! 		freespace = INSERT_FREESPACE(Insert);
  	}
  
! 	/* Ensure next record will be properly aligned */
! 	Insert->currpos = (char *) Insert->currpage +
! 		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
! 	freespace = INSERT_FREESPACE(Insert);
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will be
! 	 * stored as LSN for changed data pages...
  	 */
! 	INSERT_RECPTR(RecPtr, Insert, curridx);
  
  	/*
! 	 * If the record is an XLOG_SWITCH, we must now write and flush all the
! 	 * existing data, and then forcibly advance to the start of the next
! 	 * segment.  It's not good to do this I/O while holding the insert lock,
! 	 * but there seems too much risk of confusion if we try to release the
! 	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
! 	 * operation anyway...
  	 */
! 	if (isLogSwitch)
! 	{
! 		XLogwrtRqst FlushRqst;
! 		XLogRecPtr	OldSegEnd;
  
! 		TRACE_POSTGRESQL_XLOG_SWITCH();
  
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  
  		/*
! 		 * Flush through the end of the page containing XLOG_SWITCH, and
! 		 * perform end-of-segment actions (eg, notifying archiver).
  		 */
! 		WriteRqst = XLogCtl->xlblocks[curridx];
! 		FlushRqst.Write = WriteRqst;
! 		FlushRqst.Flush = WriteRqst;
! 		XLogWrite(FlushRqst, false, true);
! 
! 		/* Set up the next buffer as first page of next segment */
! 		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
! 		(void) AdvanceXLInsertBuffer(true);
! 
! 		/* There should be no unwritten data */
! 		curridx = Insert->curridx;
! 		Assert(curridx == XLogCtl->Write.curridx);
! 
! 		/* Compute end address of old segment */
! 		OldSegEnd = XLogCtl->xlblocks[curridx];
! 		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
! 		if (OldSegEnd.xrecoff == 0)
! 		{
! 			/* crossing a logid boundary */
! 			OldSegEnd.xlogid -= 1;
! 			OldSegEnd.xrecoff = XLogFileSize;
! 		}
  
! 		/* Make it look like we've written and synced all of old segment */
! 		LogwrtResult.Write = OldSegEnd;
! 		LogwrtResult.Flush = OldSegEnd;
  
  		/*
! 		 * Update shared-memory status --- this code should match XLogWrite
  		 */
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			SpinLockAcquire(&xlogctl->info_lck);
! 			xlogctl->LogwrtResult = LogwrtResult;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
! 				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
! 			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
! 				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
! 			SpinLockRelease(&xlogctl->info_lck);
! 		}
  
! 		LWLockRelease(WALWriteLock);
  
! 		updrqst = false;		/* done already */
  	}
  	else
  	{
! 		/* normal case, ie not xlog switch */
  
! 		/* Need to update shared LogwrtRqst if some block was filled up */
! 		if (freespace < SizeOfXLogRecord)
  		{
! 			/* curridx is filled and available for writing out */
  			updrqst = true;
  		}
! 		else
! 		{
! 			/* if updrqst already set, write through end of previous buf */
! 			curridx = PrevBufIdx(curridx);
! 		}
! 		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
! 	LWLockRelease(WALInsertLock);
  
! 	if (updrqst)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		/* advance global request to include new block(s) */
! 		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
! 			xlogctl->LogwrtRqst.Write = WriteRqst;
! 		/* update local result copy while I have the chance */
! 		LogwrtResult = xlogctl->LogwrtResult;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
! 	XactLastRecEnd = RecPtr;
  
! 	END_CRIT_SECTION();
  
! 	return RecPtr;
  }
  
  /*
--- 1170,1954 ----
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
! 						 EndPos.xlogid, EndPos.xrecoff);
! 		xlog_outrec(&buf, &rechdr);
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rmid].rm_desc(&buf, rechdr.xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
  	/*
! 	 * The recptr I return is the beginning of the *next* record. This will
! 	 * be stored as LSN for changed data pages...
  	 */
! 	return EndPos;
! }
! 
! /*
!  * Subroutine of XLogInsert.  Copies a WAL record to an already-reserved
!  * area in the WAL.
!  */
! static void
! CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecord *rechdr,
! 					XLogRecData *rdata, pg_crc32 rdata_crc,
! 					int myslotno,
! 					XLogRecPtr StartPos, XLogRecPtr EndPos)
! {
! 	char	   *currpos;
! 	XLogRecord *record;
! 	int			freespace;
! 	int			written;
! 	XLogRecPtr	CurrPos;
! 
! 	/* Get the right WAL page to start inserting to */
! 	CurrPos = StartPos;
! 	currpos = GetXLogBuffer(myslotno, CurrPos);
! 
! 	/* Copy the record header in place, and finish calculating CRC */
! 	record = (XLogRecord *) currpos;
! 	memcpy(record, rechdr, sizeof(XLogRecord));
! 	COMP_CRC32(rdata_crc, currpos + sizeof(pg_crc32),
! 			   SizeOfXLogRecord - sizeof(pg_crc32));
! 	FIN_CRC32(rdata_crc);
! 	record->xl_crc = rdata_crc;
! 
! 	currpos += SizeOfXLogRecord;
! 	XLByteAdvance(CurrPos, SizeOfXLogRecord);
  
! 	freespace = INSERT_FREESPACE(CurrPos);
! 
! 	if (!isLogSwitch)
! 	{
! 		/* Copy record data */
! 		written = 0;
! 		while (rdata != NULL)
  		{
! 			while (rdata->len > freespace)
  			{
! 				/*
! 				 * Write what fits on this page, then write the continuation
! 				 * record, and continue on the next page.
! 				 */
! 				XLogContRecord *contrecord;
! 
! 				memcpy(currpos, rdata->data, freespace);
  				rdata->data += freespace;
  				rdata->len -= freespace;
! 				written += freespace;
! 				XLByteAdvance(CurrPos, freespace);
! 
! 				/*
! 				 * Get pointer to beginning of next page, and set the
! 				 * XLP_FIRST_IS_CONTRECORD flag in the page header.
! 				 *
! 				 * It's safe to set the contrecord flag without a lock on the
! 				 * page. All the other flags are set in AdvanceXLInsertBuffer,
! 				 * and we're the only backend that needs to set the contrecord
! 				 * flag.
! 				 */
! 				currpos = GetXLogBuffer(myslotno, CurrPos);
! 				((XLogPageHeader) currpos)->xlp_info |= XLP_FIRST_IS_CONTRECORD;
! 
! 				/* skip over the page header, and write continuation record */
! 				if (CurrPos.xrecoff % XLogSegSize == 0)
! 				{
! 					CurrPos.xrecoff += SizeOfXLogLongPHD;
! 					currpos += SizeOfXLogLongPHD;
! 				}
! 				else
! 				{
! 					CurrPos.xrecoff += SizeOfXLogShortPHD;
! 					currpos += SizeOfXLogShortPHD;
! 				}
! 				contrecord = (XLogContRecord *) currpos;
! 				contrecord->xl_rem_len = write_len - written;
! 
! 				currpos += SizeOfXLogContRecord;
! 				CurrPos.xrecoff += SizeOfXLogContRecord;
! 
! 				freespace = INSERT_FREESPACE(CurrPos);
  			}
+ 
+ 			memcpy(currpos, rdata->data, rdata->len);
+ 			currpos += rdata->len;
+ 			XLByteAdvance(CurrPos, rdata->len);
+ 			freespace -= rdata->len;
+ 			written += rdata->len;
+ 
+ 			rdata = rdata->next;
  		}
+ 		Assert(written == write_len);
  
! 		/* Align the end position, so that the next record starts aligned */
! 		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
! 		if (CurrPos.xrecoff >= XLogFileSize)
! 		{
! 			/* crossed a logid boundary */
! 			CurrPos.xlogid += 1;
! 			CurrPos.xrecoff = 0;
! 		}
! 
! 		if (!XLByteEQ(CurrPos, EndPos))
! 			elog(PANIC, "space reserved for WAL record does not match what was written");
  	}
+ 	else
+ 	{
+ 		/* An xlog-switch record doesn't contain any data besides the header */
+ 		Assert(write_len == 0);
+ 
+ 		/*
+ 		 * An xlog-switch record consumes all the remaining space on the
+ 		 * WAL segment. We have already reserved it for us, but we still need
+ 		 * to make sure it's been allocated and zeroed in the WAL buffers so
+ 		 * that when the caller (or someone else) does XLogWrite(), it can
+ 		 * really write out all the zeros.
+ 		 *
+ 		 * We do this one page at a time, to make sure we don't deadlock
+ 		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
+ 		 */
+ 		Assert(EndPos.xrecoff % XLogSegSize == 0);
+ 
+ 		/* Use up all the remaining space on the first page */
+ 		XLByteAdvance(CurrPos, freespace);
  
! 		while (XLByteLT(CurrPos, EndPos))
! 		{
! 			/*
! 			 * like in the non-xlog-switch codepath, let others know that
! 			 * we're done writing up to the end of this page
! 			 */
! 			UpdateSlotCurrPos(myslotno, CurrPos, false);
! 			/* initialize the next page (if not initialized already */
! 			AdvanceXLInsertBuffer(CurrPos, false);
! 			XLByteAdvance(CurrPos, XLOG_BLCKSZ);
! 		}
! 	}
  
  	/*
! 	 * Done! Clear CurrPos in our slot to let others know that we're
! 	 * finished.
  	 */
! 	UpdateSlotCurrPos(myslotno, InvalidXLogRecPtr, true);
  
  	/*
! 	 * When we run out of insertion slots, the next inserter has to grab the
! 	 * WALInsertTailLock to clean up some old slots.  That stalls all new
! 	 * insertions. The WAL writer process cleans up old slots periodically,
! 	 * but on a busy system that might not be enough. So we try to clean up
! 	 * old ones every time we've gone through 1/4 of all the slots.
  	 */
! 	if (myslotno % (NumXLogInsertSlots / 4) == 0)
! 		ReuseOldSlots();
! }
  
! /*
!  * Reserves the right amount of space for a record of given size from the WAL.
!  * *StartPos_p is set to the beginning of the reserved section, *EndPos_p to
!  * its end+1, and *PrevRecord_p to the beginning of the previous record to set
!  * to the prev-link of the record header.
!  *
!  * A log-switch record is handled slightly differently. The rest of the
!  * segment will be reserved for this insertion, as indicated by the returned
!  * *EndPos_p value. However, if we are already at the beginning of the current
!  * segment, the *EndPos_p is set to the current location without reserving
!  * any space, and the function returns false.
!  *
!  * *updrqst_p is set to true, if this record ends on different page than
!  * the previous one - the caller should update the shared LogwrtRqst value
!  * after it's done inserting the record in that case, so that the WAL page
!  * that filled up gets written out at the next convenient moment.
!  *
!  * While holding insertpos_lck, sets myslot->CurrPos to the starting position,
!  * (or the end of previous record, to be exact) to let others know that we're
!  * busy inserting to the reserved area. The caller must clear it when the
!  * insertion is finished.
!  *
!  * Returns true on success, or false if RedoRecPtr or forcePageWrites was
!  * changed. On failure, the shared state is not modified.
!  *
!  * This is the performance critical part of XLogInsert that must be serialized
!  * across backends. The rest can happen mostly in parallel.
!  *
!  * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
!  * where we actually copy the record to the reserved space.
!  */
! static bool
! ReserveXLogInsertLocation(int size,
! 						  bool isLogSwitch,
! 						  XLogRecPtr *PrevRecord_p, XLogRecPtr *StartPos_p,
! 						  XLogRecPtr *EndPos_p,
! 						  int *myslotno_p, bool *updrqst_p)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			freespace;
! 	XLogRecPtr	ptr;
! 	XLogRecPtr	StartPos;
! 	int32		nextslot;
! 	int32		lastslot;
! 	bool		updrqst = false;
! 	bool		didPageWrites = doPageWrites;
! 
! 	/* log-switch records should contain no data */
! 	Assert(!isLogSwitch || size == 0);
  
! 	size = SizeOfXLogRecord + size;
! 
! retry:
! 	SpinLockAcquire(&Insert->insertpos_lck);
  
+ 	doPageWrites = Insert->forcePageWrites || Insert->fullPageWrites;
+ 	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr) ||
+ 		(!didPageWrites && doPageWrites))
+ 	{
  		/*
! 		 * Oops, a checkpoint just happened, or forcePageWrites was just
! 		 * turned on. Start XLogInsert() all over, because we might have to
! 		 * include more full-page images in the record.
  		 */
! 		RedoRecPtr = Insert->RedoRecPtr;
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		*EndPos_p = InvalidXLogRecPtr;
! 		return false;
! 	}
! 
! 	/*
! 	 * Reserve the next insertion slot for us.
! 	 *
! 	 * First check that the slot is not still in use. Modifications to
! 	 * lastslot are protected by WALInsertTailLock, but here we assume that
! 	 * reading an int32 is atomic. Another process might advance lastslot at
! 	 * the same time, but not past nextslot.
! 	 */
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	if (NextSlotNo(nextslot) == lastslot)
! 	{
! 		/*
! 		 * Oops, we've "caught our tail" and the oldest slot is still in use.
! 		 * Have to wait for it to become vacant, and retry.
! 		 */
! 		SpinLockRelease(&Insert->insertpos_lck);
! 		WaitXLogInsertionsToFinish(InvalidXLogRecPtr);
! 		goto retry;
! 	}
! 
! 	/*
! 	 * Got the slot. Now reserve the right amount of space from the WAL for
! 	 * our record.
! 	 */
! 	ptr = Insert->CurrPos;
! 	*PrevRecord_p = Insert->PrevRecord;
! 
! 	/*
! 	 * If there isn't enough space on the current XLOG page for a record
! 	 * header, advance to the next page (leaving the unused space as zeroes).
! 	 */
! 	freespace = INSERT_FREESPACE(ptr);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		XLByteAdvance(ptr, freespace);
  
! 		if (ptr.xrecoff % XLogSegSize == 0)
! 			ptr.xrecoff += SizeOfXLogLongPHD;
! 		else
! 			ptr.xrecoff += SizeOfXLogShortPHD;
! 		freespace = INSERT_FREESPACE(ptr);
! 		updrqst = true;
! 	}
  
+ 	/*
+ 	 * We are now at the starting position of our record. Now figure out how
+ 	 * the data will be split across the WAL pages, to calculate where the
+ 	 * record ends.
+ 	 */
+ 	StartPos = ptr;
+ 
+ 	if (isLogSwitch)
+ 	{
  		/*
! 		 * If the record is an XLOG_SWITCH, and we are exactly at the start
! 		 * of a segment, we need not insert it (and don't want to because
! 		 * we'd like consecutive switch requests to be no-ops). Otherwise the
! 		 * XLOG_SWITCH record should consume all the remaining space on the
! 		 * current segment.
  		 */
+ 		if ((ptr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
  		{
! 			/* We can release insert lock immediately */
! 			SpinLockRelease(&Insert->insertpos_lck);
  
! 			ptr.xrecoff -= SizeOfXLogLongPHD;
! 			if (ptr.xrecoff == 0)
! 			{
! 				/* crossing a logid boundary */
! 				ptr.xlogid -= 1;
! 				ptr.xrecoff = XLogFileSize;
! 			}
  
! 			*EndPos_p = ptr;
! 			*StartPos_p = ptr;
! 			*myslotno_p = 0;
  
! 			return false;
! 		}
! 		else
! 		{
! 			if (ptr.xrecoff % XLOG_SEG_SIZE != 0)
! 			{
! 				int segleft = XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE);
! 				ptr.xrecoff += segleft;
! 			}
! 			updrqst = true;
! 		}
  	}
  	else
  	{
! 		/*
! 		 * A normal record, ie. not xlog-switch.
! 		 *
! 		 * Calculate how the record will be laid out across WAL pages. The
! 		 * straightforward way to do this would be a loop that fills in the
! 		 * WAL pages one at a time, tracking how much of the size is still
! 		 * left.  That's how the CopyXLogRecordToWAL() works when actually
! 		 * copying the data. However, we want to avoid looping to keep this
! 		 * spinlock-protected block as short as possible, in case the record
! 		 * spans pages.
! 		 */
! 		int		sizeleft = size;
  
! 		if (sizeleft > freespace)
  		{
! 			int		pagesneeded;
! 			int		pagesleftonseg;
! 			int		fullpages;
! 
! 			/* First fill the first page with as much data as fits. */
! 			sizeleft -= freespace;
! 			ptr.xrecoff += freespace;
! 
! 			/* We're now positioned at the beginning of the next page */
! 			Assert(ptr.xrecoff % XLOG_BLCKSZ == 0);
! 			do
! 			{
! 				if (ptr.xrecoff >= XLogFileSize)
! 				{
! 					/* crossing a logid boundary */
! 					ptr.xlogid++;
! 					ptr.xrecoff = 0;
! 				}
! 
! 				/*
! 				 * If we're positioned at the beginning of a segment, take
! 				 * into account that the first page needs a long header.
! 				 */
! 				if (ptr.xrecoff % XLOG_SEG_SIZE == 0)
! 					sizeleft += (SizeOfXLogLongPHD - SizeOfXLogShortPHD);
! 
! 				/*
! 				 * Calculate the number of extra pages we need.  Each page
! 				 * will have a continuation record at the beginning.
! 				 *
! 				 * We do the calculation assuming that all the pages have a
! 				 * short header.  We don't know whether we have to cross to
! 				 * the next segment until we've calculated how many pages we
! 				 * need. If it turns out that we do, we'll fill up the current
! 				 * segment, and loop back to add the long page header to
! 				 * sizeleft, and continue calculation from there.
! 				 */
! #define SpaceOnXLogPage	(XLOG_BLCKSZ - SizeOfXLogShortPHD - SizeOfXLogContRecord)
! 				pagesneeded = (sizeleft + SpaceOnXLogPage - 1) / SpaceOnXLogPage;
! 
! 				pagesleftonseg = (XLOG_SEG_SIZE - (ptr.xrecoff % XLOG_SEG_SIZE)) / XLOG_BLCKSZ;
! 
! 				if (pagesneeded <= pagesleftonseg)
! 				{
! 					/*
! 					 * Fits in this segment. Skip over all the full pages, to
! 					 * the last page that will (possibly) be only partially
! 					 * filled.
! 					 */
! 					fullpages = pagesneeded - 1;
! 				}
! 				else
! 				{
! 					/*
! 					 * Doesn't fit in this segment. Fit as much as does, and
! 					 * continue from next segment.
! 					 */
! 					fullpages = pagesleftonseg;
! 				}
! 
! 				sizeleft -= fullpages * SpaceOnXLogPage;
! 				ptr.xrecoff += fullpages * XLOG_BLCKSZ;
! 			} while (pagesneeded > pagesleftonseg);
! 
! 			/*
! 			 * We're now positioned at the beginning of the last page this
! 			 * record spans.  The rest should fit on this page.
! 			 *
! 			 * Note: We already took into account the long header above.
! 			 */
! 			ptr.xrecoff += SizeOfXLogShortPHD;
! 			ptr.xrecoff += SizeOfXLogContRecord;
! 
! 			Assert(sizeleft <= INSERT_FREESPACE(ptr) && sizeleft > 0);
! 
  			updrqst = true;
  		}
! 
! 		/*
! 		 * The rest fits on this page. Note that we mustn't use XLByteAdvance
! 		 * here, because if this record just fills up a logical log file,
! 		 * we want ptr.xlogid to point to log file that was filled, not the
! 		 * next one. XLogWrite gets upset otherwise.
! 		 */
! 		ptr.xrecoff += sizeleft;
! 
! 		/* Align the end position, so that the next record starts aligned */
! 		ptr.xrecoff = MAXALIGN(ptr.xrecoff);
  	}
  
! 	/* Update the shared state before releasing the lock */
! 	Insert->CurrPos = ptr;
! 	Insert->PrevRecord = StartPos;
! 	Insert->nextslot = NextSlotNo(nextslot);
  
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! #ifdef RESERVE_XLOGINSERT_LOCATION_DEBUG
! 	elog(LOG, "reserved xlog: prev %X/%X, start %X/%X, end %X/%X (len %d)",
! 		 PrevRecord_p->xlogid, PrevRecord_p->xrecoff,
! 		 StartPos.xlogid, StartPos.xrecoff,
! 		 ptr.xlogid, ptr.xrecoff,
! 		 size);
! #endif
! 
! 	/*
! 	 * XLogWrite gets upset if given a pointer to the beginning of a logical
! 	 * log file. If the record ends exactly at the end of one, we must return
! 	 * the last byte on previous log file + 1 as our ending position, not
! 	 * the first byte on the next log file.
! 	 */
! 	Assert(ptr.xrecoff != 0);
! 
! 	*EndPos_p = ptr;
! 	*StartPos_p = StartPos;
! 	*myslotno_p = nextslot;
! 	*updrqst_p = updrqst;
! 
! 	return true;
! }
! 
! /*
!  * Update slot's CurrPos variable, and wake up anyone waiting on it.
!  */
! static void
! UpdateSlotCurrPos(int myslotno, XLogRecPtr CurrPos, bool finished)
! {
! 	volatile XLogInsertSlot *myslot = &XLogCtl->XLogInsertSlots[SlotNoToIdx(myslotno)];
! 	PGPROC	   *head;
! 
! 	/* Must not move backwards. */
! 	Assert(finished || XLByteLE(myslot->CurrPos, CurrPos));
! 
! 	/*
! 	 * The write-barrier ensures that the changes we made to the WAL pages
! 	 * are visible to everyone before the update of CurrPos.
! 	 *
! 	 * XXX: I'm not sure if this is necessary. Doesn't the spinlock
! 	 * acquire/release act as an implicit barrier?
! 	 */
! 	pg_write_barrier();
! 
! 	SpinLockAcquire(&myslot->lck);
! 	myslot->finished = finished;
! 	myslot->CurrPos = CurrPos;
! 	head = myslot->head;
! 	myslot->head = myslot->tail = NULL;
! 	SpinLockRelease(&myslot->lck);
! 	while (head != NULL)
  	{
! 		PGPROC *proc = head;
! 		head = proc->lwWaitLink;
! 		proc->lwWaitLink = NULL;
! 		proc->lwWaiting = false;
! 		PGSemaphoreUnlock(&proc->sem);
! 	}
! }
  
! /*
!  * Get a pointer to the right location in the WAL buffer containing the
!  * given XLogRecPtr.
!  *
!  * If the page is not initialized yet, it is initialized. That might require
!  * evicting an old dirty buffer from the buffer cache, which means I/O.
!  *
!  * The caller must ensure that the page containing the requested location
!  * isn't evicted yet, and won't be evicted, by holding onto an XLOG insertion
!  * slot with CurrPos set to something <= ptr. This function can update the
!  * advertised CurrPos up to 'ptr', so you are not allowed to access anything
!  * older than 'ptr' after calling this function.
!  */
! static char *
! GetXLogBuffer(int myslotno, XLogRecPtr ptr)
! {
! 	int			idx;
! 	XLogRecPtr	endptr;
! 	static uint32 cachedXlogid = 0;
! 	static uint32 cachedPage = 0;
! 	static char *cachedPos = NULL;
! 	XLogRecPtr	expectedEndPtr;
! 
! 	/*
! 	 * Fast path for the common case that we need to access again the same
! 	 * page as last time.
! 	 */
! 	if (ptr.xlogid == cachedXlogid && ptr.xrecoff / XLOG_BLCKSZ == cachedPage)
! 		return cachedPos + ptr.xrecoff % XLOG_BLCKSZ;
! 
! 	cachedXlogid = ptr.xlogid;
! 	cachedPage = ptr.xrecoff / XLOG_BLCKSZ;
! 
! 	/*
! 	 * The XLog buffer cache is organized so a page must always be loaded
! 	 * to a particular buffer.  That way we can easily calculate the buffer
! 	 * a given page must be loaded into, from the XLogRecPtr alone.
! 	 */
! 	idx = XLogRecPtrToBufIdx(ptr);
! 
! 	/*
! 	 * See what page is loaded in the buffer at the moment. It could be the
! 	 * page we're looking for, or something older. It can't be anything
! 	 * newer - that would imply the page we're looking for has already
! 	 * been written out to disk, which shouldn't happen as long as the caller
! 	 * has set its slot's CurrPos correctly.
! 	 *
! 	 * However, we don't hold a lock while we read the value. If someone has
! 	 * just initialized the page, it's possible that we get a "torn read" of
! 	 * the XLogRecPtr, and see a bogus value. That's ok, we'll grab the
! 	 * mapping lock (in AdvanceXLInsertBuffer) and retry if we see anything
! 	 * else than the page we're looking for. But it means that when we do this
! 	 * unlocked read, we might see a value that appears to be ahead of the
! 	 * page we're looking for. Don't PANIC on that, until we've verified the
! 	 * value while holding the lock.
! 	 */
! 	expectedEndPtr.xlogid = ptr.xlogid;
! 	expectedEndPtr.xrecoff = ptr.xrecoff - ptr.xrecoff % XLOG_BLCKSZ + XLOG_BLCKSZ;
! 
! 	endptr = XLogCtl->xlblocks[idx];
! 	if (!XLByteEQ(expectedEndPtr, endptr))
! 	{
! 		/*
! 		 * Before we try to initialize the buffer for this page, let others
! 		 * know how far we've inserted, by updating the CurrPos field in our
! 		 * slot. This is important because AdvanceXLInsertBuffer() might need
! 		 * to wait for some insertions to finish so that it can write out the
! 		 * old page from the buffer.  Updating our slot before waiting for a
! 		 * new buffer ensures that we don't deadlock with ourselves if we run
! 		 * out of clean buffers.
! 		 *
! 		 * Note that we must not advance CurrPos past the page header yet.
! 		 * Otherwise someone might try to flush up to that point, which would
! 		 * fail if the next page was not initialized yet.
! 		 */
! 		XLogRecPtr pagebeginptr = ptr;
! 		pagebeginptr.xrecoff -= ptr.xrecoff % XLOG_BLCKSZ;
! 		UpdateSlotCurrPos(myslotno, pagebeginptr, false);
! 
! 		AdvanceXLInsertBuffer(ptr, false);
! 		endptr = XLogCtl->xlblocks[idx];
! 
! 		if (!XLByteEQ(expectedEndPtr, endptr))
! 			elog(PANIC, "could not find WAL buffer for %X/%X",
! 				 ptr.xlogid, ptr.xrecoff);
  	}
  
! 	/*
! 	 * Found the buffer holding this page. Return a pointer to the right
! 	 * offset within the page.
! 	 */
! 	cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
! 	return XLogCtl->pages + idx * (Size) XLOG_BLCKSZ +
! 		ptr.xrecoff % XLOG_BLCKSZ;
! }
  
! /*
!  * Try to mark old insertion slots as free for reuse.
!  */
! static void
! ReuseOldSlots(void)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			lastslot;
! 	int			nextslot;
  
! 	/* Give up if someone else is already doing this */
! 	if (!LWLockConditionalAcquire(WALInsertTailLock, LW_EXCLUSIVE))
! 		return;
! 
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! 	while (lastslot != nextslot)
! 	{
! 		/*
! 		 * Check if the oldest slot is still in use. We don't do any locking
! 		 * here, we just give up as soon as we find a slot that's still in
! 		 * use.
! 		 */
! 		volatile XLogInsertSlot *slot;
! 		slot = &XLogCtl->XLogInsertSlots[SlotNoToIdx(lastslot)];
! 
! 		if (!slot->finished)
! 			break;
! 
! 		/* Reinitialize the slot for reuse */
! 		slot->finished = false;
! 
! 		lastslot = NextSlotNo(lastslot);
! 	}
! 
! 	/*
! 	 * Update lastslot before we release the lock. (We don't need to grab
! 	 * insertpos_lck here, on the assumption that writing an int32 is atomic)
! 	 */
! 	Insert->lastslot = lastslot;
! 	LWLockRelease(WALInsertTailLock);
! }
! 
! /*
!  * Wait for any insertions < upto to finish. If upto is invalid, we wait until
!  * at least one slot is available for insertion.
!  *
!  * Returns a value >= upto, which indicates the oldest in-progress insertion
!  * that we saw in the array (or if there are non in-progress, the next insert
!  * position).
!  */
! static XLogRecPtr
! WaitXLogInsertionsToFinish(XLogRecPtr upto)
! {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			lastslot;
! 	int			nextslot;
! 	XLogRecPtr	LastPos;
! 	int			extraWaits = 0;
! 
! 	if (MyProc == NULL)
! 		elog(PANIC, "cannot wait without a PGPROC structure");
! 
! retry:
! 	/* Only allow one backend to advance lastslot at a time */
! 	LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * Read lastslot, nextslot, and current end of reserved XLOG.
! 	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	LastPos = Insert->CurrPos;
! 	lastslot = Insert->lastslot;
! 	nextslot = Insert->nextslot;
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! 	while (lastslot != nextslot)
! 	{
! 		/*
! 		 * Examine the oldest slot still in use.
! 		 */
! 		volatile XLogInsertSlot *slot;
! 		XLogRecPtr	slotptr;
! 
! 		slot = &XLogCtl->XLogInsertSlots[SlotNoToIdx(lastslot)];
! 
! 		/* First, a quick check without the lock. */
! 		if (slot->finished)
! 		{
! 			lastslot = NextSlotNo(lastslot);
! 			/* Reinitialize the slot for reuse */
! 			slot->finished = false;
! 			continue;
! 		}
! 
! 		SpinLockAcquire(&slot->lck);
! 		slotptr = slot->CurrPos;
! 
! 		if (slot->finished)
! 		{
! 			/*
! 			 * The insertion has already finished, we just need to advance
! 			 * lastslot to make the slot available for reuse.
! 			 */
! 			/* Reinitialize the slot for reuse */
! 			slot->finished = false;
! 			SpinLockRelease(&slot->lck);
! 			lastslot = NextSlotNo(lastslot);
! 			continue;
! 		}
! 		else
! 		{
! 			/*
! 			 * The insertion is still in-progress. If we just needed for
! 			 * any slot to become available and there is at least one slot
! 			 * free now, or if this slot's CurrPos >= upto, we can
! 			 * stop here. Otherwise we have to wait for it to finish.
! 			 */
! 			if ((XLogRecPtrIsInvalid(upto) && NextSlotNo(nextslot) != lastslot)
! 				|| (!XLogRecPtrIsInvalid(upto) && XLByteLE(upto, slotptr)))
! 			{
! 				SpinLockRelease(&slot->lck);
! 				LastPos = slotptr;
! 				break;
! 			}
! 			else
! 			{
! 				/* Wait for this insertion to finish. */
! 				MyProc->lwWaiting = true;
! 				MyProc->lwWaitMode = 0; /* doesn't matter */
! 				MyProc->lwWaitLink = NULL;
! 				if (slot->head == NULL)
! 					slot->head = MyProc;
! 				else
! 					slot->tail->lwWaitLink = MyProc;
! 				slot->tail = MyProc;
! 				SpinLockRelease(&slot->lck);
! 
! 				Insert->lastslot = lastslot;
! 				LWLockRelease(WALInsertTailLock);
! 				for (;;)
! 				{
! 					PGSemaphoreLock(&MyProc->sem, false);
! 					if (!MyProc->lwWaiting)
! 						break;
! 					extraWaits++;
! 				}
! 
! 				/*
! 				 * The insertion has now finished. Start all over. While we
! 				 * were not holding the tail-lock, someone might've filled up
! 				 * all slots again.
! 				 */
! 				goto retry;
! 			}
! 		}
! 	}
! 
! 	/*
! 	 * Update lastslot before we release the lock. (We don't need to grab
! 	 * insertpos_lck here, on the assumption that writing an int32 is atomic)
! 	 */
! 	Insert->lastslot = lastslot;
! 	LWLockRelease(WALInsertTailLock);
! 
! 	while (extraWaits-- > 0)
! 		PGSemaphoreUnlock(&MyProc->sem);
! 
! 	return LastPos;
  }
  
  /*
***************
*** 1440,1469 **** XLogArchiveCleanup(const char *xlog)
  }
  
  /*
!  * Advance the Insert state to the next buffer page, writing out the next
!  * buffer if it still contains unwritten data.
!  *
!  * If new_segment is TRUE then we set up the next buffer page as the first
!  * page of the next xlog segment file, possibly but not usually the next
!  * consecutive file page.
!  *
!  * The global LogwrtRqst.Write pointer needs to be advanced to include the
!  * just-filled page.  If we can do this for free (without an extra lock),
!  * we do so here.  Otherwise the caller must do it.  We return TRUE if the
!  * request update still needs to be done, FALSE if we did it internally.
!  *
!  * Must be called with WALInsertLock held.
   */
! static bool
! AdvanceXLInsertBuffer(bool new_segment)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx = NextBufIdx(Insert->curridx);
! 	bool		update_needed = true;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr;
  	XLogPageHeader NewPage;
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
--- 2175,2207 ----
  }
  
  /*
!  * Initialize XLOG buffers, writing out old buffers if they still contain
!  * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
!  * true, initialize as many pages as we can without having to write out
!  * unwritten data. Any new pages are initialized to zeros, with pages headers
!  * initialized properly.
   */
! static void
! AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
! 	int			nextidx;
  	XLogRecPtr	OldPageRqstPtr;
  	XLogwrtRqst WriteRqst;
! 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
  	XLogPageHeader NewPage;
+ 	int			npages = 0;
+ 
+ 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 
+ 	/*
+ 	 * Now that we have the lock, check if someone initialized the page
+ 	 * already.
+ 	 */
+ /* XXX: fix indentation before commit */
+ while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
+ {
+ 	nextidx = NextBufIdx(XLogCtl->curridx);
  
  	/*
  	 * Get ending-offset of the buffer page we need to replace (this may be
***************
*** 1473,1482 **** AdvanceXLInsertBuffer(bool new_segment)
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  	{
! 		/* nope, got work to do... */
! 		XLogRecPtr	FinishedPageRqstPtr;
! 
! 		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
--- 2211,2222 ----
  	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
  	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  	{
! 		/*
! 		 * Nope, got work to do. If we just want to pre-initialize as much as
! 		 * we can without flushing, give up now.
! 		 */
! 		if (opportunistic)
! 			break;
  
  		/* Before waiting, get info_lck and update LogwrtResult */
  		{
***************
*** 1484,1504 **** AdvanceXLInsertBuffer(bool new_segment)
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
- 		update_needed = false;	/* Did the shared-request update */
- 
  		/*
  		 * Now that we have an up-to-date LogwrtResult value, see if we still
  		 * need to write it or if someone else already did.
  		 */
  		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/* Must acquire write lock */
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
--- 2224,2250 ----
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
  			SpinLockAcquire(&xlogctl->info_lck);
! 			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
! 				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
  			LogwrtResult = xlogctl->LogwrtResult;
  			SpinLockRelease(&xlogctl->info_lck);
  		}
  
  		/*
  		 * Now that we have an up-to-date LogwrtResult value, see if we still
  		 * need to write it or if someone else already did.
  		 */
  		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
  		{
! 			/*
! 			 * Must acquire write lock. Release WALBufMappingLock first, to
! 			 * make sure that all insertions that we need to wait for can
! 			 * finish (up to this same position). Otherwise we risk deadlock.
! 			 */
! 			LWLockRelease(WALBufMappingLock);
! 
! 			WaitXLogInsertionsToFinish(OldPageRqstPtr);
! 
  			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  			LogwrtResult = XLogCtl->LogwrtResult;
  			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
***************
*** 1508,1525 **** AdvanceXLInsertBuffer(bool new_segment)
  			}
  			else
  			{
! 				/*
! 				 * Have to write buffers while holding insert lock. This is
! 				 * not good, so only write as much as we absolutely must.
! 				 */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
  		}
  	}
  
--- 2254,2271 ----
  			}
  			else
  			{
! 				/* Have to write it ourselves */
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
  				WriteRqst.Write = OldPageRqstPtr;
  				WriteRqst.Flush.xlogid = 0;
  				WriteRqst.Flush.xrecoff = 0;
! 				XLogWrite(WriteRqst, false);
  				LWLockRelease(WALWriteLock);
  				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
  			}
+ 			/* Re-acquire WALBufMappingLock and retry */
+ 			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+ 			continue;
  		}
  	}
  
***************
*** 1527,1540 **** AdvanceXLInsertBuffer(bool new_segment)
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
! 
! 	if (new_segment)
! 	{
! 		/* force it to a segment start point */
! 		NewPageEndPtr.xrecoff += XLogSegSize - 1;
! 		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
! 	}
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
--- 2273,2279 ----
  	 * Now the next buffer slot is free and we can set it up to be the next
  	 * output page.
  	 */
! 	NewPageEndPtr = XLogCtl->xlblocks[XLogCtl->curridx];
  
  	if (NewPageEndPtr.xrecoff >= XLogFileSize)
  	{
***************
*** 1544,1556 **** AdvanceXLInsertBuffer(bool new_segment)
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
! 
! 	Insert->curridx = nextidx;
! 	Insert->currpage = NewPage;
  
! 	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
--- 2283,2292 ----
  	}
  	else
  		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
! 	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
! 	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
  
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
  
  	/*
  	 * Be sure to re-zero the buffer so that bytes beyond what we've written
***************
*** 1594,1604 **** AdvanceXLInsertBuffer(bool new_segment)
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
- 
- 		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
  	}
  
! 	return update_needed;
  }
  
  /*
--- 2330,2357 ----
  		NewLongPage->xlp_seg_size = XLogSegSize;
  		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
  		NewPage   ->xlp_info |= XLP_LONG_HEADER;
  	}
  
! 	/*
! 	 * Make sure the initialization of the page becomes visible to others
! 	 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
! 	 * holding a lock.
! 	 */
! 	pg_write_barrier();
! 
! 	*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
! 
! 	XLogCtl->curridx = nextidx;
! 
! 	npages++;
! }
! 	LWLockRelease(WALBufMappingLock);
! 
! #ifdef WAL_DEBUG
! 	if (npages > 0)
! 		elog(DEBUG1, "initialized %d pages, upto %X/%X",
! 			 npages, NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff);
! #endif
  }
  
  /*
***************
*** 1643,1658 **** XLogCheckpointNeeded(uint32 logid, uint32 logseg)
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * If xlog_switch == TRUE, we are intending an xlog segment switch, so
!  * perform end-of-segment actions after writing the last page, even if
!  * it's not physically the end of its segment.  (NB: this will work properly
!  * only if caller specifies WriteRqst == page-end and flexible == false,
!  * and there is some data to write.)
!  *
!  * Must be called with WALWriteLock held.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
--- 2396,2407 ----
   * This option allows us to avoid uselessly issuing multiple writes when a
   * single one would do.
   *
!  * Must be called with WALWriteLock held. And you must've called
!  * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
!  * the data is ready to write.
   */
  static void
! XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
***************
*** 1701,1714 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[curridx].xlogid,
! 				 XLogCtl->xlblocks[curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 2450,2463 ----
  		 * if we're passed a bogus WriteRqst.Write that is past the end of the
  		 * last page that's been initialized by AdvanceXLInsertBuffer.
  		 */
! 		XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
! 		if (!XLByteLT(LogwrtResult.Write, EndPtr))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 EndPtr.xlogid, EndPtr.xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = EndPtr;
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1805,1820 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
- 			 * We also do this if this is the last page written for an xlog
- 			 * switch.
- 			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg || (xlog_switch && last_iteration))
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
--- 2554,2566 ----
  			 * later. Doing it here ensures that one and only one backend will
  			 * perform this fsync.
  			 *
  			 * This is also the right place to notify the Archiver that the
  			 * segment is ready to copy to archival storage, and to update the
  			 * timer for archive_timeout, and to signal for a checkpoint if
  			 * too many logfile segments have been used since the last
  			 * checkpoint.
  			 */
! 			if (finishing_seg)
  			{
  				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
  				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
***************
*** 2068,2073 **** XLogFlush(XLogRecPtr record)
--- 2814,2820 ----
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		XLogRecPtr	insertpos;
  
  		/* read LogwrtResult and update local state */
  		SpinLockAcquire(&xlogctl->info_lck);
***************
*** 2081,2086 **** XLogFlush(XLogRecPtr record)
--- 2828,2839 ----
  			break;
  
  		/*
+ 		 * Before actually performing the write, wait for all in-flight
+ 		 * insertions to the pages we're about to write to finish.
+ 		 */
+ 		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+ 
+ 		/*
  		 * Try to get the write lock. If we can't get it immediately, wait
  		 * until it's released, and recheck if we still need to do the flush
  		 * or if the backend that held the lock did it for us already. This
***************
*** 2100,2128 **** XLogFlush(XLogRecPtr record)
  		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			/* try to write/flush later additions to XLOG as well */
! 			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! 			{
! 				XLogCtlInsert *Insert = &XLogCtl->Insert;
! 				uint32		freespace = INSERT_FREESPACE(Insert);
  
! 				if (freespace < SizeOfXLogRecord)		/* buffer is full */
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 				else
! 				{
! 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! 					WriteRqstPtr.xrecoff -= freespace;
! 				}
! 				LWLockRelease(WALInsertLock);
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = WriteRqstPtr;
! 			}
! 			else
! 			{
! 				WriteRqst.Write = WriteRqstPtr;
! 				WriteRqst.Flush = record;
! 			}
! 			XLogWrite(WriteRqst, false, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
--- 2853,2862 ----
  		LogwrtResult = XLogCtl->LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
! 			WriteRqst.Write = insertpos;
! 			WriteRqst.Flush = insertpos;
  
! 			XLogWrite(WriteRqst, false);
  		}
  		LWLockRelease(WALWriteLock);
  		/* done */
***************
*** 2237,2243 **** XLogBackgroundFlush(void)
  
  	START_CRIT_SECTION();
  
! 	/* now wait for the write lock */
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
--- 2971,2978 ----
  
  	START_CRIT_SECTION();
  
! 	/* now wait for any in-progress insertions to finish and get write lock */
! 	WaitXLogInsertionsToFinish(WriteRqstPtr);
  	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  	LogwrtResult = XLogCtl->LogwrtResult;
  	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
***************
*** 2246,2256 **** XLogBackgroundFlush(void)
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible, false);
  	}
- 	LWLockRelease(WALWriteLock);
  
  	END_CRIT_SECTION();
  }
  
  /*
--- 2981,2998 ----
  
  		WriteRqst.Write = WriteRqstPtr;
  		WriteRqst.Flush = WriteRqstPtr;
! 		XLogWrite(WriteRqst, flexible);
  	}
  
  	END_CRIT_SECTION();
+ 
+ 	LWLockRelease(WALWriteLock);
+ 
+ 	/*
+ 	 * Great, done. To take some work off the critical path, try to initialize
+ 	 * as many of the no-longer-needed WAL buffers for future use as we can.
+ 	 */
+ 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  }
  
  /*
***************
*** 5037,5042 **** XLOGShmemSize(void)
--- 5779,5791 ----
  
  	/* XLogCtl */
  	size = sizeof(XLogCtlData);
+ 
+ 	/*
+ 	 * We force XLogCtl to be suitably aligned to put insertpos_lck and
+ 	 * the fields it protects on the same cache line
+ 	 */
+ 	size += XLOGCTLINSERT_ALIGNMENT;
+ 
  	/* xlblocks array */
  	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
  	/* extra alignment padding for XLOG I/O buffers */
***************
*** 5059,5069 **** XLOGShmemInit(void)
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
! 	XLogCtl = (XLogCtlData *)
! 		ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &foundXLog);
  
  	if (foundCFile || foundXLog)
  	{
--- 5808,5818 ----
  	bool		foundCFile,
  				foundXLog;
  	char	   *allocptr;
+ 	int			i;
  
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
! 	allocptr = ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &foundXLog);
  
  	if (foundCFile || foundXLog)
  	{
***************
*** 5072,5077 **** XLOGShmemInit(void)
--- 5821,5833 ----
  		return;
  	}
  
+ 	/*
+ 	 * Ensure desired alignment for XLogCtlInsert, so that insertpos_lck and
+ 	 * the fields it protects fall on the same cache line.
+ 	 */
+ 	allocptr += XLOGCTLINSERT_ALIGNMENT - ((uintptr_t) allocptr) % XLOGCTLINSERT_ALIGNMENT;
+ 	XLogCtl = (XLogCtlData *) allocptr;
+ 
  	memset(XLogCtl, 0, sizeof(XLogCtlData));
  
  	/*
***************
*** 5084,5089 **** XLOGShmemInit(void)
--- 5840,5857 ----
  	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
  	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
  
+ 	/* Initialize insertion slots */
+ 	for (i = 0; i < NumXLogInsertSlots; i++)
+ 	{
+ 		XLogInsertSlot *slot = &XLogCtl->XLogInsertSlots[i];
+ 		slot->CurrPos = InvalidXLogRecPtr;
+ 		slot->finished = false;
+ 		slot->head = slot->tail = NULL;
+ 		SpinLockInit(&slot->lck);
+ 	}
+ 	XLogCtl->Insert.nextslot = 0;
+ 	XLogCtl->Insert.lastslot = 0;
+ 
  	/*
  	 * Align the start of the page buffers to an ALIGNOF_XLOG_BUFFER boundary.
  	 */
***************
*** 5098,5104 **** XLOGShmemInit(void)
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
! 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
--- 5866,5872 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->SharedRecoveryInProgress = true;
  	XLogCtl->SharedHotStandbyActive = false;
! 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
  	SpinLockInit(&XLogCtl->info_lck);
  	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
  	InitSharedLatch(&XLogCtl->WALWriterLatch);
***************
*** 5980,5985 **** StartupXLOG(void)
--- 6748,6754 ----
  	bool		backupEndRequired = false;
  	bool		backupFromStandby = false;
  	DBState		dbstate_at_startup;
+ 	int			firstIdx;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6232,6238 **** StartupXLOG(void)
  
  	lastFullPageWrites = checkPoint.fullPageWrites;
  
! 	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
  
  	if (XLByteLT(RecPtr, checkPoint.redo))
  		ereport(PANIC,
--- 7001,7007 ----
  
  	lastFullPageWrites = checkPoint.fullPageWrites;
  
! 	RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
  
  	if (XLByteLT(RecPtr, checkPoint.redo))
  		ereport(PANIC,
***************
*** 6786,6793 **** StartupXLOG(void)
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 	XLogCtl->xlblocks[0].xlogid = openLogId;
! 	XLogCtl->xlblocks[0].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
--- 7555,7566 ----
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
  	Insert->PrevRecord = LastRec;
! 
! 	firstIdx = XLogRecEndPtrToBufIdx(EndOfLog);
! 	XLogCtl->curridx = firstIdx;
! 
! 	XLogCtl->xlblocks[firstIdx].xlogid = openLogId;
! 	XLogCtl->xlblocks[firstIdx].xrecoff =
  		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
  
  	/*
***************
*** 6795,6804 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
! 	Insert->currpos = (char *) Insert->currpage +
! 		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
--- 7568,7576 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
! 	Insert->CurrPos = EndOfLog;
  
  	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
  
***************
*** 6807,6818 **** StartupXLOG(void)
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(Insert->currpos, 0, freespace);
! 		XLogCtl->Write.curridx = 0;
  	}
  	else
  	{
--- 7579,7590 ----
  	XLogCtl->LogwrtRqst.Write = EndOfLog;
  	XLogCtl->LogwrtRqst.Flush = EndOfLog;
  
! 	freespace = INSERT_FREESPACE(EndOfLog);
  	if (freespace > 0)
  	{
  		/* Make sure rest of page is zero */
! 		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndOfLog.xrecoff % XLOG_BLCKSZ, 0, freespace);
! 		XLogCtl->Write.curridx = firstIdx;
  	}
  	else
  	{
***************
*** 6824,6830 **** StartupXLOG(void)
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(0);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
--- 7596,7602 ----
  		 * this is sufficient.	The first actual attempt to insert a log
  		 * record will advance the insert state.
  		 */
! 		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
  	}
  
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
***************
*** 6835,6841 **** StartupXLOG(void)
  	 * XLOG_FPW_CHANGE record before resource manager writes cleanup
  	 * WAL records or checkpoint record is written.
  	 */
! 	Insert->fullPageWrites = lastFullPageWrites;
  	LocalSetXLogInsertAllowed();
  	UpdateFullPageWrites();
  	LocalXLogInsertAllowed = -1;
--- 7607,7613 ----
  	 * XLOG_FPW_CHANGE record before resource manager writes cleanup
  	 * WAL records or checkpoint record is written.
  	 */
! 	Insert->fullPageWrites = doPageWrites = lastFullPageWrites;
  	LocalSetXLogInsertAllowed();
  	UpdateFullPageWrites();
  	LocalXLogInsertAllowed = -1;
***************
*** 7307,7327 **** InitXLOGAccess(void)
  }
  
  /*
!  * Once spawned, a backend may update its local RedoRecPtr from
!  * XLogCtl->Insert.RedoRecPtr; it must hold the insert lock or info_lck
!  * to do so.  This is done in XLogInsert() or GetRedoRecPtr().
   */
  XLogRecPtr
  GetRedoRecPtr(void)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
  
  	SpinLockAcquire(&xlogctl->info_lck);
! 	Assert(XLByteLE(RedoRecPtr, xlogctl->Insert.RedoRecPtr));
! 	RedoRecPtr = xlogctl->Insert.RedoRecPtr;
  	SpinLockRelease(&xlogctl->info_lck);
  
  	return RedoRecPtr;
  }
  
--- 8079,8107 ----
  }
  
  /*
!  * Return the current Redo pointer from shared memory.
!  *
!  * As a side-effect, the local RedoRecPtr copy is updated.
   */
  XLogRecPtr
  GetRedoRecPtr(void)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	XLogRecPtr ptr;
  
+ 	/*
+ 	 * The possibly not up-to-date copy in XlogCtl is enough. Even if we
+ 	 * grabbed insertpos_lck to read the master copy, someone might update
+ 	 * it just after we've released the lock.
+ 	 */
  	SpinLockAcquire(&xlogctl->info_lck);
! 	ptr = xlogctl->RedoRecPtr;
  	SpinLockRelease(&xlogctl->info_lck);
  
+ 	if (XLByteLT(RedoRecPtr, ptr))
+ 		RedoRecPtr = xlogctl->RedoRecPtr;
+ 
  	return RedoRecPtr;
  }
  
***************
*** 7330,7336 **** GetRedoRecPtr(void)
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire WALInsertLock which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
--- 8110,8116 ----
   *
   * NOTE: The value *actually* returned is the position of the last full
   * xlog page. It lags behind the real insert position by at most 1 page.
!  * For that, we don't need to acquire insertpos_lck which can be quite
   * heavily contended, and an approximation is enough for the current
   * usage of this function.
   */
***************
*** 7592,7597 **** LogCheckpointEnd(bool restartpoint)
--- 8372,8379 ----
  void
  CreateCheckPoint(int flags)
  {
+ 	/* use volatile pointer to prevent code rearrangement */
+ 	volatile XLogCtlData *xlogctl = XLogCtl;
  	bool		shutdown;
  	CheckPoint	checkPoint;
  	XLogRecPtr	recptr;
***************
*** 7606,7611 **** CreateCheckPoint(int flags)
--- 8388,8394 ----
  	uint32		insert_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	XLogRecPtr	curInsert;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
***************
*** 7674,7683 **** CreateCheckPoint(int flags)
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
  	 * the checkpoint REDO pointer.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
--- 8457,8467 ----
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
  	/*
! 	 * We must hold insertpos_lck while examining insert state to determine
  	 * the checkpoint REDO pointer.
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	curInsert = Insert->CurrPos;
  
  	/*
  	 * If this isn't a shutdown or forced checkpoint, and we have not switched
***************
*** 7689,7695 **** CreateCheckPoint(int flags)
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding the WALInsertLock we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
--- 8473,8479 ----
  	 * (Perhaps it'd make even more sense to checkpoint only when the previous
  	 * checkpoint record is in a different xlog page?)
  	 *
! 	 * While holding insertpos_lck we find the current WAL insertion point
  	 * and compare that with the starting point of the last checkpoint, which
  	 * is the redo pointer. We use the redo pointer because the start and end
  	 * points of a checkpoint can be hundreds of files apart on large systems
***************
*** 7698,7712 **** CreateCheckPoint(int flags)
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
- 		XLogRecPtr	curInsert;
- 
- 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
--- 8482,8493 ----
  	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  				  CHECKPOINT_FORCE)) == 0)
  	{
  		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
  		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
  		if (insert_logId == redo_logId &&
  			insert_logSeg == redo_logSeg)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
  			return;
***************
*** 7733,7750 **** CreateCheckPoint(int flags)
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(Insert);
  	if (freespace < SizeOfXLogRecord)
  	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
  	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
! 	 * must be done while holding the insert lock AND the info_lck.
  	 *
  	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
  	 * pointing past where it really needs to point.  This is okay; the only
--- 8514,8533 ----
  	 * the buffer flush work.  Those XLOG records are logically after the
  	 * checkpoint, even though physically before it.  Got that?
  	 */
! 	freespace = INSERT_FREESPACE(curInsert);
  	if (freespace < SizeOfXLogRecord)
  	{
! 		XLByteAdvance(curInsert, freespace);
! 		if (curInsert.xrecoff % XLogSegSize == 0)
! 			curInsert.xrecoff += SizeOfXLogLongPHD;
! 		else
! 			curInsert.xrecoff += SizeOfXLogShortPHD;
  	}
! 	checkPoint.redo = curInsert;
  
  	/*
  	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
! 	 * must be done while holding the insert lock.
  	 *
  	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
  	 * pointing past where it really needs to point.  This is okay; the only
***************
*** 7753,7772 **** CreateCheckPoint(int flags)
  	 * XLogInserts that happen while we are dumping buffers must assume that
  	 * their buffer changes are not included in the checkpoint.
  	 */
! 	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
! 
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
! 		SpinLockRelease(&xlogctl->info_lck);
! 	}
  
  	/*
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
--- 8536,8553 ----
  	 * XLogInserts that happen while we are dumping buffers must assume that
  	 * their buffer changes are not included in the checkpoint.
  	 */
! 	RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
  
  	/*
  	 * Now we can release WAL insert lock, allowing other xacts to proceed
  	 * while we are flushing disk buffers.
  	 */
! 	SpinLockRelease(&Insert->insertpos_lck);
! 
! 	/* Update the info_lck-protected copy of RedoRecPtr as well */
! 	SpinLockAcquire(&xlogctl->info_lck);
! 	xlogctl->RedoRecPtr = checkPoint.redo;
! 	SpinLockRelease(&xlogctl->info_lck);
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
***************
*** 7786,7792 **** CreateCheckPoint(int flags)
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released WALInsertLock, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
--- 8567,8573 ----
  	 * we wait till he's out of his commit critical section before proceeding.
  	 * See notes in RecordTransactionCommit().
  	 *
! 	 * Because we've already released insertpos_lck, this test is a bit fuzzy:
  	 * it is possible that we will wait for xacts we didn't really need to
  	 * wait for.  But the delay should be short and it seems better to make
  	 * checkpoint take a bit longer than to hold locks longer than necessary.
***************
*** 8153,8167 **** CreateRestartPoint(int flags)
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * You need to hold WALInsertLock and info_lck to update it, although
! 	 * during recovery acquiring WALInsertLock is just pro forma, because
! 	 * there is no other processes updating Insert.RedoRecPtr.
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	SpinLockAcquire(&xlogctl->info_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
- 	LWLockRelease(WALInsertLock);
  
  	/*
  	 * Prepare to accumulate statistics.
--- 8934,8951 ----
  	 * the number of segments replayed since last restartpoint, and request a
  	 * restartpoint if it exceeds checkpoint_segments.
  	 *
! 	 * Like in CreatecheckPoint(), hold insertpos_lck to update it, although
! 	 * during recovery acquiring insertpos_lck is just pro forma, because no
! 	 * WAL insertions are happening.
  	 */
! 	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
  	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
+ 	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
+ 
+ 	/* Also update the info_lck-protected copy */
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	xlogctl->RedoRecPtr = lastCheckPoint.redo;
  	SpinLockRelease(&xlogctl->info_lck);
  
  	/*
  	 * Prepare to accumulate statistics.
***************
*** 8448,8454 **** XLogReportParameters(void)
  void
  UpdateFullPageWrites(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
--- 9232,9238 ----
  void
  UpdateFullPageWrites(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  
  	/*
  	 * Do nothing if full_page_writes has not been changed.
***************
*** 8471,8479 **** UpdateFullPageWrites(void)
  	 */
  	if (fullPageWrites)
  	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  		Insert->fullPageWrites = true;
! 		LWLockRelease(WALInsertLock);
  	}
  
  	/*
--- 9255,9263 ----
  	 */
  	if (fullPageWrites)
  	{
! 		SpinLockAcquire(&Insert->insertpos_lck);
  		Insert->fullPageWrites = true;
! 		SpinLockRelease(&Insert->insertpos_lck);
  	}
  
  	/*
***************
*** 8494,8502 **** UpdateFullPageWrites(void)
  
  	if (!fullPageWrites)
  	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  		Insert->fullPageWrites = false;
! 		LWLockRelease(WALInsertLock);
  	}
  	END_CRIT_SECTION();
  }
--- 9278,9286 ----
  
  	if (!fullPageWrites)
  	{
! 		SpinLockAcquire(&Insert->insertpos_lck);
  		Insert->fullPageWrites = false;
! 		SpinLockRelease(&Insert->insertpos_lck);
  	}
  	END_CRIT_SECTION();
  }
***************
*** 9024,9029 **** issue_xlog_fsync(int fd, uint32 log, uint32 seg)
--- 9808,9814 ----
  XLogRecPtr
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	checkpointloc;
***************
*** 9086,9111 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
  		{
! 			LWLockRelease(WALInsertLock);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
  	else
! 		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
! 	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
--- 9871,9896 ----
  	 * Note that forcePageWrites has no effect during an online backup from
  	 * the standby.
  	 *
! 	 * We must hold insertpos_lck to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		if (Insert->exclusiveBackup)
  		{
! 			SpinLockRelease(&Insert->insertpos_lck);
  			ereport(ERROR,
  					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  					 errmsg("a backup is already in progress"),
  					 errhint("Run pg_stop_backup() and try again.")));
  		}
! 		Insert->exclusiveBackup = true;
  	}
  	else
! 		Insert->nonExclusiveBackups++;
! 	Insert->forcePageWrites = true;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
***************
*** 9218,9230 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			LWLockAcquire(WALInsertLock, LW_SHARED);
! 			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
  			{
! 				XLogCtl->Insert.lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			LWLockRelease(WALInsertLock);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
--- 10003,10015 ----
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
  			 */
! 			SpinLockAcquire(&Insert->insertpos_lck);
! 			if (XLByteLT(Insert->lastBackupStart, startpoint))
  			{
! 				Insert->lastBackupStart = startpoint;
  				gotUniqueStartpoint = true;
  			}
! 			SpinLockRelease(&Insert->insertpos_lck);
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9308,9334 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
  	{
! 		Assert(XLogCtl->Insert.exclusiveBackup);
! 		XLogCtl->Insert.exclusiveBackup = false;
  	}
  	else
  	{
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 10093,10120 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = DatumGetBool(arg);
  
  	/* Update backup counters and forcePageWrites on failure */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
  	{
! 		Assert(Insert->exclusiveBackup);
! 		Insert->exclusiveBackup = false;
  	}
  	else
  	{
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9341,9346 **** pg_start_backup_callback(int code, Datum arg)
--- 10127,10133 ----
  XLogRecPtr
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
+ 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	bool		exclusive = (labelfile == NULL);
  	bool		backup_started_in_recovery = false;
  	XLogRecPtr	startpoint;
***************
*** 9394,9402 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
  	else
  	{
  		/*
--- 10181,10189 ----
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
! 	SpinLockAcquire(&Insert->insertpos_lck);
  	if (exclusive)
! 		Insert->exclusiveBackup = false;
  	else
  	{
  		/*
***************
*** 9405,9420 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
  	}
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  
  	if (exclusive)
  	{
--- 10192,10207 ----
  		 * backups, it is expected that each do_pg_start_backup() call is
  		 * matched by exactly one do_pg_stop_backup() call.
  		 */
! 		Assert(Insert->nonExclusiveBackups > 0);
! 		Insert->nonExclusiveBackups--;
  	}
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	if (exclusive)
  	{
***************
*** 9692,9707 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  void
  do_pg_abort_backup(void)
  {
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
! 		XLogCtl->Insert.forcePageWrites = false;
  	}
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
--- 10479,10496 ----
  void
  do_pg_abort_backup(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
! 
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	Assert(Insert->nonExclusiveBackups > 0);
! 	Insert->nonExclusiveBackups--;
  
! 	if (!Insert->exclusiveBackup &&
! 		Insert->nonExclusiveBackups == 0)
  	{
! 		Insert->forcePageWrites = false;
  	}
! 	SpinLockRelease(&Insert->insertpos_lck);
  }
  
  /*
***************
*** 9755,9766 **** GetStandbyFlushRecPtr(void)
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	LWLockAcquire(WALInsertLock, LW_SHARED);
! 	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
! 	LWLockRelease(WALInsertLock);
  
  	return current_recptr;
  }
--- 10544,10555 ----
  XLogRecPtr
  GetXLogInsertRecPtr(void)
  {
! 	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecPtr	current_recptr;
  
! 	SpinLockAcquire(&Insert->insertpos_lck);
! 	current_recptr = Insert->CurrPos;
! 	SpinLockRelease(&Insert->insertpos_lck);
  
  	return current_recptr;
  }
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 1753,1761 **** GetOldestActiveTransactionId(void)
   * the result is somewhat indeterminate, but we don't really care.  Even in
   * a multiprocessor with delayed writes to shared memory, it should be certain
   * that setting of inCommit will propagate to shared memory when the backend
!  * takes the WALInsertLock, so we cannot fail to see an xact as inCommit if
!  * it's already inserted its commit record.  Whether it takes a little while
!  * for clearing of inCommit to propagate is unimportant for correctness.
   */
  int
  GetTransactionsInCommit(TransactionId **xids_p)
--- 1753,1762 ----
   * the result is somewhat indeterminate, but we don't really care.  Even in
   * a multiprocessor with delayed writes to shared memory, it should be certain
   * that setting of inCommit will propagate to shared memory when the backend
!  * takes a lock to write the WAL record, so we cannot fail to see an xact as
!  * inCommit if it's already inserted its commit record.  Whether it takes a
!  * little while for clearing of inCommit to propagate is unimportant for
!  * correctness.
   */
  int
  GetTransactionsInCommit(TransactionId **xids_p)
*** a/src/backend/storage/lmgr/spin.c
--- b/src/backend/storage/lmgr/spin.c
***************
*** 56,61 **** SpinlockSemas(void)
--- 56,64 ----
  	 *
  	 * For now, though, we just need a few spinlocks (10 should be plenty)
  	 * plus one for each LWLock and one for each buffer header.
+ 	 *
+ 	 * XXX: remember to adjust this for the number of spinlocks needed by the
+ 	 * xlog.c changes before committing!
  	 */
  	return NumLWLocks() + NBuffers + 10;
  }
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 53,59 **** typedef enum LWLockId
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALInsertLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
--- 53,59 ----
  	ProcArrayLock,
  	SInvalReadLock,
  	SInvalWriteLock,
! 	WALBufMappingLock,
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
  	SyncRepLock,
+ 	WALInsertTailLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#79Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#78)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Thu, Mar 15, 2012 at 5:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

When all those changes are put together, the patched version now beats or
matches the current code in the RAM drive tests, except that the
single-client case is still about 10% slower. I added the new test results
at http://community.enterprisedb.com/xloginsert-scale-tests/, and a new
version of the patch is attached.

When I ran pgbench with v18 patch, I encountered the following PANIC error:

PANIC: space reserved for WAL record does not match what was written

To investigate the cause, I applied the following changes and ran pgbench again,

------------------------
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index bfc7421..2cef0bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1294,7 +1294,7 @@ CopyXLogRecordToWAL(int write_len, bool
isLogSwitch, XLogRecord *rechdr,
                }
                if (!XLByteEQ(CurrPos, EndPos))
-                       elog(PANIC, "space reserved for WAL record
does not match what was written");
+                 elog(PANIC, "space reserved for WAL record does not
match what was written, CurrPos: %X/%X, EndPos: %X/%X",
CurrPos.xlogid, CurrPos.xrecoff, EndPos.xlogid, EndPos.xrecoff);
        }
        else
        {
------------------------

then I got the following:

PANIC: space reserved for WAL record does not match what was
written, CurrPos: C/0, EndPos: B/FF000000

So I think that the patch would have a bug which handles WAL boundary wrongly.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#80Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#79)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On 21.03.2012 13:14, Fujii Masao wrote:

PANIC: space reserved for WAL record does not match what was
written, CurrPos: C/0, EndPos: B/FF000000

So I think that the patch would have a bug which handles WAL boundary wrongly.

Thanks for the testing! These WAL boundary issues are really tricky, you
found bugs in that area before, and I found and fixed one before posting
the last version, and apparently there's still at least one left.

Overall, what do you (and others) think about the state of this patch?
I'm starting to feel that this needs to be pushed to 9.3. That bug might
not be very hard to fix, but the fact that these bugs are still cropping
up at this late stage makes me uneasy. That boundary calculation in
particular is surprisingly tricky, and I think it could be made less
tricky with some refactoring of the WAL-logging code, replacing
XLogRecPtr with uint64s, like Peter (IIRC) suggested a while ago. And
that seems like 9.3 material. Also, there's still these two known issues:

1. It slows down the WAL insertion in a single backend by about 10%
2. With lots of backends inserting tiny records concurrently, you get
spinlock contention, which consumes a lot of CPU. Without the patch, you
get lwlock contention and bandwidth isn't any better, but you sleep
rather than spin.

I'm afraid those issues aren't easily fixable. I haven't been able to
identify the source of slowdown in the single-backend case, it seems to
be simply the distributed cost of the extra bookkeeping. That might be
acceptable, 10% slowdown of raw WAL insertion speed is not good, but WAL
insertion only accounts for a fraction of the total CPU usage for any
real workload, so I believe the slowdown of a real application would be
more like 1-3%, at worst. But I would feel more comfortable if we had
more time to test that.

The spinlock contention issue might be acceptable too. I think it would
be hard to run into it in a real application, and even then, the
benchmarks show that although you spend a lot of CPU time spinning, you
get at least the same overall bandwidth with the patch, which is what
really matters. And I think it could be alleviated by reducing the time
the spinlock is held, and I think that could be done by making the space
reservation calculations simpler. If we got rid of the limitation that
the WAL record header is never split across WAL pages, and always stored
the continuation record header on all WAL pages, the space reservation
calculation could be reduced to essentially "currentpos += size". But
that again seems 9.3 material.

So, although none of the issues alone is a show-stopper, but considering
all these things together, I'm starting to feel that this needs to be
pushed to 9.3. Thoughts?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#81Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#80)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Wed, Mar 21, 2012 at 7:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

So, although none of the issues alone is a show-stopper, but considering all
these things together, I'm starting to feel that this needs to be pushed to
9.3. Thoughts?

I think I agree. I like the refactoring ideas that you're proposing,
but I don't really think we should be starting on that in mid-March.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#82Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#80)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

... although none of the issues alone is a show-stopper, but considering
all these things together, I'm starting to feel that this needs to be
pushed to 9.3. Thoughts?

Agreed. In particular, I think you are right that it'd be prudent to
simplify the WAL-location arithmetic and then rebase this code onto
that. And since no code at all has been written for the arithmetic
change, I think we have to consider that it's not 9.2 material.

regards, tom lane

#83Fujii Masao
masao.fujii@gmail.com
In reply to: Tom Lane (#82)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

On Wed, Mar 21, 2012 at 10:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

... although none of the issues alone is a show-stopper, but considering
all these things together, I'm starting to feel that this needs to be
pushed to 9.3. Thoughts?

Agreed.  In particular, I think you are right that it'd be prudent to
simplify the WAL-location arithmetic and then rebase this code onto
that.  And since no code at all has been written for the arithmetic
change, I think we have to consider that it's not 9.2 material.

Agreed.

BTW, the patch changes some functions so that they use volatile pointer,
as follows:

@@ -8448,7 +9232,7 @@ XLogReportParameters(void)
 void
 UpdateFullPageWrites(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;

These changes should be applied?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center