XLogInsert

Started by Jeff Janesover 16 years ago15 messages

jeff.janes@gmail.com

over 16 years ago

In XLogInsert (src/backend/access/transam/xlog.c), the part that adds back-up
blocks into the rdata chain is described:

/*
* Make additional rdata chain entries for the backup blocks, so that we
* don't need to special-case them in the write loop. Note that we have
* now irrevocably changed the input rdata chain.

If I read the code correctly, the only thing that is irrevocable is
that it writes into
rdt->next, and if it saved an old copy of rdt first, then it could
revoke the changes just
by doing rdt_old->next=NULL. If that were done, then I think this
code could be
moved out of the section holding the WALInsertLock. It could probably be folded
into the part of the code that computes the CRC.

I don't think this wold be a big performance win, as that part of the
code is pretty
fast, but every bit helps in a highly contended lock, and I think the code
would be simpler as well. Do you think it would be worth me giving this a try?

Cheers,

Jeff

Tom Lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Jeff Janes (#1)

Re: XLogInsert

Jeff Janes <jeff.janes@gmail.com> writes:

If I read the code correctly, the only thing that is irrevocable is
that it writes into
rdt->next, and if it saved an old copy of rdt first, then it could
revoke the changes just
by doing rdt_old->next=NULL. If that were done, then I think this
code could be
moved out of the section holding the WALInsertLock.

Hmm, I recall that the changes are ... or were ... more complex.
The tricky case I think is where we have to go back and redo the
block-backup decisions after discovering that the checkpoint REDO
pointer has just moved.

If you can get the work out of the WALInsertLock section for just a
few more instructions, it would definitely be worth doing.

regards, tom lane

Jeff Janes

jeff.janes@gmail.com

over 16 years ago

In reply to: Tom Lane (#2)

1 attachment(s)

Re: XLogInsert

On Wed, Aug 19, 2009 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

If I read the code correctly, the only thing that is irrevocable is
that it writes into
rdt->next, and if it saved an old copy of rdt first, then it could
revoke the changes just
by doing rdt_old->next=NULL. If that were done, then I think this
code could be
moved out of the section holding the WALInsertLock.

Hmm, I recall that the changes are ... or were ... more complex.
The tricky case I think is where we have to go back and redo the
block-backup decisions after discovering that the checkpoint REDO
pointer has just moved.

If you can get the work out of the WALInsertLock section for just a
few more instructions, it would definitely be worth doing.

I've attached a patch which removes the iteration over the blocks to be
backed-up from the critical section of XLogInsert. Now those blocks are
only looped over in one piece of code which both computes the CRC and builds
the linked list, rather than having parallel loops.

I've used an elog statement (not shown in patch) to demonstrate that the
"goto begin;" after detecting REDO race actually does get executed under a
standard workload, (pgbench -c10). Two to 4 out of 10 the backends execute
that code path for each checkpoint on my single CPU machine. By doing a
kill -9 on a process, to simulate a crash, during the period after the goto
begin is execercised but before the precipitating heckpoint completes, I can
force it to use the written WAL records in recovery. The database
automatically recovers and the results are self-consistent.

I cannot imagine any other races, rare events, or action at a distance that
could come into play with this code change, so I cannot think of anything
else to test at the moment.

I could not detect a speed difference with pgbench, but as I cannot get
pgbench to be XLogInsert bound, that is not surprising. Using the only
XLogInsert-bound test case I know of, parallel COPY into a skinny, unindexed
table, using 8 parallel copies on a 4 x dual-core x86_64 and with fsync
turned off (to approxiamately simulate SSD, which I do not have), I get a
speed improvement of 2-4% with the patch over unpatched head. Maybe with
more CPUs the benefit would be greater.

That small improvement is probably not very attractive, however I think the
patch makes the overall code a bit cleaner, so it may be warranted on that
ground. Indeed, my motivation for working on this is that I kept beating my
head against the complexity of the old code, and thought that simplifying it
would make future work easier.

Cheers,

Jeff

Attachments:

XLogInsert_rdt2_v1.patchtext/x-diff; charset=US-ASCII; name=XLogInsert_rdt2_v1.patchDownload

Index: xlog.c
===================================================================
RCS file: /home/jjanes/pgrepo/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.352
diff -c -r1.352 xlog.c
*** xlog.c	10 Sep 2009 09:42:10 -0000	1.352
--- xlog.c	10 Sep 2009 19:27:08 -0000
***************
*** 540,548 ****
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
  	XLogRecPtr	dtbuf_lsn[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt1[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt2[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt3[XLR_MAX_BKP_BLOCKS];
  	pg_crc32	rdata_crc;
  	uint32		len,
  				write_len;
--- 540,550 ----
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
  	XLogRecPtr	dtbuf_lsn[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt1[XLR_MAX_BKP_BLOCKS];	/*xlog header of backed up block*/
! 	XLogRecData dtbuf_rdt2[XLR_MAX_BKP_BLOCKS];	/*part of block before the hole*/
! 	XLogRecData dtbuf_rdt3[XLR_MAX_BKP_BLOCKS];	/*part of block after the hole*/
! 	XLogRecData dummy_node;	/* head node for back-up block chain*/
! 	XLogRecData *rdt2;	/* tail pointer for back-up block chain*/
  	pg_crc32	rdata_crc;
  	uint32		len,
  				write_len;
***************
*** 663,696 ****
  
  	/*
  	 * Now add the backup block headers and data into the CRC
  	 */
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		if (dtbuf_bkp[i])
! 		{
! 			BkpBlock   *bkpb = &(dtbuf_xlg[i]);
! 			char	   *page;
  
! 			COMP_CRC32(rdata_crc,
! 					   (char *) bkpb,
! 					   sizeof(BkpBlock));
! 			page = (char *) BufferGetBlock(dtbuf[i]);
! 			if (bkpb->hole_length == 0)
! 			{
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   BLCKSZ);
! 			}
! 			else
! 			{
! 				/* must skip the hole */
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   bkpb->hole_offset);
  				COMP_CRC32(rdata_crc,
! 						   page + (bkpb->hole_offset + bkpb->hole_length),
! 						   BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
! 			}
  		}
  	}
  
--- 665,740 ----
  
  	/*
  	 * Now add the backup block headers and data into the CRC
+ 	 * Also make a separate chain of entries for the backup blocks.  
+ 	 * Once we know we do not need to repeat the process due to races,
+ 	 * the two chains are stitched together so that we  don't need 
+ 	 * to special-case them in the write loop.  At the exit of this
+ 	 * loop, write_len includes the backup block data.
+ 	 *
+ 	 * Also set the appropriate info bits to show which buffers were backed
+ 	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
+ 	 * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
+ 	 * Doing this here is safe, for the same reason setting rdt->data = NULL
+ 	 * is safe
  	 */
+ 	rdt2 = &dummy_node;
+ 	dummy_node.next=NULL;
+ 	write_len = len;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		BkpBlock   *bkpb;
! 		char	   *page;
  
! 		if (!dtbuf_bkp[i]) continue;
! 
! 		info |= XLR_SET_BKP_BLOCK(i);
! 		bkpb = &(dtbuf_xlg[i]);
! 
! 		COMP_CRC32(rdata_crc,
! 				   (char *) bkpb,
! 				   sizeof(BkpBlock));
! 		rdt2->next = &(dtbuf_rdt1[i]);
! 		rdt2 = rdt2->next;
! 
! 		rdt2->data = (char *) bkpb;
! 		rdt2->len = sizeof(BkpBlock);
! 		write_len += sizeof(BkpBlock);
! 
! 		rdt2->next = &(dtbuf_rdt2[i]);
! 		rdt2 = rdt2->next;
! 
! 		page = (char *) BufferGetBlock(dtbuf[i]);
! 
! 		if (bkpb->hole_length == 0)
! 		{
  				COMP_CRC32(rdata_crc,
! 					   page,
! 					   BLCKSZ);
! 			rdt2->data = page;
! 			rdt2->len = BLCKSZ;
! 			write_len += BLCKSZ;
! 			rdt2->next = NULL;
! 		}
! 		else
! 		{
! 			/* must skip the hole */
! 			COMP_CRC32(rdata_crc,
! 					   page,
! 					   bkpb->hole_offset);
! 			COMP_CRC32(rdata_crc,
! 					   page + (bkpb->hole_offset + bkpb->hole_length),
! 					   BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
! 			rdt2->data = page;
! 			rdt2->len = bkpb->hole_offset;
! 			write_len += bkpb->hole_offset;
! 
! 			rdt2->next = &(dtbuf_rdt3[i]);
! 			rdt2 = rdt2->next;
! 
! 			rdt2->data = page + (bkpb->hole_offset + bkpb->hole_length);
! 			rdt2->len = BLCKSZ - (bkpb->hole_offset + bkpb->hole_length);
! 			write_len += rdt2->len;
! 			rdt2->next = NULL;
  		}
  	}
  
***************
*** 758,820 ****
  		goto begin;
  	}
  
! 	/*
! 	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  Note that we have
! 	 * now irrevocably changed the input rdata chain.  At the exit of this
! 	 * loop, write_len includes the backup block data.
! 	 *
! 	 * Also set the appropriate info bits to show which buffers were backed
! 	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
! 	 * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
! 	 */
! 	write_len = len;
! 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 	{
! 		BkpBlock   *bkpb;
! 		char	   *page;
! 
! 		if (!dtbuf_bkp[i])
! 			continue;
! 
! 		info |= XLR_SET_BKP_BLOCK(i);
! 
! 		bkpb = &(dtbuf_xlg[i]);
! 		page = (char *) BufferGetBlock(dtbuf[i]);
! 
! 		rdt->next = &(dtbuf_rdt1[i]);
! 		rdt = rdt->next;
! 
! 		rdt->data = (char *) bkpb;
! 		rdt->len = sizeof(BkpBlock);
! 		write_len += sizeof(BkpBlock);
! 
! 		rdt->next = &(dtbuf_rdt2[i]);
! 		rdt = rdt->next;
! 
! 		if (bkpb->hole_length == 0)
! 		{
! 			rdt->data = page;
! 			rdt->len = BLCKSZ;
! 			write_len += BLCKSZ;
! 			rdt->next = NULL;
! 		}
! 		else
! 		{
! 			/* must skip the hole */
! 			rdt->data = page;
! 			rdt->len = bkpb->hole_offset;
! 			write_len += bkpb->hole_offset;
! 
! 			rdt->next = &(dtbuf_rdt3[i]);
! 			rdt = rdt->next;
! 
! 			rdt->data = page + (bkpb->hole_offset + bkpb->hole_length);
! 			rdt->len = BLCKSZ - (bkpb->hole_offset + bkpb->hole_length);
! 			write_len += rdt->len;
! 			rdt->next = NULL;
! 		}
! 	}
  
  	/*
  	 * If we backed up any full blocks and online backup is not in progress,
--- 802,809 ----
  		goto begin;
  	}
  
! 	/* Stitch the two chains together.  We have now irrevocably changed the rdata chain */
! 	rdt->next = dummy_node.next;
  
  	/*
  	 * If we backed up any full blocks and online backup is not in progress,

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

In reply to: Jeff Janes (#3)

2 attachment(s)

Patch LWlocks instrumentation

A little bit of a reply to Jeff's email about WALInsertLock.

This patch instruments LWLocks, it is controlled with the following
#define's in lwlock.c :

LWLOCK_STATS
LWLOCK_TIMING_STATS

It is an upgrade of current lwlocks stats.

When active, at backend exit, it will display stats as shown below (here,
we have a parallel COPY with 4 concurrent processes into the same table,
on a 4 core machine).
If the (rather wide) sample output is mangled in your mail client, I've
attached it as a separate text file.

-------- Lock stats for PID 22403
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
22403 7 0 0 0.00 0.00
2500002 730338 24.02 ( 53.49 %) 7.25 ( 16.14 %) WALInsert
22403 8 0 0 0.00 0.00
19501 73 3.48 ( 7.75 %) 0.40 ( 0.88 %) WALWrite

-------- Lock stats for PID 22404
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
22404 7 0 0 0.00 0.00
2500002 724683 23.34 ( 51.59 %) 8.24 ( 18.20 %) WALInsert
22404 8 0 0 0.00 0.00
19418 90 4.37 ( 9.67 %) 0.44 ( 0.97 %) WALWrite

-------- Lock stats for PID 22402
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
22402 7 0 0 0.00 0.00
2500002 735958 24.06 ( 52.73 %) 8.05 ( 17.63 %) WALInsert
22402 8 0 0 0.00 0.00
19154 97 4.21 ( 9.22 %) 0.39 ( 0.85 %) WALWrite

-------- Lock stats for PID 22400
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
22400 7 0 0 0.00 0.00
2500002 736265 25.50 ( 55.59 %) 6.74 ( 14.70 %) WALInsert
22400 8 0 0 0.00 0.00
19391 66 2.95 ( 6.42 %) 0.39 ( 0.85 %) WALWrite

Here we see that PID 22400 spent :

25.50 s waiting to get exclusive on WALInsert
6.74 s while holding exclusive on WALInsert

The percentages represent the fraction of time relative to the backend
process' lifetime.
Here, I've exited the processes right after committing the transactions,
but if you use psql and want accurate %, you'll need to exit quickly after
the query to benchmark.

Here, for example, backends spend more than 50% of their time waiting on
WALInsert...

Attachments:

lwlock_instrumentation.patchapplication/octet-stream; name=lwlock_instrumentation.patchDownload

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index f2ccbe1..729c6d8 100644
*** a/src/backend/storage/lmgr/lwlock.c
--- b/src/backend/storage/lmgr/lwlock.c
***************
*** 29,34 ****
--- 29,55 ----
  #include "storage/ipc.h"
  #include "storage/proc.h"
  #include "storage/spin.h"
+ #include "portability/instr_time.h"
+ 
+ /* displays counters (number of lock acquires, etc)
+  * on process exit */
+ #define LWLOCK_STATS
+ 
+ #ifdef LWLOCK_STATS
+ 
+ /* also displays cumulative timings of time spent
+  * waiting and holding locks
+  * (depends on LWLOCK_STATS)
+  */
+ #define LWLOCK_TIMING_STATS
+ 
+ /* displays only locks which were waited on more than
+  * this amount of seconds (set to 0 to display all)
+  * (depends on LWLOCK_TIMING_STATS)
+  */
+ #define LWLOCKS_STATS_TIME_DISPLAY_THRESHOLD 0.1
+ #endif /* LWLOCK_STATS */
+ 
  
  
  /* We use the ShmemLock spinlock to protect LWLockAssign */
*************** static LWLockId held_lwlocks[MAX_SIMUL_L
*** 88,100 ****
  static int	lock_addin_request = 0;
  static bool lock_addin_request_allowed = true;
  
- #ifdef LWLOCK_STATS
- static int	counts_for_pid = 0;
- static int *sh_acquire_counts;
- static int *ex_acquire_counts;
- static int *block_counts;
- #endif
- 
  #ifdef LOCK_DEBUG
  bool		Trace_lwlocks = false;
  
--- 109,114 ----
*************** LOG_LWDEBUG(const char *where, LWLockId 
*** 121,146 ****
  
  #ifdef LWLOCK_STATS
  
  static void
  print_lwlock_stats(int code, Datum arg)
  {
! 	int			i;
  	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
  	int			numLocks = LWLockCounter[1];
  
  	/* Grab an LWLock to keep different backends from mixing reports */
  	LWLockAcquire(0, LW_EXCLUSIVE);
  
  	for (i = 0; i < numLocks; i++)
  	{
! 		if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i])
! 			fprintf(stderr, "PID %d lwlock %d: shacq %u exacq %u blk %u\n",
! 					MyProcPid, i, sh_acquire_counts[i], ex_acquire_counts[i],
! 					block_counts[i]);
  	}
  
  	LWLockRelease(0);
  }
  #endif   /* LWLOCK_STATS */
  
  
--- 135,282 ----
  
  #ifdef LWLOCK_STATS
  
+ typedef struct
+ {
+ 	/* Process-local Counters for the number of times the lock has been ... */
+ 	int		count_acquired_share;			/* acquired in SHARE mode */
+ 	int		count_waited_share;				/* and it needed to wait */
+ 	
+ 	int		count_acquired_exclusive;		/*	acquired in EXCLUSIVEmode */
+ 	int		count_waited_exclusive;			/* and it needed to wait */
+ 	
+ #ifdef LWLOCK_TIMING_STATS
+ 	/* Process local sum of time during which ... */
+ 	instr_time		time_waited_share;		/* we waited to get a SHARE lock */
+ 	instr_time		time_held_share;		/* we held the SHARE lock */
+ 	instr_time		time_waited_exclusive;	/* we waited to get an EXCLUSIVE lock */
+ 	instr_time		time_held_exclusive;	/* we held the EXCLUSIVE lock */
+ 	
+ 	instr_time		ta, th;	/* temp variables */
+ #endif
+ } LWLockStatsData;
+ 
+ static int	counts_for_pid = 0;
+ static LWLockStatsData *lwlock_stats;
+ #ifdef LWLOCK_TIMING_STATS
+ static instr_time lwlock_stats_begin_time;
+ #endif
+ 
+ static const char* lwlock_stats_lock_name( LWLockId id )
+ {
+ 	static const char * names[] = {
+ 	"BufFreelist",
+ 	"ShmemIndex",
+ 	"OidGen",
+ 	"XidGen",
+ 	"ProcArray",
+ 	"SInvalRead",
+ 	"SInvalWrite",
+ 	"WALInsert",
+ 	"WALWrite",
+ 	"ControlFile",
+ 	"Checkpoint",
+ 	"CLogControl",
+ 	"SubtransControl",
+ 	"MultiXactGen",
+ 	"MultiXactOffsetControl",
+ 	"MultiXactMemberControl",
+ 	"RelCacheInit",
+ 	"BgWriterComm",
+ 	"TwoPhaseState",
+ 	"TablespaceCreate",
+ 	"BtreeVacuum",
+ 	"AddinShmemInit",
+ 	"Autovacuum",
+ 	"AutovacuumSchedule",
+ 	"SyncScan" };
+ 	
+ 	if( id >= 0 && id < (sizeof( names )/sizeof( names[0] )))
+ 		return names[id];
+ 	else
+ 		return "";
+ }
+ 
  static void
  print_lwlock_stats(int code, Datum arg)
  {
! 	int			i, lines=0;
  	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
  	int			numLocks = LWLockCounter[1];
+ #ifdef LWLOCK_TIMING_STATS
+ 	instr_time	end_time;
+ 	double		total_percents;
+ 	INSTR_TIME_SET_CURRENT( end_time );
+ 	INSTR_TIME_SUBTRACT( end_time, lwlock_stats_begin_time );
+ 	total_percents = 100.0 / INSTR_TIME_GET_DOUBLE( end_time );
+ #endif
  
+ 	fprintf( stderr, "\n-------- Lock stats for PID %d\n", MyProcPid );
+ 	
  	/* Grab an LWLock to keep different backends from mixing reports */
  	LWLockAcquire(0, LW_EXCLUSIVE);
  
  	for (i = 0; i < numLocks; i++)
  	{
! 		LWLockStatsData *p = &lwlock_stats[i];
! #ifdef LWLOCK_TIMING_STATS
! 		double	time_waited_share		= INSTR_TIME_GET_DOUBLE(p->time_waited_share	);
! 		double	time_held_share			= INSTR_TIME_GET_DOUBLE(p->time_held_share		);		
! 		double	time_waited_exclusive	= INSTR_TIME_GET_DOUBLE(p->time_waited_exclusive);	
! 		double	time_held_exclusive		= INSTR_TIME_GET_DOUBLE(p->time_held_exclusive	);	
! #endif
! 		
! 		if( !( p->count_waited_share || p->count_waited_exclusive ))
! 			continue;		
! 		
! #ifdef LWLOCK_TIMING_STATS
! 		if(    time_waited_share		< LWLOCKS_STATS_TIME_DISPLAY_THRESHOLD
! 			&& time_waited_exclusive	< LWLOCKS_STATS_TIME_DISPLAY_THRESHOLD )
! 			continue;
! #endif		
! 		
! #ifdef LWLOCK_TIMING_STATS
! 		if( !(lines++ & 15) )
! 			fprintf( stderr, "    PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT      ExAcq     "
! 							 "ExWait             ExWaitT                ExHeldT     Name\n" );
! 		
! 		fprintf( stderr,
! 			"%7d %7d %10d %10d %11.02f %11.02f %10d %10d %11.02f (%6.02f %%) %11.02f (%6.02f %%)  %s\n",
! 			MyProcPid, i, 
! 			p->count_acquired_share, p->count_waited_share, time_waited_share, time_held_share,
! 			p->count_acquired_exclusive, p->count_waited_exclusive, 
! 			time_waited_exclusive, 	time_waited_exclusive * total_percents,
! 			time_held_exclusive,	time_held_exclusive * total_percents,
! 			lwlock_stats_lock_name( i ));
! #else
! 		if( !(lines++ & 15) )
! 			fprintf( stderr, "    PID    Lock      ShAcq     ShWait     ExAcq     ExWait     Name\n" );
! 		
! 		fprintf( stderr,
! 			"%7d %7d %10d %10d %10d %10d    %s\n",
! 			MyProcPid, i, 
! 			p->count_acquired_share, p->count_waited_share, 
! 			p->count_acquired_exclusive, p->count_waited_exclusive, 
! 			lwlock_stats_lock_name( i ));
! #endif		
  	}
  
  	LWLockRelease(0);
  }
+ 
+ static void lock_stats_setup( void )
+ {
+ 	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
+ 	int			numLocks = LWLockCounter[1];
+ 
+ 	lwlock_stats = calloc(numLocks, sizeof(LWLockStatsData));
+ 	counts_for_pid = MyProcPid;
+ 
+ #ifdef LWLOCK_TIMING_STATS
+ 	INSTR_TIME_SET_CURRENT( lwlock_stats_begin_time );
+ #endif
+ 
+ 	on_shmem_exit(print_lwlock_stats, 0);
+ }
  #endif   /* LWLOCK_STATS */
  
  
*************** LWLockAcquire(LWLockId lockid, LWLockMod
*** 321,341 ****
  #ifdef LWLOCK_STATS
  	/* Set up local count state first time through in a given process */
  	if (counts_for_pid != MyProcPid)
! 	{
! 		int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
! 		int			numLocks = LWLockCounter[1];
! 
! 		sh_acquire_counts = calloc(numLocks, sizeof(int));
! 		ex_acquire_counts = calloc(numLocks, sizeof(int));
! 		block_counts = calloc(numLocks, sizeof(int));
! 		counts_for_pid = MyProcPid;
! 		on_shmem_exit(print_lwlock_stats, 0);
! 	}
  	/* Count lock acquisition attempts */
! 	if (mode == LW_EXCLUSIVE)
! 		ex_acquire_counts[lockid]++;
! 	else
! 		sh_acquire_counts[lockid]++;
  #endif   /* LWLOCK_STATS */
  
  	/*
--- 457,471 ----
  #ifdef LWLOCK_STATS
  	/* Set up local count state first time through in a given process */
  	if (counts_for_pid != MyProcPid)
! 		lock_stats_setup();
! 	
  	/* Count lock acquisition attempts */
! 	if (mode == LW_EXCLUSIVE)	lwlock_stats[lockid].count_acquired_exclusive++;
! 	else						lwlock_stats[lockid].count_acquired_share++;
! #ifdef LWLOCK_TIMING_STATS
! 	/* record start of wait time and lock mode */
! 	INSTR_TIME_SET_CURRENT( lwlock_stats[lockid].ta );
! #endif
  #endif   /* LWLOCK_STATS */
  
  	/*
*************** LWLockAcquire(LWLockId lockid, LWLockMod
*** 445,451 ****
  		LOG_LWDEBUG("LWLockAcquire", lockid, "waiting");
  
  #ifdef LWLOCK_STATS
! 		block_counts[lockid]++;
  #endif
  
  		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
--- 575,583 ----
  		LOG_LWDEBUG("LWLockAcquire", lockid, "waiting");
  
  #ifdef LWLOCK_STATS
! 		/* count lock waits */
! 		if (mode == LW_EXCLUSIVE)	lwlock_stats[lockid].count_waited_exclusive++;
! 		else						lwlock_stats[lockid].count_waited_share++;
  #endif
  
  		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
*************** LWLockAcquire(LWLockId lockid, LWLockMod
*** 480,485 ****
--- 612,625 ----
  	 */
  	while (extraWaits-- > 0)
  		PGSemaphoreUnlock(&proc->sem);
+ 	
+ #ifdef LWLOCK_TIMING_STATS
+ 	/* record end of wait time and start of hold time */
+ 	INSTR_TIME_SET_CURRENT( lwlock_stats[lockid].th );
+ 	
+ 	if (mode == LW_EXCLUSIVE)	INSTR_TIME_ACCUM_DIFF( lwlock_stats[lockid].time_waited_exclusive, lwlock_stats[lockid].th, lwlock_stats[lockid].ta );
+ 	else						INSTR_TIME_ACCUM_DIFF( lwlock_stats[lockid].time_waited_share,     lwlock_stats[lockid].th, lwlock_stats[lockid].ta );
+ #endif
  }
  
  /*
*************** LWLockConditionalAcquire(LWLockId lockid
*** 548,553 ****
--- 688,707 ----
  		/* Add lock to list of locks held by this backend */
  		held_lwlocks[num_held_lwlocks++] = lockid;
  		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(lockid, mode);
+ 		
+ #ifdef LWLOCK_STATS
+ 		/* Set up local count state first time through in a given process */
+ 		if (counts_for_pid != MyProcPid)
+ 			lock_stats_setup();
+ 		
+ 		/* Count lock acquisition attempts */
+ 		if (mode == LW_EXCLUSIVE)	lwlock_stats[lockid].count_acquired_exclusive++;
+ 		else						lwlock_stats[lockid].count_acquired_share++;
+ #ifdef LWLOCK_TIMING_STATS
+ 		/* record start of hold time */
+ 		INSTR_TIME_SET_CURRENT( lwlock_stats[lockid].th );
+ #endif
+ #endif   /* LWLOCK_STATS */
  	}
  
  	return !mustwait;
*************** LWLockRelease(LWLockId lockid)
*** 563,568 ****
--- 717,725 ----
  	PGPROC	   *head;
  	PGPROC	   *proc;
  	int			i;
+ #ifdef LWLOCK_TIMING_STATS
+ 	bool		was_exclusive = false;
+ #endif
  
  	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
  
*************** LWLockRelease(LWLockId lockid)
*** 586,592 ****
--- 743,754 ----
  
  	/* Release my hold on lock */
  	if (lock->exclusive > 0)
+ 	{
  		lock->exclusive--;
+ #ifdef LWLOCK_TIMING_STATS
+ 		was_exclusive = true;
+ #endif
+ 	}
  	else
  	{
  		Assert(lock->shared > 0);
*************** LWLockRelease(LWLockId lockid)
*** 651,656 ****
--- 813,825 ----
  	 * Now okay to allow cancel/die interrupts.
  	 */
  	RESUME_INTERRUPTS();
+ #ifdef LWLOCK_TIMING_STATS
+ 	/* record end of hold time */
+ 	INSTR_TIME_SET_CURRENT( lwlock_stats[lockid].ta );
+ 	
+ 	if (was_exclusive)	INSTR_TIME_ACCUM_DIFF( lwlock_stats[lockid].time_held_exclusive, lwlock_stats[lockid].ta, lwlock_stats[lockid].th );
+ 	else				INSTR_TIME_ACCUM_DIFF( lwlock_stats[lockid].time_held_share,     lwlock_stats[lockid].ta, lwlock_stats[lockid].th );
+ #endif
  }

lwlock_instrumentation.sampleapplication/octet-stream; name=lwlock_instrumentation.sampleDownload

Jeff Janes

jeff.janes@gmail.com

over 16 years ago

In reply to: Pierre Frédéric Caillaud (#4)

Re: Patch LWlocks instrumentation

2009/9/14 Pierre Frédéric Caillaud <lists@peufeu.com>

A little bit of a reply to Jeff's email about WALInsertLock.

This patch instruments LWLocks, it is controlled with the following
#define's in lwlock.c :

LWLOCK_STATS
LWLOCK_TIMING_STATS

It is an upgrade of current lwlocks stats.

Hi Pierre,

Have you looked at the total execution time with and without the
LWLOCK_TIMING_STATS?

I've implemented something similar to this myself (only without attempting
to make it portable and otherwise worthy of submitting as a general-interest
patch), what I found is that attempting to time every "hold" time
substantially increased the overall run time (which I would worry distorts
the reported times, queue bad Heisenberg analogies). The problem is that
gettimeofday is slow, and on some multi-processor systems it is a global
point of serialization, making it even slower. I decided to time only the
time spent waiting on a block, and not the time spent holding the lock.
This way you only call gettimeofday twice if you actually need to block, and
not at all if you immediately get the lock. This had a much smaller effect
on runtime, and the info produced was sufficient for my purposes.

Not that this changes your conclusion. With or without that distortion I
completely believe that WALInsertLock is the bottleneck of parallel bulk
copy into unindexed tables. I just can't find anything else it is a primary
bottleneck on. I think the only real solution for bulk copy is to call
XLogInsert less often. For example, it could build blocks in local memory,
then when done copy it into the shared buffers and then toss the entire
block into WAL in one call. Easier said than implemented, of course.

Cheers,

Jeff

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

In reply to: Jeff Janes (#5)

Re: Patch LWlocks instrumentation

Have you looked at the total execution time with and without the
LWLOCK_TIMING_STATS?

It didn't show any significant overhead on the little COPY test I made.
On selects, it probably does (just like EXPLAIN ANALYZE), but I didn't
test.
It is not meant to be always active, it's a #define, so I guess it would
be OK though.

I'm going to modify it according to your suggestions and repost it (why
didn't I do that first ?...)

Not that this changes your conclusion. With or without that distortion I
completely believe that WALInsertLock is the bottleneck of parallel bulk
copy into unindexed tables. I just can't find anything else it is a
primary
bottleneck on. I think the only real solution for bulk copy is to call
XLogInsert less often. For example, it could build blocks in local
memory,
then when done copy it into the shared buffers and then toss the entire
block into WAL in one call. Easier said than implemented, of course.

Actually,

http://archives.postgresql.org/pgsql-hackers/2009-09/msg00806.php

Jaime Casanova

jcasanov@systemguards.com.ec

about 16 years ago

In reply to: Jeff Janes (#3)

Re: XLogInsert

On Sun, Sep 13, 2009 at 10:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

I've attached a patch which removes the iteration over the blocks to be
backed-up from the critical section of XLogInsert. Now those blocks are
only looped over in one piece of code which both computes the CRC and builds
the linked list, rather than having parallel loops.

ok, i started to review this one...

what the patch is doing seems very obvious and doesn't break the
original logic AFAIUI (i could be wrong, this seems like chinese to me
;)

i made some tests archiving wal and recovering... crashing the server
and restarting... every test was running the regression tests,
don't know what else to do, ideas?

i haven't made any performance tests but it should gain something :),
maybe someone can make those tests?

if not, i will mark the patch as ready for committer some time in the
next hours...

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Tom Lane

tgl@sss.pgh.pa.us

about 16 years ago

In reply to: Jaime Casanova (#7)

Re: XLogInsert

Jaime Casanova <jcasanov@systemguards.com.ec> writes:

i haven't made any performance tests but it should gain something :),
maybe someone can make those tests?

The argument for changing this at all is only that it will improve
performance, so I'd like some independent confirmation that it does.

regards, tom lane

Greg Smith

greg@2ndquadrant.com

about 16 years ago

In reply to: Tom Lane (#8)

Re: XLogInsert

Tom Lane wrote:

Jaime Casanova <jcasanov@systemguards.com.ec> writes:

i haven't made any performance tests but it should gain something :),
maybe someone can make those tests?

The argument for changing this at all is only that it will improve
performance, so I'd like some independent confirmation that it does.

I've done a little review of this myself, and I'm not quite happy with how this patch was delivered to us. The bar for committing something that touches the WAL is really high--it needs to be a unquestionable win to touch that code. The justification of "the patch makes the overall code a bit cleaner" is a hard sell on something that's hard to debug (triggering bad WAL situations at will isn't easy) and critical to the system. If there's a clear performance improvement, that helps justify why it's worth working on. Here's the original performance justification:

"Using the only XLogInsert-bound test case I know of, parallel COPY into a skinny, unindexed table, using 8 parallel copies on a 4 x dual-core x86_64 and with fsync turned off (to approxiamately simulate SSD, which I do not have), I get a speed improvement of 2-4% with the patch over unpatched head."

That makes sense, and using this spec I could probably come up with the test program to reproduce this. But I'm getting tired of doing that. It's hard enough to reproduce performance changes when someone gives the exact configuration and test program they used. If we're working with a verbal spec for how to reproduce the issues, forget about it--that's more than we can expect a reviewer to handle, and the odds of that whole thing ending well are low.

Jeff: before we do anything else with your patch, I'd like to see a script of some sort that runs the test you describe above, everything changed in the postgresql.conf from the defaults, and the resulting raw numbers that come from it on your system that prove an improvement there--not just a summary of the change. That's really mandatory for a performance patch. If any reviewer who's interested can't just run something and get a report suggesting whether the patch helped or harmed results in five minutes, unless we really, really want your patch it's just going to stall at that point. And unfortunately, in the case of something that touches the WAL path, we really don't want to change anything unless there's a quite good reason to do so.

I've also realized that "Patch LWlocks instrumentation" at http://archives.postgresql.org/message-id/op.uz8sfkxycke6l8@soyouz should have been evaluated as its own patch altogether. I think that the test program you're suggesting also proves its utility though, so for now I'll keep them roped together.

Sorry this ended up so late in this CommitFest, just a series of unexpected stuff rippled down to you. On the bright side, had you submitted this before the whole organized CF process started, you could have waited months longer to get the same feedback.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#10

Jaime Casanova

jcasanov@systemguards.com.ec

about 16 years ago

In reply to: Tom Lane (#8)

Re: XLogInsert

On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

so I'd like some independent confirmation that it does.

what kind of tests could show that? or simply running pgbench several
times for 15 minutes each run could show any benefit?

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

#11

Alvaro Herrera

alvherre@commandprompt.com

about 16 years ago

In reply to: Jaime Casanova (#10)

Re: XLogInsert

Jaime Casanova escribiï¿½:

On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

so I'd like some independent confirmation that it does.

what kind of tests could show that? or simply running pgbench several
times for 15 minutes each run could show any benefit?

Isn't the supposed performance improvement in this patch linked strongly
to backup blocks? If that's really the case, I think it would show more
prominently if you had very frequent checkpoints.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#12

Jaime Casanova

jcasanov@systemguards.com.ec

about 16 years ago

In reply to: Alvaro Herrera (#11)

2 attachment(s)

Re: XLogInsert

On Thu, Dec 10, 2009 at 3:58 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Jaime Casanova escribió:

On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

so I'd like some independent confirmation that it does.

what kind of tests could show that? or simply running pgbench several
times for 15 minutes each run could show any benefit?

Isn't the supposed performance improvement in this patch linked strongly
to backup blocks? If that's really the case, I think it would show more
prominently if you had very frequent checkpoints.

Ah! Ok, i was only following the logic that it was eliminating the
need of executing a loop twice...
But you are right while the loop executes always it only do something
meaningful after a checkpoint and the for statement only make 3 loops
each, because XLR_MAX_BKP_BLOCKS is defined as 3 in
src/include/access/xlog.h

looked that way seems like the benefit could be only marginal

to prove that i compile with and without the patch, and change
checkpoint_segments = 1 and checkpoint_timeout = 1min to force
frequent checkpoints (actually they ocurred a few seconds apart)

initialize the pgbench database in each installation with:
pgbench -i -s 200 -F 90 test

and executed 6 times with:
pgbench -n -c 50 -j 5 -l -T 900 test

Results are:

Min (tps)
Unpatched - including connections establishing 133.046069 excluding it
133.085274
Patched - including connections establishing 139.567284 excluding
it 139.591229

Max (tps)
Unpatched - including connections establishing 147.082181 excluding
it 147.108752
Patched - including connections establishing 151.276416 excluding
it 151.311745

Avg (tps)
Unpatched - including connections establishing 140.750998 excluding
it 140.790336
Patched - including connections establishing 146.383735 excluding
it 146.411039

So in this extreme case avg tps is just 6 transactions better

PS: i'm attaching the files i use for the tests

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

#13

Greg Smith

greg@2ndquadrant.com

about 16 years ago

In reply to: Jaime Casanova (#12)

Re: XLogInsert

Jaime Casanova wrote:

So in this extreme case avg tps is just 6 transactions better

Great job trying to find the spot where the code worked better. I'm not
so sure I trust pgbench results where the TPS was so low though. Which
leads us right back to exactly how Jeff measured his original results.

As I said already, I think we need more insight into Jeff's performance
report, a way to replicate that test, to look a bit at the latency as
reported by the updated LWLock patch that Pierre submitted. Tweaking
your test to give more useful results is a nice second opinion on top of
that. But we're out of time for now, so this patch is getting returned
with feedback. I encourage Jeff to resubmit the same patch or a better
one with a little more data on performance measurements to our final 8.5
CommitFest in hopes we can confirm this an improvement worth committing.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#14

Gurjeet Singh

singh.gurjeet@gmail.com

about 16 years ago

In reply to: Greg Smith (#13)

Re: XLogInsert

2009/12/15 Greg Smith <greg@2ndquadrant.com>

Jaime Casanova wrote:

So in this extreme case avg tps is just 6 transactions better

Great job trying to find the spot where the code worked better. I'm not so
sure I trust pgbench results where the TPS was so low though. Which leads
us right back to exactly how Jeff measured his original results.

As I said already, I think we need more insight into Jeff's performance
report, a way to replicate that test, to look a bit at the latency as
reported by the updated LWLock patch that Pierre submitted. Tweaking your
test to give more useful results is a nice second opinion on top of that.
But we're out of time for now, so this patch is getting returned with
feedback. I encourage Jeff to resubmit the same patch or a better one with
a little more data on performance measurements to our final 8.5 CommitFest
in hopes we can confirm this an improvement worth committing.

Last week I worked on a FUSE based filesystem, which I call BlackholeFS. Its
similar to /dev/null, but for directories. Basically it simply returns
success for all the writes, but doesn't do any writes on the files under it.

Would moving the pg_xlog/ (and possibly table data too) to such a filesystem
exercise this patch better?

Best regards,
--
Lets call it Postgres

EnterpriseDB http://www.enterprisedb.com

gurjeet[.singh]@EnterpriseDB.com

singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
Twitter: singh_gurjeet
Skype: singh_gurjeet

Mail sent from my BlackLaptop device

#15

Andres Freund

andres@anarazel.de

about 16 years ago

In reply to: Gurjeet Singh (#14)

Re: XLogInsert

On Wednesday 16 December 2009 20:07:07 Gurjeet Singh wrote:

2009/12/15 Greg Smith <greg@2ndquadrant.com>

Jaime Casanova wrote:

So in this extreme case avg tps is just 6 transactions better

Great job trying to find the spot where the code worked better. I'm not
so sure I trust pgbench results where the TPS was so low though. Which
leads us right back to exactly how Jeff measured his original results.

As I said already, I think we need more insight into Jeff's performance
report, a way to replicate that test, to look a bit at the latency as
reported by the updated LWLock patch that Pierre submitted. Tweaking
your test to give more useful results is a nice second opinion on top of
that. But we're out of time for now, so this patch is getting returned
with feedback. I encourage Jeff to resubmit the same patch or a better
one with a little more data on performance measurements to our final 8.5
CommitFest in hopes we can confirm this an improvement worth committing.

Last week I worked on a FUSE based filesystem, which I call BlackholeFS.
Its similar to /dev/null, but for directories. Basically it simply returns
success for all the writes, but doesn't do any writes on the files under
it.

I doubt that it will be faster than a tmpfs - the additional context switches
et al probably will hurt already.
If you constrain the checkpoint_segments to something sensible it shouldnt use
too much memory.

Andres