pg_stat_lwlocks view - lwlocks statistics

Started by Satoshi Nagayasuover 13 years ago30 messages
#1Satoshi Nagayasu
snaga@uptime.jp
1 attachment(s)

Hi all,

I've been working on a new system view, pg_stat_lwlocks, to observe
LWLock, and just completed my 'proof-of-concept' code that can work
with version 9.1.

Now, I'd like to know the possibility of this feature for future
release.

With this patch, DBA can easily determine a bottleneck around lwlocks.
--------------------------------------------------
postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10;
lwlockid | calls | waits | time_ms
----------+--------+-------+---------
49 | 193326 | 32096 | 23688
8 | 3305 | 133 | 1335
2 | 21 | 0 | 0
4 | 135188 | 0 | 0
5 | 57935 | 0 | 0
6 | 141 | 0 | 0
7 | 24580 | 1 | 0
3 | 3282 | 0 | 0
1 | 41 | 0 | 0
9 | 3 | 0 | 0
(10 rows)

postgres=#
--------------------------------------------------

In this view,
'lwlockid' column represents LWLockId used in the backends.
'calls' represents how many times LWLockAcquire() was called.
'waits' represents how many times LWLockAcquire() needed to wait
within it before lock acquisition.
'time_ms' represents how long LWLockAcquire() totally waited on
a lwlock.

And lwlocks that use a LWLockId range, such as BufMappingLock or
LockMgrLock, would be grouped and summed up in a single record.
For example, lwlockid 49 in the above view represents LockMgrLock
statistics.

Now, I know there are some considerations.

(1) Performance

I've measured LWLock performance both with and without the patch,
and confirmed that this patch does not affect the LWLock perfomance
at all.

pgbench scores with the patch:
tps = 900.906658 (excluding connections establishing)
tps = 908.528422 (excluding connections establishing)
tps = 903.900977 (excluding connections establishing)
tps = 910.470595 (excluding connections establishing)
tps = 909.685396 (excluding connections establishing)

pgbench scores without the patch:
tps = 909.096785 (excluding connections establishing)
tps = 894.868712 (excluding connections establishing)
tps = 910.074669 (excluding connections establishing)
tps = 904.022770 (excluding connections establishing)
tps = 895.673830 (excluding connections establishing)

Of course, this experiment was not I/O bound, and the cache hit ratio
was >99.9%.

(2) Memory space

In this patch, I added three new members to LWLock structure
as uint64 to collect statistics.

It means that those members must be held in the shared memory,
but I'm not sure whether it's appropriate.

I think another possible option is holding those statistics
values in local (backend) process memory, and send them through
the stat collector process (like other statistics values).

(3) LWLock names (or labels)

Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is
not easy for DBA to determine actual lock type.

So, I want to show LWLock names (or labels), like 'WALWriteLock'
or 'LockMgrLock', but how should I implement it?

Any comments?

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

Attachments:

pg_stat_lwlocks_91.difftext/plain; charset=Shift_JIS; name=pg_stat_lwlocks_91.diffDownload
diff -rc postgresql-9.1.2.orig/src/backend/catalog/postgres.bki postgresql-9.1.2/src/backend/catalog/postgres.bki
*** postgresql-9.1.2.orig/src/backend/catalog/postgres.bki	2012-06-20 03:32:46.000000000 +0900
--- postgresql-9.1.2/src/backend/catalog/postgres.bki	2012-06-26 01:51:52.000000000 +0900
***************
*** 1553,1558 ****
--- 1553,1559 ----
  insert OID = 3071 ( pg_xlog_replay_pause 11 10 12 1 0 0 f f f t f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_xlog_replay_pause _null_ _null_ _null_ )
  insert OID = 3072 ( pg_xlog_replay_resume 11 10 12 1 0 0 f f f t f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_xlog_replay_resume _null_ _null_ _null_ )
  insert OID = 3073 ( pg_is_xlog_replay_paused 11 10 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_is_xlog_replay_paused _null_ _null_ _null_ )
+ insert OID = 3764 ( pg_stat_get_lwlocks 11 10 12 1 100 0 f f f f t s 0 0 2249 "" "{20,20,20,20}" "{o,o,o,o}" "{lwlockid,calls,waits,time_ms}" _null_ pg_stat_get_lwlocks _null_ _null_ _null_ )
  insert OID = 2621 ( pg_reload_conf 11 10 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_reload_conf _null_ _null_ _null_ )
  insert OID = 2622 ( pg_rotate_logfile 11 10 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ )
  insert OID = 2623 ( pg_stat_file 11 10 12 1 0 0 f f f t f v 1 0 2249 "25" "{25,20,1184,1184,1184,1184,16}" "{i,o,o,o,o,o,o}" "{filename,size,access,modification,change,creation,isdir}" _null_ pg_stat_file _null_ _null_ _null_ )
diff -rc postgresql-9.1.2.orig/src/backend/catalog/postgres.description postgresql-9.1.2/src/backend/catalog/postgres.description
*** postgresql-9.1.2.orig/src/backend/catalog/postgres.description	2012-06-20 03:32:46.000000000 +0900
--- postgresql-9.1.2/src/backend/catalog/postgres.description	2012-06-26 01:51:52.000000000 +0900
***************
*** 947,952 ****
--- 947,953 ----
  3071	pg_proc	0	pause xlog replay
  3072	pg_proc	0	resume xlog replay, if it was paused
  3073	pg_proc	0	true if xlog replay is paused
+ 3764	pg_proc	0	light-weight lock statistics
  2621	pg_proc	0	reload configuration files
  2622	pg_proc	0	rotate log file
  2623	pg_proc	0	get information about file
diff -rc postgresql-9.1.2.orig/src/backend/catalog/system_views.sql postgresql-9.1.2/src/backend/catalog/system_views.sql
*** postgresql-9.1.2.orig/src/backend/catalog/system_views.sql	2012-06-20 03:32:46.000000000 +0900
--- postgresql-9.1.2/src/backend/catalog/system_views.sql	2012-06-26 03:35:59.000000000 +0900
***************
*** 594,599 ****
--- 594,607 ----
          pg_stat_get_buf_alloc() AS buffers_alloc,
          pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
  
+ CREATE VIEW pg_stat_lwlocks AS
+     SELECT
+             S.lwlockid,
+             S.calls,
+             S.waits,
+             S.time_ms
+     FROM pg_stat_get_lwlocks() AS S;
+ 
  CREATE VIEW pg_user_mappings AS
      SELECT
          U.oid       AS umid,
diff -rc postgresql-9.1.2.orig/src/backend/storage/lmgr/lwlock.c postgresql-9.1.2/src/backend/storage/lmgr/lwlock.c
*** postgresql-9.1.2.orig/src/backend/storage/lmgr/lwlock.c	2012-06-20 03:32:46.000000000 +0900
--- postgresql-9.1.2/src/backend/storage/lmgr/lwlock.c	2012-06-26 01:57:28.000000000 +0900
***************
*** 32,37 ****
--- 32,38 ----
  #include "storage/proc.h"
  #include "storage/spin.h"
  
+ #include <sys/time.h>
  
  /* We use the ShmemLock spinlock to protect LWLockAssign */
  extern slock_t *ShmemLock;
***************
*** 46,51 ****
--- 47,57 ----
  	PGPROC	   *head;			/* head of list of waiting PGPROCs */
  	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
  	/* tail is undefined when head is NULL */
+ 
+ 	/* statistics stuff */
+ 	uint64		calls;
+ 	uint64		waits;
+ 	uint64		time_ms;
  } LWLock;
  
  /*
***************
*** 268,273 ****
--- 274,282 ----
  		lock->lock.shared = 0;
  		lock->lock.head = NULL;
  		lock->lock.tail = NULL;
+ 		lock->lock.calls = 0;
+ 		lock->lock.waits = 0;
+ 		lock->lock.time_ms = 0;
  	}
  
  	/*
***************
*** 323,330 ****
--- 332,341 ----
  	PGPROC	   *proc = MyProc;
  	bool		retry = false;
  	int			extraWaits = 0;
+ 	struct timeval		t1,t2;
  
  	PRINT_LWDEBUG("LWLockAcquire", lockid, lock);
+ 	lock->calls++;
  
  #ifdef LWLOCK_STATS
  	/* Set up local count state first time through in a given process */
***************
*** 457,462 ****
--- 468,475 ----
  #endif
  
  		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
+ 		lock->waits++;
+ 		gettimeofday(&t1, NULL);
  
  		for (;;)
  		{
***************
*** 468,473 ****
--- 481,488 ----
  		}
  
  		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
+ 		gettimeofday(&t2, NULL);
+ 		lock->time_ms += (t2.tv_sec-t1.tv_sec)*1000 + (t2.tv_usec-t1.tv_usec)/1000;
  
  		LOG_LWDEBUG("LWLockAcquire", lockid, "awakened");
  
***************
*** 701,703 ****
--- 716,752 ----
  	}
  	return false;
  }
+ 
+ void
+ lwlock_get_stat(LWLockId lockid, uint64 *calls, uint64 *waits, uint64 *time_ms)
+ {
+ 	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+ 
+ 	SpinLockAcquire(&lock->mutex);
+ 	*calls   = lock->calls;
+ 	*waits   = lock->waits;
+ 	*time_ms = lock->time_ms;
+ 	SpinLockRelease(&lock->mutex);
+ }
+ 
+ void
+ lwlock_get_stat_sum(LWLockId start, LWLockId end, uint64 *calls, uint64 *waits, uint64 *time_ms)
+ {
+ 	LWLockId lockid;
+ 
+ 	*calls   = 0;
+ 	*waits   = 0;
+ 	*time_ms = 0;
+ 
+ 	for (lockid=start ; lockid<=end ; lockid++)
+ 	{
+ 		uint64 calls2, waits2, time_ms2;
+ 
+ 		lwlock_get_stat(lockid, &calls2, &waits2, &time_ms2);
+ 
+ 		*calls   += calls2;
+ 		*waits   += waits2;
+ 		*time_ms += time_ms2;
+ 	}
+ }
+ 
diff -rc postgresql-9.1.2.orig/src/backend/utils/adt/pgstatfuncs.c postgresql-9.1.2/src/backend/utils/adt/pgstatfuncs.c
*** postgresql-9.1.2.orig/src/backend/utils/adt/pgstatfuncs.c	2012-06-20 03:32:50.000000000 +0900
--- postgresql-9.1.2/src/backend/utils/adt/pgstatfuncs.c	2012-06-26 01:59:17.000000000 +0900
***************
*** 109,114 ****
--- 109,116 ----
  extern Datum pg_stat_reset_single_table_counters(PG_FUNCTION_ARGS);
  extern Datum pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS);
  
+ extern Datum pg_stat_get_lwlocks(PG_FUNCTION_ARGS);
+ 
  /* Global bgwriter statistics, from bgwriter.c */
  extern PgStat_MsgBgWriter bgwriterStats;
  
***************
*** 1572,1574 ****
--- 1574,1670 ----
  
  	PG_RETURN_VOID();
  }
+ 
+ 
+ Datum
+ pg_stat_get_lwlocks(PG_FUNCTION_ARGS)
+ {
+ 	FuncCallContext *funcctx;
+ 
+ 	/* stuff done only on the first call of the function */
+ 	if (SRF_IS_FIRSTCALL())
+ 	{
+ 		MemoryContext oldcontext;
+ 		TupleDesc	tupdesc;
+ 
+ 		/* create a function context for cross-call persistence */
+ 		funcctx = SRF_FIRSTCALL_INIT();
+ 
+ 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ 
+ 		tupdesc = CreateTemplateTupleDesc(4, false);
+ 		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "lockid",
+ 						   INT8OID, -1, 0);
+ 		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "calls",
+ 						   INT8OID, -1, 0);
+ 		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "waits",
+ 						   INT8OID, -1, 0);
+ 		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "time_ms",
+ 						   INT8OID, -1, 0);
+ 
+ 		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+ 		funcctx->max_calls = NumLWLocks();
+ 
+ 		MemoryContextSwitchTo(oldcontext);
+ 	}
+ 
+ 	/* stuff done on every call of the function */
+ 	funcctx = SRF_PERCALL_SETUP();
+ 
+ 	if (funcctx->call_cntr < funcctx->max_calls)
+ 	{
+ 		Datum		values[4];
+ 		bool		nulls[4];
+ 		HeapTuple	tuple;
+ 		LWLockId	lockid;
+ 		uint64		calls,waits,time_ms;
+ 
+ 		MemSet(values, 0, sizeof(values));
+ 		MemSet(nulls, 0, sizeof(nulls));
+ 
+ 		lockid = funcctx->call_cntr;
+ 
+ 		if ( lockid<FirstBufMappingLock )
+ 		{
+ 			lwlock_get_stat(lockid, &calls, &waits, &time_ms);
+ 		}
+ 		else if ( FirstBufMappingLock<=lockid && lockid<FirstLockMgrLock )
+ 		{
+ 			lwlock_get_stat_sum(FirstBufMappingLock, FirstLockMgrLock-1,
+ 					    &calls, &waits, &time_ms);
+ 			funcctx->call_cntr = FirstLockMgrLock - 1;
+ 		}
+ 		else if ( FirstLockMgrLock<=lockid && lockid<FirstPredicateLockMgrLock )
+ 		{
+ 			lwlock_get_stat_sum(FirstLockMgrLock, FirstPredicateLockMgrLock-1,
+ 					    &calls, &waits, &time_ms);
+ 			funcctx->call_cntr = FirstPredicateLockMgrLock - 1;
+ 		}
+ 		else if ( FirstPredicateLockMgrLock<=lockid && lockid<NumFixedLWLocks )
+ 		{
+ 			lwlock_get_stat_sum(FirstPredicateLockMgrLock, NumFixedLWLocks-1,
+ 					    &calls, &waits, &time_ms);
+ 			funcctx->call_cntr = NumFixedLWLocks - 1;
+ 		}
+ 		else if ( NumFixedLWLocks<=lockid && lockid<NumLWLocks() )
+ 		{
+ 			lwlock_get_stat_sum(NumFixedLWLocks, NumLWLocks()-1,
+ 					    &calls, &waits, &time_ms);
+ 			funcctx->call_cntr = NumLWLocks() - 1;
+ 		}
+ 
+ 		values[0] = Int64GetDatum(lockid);
+ 		values[1] = Int64GetDatum(calls);
+ 		values[2] = Int64GetDatum(waits);
+ 		values[3] = Int64GetDatum(time_ms);
+ 
+ 		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+ 
+ 		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+ 	}
+ 	else
+ 	{
+ 		SRF_RETURN_DONE(funcctx);
+ 	}
+ }
+ 
diff -rc postgresql-9.1.2.orig/src/backend/utils/fmgroids.h postgresql-9.1.2/src/backend/utils/fmgroids.h
*** postgresql-9.1.2.orig/src/backend/utils/fmgroids.h	2012-06-20 03:32:50.000000000 +0900
--- postgresql-9.1.2/src/backend/utils/fmgroids.h	2012-06-26 01:51:52.000000000 +0900
***************
*** 1977,1982 ****
--- 1977,1983 ----
  #define F_GET_CURRENT_TS_CONFIG 3759
  #define F_TS_MATCH_TT 3760
  #define F_TS_MATCH_TQ 3761
+ #define F_PG_STAT_GET_LWLOCKS 3764
  #define F_PG_TS_TEMPLATE_IS_VISIBLE 3768
  #define F_REGDICTIONARYIN 3771
  #define F_REGDICTIONARYOUT 3772
diff -rc postgresql-9.1.2.orig/src/backend/utils/fmgrtab.c postgresql-9.1.2/src/backend/utils/fmgrtab.c
*** postgresql-9.1.2.orig/src/backend/utils/fmgrtab.c	2012-06-20 03:32:46.000000000 +0900
--- postgresql-9.1.2/src/backend/utils/fmgrtab.c	2012-06-26 01:51:52.000000000 +0900
***************
*** 1966,1971 ****
--- 1966,1972 ----
  extern Datum get_current_ts_config (PG_FUNCTION_ARGS);
  extern Datum ts_match_tt (PG_FUNCTION_ARGS);
  extern Datum ts_match_tq (PG_FUNCTION_ARGS);
+ extern Datum pg_stat_get_lwlocks (PG_FUNCTION_ARGS);
  extern Datum pg_ts_template_is_visible (PG_FUNCTION_ARGS);
  extern Datum regdictionaryin (PG_FUNCTION_ARGS);
  extern Datum regdictionaryout (PG_FUNCTION_ARGS);
***************
*** 4156,4161 ****
--- 4157,4163 ----
    { 3759, "get_current_ts_config", 0, true, false, get_current_ts_config },
    { 3760, "ts_match_tt", 2, true, false, ts_match_tt },
    { 3761, "ts_match_tq", 2, true, false, ts_match_tq },
+   { 3764, "pg_stat_get_lwlocks", 0, false, true, pg_stat_get_lwlocks },
    { 3768, "pg_ts_template_is_visible", 1, true, false, pg_ts_template_is_visible },
    { 3771, "regdictionaryin", 1, true, false, regdictionaryin },
    { 3772, "regdictionaryout", 1, true, false, regdictionaryout },
diff -rc postgresql-9.1.2.orig/src/include/catalog/pg_proc.h postgresql-9.1.2/src/include/catalog/pg_proc.h
*** postgresql-9.1.2.orig/src/include/catalog/pg_proc.h	2012-06-20 03:32:43.000000000 +0900
--- postgresql-9.1.2/src/include/catalog/pg_proc.h	2012-06-26 01:51:46.000000000 +0900
***************
*** 2881,2886 ****
--- 2881,2889 ----
  DATA(insert OID = 3073 ( pg_is_xlog_replay_paused	PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_is_xlog_replay_paused _null_ _null_ _null_ ));
  DESCR("true if xlog replay is paused");
  
+ DATA(insert OID = 3764 (  pg_stat_get_lwlocks	PGNSP PGUID 12 1 100 0 f f f f t s 0 0 2249 "" "{20,20,20,20}" "{o,o,o,o}" "{lwlockid,calls,waits,time_ms}" _null_ pg_stat_get_lwlocks _null_ _null_ _null_ ));
+ DESCR("light-weight lock statistics");
+ 
  DATA(insert OID = 2621 ( pg_reload_conf			PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_reload_conf _null_ _null_ _null_ ));
  DESCR("reload configuration files");
  DATA(insert OID = 2622 ( pg_rotate_logfile		PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ ));
diff -rc postgresql-9.1.2.orig/src/include/storage/lwlock.h postgresql-9.1.2/src/include/storage/lwlock.h
*** postgresql-9.1.2.orig/src/include/storage/lwlock.h	2012-06-20 03:32:43.000000000 +0900
--- postgresql-9.1.2/src/include/storage/lwlock.h	2012-06-26 01:57:22.000000000 +0900
***************
*** 115,118 ****
--- 115,121 ----
  
  extern void RequestAddinLWLocks(int n);
  
+ extern void lwlock_get_stat(LWLockId, uint64 *, uint64 *, uint64 *);
+ extern void lwlock_get_stat_sum(LWLockId, LWLockId, uint64 *, uint64 *, uint64 *);
+ 
  #endif   /* LWLOCK_H */
#2Josh Berkus
josh@agliodbs.com
In reply to: Satoshi Nagayasu (#1)
Re: pg_stat_lwlocks view - lwlocks statistics

On 6/25/12 1:29 PM, Satoshi Nagayasu wrote:

(1) Performance

I've measured LWLock performance both with and without the patch,
and confirmed that this patch does not affect the LWLock perfomance
at all.

This would be my main concern with this patch; it's hard for me to
imagine that it has no performance impact *at all*, since trace_lwlocks
has quite a noticable one in my experience. However, the answer to that
is to submit the patch and let people test.

I will remark that it would be far more useful to me if we could also
track lwlocks per session. Overall counts are somewhat useful, but more
granular counts are even more useful. What period of time does the
table cover? Since last reset?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#3Satoshi Nagayasu
snaga@uptime.jp
In reply to: Josh Berkus (#2)
Re: pg_stat_lwlocks view - lwlocks statistics

2012/06/26 6:44, Josh Berkus wrote:

On 6/25/12 1:29 PM, Satoshi Nagayasu wrote:

(1) Performance

I've measured LWLock performance both with and without the patch,
and confirmed that this patch does not affect the LWLock perfomance
at all.

This would be my main concern with this patch; it's hard for me to
imagine that it has no performance impact *at all*, since trace_lwlocks
has quite a noticable one in my experience. However, the answer to that
is to submit the patch and let people test.

Thanks. I will submit the patch to the CommitFest page with some fixes
to be able to work with the latest PostgreSQL on Git.

I will remark that it would be far more useful to me if we could also
track lwlocks per session. Overall counts are somewhat useful, but more
granular counts are even more useful. What period of time does the
table cover? Since last reset?

Yes. it has not yet been implemented yet since this code is just a PoC
one, but it is another design issue which needs to be discussed.

To implement it, a new array can be added in the local process memory
to hold lwlock statistics, and update counters both in the shared
memory and the local process memory at once. Then, the session can
retrieve 'per-session' statistics from the local process memory
via some dedicated function.

Does it make sense? Any comments?

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#4Satoshi Nagayasu
snaga@uptime.jp
In reply to: Satoshi Nagayasu (#1)
Re: pg_stat_lwlocks view - lwlocks statistics

2012/06/26 7:04, Kevin Grittner wrote:

Josh Berkus<josh@agliodbs.com> wrote:

On 6/25/12 1:29 PM, Satoshi Nagayasu wrote:

(1) Performance

I've measured LWLock performance both with and without the
patch, and confirmed that this patch does not affect the LWLock
perfomance at all.

This would be my main concern with this patch; it's hard for me to
imagine that it has no performance impact *at all*, since
trace_lwlocks has quite a noticable one in my experience.
However, the answer to that is to submit the patch and let people
test.

I think overhead is going to depend quite a bit on the
gettimeofday() implementation, since that is called twice per lock
wait.

Yes.
It's one of my concerns, and what I actually want hackers to test.

It looked to me like there was nothing to prevent concurrent updates
of the counts while gathering the accumulated values for display.
Won't this be a problem on 32-bit builds?

Actually, I'd like to know how I can improve my code in a 32bit box.

However, unfortunately I don't have any 32bit (physical) box now,
so I want someone to test it if it needs to be tested.

Please add this to the Open COmmitFest for a proper review:

https://commitfest.postgresql.org/action/commitfest_view/open

Will submit soon. Thanks.

-Kevin

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#5Satoshi Nagayasu
snaga@uptime.jp
In reply to: Satoshi Nagayasu (#1)
1 attachment(s)
Re: pg_stat_lwlocks view - lwlocks statistics

Hi all,

I've modified the pg_stat_lwlocks patch to be able to work with
the latest PostgreSQL Git code.

This patch provides:
pg_stat_lwlocks New system view to show lwlock statistics.
pg_stat_get_lwlocks() New function to retrieve lwlock statistics.
pg_stat_reset_lwlocks() New function to reset lwlock statistics.

Please try it out.

Regards,

2012/06/26 5:29, Satoshi Nagayasu wrote:

Hi all,

I've been working on a new system view, pg_stat_lwlocks, to observe
LWLock, and just completed my 'proof-of-concept' code that can work
with version 9.1.

Now, I'd like to know the possibility of this feature for future
release.

With this patch, DBA can easily determine a bottleneck around lwlocks.
--------------------------------------------------
postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10;
lwlockid | calls | waits | time_ms
----------+--------+-------+---------
49 | 193326 | 32096 | 23688
8 | 3305 | 133 | 1335
2 | 21 | 0 | 0
4 | 135188 | 0 | 0
5 | 57935 | 0 | 0
6 | 141 | 0 | 0
7 | 24580 | 1 | 0
3 | 3282 | 0 | 0
1 | 41 | 0 | 0
9 | 3 | 0 | 0
(10 rows)

postgres=#
--------------------------------------------------

In this view,
'lwlockid' column represents LWLockId used in the backends.
'calls' represents how many times LWLockAcquire() was called.
'waits' represents how many times LWLockAcquire() needed to wait
within it before lock acquisition.
'time_ms' represents how long LWLockAcquire() totally waited on
a lwlock.

And lwlocks that use a LWLockId range, such as BufMappingLock or
LockMgrLock, would be grouped and summed up in a single record.
For example, lwlockid 49 in the above view represents LockMgrLock
statistics.

Now, I know there are some considerations.

(1) Performance

I've measured LWLock performance both with and without the patch,
and confirmed that this patch does not affect the LWLock perfomance
at all.

pgbench scores with the patch:
tps = 900.906658 (excluding connections establishing)
tps = 908.528422 (excluding connections establishing)
tps = 903.900977 (excluding connections establishing)
tps = 910.470595 (excluding connections establishing)
tps = 909.685396 (excluding connections establishing)

pgbench scores without the patch:
tps = 909.096785 (excluding connections establishing)
tps = 894.868712 (excluding connections establishing)
tps = 910.074669 (excluding connections establishing)
tps = 904.022770 (excluding connections establishing)
tps = 895.673830 (excluding connections establishing)

Of course, this experiment was not I/O bound, and the cache hit ratio
was>99.9%.

(2) Memory space

In this patch, I added three new members to LWLock structure
as uint64 to collect statistics.

It means that those members must be held in the shared memory,
but I'm not sure whether it's appropriate.

I think another possible option is holding those statistics
values in local (backend) process memory, and send them through
the stat collector process (like other statistics values).

(3) LWLock names (or labels)

Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is
not easy for DBA to determine actual lock type.

So, I want to show LWLock names (or labels), like 'WALWriteLock'
or 'LockMgrLock', but how should I implement it?

Any comments?

Regards,

--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

Attachments:

pg_stat_lwlocks_20120626.difftext/plain; charset=Shift_JIS; name=pg_stat_lwlocks_20120626.diffDownload
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7cc1d41..f832b45 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_lwlocks AS
+    SELECT
+            S.lwlockid,
+            S.calls,
+            S.waits,
+            S.time_ms
+    FROM pg_stat_get_lwlocks() AS S;
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 95d4b37..2a2c197 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -32,6 +32,7 @@
 #include "storage/proc.h"
 #include "storage/spin.h"
 
+#include <sys/time.h>
 
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
@@ -46,6 +47,11 @@ typedef struct LWLock
 	PGPROC	   *head;			/* head of list of waiting PGPROCs */
 	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
 	/* tail is undefined when head is NULL */
+
+	/* statistics stuff */
+	uint64		calls;
+	uint64		waits;
+	uint64		time_ms;
 } LWLock;
 
 /*
@@ -287,6 +293,9 @@ CreateLWLocks(void)
 		lock->lock.shared = 0;
 		lock->lock.head = NULL;
 		lock->lock.tail = NULL;
+		lock->lock.calls = 0;
+		lock->lock.waits = 0;
+		lock->lock.time_ms = 0;
 	}
 
 	/*
@@ -342,8 +351,10 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	int			extraWaits = 0;
+	struct timeval		wait_start,wait_done;
 
 	PRINT_LWDEBUG("LWLockAcquire", lockid, lock);
+	lock->calls++;
 
 #ifdef LWLOCK_STATS
 	/* Set up local count state first time through in a given process */
@@ -467,6 +478,8 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 #endif
 
 		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
+		lock->waits++;
+		gettimeofday(&wait_start, NULL);
 
 		for (;;)
 		{
@@ -478,6 +491,8 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		}
 
 		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
+		gettimeofday(&wait_done, NULL);
+		lock->time_ms += (wait_done.tv_sec-wait_start.tv_sec)*1000 + (wait_done.tv_usec-wait_start.tv_usec)/1000;
 
 		LOG_LWDEBUG("LWLockAcquire", lockid, "awakened");
 
@@ -879,3 +894,48 @@ LWLockHeldByMe(LWLockId lockid)
 	}
 	return false;
 }
+
+void
+lwlock_get_stat(LWLockId lockid, uint64 *calls, uint64 *waits, uint64 *time_ms)
+{
+	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+
+	SpinLockAcquire(&lock->mutex);
+	*calls   = lock->calls;
+	*waits   = lock->waits;
+	*time_ms = lock->time_ms;
+	SpinLockRelease(&lock->mutex);
+}
+
+void
+lwlock_get_stat_sum(LWLockId start, LWLockId end, uint64 *calls, uint64 *waits, uint64 *time_ms)
+{
+	LWLockId lockid;
+
+	*calls   = 0;
+	*waits   = 0;
+	*time_ms = 0;
+
+	for (lockid=start ; lockid<=end ; lockid++)
+	{
+		uint64 calls2, waits2, time_ms2;
+
+		lwlock_get_stat(lockid, &calls2, &waits2, &time_ms2);
+
+		*calls   += calls2;
+		*waits   += waits2;
+		*time_ms += time_ms2;
+	}
+}
+
+void
+lwlock_reset_stat(LWLockId lockid)
+{
+	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+
+	SpinLockAcquire(&lock->mutex);
+	lock->calls   = 0;
+	lock->waits   = 0;
+	lock->time_ms = 0;
+	SpinLockRelease(&lock->mutex);
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 7c0705a..b4e453f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -117,6 +117,8 @@ extern Datum pg_stat_reset_shared(PG_FUNCTION_ARGS);
 extern Datum pg_stat_reset_single_table_counters(PG_FUNCTION_ARGS);
 extern Datum pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS);
 
+extern Datum pg_stat_get_lwlocks(PG_FUNCTION_ARGS);
+
 /* Global bgwriter statistics, from bgwriter.c */
 extern PgStat_MsgBgWriter bgwriterStats;
 
@@ -1700,3 +1702,109 @@ pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+Datum
+pg_stat_get_lwlocks(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+
+	/* stuff done only on the first call of the function */
+	if (SRF_IS_FIRSTCALL())
+	{
+		MemoryContext oldcontext;
+		TupleDesc	tupdesc;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		tupdesc = CreateTemplateTupleDesc(4, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "lockid",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "calls",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "waits",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "time_ms",
+						   INT8OID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+		funcctx->max_calls = NumLWLocks();
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	/* stuff done on every call of the function */
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		Datum		values[4];
+		bool		nulls[4];
+		HeapTuple	tuple;
+		LWLockId	lockid;
+		uint64		calls,waits,time_ms;
+
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		lockid = funcctx->call_cntr;
+
+		if ( lockid<FirstBufMappingLock )
+		{
+			lwlock_get_stat(lockid, &calls, &waits, &time_ms);
+		}
+		else if ( FirstBufMappingLock<=lockid && lockid<FirstLockMgrLock )
+		{
+			lwlock_get_stat_sum(FirstBufMappingLock, FirstLockMgrLock-1,
+					    &calls, &waits, &time_ms);
+			funcctx->call_cntr = FirstLockMgrLock - 1;
+		}
+		else if ( FirstLockMgrLock<=lockid && lockid<FirstPredicateLockMgrLock )
+		{
+			lwlock_get_stat_sum(FirstLockMgrLock, FirstPredicateLockMgrLock-1,
+					    &calls, &waits, &time_ms);
+			funcctx->call_cntr = FirstPredicateLockMgrLock - 1;
+		}
+		else if ( FirstPredicateLockMgrLock<=lockid && lockid<NumFixedLWLocks )
+		{
+			lwlock_get_stat_sum(FirstPredicateLockMgrLock, NumFixedLWLocks-1,
+					    &calls, &waits, &time_ms);
+			funcctx->call_cntr = NumFixedLWLocks - 1;
+		}
+		else if ( NumFixedLWLocks<=lockid && lockid<NumLWLocks() )
+		{
+			lwlock_get_stat_sum(NumFixedLWLocks, NumLWLocks()-1,
+					    &calls, &waits, &time_ms);
+			funcctx->call_cntr = NumLWLocks() - 1;
+		}
+
+		values[0] = Int64GetDatum(lockid);
+		values[1] = Int64GetDatum(calls);
+		values[2] = Int64GetDatum(waits);
+		values[3] = Int64GetDatum(time_ms);
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+
+		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+	}
+	else
+	{
+		SRF_RETURN_DONE(funcctx);
+	}
+}
+
+Datum
+pg_stat_reset_lwlocks(PG_FUNCTION_ARGS)
+{
+	LWLockId	lockid;
+
+	for (lockid=0 ; lockid<NumLWLocks() ; lockid++)
+	{
+		lwlock_reset_stat(lockid);
+	}
+
+	PG_RETURN_VOID();
+}
+
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index bee7154..85d2a7f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2602,6 +2602,10 @@ DATA(insert OID = 1936 (  pg_stat_get_backend_idset		PGNSP PGUID 12 1 100 0 0 f
 DESCR("statistics: currently active backend IDs");
 DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active backends");
+DATA(insert OID = 3764 (  pg_stat_get_lwlocks  PGNSP PGUID 12 1 100 0 0 f f f f f t s 0 0 2249 "" "{20,20,20,20}" "{o,o,o,o}" "{lwlockid,calls,waits,time_ms}" _null_ pg_stat_get_lwlocks _null_ _null_ _null_ ));
+DESCR("statistics: light-weight lock statistics");
+DATA(insert OID = 3765 (  pg_stat_reset_lwlocks  PGNSP PGUID 12 1 0 0 0 f f f f f f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_stat_reset_lwlocks _null_ _null_ _null_ ));
+DESCR("statistics: reset light-weight lock statistics");
 DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s 0 0 2249 "" "{23,25,25,25,25,25,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 82d8ec4..41c4259 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -119,4 +119,8 @@ extern void CreateLWLocks(void);
 
 extern void RequestAddinLWLocks(int n);
 
+extern void lwlock_get_stat(LWLockId, uint64 *, uint64 *, uint64 *);
+extern void lwlock_get_stat_sum(LWLockId, LWLockId, uint64 *, uint64 *, uint64 *);
+extern void lwlock_reset_stat(LWLockId);
+
 #endif   /* LWLOCK_H */
#6Josh Berkus
josh@agliodbs.com
In reply to: Satoshi Nagayasu (#3)
Re: pg_stat_lwlocks view - lwlocks statistics

To implement it, a new array can be added in the local process memory
to hold lwlock statistics, and update counters both in the shared
memory and the local process memory at once. Then, the session can
retrieve 'per-session' statistics from the local process memory
via some dedicated function.

That would be way cool from a diagnostics perspective. Not sure what
impact it has, though.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#7Satoshi Nagayasu
snaga@uptime.jp
In reply to: Satoshi Nagayasu (#5)
1 attachment(s)
pg_stat_lwlocks view - lwlocks statistics, round 2

Hi all,

I have fixed my previous patch for pg_stat_lwlocks view, and
as Josh commented, it now supports local and global (shared)
statistics in the same system view.

Local statistics means the counters are only effective in the
same session, and shared ones means the counters are shared within
the entire cluster.

Also the global statistics would be collected via pgstat collector
process like other statistics do.

Now, the global statistics struct has been splitted into two parts
for different use, for bgwriter stats and lwlock stats.

Therefore, calling pg_stat_reset_shared('bgwriter') or
pg_stat_reset_shared('lwlocks') would reset dedicated struct,
not entire PgStat_GlobalStats.

Comments and review are always welcome.

Regards,

------------------------------------------------------------------------------
postgres=# SELECT * FROM pg_stat_lwlocks;
lwlockid | local_calls | local_waits | local_time_ms | shared_calls |
shared_waits | shared_time_ms
----------+-------------+-------------+---------------+--------------+--------------+----------------
0 | 0 | 0 | 0 | 4268 |
0 | 0
1 | 43 | 0 | 0 | 387 |
0 | 0
2 | 0 | 0 | 0 | 19 |
0 | 0
3 | 0 | 0 | 0 | 28 |
0 | 0
4 | 3 | 0 | 0 | 315 |
0 | 0
5 | 0 | 0 | 0 | 24 |
0 | 0
6 | 1 | 0 | 0 | 76 |
0 | 0
7 | 0 | 0 | 0 | 16919 |
0 | 0
8 | 0 | 0 | 0 | 0 |
0 | 0
9 | 0 | 0 | 0 | 0 |
0 | 0
10 | 0 | 0 | 0 | 0 |
0 | 0
11 | 0 | 0 | 0 | 75 |
0 | 0
12 | 0 | 0 | 0 | 0 |
0 | 0
13 | 0 | 0 | 0 | 0 |
0 | 0
14 | 0 | 0 | 0 | 0 |
0 | 0
15 | 0 | 0 | 0 | 0 |
0 | 0
16 | 0 | 0 | 0 | 0 |
0 | 0
17 | 0 | 0 | 0 | 61451 |
6 | 0
18 | 0 | 0 | 0 | 0 |
0 | 0
19 | 0 | 0 | 0 | 0 |
0 | 0
20 | 0 | 0 | 0 | 0 |
0 | 0
21 | 1 | 0 | 0 | 9 |
0 | 0
22 | 0 | 0 | 0 | 0 |
0 | 0
23 | 0 | 0 | 0 | 0 |
0 | 0
24 | 0 | 0 | 0 | 1 |
0 | 0
25 | 0 | 0 | 0 | 0 |
0 | 0
26 | 2 | 0 | 0 | 18 |
0 | 0
27 | 0 | 0 | 0 | 0 |
0 | 0
28 | 0 | 0 | 0 | 0 |
0 | 0
29 | 0 | 0 | 0 | 0 |
0 | 0
30 | 0 | 0 | 0 | 0 |
0 | 0
31 | 0 | 0 | 0 | 0 |
0 | 0
32 | 0 | 0 | 0 | 0 |
0 | 0
33 | 4 | 0 | 0 | 207953 |
0 | 0
50 | 8 | 0 | 0 | 33388 |
0 | 0
67 | 0 | 0 | 0 | 0 |
0 | 0
(36 rows)

postgres=#
------------------------------------------------------------------------------

2012/06/26 21:11, Satoshi Nagayasu wrote:

Hi all,

I've modified the pg_stat_lwlocks patch to be able to work with
the latest PostgreSQL Git code.

This patch provides:
pg_stat_lwlocks New system view to show lwlock statistics.
pg_stat_get_lwlocks() New function to retrieve lwlock statistics.
pg_stat_reset_lwlocks() New function to reset lwlock statistics.

Please try it out.

Regards,

2012/06/26 5:29, Satoshi Nagayasu wrote:

Hi all,

I've been working on a new system view, pg_stat_lwlocks, to observe
LWLock, and just completed my 'proof-of-concept' code that can work
with version 9.1.

Now, I'd like to know the possibility of this feature for future
release.

With this patch, DBA can easily determine a bottleneck around lwlocks.
--------------------------------------------------
postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10;
lwlockid | calls | waits | time_ms
----------+--------+-------+---------
49 | 193326 | 32096 | 23688
8 | 3305 | 133 | 1335
2 | 21 | 0 | 0
4 | 135188 | 0 | 0
5 | 57935 | 0 | 0
6 | 141 | 0 | 0
7 | 24580 | 1 | 0
3 | 3282 | 0 | 0
1 | 41 | 0 | 0
9 | 3 | 0 | 0
(10 rows)

postgres=#
--------------------------------------------------

In this view,
'lwlockid' column represents LWLockId used in the backends.
'calls' represents how many times LWLockAcquire() was called.
'waits' represents how many times LWLockAcquire() needed to wait
within it before lock acquisition.
'time_ms' represents how long LWLockAcquire() totally waited on
a lwlock.

And lwlocks that use a LWLockId range, such as BufMappingLock or
LockMgrLock, would be grouped and summed up in a single record.
For example, lwlockid 49 in the above view represents LockMgrLock
statistics.

Now, I know there are some considerations.

(1) Performance

I've measured LWLock performance both with and without the patch,
and confirmed that this patch does not affect the LWLock perfomance
at all.

pgbench scores with the patch:
tps = 900.906658 (excluding connections establishing)
tps = 908.528422 (excluding connections establishing)
tps = 903.900977 (excluding connections establishing)
tps = 910.470595 (excluding connections establishing)
tps = 909.685396 (excluding connections establishing)

pgbench scores without the patch:
tps = 909.096785 (excluding connections establishing)
tps = 894.868712 (excluding connections establishing)
tps = 910.074669 (excluding connections establishing)
tps = 904.022770 (excluding connections establishing)
tps = 895.673830 (excluding connections establishing)

Of course, this experiment was not I/O bound, and the cache hit ratio
was>99.9%.

(2) Memory space

In this patch, I added three new members to LWLock structure
as uint64 to collect statistics.

It means that those members must be held in the shared memory,
but I'm not sure whether it's appropriate.

I think another possible option is holding those statistics
values in local (backend) process memory, and send them through
the stat collector process (like other statistics values).

(3) LWLock names (or labels)

Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is
not easy for DBA to determine actual lock type.

So, I want to show LWLock names (or labels), like 'WALWriteLock'
or 'LockMgrLock', but how should I implement it?

Any comments?

Regards,

--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

Attachments:

pg_stat_lwlocks_20121013.difftext/plain; charset=Shift_JIS; name=pg_stat_lwlocks_20121013.diffDownload
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 607a72f..2f84940 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -671,6 +671,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_lwlocks AS
+    SELECT
+            S.lwlockid,
+            S.local_calls,
+            S.local_waits,
+            S.local_time_ms,
+            S.shared_calls,
+            S.shared_waits,
+            S.shared_time_ms
+    FROM pg_stat_get_lwlocks() AS S;
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8389d5c..970e8bd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -282,6 +282,7 @@ static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
+static void pgstat_recv_lwlockstat(PgStat_MsgLWLockstat *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 
@@ -1188,6 +1189,8 @@ pgstat_reset_shared_counters(const char *target)
 
 	if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "lwlocks") == 0)
+		msg.m_resettarget = RESET_LWLOCKSTAT;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1344,6 +1347,72 @@ pgstat_report_recovery_conflict(int reason)
 }
 
 /* --------
+ * pgstat_report_lwlockstat() -
+ *
+ *	Tell the collector about lwlock statistics.
+ * --------
+ */
+void
+pgstat_report_lwlockstat(void)
+{
+	PgStat_MsgLWLockstat msg;
+
+	int32 lockid = 0;
+	int need_continue = 0;
+
+ report_continue:
+	memset(&msg, 0, sizeof(PgStat_MsgLWLockstat));
+
+	for ( ; lockid<NumFixedLWLocks+1 ; lockid++)
+	{
+		uint64 calls, waits, time_ms;
+
+		calls = waits = time_ms = 0;
+
+		calls   = lwlock_get_stat_calls_global(lockid);
+		waits   = lwlock_get_stat_waits_global(lockid);
+		time_ms = lwlock_get_stat_time_ms_global(lockid);
+
+		if ( calls>0 || waits>0 || time_ms>0 )
+		{
+			msg.m_entry[msg.m_nentries].lockid      = lockid;
+			msg.m_entry[msg.m_nentries].calls       = calls;
+			msg.m_entry[msg.m_nentries].waits       = waits;
+			msg.m_entry[msg.m_nentries].waited_time = time_ms;
+
+			msg.m_nentries++;
+
+			lwlock_reset_stat_global(lockid);
+
+			/*
+			 * Need to keep a message packet smaller than PGSTAT_MSG_PAYLOAD.
+			 * So, going to split a report into multiple messages.
+			 */
+			if ( msg.m_nentries>=MAX_LWLOCKSTAT_ENTRIES )
+			{
+				need_continue = 1;
+				break;
+			}
+		}
+	}
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_LWLOCKSTAT);
+	pgstat_send(&msg, sizeof(msg));
+
+	/*
+	 * Need to continue because of the larger report?
+	 */
+	if ( need_continue )
+	{
+		need_continue = 0;
+		goto report_continue;
+	}
+}
+
+/* --------
  * pgstat_report_deadlock() -
  *
  *	Tell the collector about a deadlock detected.
@@ -3219,6 +3288,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
 					break;
 
+				case PGSTAT_MTYPE_LWLOCKSTAT:
+					pgstat_recv_lwlockstat((PgStat_MsgLWLockstat *) &msg, len);
+					break;
+
 				case PGSTAT_MTYPE_DEADLOCK:
 					pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
 					break;
@@ -4379,8 +4452,15 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
 		/* Reset the global background writer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
-		globalStats.stat_reset_timestamp = GetCurrentTimestamp();
+		memset(&globalStats.bgwriterstats, 0, sizeof(globalStats.bgwriterstats));
+		globalStats.bgwriterstats.reset_timestamp = GetCurrentTimestamp();
+	}
+
+	if (msg->m_resettarget == RESET_LWLOCKSTAT)
+	{
+		/* Reset the global lwlock statistics for the cluster. */
+		memset(&globalStats.lwlockstats, 0, sizeof(globalStats.lwlockstats));
+		globalStats.lwlockstats.reset_timestamp = GetCurrentTimestamp();
 	}
 
 	/*
@@ -4521,16 +4601,16 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-	globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-	globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-	globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.buf_written_clean += msg->m_buf_written_clean;
-	globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.bgwriterstats.timed_checkpoints += msg->m_timed_checkpoints;
+	globalStats.bgwriterstats.requested_checkpoints += msg->m_requested_checkpoints;
+	globalStats.bgwriterstats.checkpoint_write_time += msg->m_checkpoint_write_time;
+	globalStats.bgwriterstats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
+	globalStats.bgwriterstats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
+	globalStats.bgwriterstats.buf_written_clean += msg->m_buf_written_clean;
+	globalStats.bgwriterstats.maxwritten_clean += msg->m_maxwritten_clean;
+	globalStats.bgwriterstats.buf_written_backend += msg->m_buf_written_backend;
+	globalStats.bgwriterstats.buf_fsync_backend += msg->m_buf_fsync_backend;
+	globalStats.bgwriterstats.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -4574,6 +4654,27 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 }
 
 /* ----------
+ * pgstat_recv_lwlockstat() -
+ *
+ *	Process a LWLockstat message.
+ * ----------
+ */
+static void
+pgstat_recv_lwlockstat(PgStat_MsgLWLockstat *msg, int len)
+{
+	int i;
+
+	for (i=0 ; i<msg->m_nentries ; i++)
+	{
+		int32 lockid = msg->m_entry[i].lockid;
+
+		globalStats.lwlockstats.lwlock_stat[lockid].calls       += msg->m_entry[i].calls;
+		globalStats.lwlockstats.lwlock_stat[lockid].waits       += msg->m_entry[i].waits;
+		globalStats.lwlockstats.lwlock_stat[lockid].waited_time += msg->m_entry[i].waited_time;
+	}
+}
+
+/* ----------
  * pgstat_recv_deadlock() -
  *
  *	Process a DEADLOCK message.
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5e1ce17..402799d 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -32,6 +32,7 @@
 #include "storage/proc.h"
 #include "storage/spin.h"
 
+#include <sys/time.h>
 
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
@@ -48,6 +49,25 @@ typedef struct LWLock
 	/* tail is undefined when head is NULL */
 } LWLock;
 
+typedef struct LWLockCounter2
+{
+	/* statistics stuff */
+	uint64		calls;
+	uint64		waits;
+	uint64		time_ms;
+} LWLockCounter2;
+
+/*
+ * LWLockCounterLocal has <NumFixedLWLocks> counters
+ * and one additional counter for dynamic LWLocks
+ * to hold lwlock statistic in the local session.
+ */
+LWLockCounter2 LWLockCounterLocal[NumFixedLWLocks+1];
+
+LWLockCounter2 LWLockCounterGlobal[NumFixedLWLocks+1];
+
+#define LWLockCounterId(X) ((X) < (NumFixedLWLocks+1) ? (X) : (NumFixedLWLocks+1))
+
 /*
  * All the LWLock structs are allocated as an array in shared memory.
  * (LWLockIds are indexes into the array.)	We force the array stride to
@@ -90,6 +110,8 @@ static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
 static int	lock_addin_request = 0;
 static bool lock_addin_request_allowed = true;
 
+static void InitLWockCounter(void);
+
 #ifdef LWLOCK_STATS
 static int	counts_for_pid = 0;
 static int *sh_acquire_counts;
@@ -253,6 +275,26 @@ LWLockShmemSize(void)
 	return size;
 }
 
+/*
+ * Initialize local and global counters for lwlock statistics.
+ */
+static void
+InitLWockCounter(void)
+{
+	int i;
+
+	for (i=0 ; i<NumFixedLWLocks+1 ; i++)
+	{
+		LWLockCounterLocal[i].calls   = 0;
+		LWLockCounterLocal[i].waits   = 0;
+		LWLockCounterLocal[i].time_ms = 0;
+
+		LWLockCounterGlobal[i].calls   = 0;
+		LWLockCounterGlobal[i].waits   = 0;
+		LWLockCounterGlobal[i].time_ms = 0;
+	}
+}
+
 
 /*
  * Allocate shmem space for LWLocks and initialize the locks.
@@ -298,6 +340,8 @@ CreateLWLocks(void)
 	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
 	LWLockCounter[0] = (int) NumFixedLWLocks;
 	LWLockCounter[1] = numLocks;
+
+	InitLWockCounter();
 }
 
 
@@ -344,9 +388,13 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	int			extraWaits = 0;
+	struct timeval		wait_start,wait_done;
 
 	PRINT_LWDEBUG("LWLockAcquire", lockid, lock);
 
+	LWLockCounterLocal[ LWLockCounterId(lockid) ].calls++;
+	LWLockCounterGlobal[ LWLockCounterId(lockid) ].calls++;
+
 #ifdef LWLOCK_STATS
 	/* Set up local count state first time through in a given process */
 	if (counts_for_pid != MyProcPid)
@@ -395,6 +443,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	for (;;)
 	{
 		bool		mustwait;
+		uint64 waited;
 
 		/* Acquire mutex.  Time spent holding mutex should be short! */
 #ifdef LWLOCK_STATS
@@ -473,6 +522,9 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 #endif
 
 		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
+		LWLockCounterLocal[ LWLockCounterId(lockid) ].waits++;
+		LWLockCounterGlobal[ LWLockCounterId(lockid) ].waits++;
+		gettimeofday(&wait_start, NULL);
 
 		for (;;)
 		{
@@ -484,6 +536,20 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		}
 
 		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
+		gettimeofday(&wait_done, NULL);
+
+		if ( wait_done.tv_usec >= wait_start.tv_usec )
+		{
+			waited = ( wait_done.tv_usec - wait_start.tv_usec ) / 1000 ;
+			waited += ( wait_done.tv_sec - wait_start.tv_sec ) * 1000 ;
+		}
+		else
+		{
+			waited = ( wait_done.tv_usec + 1000*1000 - wait_start.tv_usec ) / 1000 ;
+			waited += ( wait_done.tv_sec - 1 - wait_start.tv_sec ) * 1000 ;
+		}
+		LWLockCounterLocal[ LWLockCounterId(lockid) ].time_ms += waited;
+		LWLockCounterGlobal[ LWLockCounterId(lockid) ].time_ms += waited;
 
 		LOG_LWDEBUG("LWLockAcquire", lockid, "awakened");
 
@@ -885,3 +951,55 @@ LWLockHeldByMe(LWLockId lockid)
 	}
 	return false;
 }
+
+uint64
+lwlock_get_stat_calls_local(LWLockId lockid)
+{
+	return LWLockCounterLocal[ LWLockCounterId(lockid) ].calls;
+}
+
+uint64
+lwlock_get_stat_waits_local(LWLockId lockid)
+{
+	return LWLockCounterLocal[ LWLockCounterId(lockid) ].waits;
+}
+
+uint64
+lwlock_get_stat_time_ms_local(LWLockId lockid)
+{
+	return LWLockCounterLocal[ LWLockCounterId(lockid) ].time_ms;
+}
+
+void
+lwlock_reset_stat_local(LWLockId lockid)
+{
+	LWLockCounterLocal[ LWLockCounterId(lockid) ].calls   = 0;
+	LWLockCounterLocal[ LWLockCounterId(lockid) ].waits   = 0;
+	LWLockCounterLocal[ LWLockCounterId(lockid) ].time_ms = 0;
+}
+
+uint64
+lwlock_get_stat_calls_global(LWLockId lockid)
+{
+	return LWLockCounterGlobal[ LWLockCounterId(lockid) ].calls;
+}
+
+uint64
+lwlock_get_stat_waits_global(LWLockId lockid)
+{
+	return LWLockCounterGlobal[ LWLockCounterId(lockid) ].waits;
+}
+
+uint64
+lwlock_get_stat_time_ms_global(LWLockId lockid)
+{
+	return LWLockCounterGlobal[ LWLockCounterId(lockid) ].time_ms;
+}
+
+void
+lwlock_reset_stat_global(LWLockId lockid)
+{
+	LWLockCounterGlobal[ LWLockCounterId(lockid) ].calls   = 0;
+	LWLockCounterGlobal[ LWLockCounterId(lockid) ].waits   = 0;
+	LWLockCounterGlobal[ LWLockCounterId(lockid) ].time_ms = 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 585db1a..5ca2c6f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3919,6 +3919,8 @@ PostgresMain(int argc, char *argv[], const char *username)
 				pgstat_report_activity(STATE_IDLE, NULL);
 			}
 
+			pgstat_report_lwlockstat();
+
 			ReadyForQuery(whereToSendOutput);
 			send_ready_for_query = false;
 		}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 7d4059f..e74042d 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -118,6 +118,8 @@ extern Datum pg_stat_reset_shared(PG_FUNCTION_ARGS);
 extern Datum pg_stat_reset_single_table_counters(PG_FUNCTION_ARGS);
 extern Datum pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS);
 
+extern Datum pg_stat_get_lwlocks(PG_FUNCTION_ARGS);
+
 /* Global bgwriter statistics, from bgwriter.c */
 extern PgStat_MsgBgWriter bgwriterStats;
 
@@ -1399,69 +1401,69 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
 	/* time is already in msec, just convert to double for presentation */
-	PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+	PG_RETURN_FLOAT8((double) pgstat_fetch_global()->bgwriterstats.checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
 	/* time is already in msec, just convert to double for presentation */
-	PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+	PG_RETURN_FLOAT8((double) pgstat_fetch_global()->bgwriterstats.checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->bgwriterstats.reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_alloc);
 }
 
 Datum
@@ -1701,3 +1703,162 @@ pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+Datum
+pg_stat_get_lwlocks(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+
+	/* stuff done only on the first call of the function */
+	if (SRF_IS_FIRSTCALL())
+	{
+		MemoryContext oldcontext;
+		TupleDesc	tupdesc;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		tupdesc = CreateTemplateTupleDesc(7, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "lockid",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "local_calls",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "local_waits",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "local_time_ms",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "shared_calls",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "shared_waits",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 7, "shared_time_ms",
+						   INT8OID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+		funcctx->max_calls = NumFixedLWLocks + 1;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	/* stuff done on every call of the function */
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		Datum		values[7];
+		bool		nulls[7];
+		HeapTuple	tuple;
+		LWLockId	lockid;
+		uint64		local_calls,local_waits,local_time_ms;
+		uint64		shared_calls,shared_waits,shared_time_ms;
+		int i;
+		PgStat_LWLockEntry *lwlock_stat = pgstat_fetch_global()->lwlockstats.lwlock_stat;
+
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		lockid = funcctx->call_cntr;
+
+		local_calls = local_waits = local_time_ms = 0;
+		shared_calls = shared_waits = shared_time_ms = 0;
+
+		/*
+		 * Partitioned locks need to be summed up by the lock group.
+		 */
+		if ( FirstBufMappingLock <= lockid && lockid < FirstLockMgrLock )
+		{
+			for (i=0 ; i<NUM_BUFFER_PARTITIONS ; i++)
+			{
+				/* local statistics */
+				local_calls   = lwlock_get_stat_calls_local(lockid);
+				local_waits   = lwlock_get_stat_waits_local(lockid);
+				local_time_ms = lwlock_get_stat_time_ms_local(lockid);
+
+				/* global statistics */
+				shared_calls   += lwlock_stat[FirstBufMappingLock+i].calls;
+				shared_waits   += lwlock_stat[FirstBufMappingLock+i].waits;
+				shared_time_ms += lwlock_stat[FirstBufMappingLock+i].waited_time;
+			}
+
+			funcctx->call_cntr += NUM_BUFFER_PARTITIONS;
+		}
+		else if ( FirstLockMgrLock <= lockid && lockid < FirstPredicateLockMgrLock )
+		{
+			for (i=0 ; i<NUM_LOCK_PARTITIONS ; i++)
+			{
+				/* local statistics */
+				local_calls   = lwlock_get_stat_calls_local(lockid);
+				local_waits   = lwlock_get_stat_waits_local(lockid);
+				local_time_ms = lwlock_get_stat_time_ms_local(lockid);
+
+				/* global statistics */
+				shared_calls   += lwlock_stat[FirstLockMgrLock+i].calls;
+				shared_waits   += lwlock_stat[FirstLockMgrLock+i].waits;
+				shared_time_ms += lwlock_stat[FirstLockMgrLock+i].waited_time;
+			}
+
+			funcctx->call_cntr += NUM_LOCK_PARTITIONS;
+		}
+		else if ( FirstPredicateLockMgrLock <= lockid && lockid < NumFixedLWLocks )
+		{
+			for (i=0 ; i<NUM_PREDICATELOCK_PARTITIONS ; i++)
+			{
+				/* local statistics */
+				local_calls   = lwlock_get_stat_calls_local(lockid);
+				local_waits   = lwlock_get_stat_waits_local(lockid);
+				local_time_ms = lwlock_get_stat_time_ms_local(lockid);
+
+				/* global statistics */
+				shared_calls   += lwlock_stat[FirstPredicateLockMgrLock+i].calls;
+				shared_waits   += lwlock_stat[FirstPredicateLockMgrLock+i].waits;
+				shared_time_ms += lwlock_stat[FirstPredicateLockMgrLock+i].waited_time;
+			}
+
+			funcctx->call_cntr += NUM_PREDICATELOCK_PARTITIONS;
+		}
+		else
+		{
+			/* local statistics */
+			local_calls   = lwlock_get_stat_calls_local(lockid);
+			local_waits   = lwlock_get_stat_waits_local(lockid);
+			local_time_ms = lwlock_get_stat_time_ms_local(lockid);
+
+			/* global statistics */
+			shared_calls   = lwlock_stat[lockid].calls;
+			shared_waits   = lwlock_stat[lockid].waits;
+			shared_time_ms = lwlock_stat[lockid].waited_time;
+		}
+
+		values[0] = Int64GetDatum(lockid);
+		values[1] = Int64GetDatum(local_calls);
+		values[2] = Int64GetDatum(local_waits);
+		values[3] = Int64GetDatum(local_time_ms);
+		values[4] = Int64GetDatum(shared_calls);
+		values[5] = Int64GetDatum(shared_waits);
+		values[6] = Int64GetDatum(shared_time_ms);
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+
+		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+	}
+	else
+	{
+		SRF_RETURN_DONE(funcctx);
+	}
+}
+
+Datum
+pg_stat_reset_lwlocks(PG_FUNCTION_ARGS)
+{
+	LWLockId	lockid;
+
+	for (lockid=0 ; lockid<NumLWLocks() ; lockid++)
+	{
+		lwlock_reset_stat_local(lockid);
+	}
+
+	PG_RETURN_VOID();
+}
+
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f935eb1..4582b12 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2612,6 +2612,10 @@ DATA(insert OID = 1936 (  pg_stat_get_backend_idset		PGNSP PGUID 12 1 100 0 0 f
 DESCR("statistics: currently active backend IDs");
 DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active backends");
+DATA(insert OID = 3764 (  pg_stat_get_lwlocks  PGNSP PGUID 12 1 100 0 0 f f f f f t s 0 0 2249 "" "{20,20,20,20,20,20,20}" "{o,o,o,o,o,o,o}" "{lwlockid,local_calls,local_waits,local_time_ms,shared_calls,shared_waits,shared_time_ms}" _null_ pg_stat_get_lwlocks _null_ _null_ _null_ ));
+DESCR("statistics: light-weight lock statistics");
+DATA(insert OID = 3765 (  pg_stat_reset_lwlocks  PGNSP PGUID 12 1 0 0 0 f f f f f f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_stat_reset_lwlocks _null_ _null_ _null_ ));
+DESCR("statistics: reset light-weight lock statistics");
 DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s 0 0 2249 "" "{23,25,25,25,25,25,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 613c1c2..0e52534 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -15,6 +15,7 @@
 #include "fmgr.h"
 #include "libpq/pqcomm.h"
 #include "portability/instr_time.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -49,6 +50,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
 	PGSTAT_MTYPE_TEMPFILE,
+	PGSTAT_MTYPE_LWLOCKSTAT,
 	PGSTAT_MTYPE_DEADLOCK
 } StatMsgType;
 
@@ -102,7 +104,8 @@ typedef struct PgStat_TableCounts
 /* Possible targets for resetting cluster-wide shared values */
 typedef enum PgStat_Shared_Reset_Target
 {
-	RESET_BGWRITER
+	RESET_BGWRITER,
+	RESET_LWLOCKSTAT
 } PgStat_Shared_Reset_Target;
 
 /* Possible object types for resetting single counters */
@@ -605,13 +608,31 @@ typedef struct PgStat_StatFuncEntry
 	PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
+#define MAX_LWLOCKSTAT_ENTRIES 20
+
+typedef struct PgStat_LWLockEntry
+{
+	LWLockId       lockid;
+	PgStat_Counter calls;
+	PgStat_Counter waits;
+	PgStat_Counter waited_time;     /* time in milliseconds */
+} PgStat_LWLockEntry;
+
+typedef struct PgStat_MsgLWLockstat
+{
+	PgStat_MsgHdr m_hdr;
+	int			m_nentries;
+
+	/* Need to keep a msg smaller than PGSTAT_MSG_PAYLOAD */
+	PgStat_LWLockEntry m_entry[MAX_LWLOCKSTAT_ENTRIES];
+} PgStat_MsgLWLockstat;
+
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics for BgWriter
  */
-typedef struct PgStat_GlobalStats
+typedef struct PgStat_BgWriterGlobalStats
 {
-	TimestampTz stats_timestamp;	/* time of stats file update */
 	PgStat_Counter timed_checkpoints;
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;		/* times in milliseconds */
@@ -622,6 +643,28 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	TimestampTz reset_timestamp;
+} PgStat_BgWriterGlobalStats;
+
+/*
+ * Global statistics for LWLocks
+ */
+typedef struct PgStat_LWLockGlobalStats
+{
+	PgStat_LWLockEntry lwlock_stat[NumFixedLWLocks+1];
+	TimestampTz reset_timestamp;
+} PgStat_LWLockGlobalStats;
+
+/*
+ * Global statistics kept in the stats collector
+ */
+typedef struct PgStat_GlobalStats
+{
+	TimestampTz stats_timestamp;	/* time of stats file update */
+
+	PgStat_BgWriterGlobalStats bgwriterstats;
+	PgStat_LWLockGlobalStats lwlockstats;
+
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 82d8ec4..a101db5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -119,4 +119,14 @@ extern void CreateLWLocks(void);
 
 extern void RequestAddinLWLocks(int n);
 
+extern uint64 lwlock_get_stat_calls_local(LWLockId);
+extern uint64 lwlock_get_stat_waits_local(LWLockId);
+extern uint64 lwlock_get_stat_time_ms_local(LWLockId);
+extern void lwlock_reset_stat_local(LWLockId);
+
+extern uint64 lwlock_get_stat_calls_global(LWLockId);
+extern uint64 lwlock_get_stat_waits_global(LWLockId);
+extern uint64 lwlock_get_stat_time_ms_global(LWLockId);
+extern void lwlock_reset_stat_global(LWLockId);
+
 #endif   /* LWLOCK_H */
#8Satoshi Nagayasu
snaga@uptime.jp
In reply to: Satoshi Nagayasu (#7)
1 attachment(s)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

Hi,

2012/10/13 23:05, Satoshi Nagayasu wrote:

Hi all,

I have fixed my previous patch for pg_stat_lwlocks view, and
as Josh commented, it now supports local and global (shared)
statistics in the same system view.

Sorry, I found my mistakes. New fixed one is attached to this mail.

Regards,

Local statistics means the counters are only effective in the
same session, and shared ones means the counters are shared within
the entire cluster.

Also the global statistics would be collected via pgstat collector
process like other statistics do.

Now, the global statistics struct has been splitted into two parts
for different use, for bgwriter stats and lwlock stats.

Therefore, calling pg_stat_reset_shared('bgwriter') or
pg_stat_reset_shared('lwlocks') would reset dedicated struct,
not entire PgStat_GlobalStats.

Comments and review are always welcome.

Regards,

------------------------------------------------------------------------------
postgres=# SELECT * FROM pg_stat_lwlocks;
lwlockid | local_calls | local_waits | local_time_ms | shared_calls |
shared_waits | shared_time_ms
----------+-------------+-------------+---------------+--------------+--------------+----------------
0 | 0 | 0 | 0 | 4268 |
0 | 0
1 | 43 | 0 | 0 | 387 |
0 | 0
2 | 0 | 0 | 0 | 19 |
0 | 0
3 | 0 | 0 | 0 | 28 |
0 | 0
4 | 3 | 0 | 0 | 315 |
0 | 0
5 | 0 | 0 | 0 | 24 |
0 | 0
6 | 1 | 0 | 0 | 76 |
0 | 0
7 | 0 | 0 | 0 | 16919 |
0 | 0
8 | 0 | 0 | 0 | 0 |
0 | 0
9 | 0 | 0 | 0 | 0 |
0 | 0
10 | 0 | 0 | 0 | 0 |
0 | 0
11 | 0 | 0 | 0 | 75 |
0 | 0
12 | 0 | 0 | 0 | 0 |
0 | 0
13 | 0 | 0 | 0 | 0 |
0 | 0
14 | 0 | 0 | 0 | 0 |
0 | 0
15 | 0 | 0 | 0 | 0 |
0 | 0
16 | 0 | 0 | 0 | 0 |
0 | 0
17 | 0 | 0 | 0 | 61451 |
6 | 0
18 | 0 | 0 | 0 | 0 |
0 | 0
19 | 0 | 0 | 0 | 0 |
0 | 0
20 | 0 | 0 | 0 | 0 |
0 | 0
21 | 1 | 0 | 0 | 9 |
0 | 0
22 | 0 | 0 | 0 | 0 |
0 | 0
23 | 0 | 0 | 0 | 0 |
0 | 0
24 | 0 | 0 | 0 | 1 |
0 | 0
25 | 0 | 0 | 0 | 0 |
0 | 0
26 | 2 | 0 | 0 | 18 |
0 | 0
27 | 0 | 0 | 0 | 0 |
0 | 0
28 | 0 | 0 | 0 | 0 |
0 | 0
29 | 0 | 0 | 0 | 0 |
0 | 0
30 | 0 | 0 | 0 | 0 |
0 | 0
31 | 0 | 0 | 0 | 0 |
0 | 0
32 | 0 | 0 | 0 | 0 |
0 | 0
33 | 4 | 0 | 0 | 207953 |
0 | 0
50 | 8 | 0 | 0 | 33388 |
0 | 0
67 | 0 | 0 | 0 | 0 |
0 | 0
(36 rows)

postgres=#
------------------------------------------------------------------------------

2012/06/26 21:11, Satoshi Nagayasu wrote:

Hi all,

I've modified the pg_stat_lwlocks patch to be able to work with
the latest PostgreSQL Git code.

This patch provides:
pg_stat_lwlocks New system view to show lwlock statistics.
pg_stat_get_lwlocks() New function to retrieve lwlock statistics.
pg_stat_reset_lwlocks() New function to reset lwlock statistics.

Please try it out.

Regards,

2012/06/26 5:29, Satoshi Nagayasu wrote:

Hi all,

I've been working on a new system view, pg_stat_lwlocks, to observe
LWLock, and just completed my 'proof-of-concept' code that can work
with version 9.1.

Now, I'd like to know the possibility of this feature for future
release.

With this patch, DBA can easily determine a bottleneck around lwlocks.
--------------------------------------------------
postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10;
lwlockid | calls | waits | time_ms
----------+--------+-------+---------
49 | 193326 | 32096 | 23688
8 | 3305 | 133 | 1335
2 | 21 | 0 | 0
4 | 135188 | 0 | 0
5 | 57935 | 0 | 0
6 | 141 | 0 | 0
7 | 24580 | 1 | 0
3 | 3282 | 0 | 0
1 | 41 | 0 | 0
9 | 3 | 0 | 0
(10 rows)

postgres=#
--------------------------------------------------

In this view,
'lwlockid' column represents LWLockId used in the backends.
'calls' represents how many times LWLockAcquire() was called.
'waits' represents how many times LWLockAcquire() needed to wait
within it before lock acquisition.
'time_ms' represents how long LWLockAcquire() totally waited on
a lwlock.

And lwlocks that use a LWLockId range, such as BufMappingLock or
LockMgrLock, would be grouped and summed up in a single record.
For example, lwlockid 49 in the above view represents LockMgrLock
statistics.

Now, I know there are some considerations.

(1) Performance

I've measured LWLock performance both with and without the patch,
and confirmed that this patch does not affect the LWLock perfomance
at all.

pgbench scores with the patch:
tps = 900.906658 (excluding connections establishing)
tps = 908.528422 (excluding connections establishing)
tps = 903.900977 (excluding connections establishing)
tps = 910.470595 (excluding connections establishing)
tps = 909.685396 (excluding connections establishing)

pgbench scores without the patch:
tps = 909.096785 (excluding connections establishing)
tps = 894.868712 (excluding connections establishing)
tps = 910.074669 (excluding connections establishing)
tps = 904.022770 (excluding connections establishing)
tps = 895.673830 (excluding connections establishing)

Of course, this experiment was not I/O bound, and the cache hit ratio
was>99.9%.

(2) Memory space

In this patch, I added three new members to LWLock structure
as uint64 to collect statistics.

It means that those members must be held in the shared memory,
but I'm not sure whether it's appropriate.

I think another possible option is holding those statistics
values in local (backend) process memory, and send them through
the stat collector process (like other statistics values).

(3) LWLock names (or labels)

Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is
not easy for DBA to determine actual lock type.

So, I want to show LWLock names (or labels), like 'WALWriteLock'
or 'LockMgrLock', but how should I implement it?

Any comments?

Regards,

--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

Attachments:

pg_stat_lwlocks_20121013_2.difftext/plain; charset=Shift_JIS; name=pg_stat_lwlocks_20121013_2.diffDownload
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 607a72f..2f84940 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -671,6 +671,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_lwlocks AS
+    SELECT
+            S.lwlockid,
+            S.local_calls,
+            S.local_waits,
+            S.local_time_ms,
+            S.shared_calls,
+            S.shared_waits,
+            S.shared_time_ms
+    FROM pg_stat_get_lwlocks() AS S;
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8389d5c..970e8bd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -282,6 +282,7 @@ static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
+static void pgstat_recv_lwlockstat(PgStat_MsgLWLockstat *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 
@@ -1188,6 +1189,8 @@ pgstat_reset_shared_counters(const char *target)
 
 	if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "lwlocks") == 0)
+		msg.m_resettarget = RESET_LWLOCKSTAT;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1344,6 +1347,72 @@ pgstat_report_recovery_conflict(int reason)
 }
 
 /* --------
+ * pgstat_report_lwlockstat() -
+ *
+ *	Tell the collector about lwlock statistics.
+ * --------
+ */
+void
+pgstat_report_lwlockstat(void)
+{
+	PgStat_MsgLWLockstat msg;
+
+	int32 lockid = 0;
+	int need_continue = 0;
+
+ report_continue:
+	memset(&msg, 0, sizeof(PgStat_MsgLWLockstat));
+
+	for ( ; lockid<NumFixedLWLocks+1 ; lockid++)
+	{
+		uint64 calls, waits, time_ms;
+
+		calls = waits = time_ms = 0;
+
+		calls   = lwlock_get_stat_calls_global(lockid);
+		waits   = lwlock_get_stat_waits_global(lockid);
+		time_ms = lwlock_get_stat_time_ms_global(lockid);
+
+		if ( calls>0 || waits>0 || time_ms>0 )
+		{
+			msg.m_entry[msg.m_nentries].lockid      = lockid;
+			msg.m_entry[msg.m_nentries].calls       = calls;
+			msg.m_entry[msg.m_nentries].waits       = waits;
+			msg.m_entry[msg.m_nentries].waited_time = time_ms;
+
+			msg.m_nentries++;
+
+			lwlock_reset_stat_global(lockid);
+
+			/*
+			 * Need to keep a message packet smaller than PGSTAT_MSG_PAYLOAD.
+			 * So, going to split a report into multiple messages.
+			 */
+			if ( msg.m_nentries>=MAX_LWLOCKSTAT_ENTRIES )
+			{
+				need_continue = 1;
+				break;
+			}
+		}
+	}
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_LWLOCKSTAT);
+	pgstat_send(&msg, sizeof(msg));
+
+	/*
+	 * Need to continue because of the larger report?
+	 */
+	if ( need_continue )
+	{
+		need_continue = 0;
+		goto report_continue;
+	}
+}
+
+/* --------
  * pgstat_report_deadlock() -
  *
  *	Tell the collector about a deadlock detected.
@@ -3219,6 +3288,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
 					break;
 
+				case PGSTAT_MTYPE_LWLOCKSTAT:
+					pgstat_recv_lwlockstat((PgStat_MsgLWLockstat *) &msg, len);
+					break;
+
 				case PGSTAT_MTYPE_DEADLOCK:
 					pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
 					break;
@@ -4379,8 +4452,15 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
 		/* Reset the global background writer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
-		globalStats.stat_reset_timestamp = GetCurrentTimestamp();
+		memset(&globalStats.bgwriterstats, 0, sizeof(globalStats.bgwriterstats));
+		globalStats.bgwriterstats.reset_timestamp = GetCurrentTimestamp();
+	}
+
+	if (msg->m_resettarget == RESET_LWLOCKSTAT)
+	{
+		/* Reset the global lwlock statistics for the cluster. */
+		memset(&globalStats.lwlockstats, 0, sizeof(globalStats.lwlockstats));
+		globalStats.lwlockstats.reset_timestamp = GetCurrentTimestamp();
 	}
 
 	/*
@@ -4521,16 +4601,16 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-	globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-	globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-	globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.buf_written_clean += msg->m_buf_written_clean;
-	globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.bgwriterstats.timed_checkpoints += msg->m_timed_checkpoints;
+	globalStats.bgwriterstats.requested_checkpoints += msg->m_requested_checkpoints;
+	globalStats.bgwriterstats.checkpoint_write_time += msg->m_checkpoint_write_time;
+	globalStats.bgwriterstats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
+	globalStats.bgwriterstats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
+	globalStats.bgwriterstats.buf_written_clean += msg->m_buf_written_clean;
+	globalStats.bgwriterstats.maxwritten_clean += msg->m_maxwritten_clean;
+	globalStats.bgwriterstats.buf_written_backend += msg->m_buf_written_backend;
+	globalStats.bgwriterstats.buf_fsync_backend += msg->m_buf_fsync_backend;
+	globalStats.bgwriterstats.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -4574,6 +4654,27 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 }
 
 /* ----------
+ * pgstat_recv_lwlockstat() -
+ *
+ *	Process a LWLockstat message.
+ * ----------
+ */
+static void
+pgstat_recv_lwlockstat(PgStat_MsgLWLockstat *msg, int len)
+{
+	int i;
+
+	for (i=0 ; i<msg->m_nentries ; i++)
+	{
+		int32 lockid = msg->m_entry[i].lockid;
+
+		globalStats.lwlockstats.lwlock_stat[lockid].calls       += msg->m_entry[i].calls;
+		globalStats.lwlockstats.lwlock_stat[lockid].waits       += msg->m_entry[i].waits;
+		globalStats.lwlockstats.lwlock_stat[lockid].waited_time += msg->m_entry[i].waited_time;
+	}
+}
+
+/* ----------
  * pgstat_recv_deadlock() -
  *
  *	Process a DEADLOCK message.
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5e1ce17..402799d 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -32,6 +32,7 @@
 #include "storage/proc.h"
 #include "storage/spin.h"
 
+#include <sys/time.h>
 
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
@@ -48,6 +49,25 @@ typedef struct LWLock
 	/* tail is undefined when head is NULL */
 } LWLock;
 
+typedef struct LWLockCounter2
+{
+	/* statistics stuff */
+	uint64		calls;
+	uint64		waits;
+	uint64		time_ms;
+} LWLockCounter2;
+
+/*
+ * LWLockCounterLocal has <NumFixedLWLocks> counters
+ * and one additional counter for dynamic LWLocks
+ * to hold lwlock statistic in the local session.
+ */
+LWLockCounter2 LWLockCounterLocal[NumFixedLWLocks+1];
+
+LWLockCounter2 LWLockCounterGlobal[NumFixedLWLocks+1];
+
+#define LWLockCounterId(X) ((X) < (NumFixedLWLocks+1) ? (X) : (NumFixedLWLocks+1))
+
 /*
  * All the LWLock structs are allocated as an array in shared memory.
  * (LWLockIds are indexes into the array.)	We force the array stride to
@@ -90,6 +110,8 @@ static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
 static int	lock_addin_request = 0;
 static bool lock_addin_request_allowed = true;
 
+static void InitLWockCounter(void);
+
 #ifdef LWLOCK_STATS
 static int	counts_for_pid = 0;
 static int *sh_acquire_counts;
@@ -253,6 +275,26 @@ LWLockShmemSize(void)
 	return size;
 }
 
+/*
+ * Initialize local and global counters for lwlock statistics.
+ */
+static void
+InitLWockCounter(void)
+{
+	int i;
+
+	for (i=0 ; i<NumFixedLWLocks+1 ; i++)
+	{
+		LWLockCounterLocal[i].calls   = 0;
+		LWLockCounterLocal[i].waits   = 0;
+		LWLockCounterLocal[i].time_ms = 0;
+
+		LWLockCounterGlobal[i].calls   = 0;
+		LWLockCounterGlobal[i].waits   = 0;
+		LWLockCounterGlobal[i].time_ms = 0;
+	}
+}
+
 
 /*
  * Allocate shmem space for LWLocks and initialize the locks.
@@ -298,6 +340,8 @@ CreateLWLocks(void)
 	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
 	LWLockCounter[0] = (int) NumFixedLWLocks;
 	LWLockCounter[1] = numLocks;
+
+	InitLWockCounter();
 }
 
 
@@ -344,9 +388,13 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	int			extraWaits = 0;
+	struct timeval		wait_start,wait_done;
 
 	PRINT_LWDEBUG("LWLockAcquire", lockid, lock);
 
+	LWLockCounterLocal[ LWLockCounterId(lockid) ].calls++;
+	LWLockCounterGlobal[ LWLockCounterId(lockid) ].calls++;
+
 #ifdef LWLOCK_STATS
 	/* Set up local count state first time through in a given process */
 	if (counts_for_pid != MyProcPid)
@@ -395,6 +443,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	for (;;)
 	{
 		bool		mustwait;
+		uint64 waited;
 
 		/* Acquire mutex.  Time spent holding mutex should be short! */
 #ifdef LWLOCK_STATS
@@ -473,6 +522,9 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 #endif
 
 		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
+		LWLockCounterLocal[ LWLockCounterId(lockid) ].waits++;
+		LWLockCounterGlobal[ LWLockCounterId(lockid) ].waits++;
+		gettimeofday(&wait_start, NULL);
 
 		for (;;)
 		{
@@ -484,6 +536,20 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		}
 
 		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
+		gettimeofday(&wait_done, NULL);
+
+		if ( wait_done.tv_usec >= wait_start.tv_usec )
+		{
+			waited = ( wait_done.tv_usec - wait_start.tv_usec ) / 1000 ;
+			waited += ( wait_done.tv_sec - wait_start.tv_sec ) * 1000 ;
+		}
+		else
+		{
+			waited = ( wait_done.tv_usec + 1000*1000 - wait_start.tv_usec ) / 1000 ;
+			waited += ( wait_done.tv_sec - 1 - wait_start.tv_sec ) * 1000 ;
+		}
+		LWLockCounterLocal[ LWLockCounterId(lockid) ].time_ms += waited;
+		LWLockCounterGlobal[ LWLockCounterId(lockid) ].time_ms += waited;
 
 		LOG_LWDEBUG("LWLockAcquire", lockid, "awakened");
 
@@ -885,3 +951,55 @@ LWLockHeldByMe(LWLockId lockid)
 	}
 	return false;
 }
+
+uint64
+lwlock_get_stat_calls_local(LWLockId lockid)
+{
+	return LWLockCounterLocal[ LWLockCounterId(lockid) ].calls;
+}
+
+uint64
+lwlock_get_stat_waits_local(LWLockId lockid)
+{
+	return LWLockCounterLocal[ LWLockCounterId(lockid) ].waits;
+}
+
+uint64
+lwlock_get_stat_time_ms_local(LWLockId lockid)
+{
+	return LWLockCounterLocal[ LWLockCounterId(lockid) ].time_ms;
+}
+
+void
+lwlock_reset_stat_local(LWLockId lockid)
+{
+	LWLockCounterLocal[ LWLockCounterId(lockid) ].calls   = 0;
+	LWLockCounterLocal[ LWLockCounterId(lockid) ].waits   = 0;
+	LWLockCounterLocal[ LWLockCounterId(lockid) ].time_ms = 0;
+}
+
+uint64
+lwlock_get_stat_calls_global(LWLockId lockid)
+{
+	return LWLockCounterGlobal[ LWLockCounterId(lockid) ].calls;
+}
+
+uint64
+lwlock_get_stat_waits_global(LWLockId lockid)
+{
+	return LWLockCounterGlobal[ LWLockCounterId(lockid) ].waits;
+}
+
+uint64
+lwlock_get_stat_time_ms_global(LWLockId lockid)
+{
+	return LWLockCounterGlobal[ LWLockCounterId(lockid) ].time_ms;
+}
+
+void
+lwlock_reset_stat_global(LWLockId lockid)
+{
+	LWLockCounterGlobal[ LWLockCounterId(lockid) ].calls   = 0;
+	LWLockCounterGlobal[ LWLockCounterId(lockid) ].waits   = 0;
+	LWLockCounterGlobal[ LWLockCounterId(lockid) ].time_ms = 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 585db1a..5ca2c6f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3919,6 +3919,8 @@ PostgresMain(int argc, char *argv[], const char *username)
 				pgstat_report_activity(STATE_IDLE, NULL);
 			}
 
+			pgstat_report_lwlockstat();
+
 			ReadyForQuery(whereToSendOutput);
 			send_ready_for_query = false;
 		}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 7d4059f..0a26626 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -118,6 +118,8 @@ extern Datum pg_stat_reset_shared(PG_FUNCTION_ARGS);
 extern Datum pg_stat_reset_single_table_counters(PG_FUNCTION_ARGS);
 extern Datum pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS);
 
+extern Datum pg_stat_get_lwlocks(PG_FUNCTION_ARGS);
+
 /* Global bgwriter statistics, from bgwriter.c */
 extern PgStat_MsgBgWriter bgwriterStats;
 
@@ -1399,69 +1401,69 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
 	/* time is already in msec, just convert to double for presentation */
-	PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+	PG_RETURN_FLOAT8((double) pgstat_fetch_global()->bgwriterstats.checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
 	/* time is already in msec, just convert to double for presentation */
-	PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+	PG_RETURN_FLOAT8((double) pgstat_fetch_global()->bgwriterstats.checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->bgwriterstats.reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+	PG_RETURN_INT64(pgstat_fetch_global()->bgwriterstats.buf_alloc);
 }
 
 Datum
@@ -1701,3 +1703,162 @@ pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+Datum
+pg_stat_get_lwlocks(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+
+	/* stuff done only on the first call of the function */
+	if (SRF_IS_FIRSTCALL())
+	{
+		MemoryContext oldcontext;
+		TupleDesc	tupdesc;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		tupdesc = CreateTemplateTupleDesc(7, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "lockid",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "local_calls",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "local_waits",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "local_time_ms",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "shared_calls",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "shared_waits",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 7, "shared_time_ms",
+						   INT8OID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+		funcctx->max_calls = NumFixedLWLocks + 1;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	/* stuff done on every call of the function */
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		Datum		values[7];
+		bool		nulls[7];
+		HeapTuple	tuple;
+		LWLockId	lockid;
+		uint64		local_calls,local_waits,local_time_ms;
+		uint64		shared_calls,shared_waits,shared_time_ms;
+		int i;
+		PgStat_LWLockEntry *lwlock_stat = pgstat_fetch_global()->lwlockstats.lwlock_stat;
+
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		lockid = funcctx->call_cntr;
+
+		local_calls = local_waits = local_time_ms = 0;
+		shared_calls = shared_waits = shared_time_ms = 0;
+
+		/*
+		 * Partitioned locks need to be summed up by the lock group.
+		 */
+		if ( FirstBufMappingLock <= lockid && lockid < FirstLockMgrLock )
+		{
+			for (i=0 ; i<NUM_BUFFER_PARTITIONS ; i++)
+			{
+				/* local statistics */
+				local_calls   = lwlock_get_stat_calls_local(FirstBufMappingLock+i);
+				local_waits   = lwlock_get_stat_waits_local(FirstBufMappingLock+i);
+				local_time_ms = lwlock_get_stat_time_ms_local(FirstBufMappingLock+i);
+
+				/* global statistics */
+				shared_calls   += lwlock_stat[FirstBufMappingLock+i].calls;
+				shared_waits   += lwlock_stat[FirstBufMappingLock+i].waits;
+				shared_time_ms += lwlock_stat[FirstBufMappingLock+i].waited_time;
+			}
+
+			funcctx->call_cntr += NUM_BUFFER_PARTITIONS;
+		}
+		else if ( FirstLockMgrLock <= lockid && lockid < FirstPredicateLockMgrLock )
+		{
+			for (i=0 ; i<NUM_LOCK_PARTITIONS ; i++)
+			{
+				/* local statistics */
+				local_calls   = lwlock_get_stat_calls_local(FirstLockMgrLock+i);
+				local_waits   = lwlock_get_stat_waits_local(FirstLockMgrLock+i);
+				local_time_ms = lwlock_get_stat_time_ms_local(FirstLockMgrLock+i);
+
+				/* global statistics */
+				shared_calls   += lwlock_stat[FirstLockMgrLock+i].calls;
+				shared_waits   += lwlock_stat[FirstLockMgrLock+i].waits;
+				shared_time_ms += lwlock_stat[FirstLockMgrLock+i].waited_time;
+			}
+
+			funcctx->call_cntr += NUM_LOCK_PARTITIONS;
+		}
+		else if ( FirstPredicateLockMgrLock <= lockid && lockid < NumFixedLWLocks )
+		{
+			for (i=0 ; i<NUM_PREDICATELOCK_PARTITIONS ; i++)
+			{
+				/* local statistics */
+				local_calls   = lwlock_get_stat_calls_local(FirstPredicateLockMgrLock+i);
+				local_waits   = lwlock_get_stat_waits_local(FirstPredicateLockMgrLock+i);
+				local_time_ms = lwlock_get_stat_time_ms_local(FirstPredicateLockMgrLock+i);
+
+				/* global statistics */
+				shared_calls   += lwlock_stat[FirstPredicateLockMgrLock+i].calls;
+				shared_waits   += lwlock_stat[FirstPredicateLockMgrLock+i].waits;
+				shared_time_ms += lwlock_stat[FirstPredicateLockMgrLock+i].waited_time;
+			}
+
+			funcctx->call_cntr += NUM_PREDICATELOCK_PARTITIONS;
+		}
+		else
+		{
+			/* local statistics */
+			local_calls   = lwlock_get_stat_calls_local(lockid);
+			local_waits   = lwlock_get_stat_waits_local(lockid);
+			local_time_ms = lwlock_get_stat_time_ms_local(lockid);
+
+			/* global statistics */
+			shared_calls   = lwlock_stat[lockid].calls;
+			shared_waits   = lwlock_stat[lockid].waits;
+			shared_time_ms = lwlock_stat[lockid].waited_time;
+		}
+
+		values[0] = Int64GetDatum(lockid);
+		values[1] = Int64GetDatum(local_calls);
+		values[2] = Int64GetDatum(local_waits);
+		values[3] = Int64GetDatum(local_time_ms);
+		values[4] = Int64GetDatum(shared_calls);
+		values[5] = Int64GetDatum(shared_waits);
+		values[6] = Int64GetDatum(shared_time_ms);
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+
+		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+	}
+	else
+	{
+		SRF_RETURN_DONE(funcctx);
+	}
+}
+
+Datum
+pg_stat_reset_lwlocks(PG_FUNCTION_ARGS)
+{
+	LWLockId	lockid;
+
+	for (lockid=0 ; lockid<NumLWLocks() ; lockid++)
+	{
+		lwlock_reset_stat_local(lockid);
+	}
+
+	PG_RETURN_VOID();
+}
+
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f935eb1..4582b12 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2612,6 +2612,10 @@ DATA(insert OID = 1936 (  pg_stat_get_backend_idset		PGNSP PGUID 12 1 100 0 0 f
 DESCR("statistics: currently active backend IDs");
 DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active backends");
+DATA(insert OID = 3764 (  pg_stat_get_lwlocks  PGNSP PGUID 12 1 100 0 0 f f f f f t s 0 0 2249 "" "{20,20,20,20,20,20,20}" "{o,o,o,o,o,o,o}" "{lwlockid,local_calls,local_waits,local_time_ms,shared_calls,shared_waits,shared_time_ms}" _null_ pg_stat_get_lwlocks _null_ _null_ _null_ ));
+DESCR("statistics: light-weight lock statistics");
+DATA(insert OID = 3765 (  pg_stat_reset_lwlocks  PGNSP PGUID 12 1 0 0 0 f f f f f f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_stat_reset_lwlocks _null_ _null_ _null_ ));
+DESCR("statistics: reset light-weight lock statistics");
 DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s 0 0 2249 "" "{23,25,25,25,25,25,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 613c1c2..0e52534 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -15,6 +15,7 @@
 #include "fmgr.h"
 #include "libpq/pqcomm.h"
 #include "portability/instr_time.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -49,6 +50,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
 	PGSTAT_MTYPE_TEMPFILE,
+	PGSTAT_MTYPE_LWLOCKSTAT,
 	PGSTAT_MTYPE_DEADLOCK
 } StatMsgType;
 
@@ -102,7 +104,8 @@ typedef struct PgStat_TableCounts
 /* Possible targets for resetting cluster-wide shared values */
 typedef enum PgStat_Shared_Reset_Target
 {
-	RESET_BGWRITER
+	RESET_BGWRITER,
+	RESET_LWLOCKSTAT
 } PgStat_Shared_Reset_Target;
 
 /* Possible object types for resetting single counters */
@@ -605,13 +608,31 @@ typedef struct PgStat_StatFuncEntry
 	PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
+#define MAX_LWLOCKSTAT_ENTRIES 20
+
+typedef struct PgStat_LWLockEntry
+{
+	LWLockId       lockid;
+	PgStat_Counter calls;
+	PgStat_Counter waits;
+	PgStat_Counter waited_time;     /* time in milliseconds */
+} PgStat_LWLockEntry;
+
+typedef struct PgStat_MsgLWLockstat
+{
+	PgStat_MsgHdr m_hdr;
+	int			m_nentries;
+
+	/* Need to keep a msg smaller than PGSTAT_MSG_PAYLOAD */
+	PgStat_LWLockEntry m_entry[MAX_LWLOCKSTAT_ENTRIES];
+} PgStat_MsgLWLockstat;
+
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics for BgWriter
  */
-typedef struct PgStat_GlobalStats
+typedef struct PgStat_BgWriterGlobalStats
 {
-	TimestampTz stats_timestamp;	/* time of stats file update */
 	PgStat_Counter timed_checkpoints;
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;		/* times in milliseconds */
@@ -622,6 +643,28 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	TimestampTz reset_timestamp;
+} PgStat_BgWriterGlobalStats;
+
+/*
+ * Global statistics for LWLocks
+ */
+typedef struct PgStat_LWLockGlobalStats
+{
+	PgStat_LWLockEntry lwlock_stat[NumFixedLWLocks+1];
+	TimestampTz reset_timestamp;
+} PgStat_LWLockGlobalStats;
+
+/*
+ * Global statistics kept in the stats collector
+ */
+typedef struct PgStat_GlobalStats
+{
+	TimestampTz stats_timestamp;	/* time of stats file update */
+
+	PgStat_BgWriterGlobalStats bgwriterstats;
+	PgStat_LWLockGlobalStats lwlockstats;
+
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 82d8ec4..a101db5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -119,4 +119,14 @@ extern void CreateLWLocks(void);
 
 extern void RequestAddinLWLocks(int n);
 
+extern uint64 lwlock_get_stat_calls_local(LWLockId);
+extern uint64 lwlock_get_stat_waits_local(LWLockId);
+extern uint64 lwlock_get_stat_time_ms_local(LWLockId);
+extern void lwlock_reset_stat_local(LWLockId);
+
+extern uint64 lwlock_get_stat_calls_global(LWLockId);
+extern uint64 lwlock_get_stat_waits_global(LWLockId);
+extern uint64 lwlock_get_stat_time_ms_global(LWLockId);
+extern void lwlock_reset_stat_global(LWLockId);
+
 #endif   /* LWLOCK_H */
#9Fujii Masao
masao.fujii@gmail.com
In reply to: Satoshi Nagayasu (#8)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Sat, Oct 13, 2012 at 11:34 PM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

Hi,

2012/10/13 23:05, Satoshi Nagayasu wrote:

Hi all,

I have fixed my previous patch for pg_stat_lwlocks view, and
as Josh commented, it now supports local and global (shared)
statistics in the same system view.

Sorry, I found my mistakes. New fixed one is attached to this mail.

Thanks for revising the patch. Here are the comments:

The document needs to be updated.

The patch caused the following compile warnings in my machine.

pgstat.c:1357: warning: no previous prototype for 'pgstat_report_lwlockstat'
postgres.c:3922: warning: implicit declaration of function
'pgstat_report_lwlockstat'
pgstatfuncs.c:1854: warning: no previous prototype for 'pg_stat_reset_lwlocks'

In my test, this patch caused the measurable performance overhead.
I created the test database by pgbench -s10 and ran pgbench -c8 -j8 -T60 -S.
Results are:

[HEAD]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

[PATCH]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

So I think that tracking lwlock usage should be enabled only when
trace_lwlocks is enabled, so that a user who is not interested in
lwlock usage can avoid such performance overhead.

As far as I read the patch, only lwlock usage by backends is collected.
Why aren't the lwlock usages by autovacuum worker and auxiliary
processes collected?

Regards,

--
Fujii Masao

#10Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#9)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Sun, Oct 14, 2012 at 3:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Oct 13, 2012 at 11:34 PM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

Hi,

2012/10/13 23:05, Satoshi Nagayasu wrote:

Hi all,

I have fixed my previous patch for pg_stat_lwlocks view, and
as Josh commented, it now supports local and global (shared)
statistics in the same system view.

Sorry, I found my mistakes. New fixed one is attached to this mail.

Thanks for revising the patch. Here are the comments:

The document needs to be updated.

The patch caused the following compile warnings in my machine.

pgstat.c:1357: warning: no previous prototype for 'pgstat_report_lwlockstat'
postgres.c:3922: warning: implicit declaration of function
'pgstat_report_lwlockstat'
pgstatfuncs.c:1854: warning: no previous prototype for 'pg_stat_reset_lwlocks'

In my test, this patch caused the measurable performance overhead.
I created the test database by pgbench -s10 and ran pgbench -c8 -j8 -T60 -S.
Results are:

[HEAD]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

[PATCH]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

Oops! Obviously I copied and pasted the test result wrongly...
Here is the right result.

[HEAD]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

[PATCH]
number of transactions actually processed: 1092400
tps = 18179.498013 (including connections establishing)
tps = 18182.450824 (excluding connections establishing)

Another comment is; local_calls/waits/time_ms are really required?
I'm not sure how those info would help the performance debugging.

Regards,

--
Fujii Masao

#11Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#10)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Sun, Oct 14, 2012 at 6:00 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Oct 14, 2012 at 3:34 AM, Fujii Masao <masao.fujii@gmail.com>
wrote:

On Sat, Oct 13, 2012 at 11:34 PM, Satoshi Nagayasu <snaga@uptime.jp>

wrote:

Hi,

2012/10/13 23:05, Satoshi Nagayasu wrote:

Hi all,

I have fixed my previous patch for pg_stat_lwlocks view, and
as Josh commented, it now supports local and global (shared)
statistics in the same system view.

Sorry, I found my mistakes. New fixed one is attached to this mail.

Thanks for revising the patch. Here are the comments:

The document needs to be updated.

The patch caused the following compile warnings in my machine.

pgstat.c:1357: warning: no previous prototype for

'pgstat_report_lwlockstat'

postgres.c:3922: warning: implicit declaration of function
'pgstat_report_lwlockstat'
pgstatfuncs.c:1854: warning: no previous prototype for

'pg_stat_reset_lwlocks'

In my test, this patch caused the measurable performance overhead.
I created the test database by pgbench -s10 and ran pgbench -c8 -j8 -T60

-S.

Results are:

[HEAD]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

[PATCH]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

Oops! Obviously I copied and pasted the test result wrongly...
Here is the right result.

[HEAD]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

[PATCH]
number of transactions actually processed: 1092400
tps = 18179.498013 (including connections establishing)
tps = 18182.450824 (excluding connections establishing)

Performance difference is due to only the mutex lock taken?

Another comment is; local_calls/waits/time_ms are really required?
I'm not sure how those info would help the performance debugging.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Michael Paquier
http://michael.otacoo.com

#12Satoshi Nagayasu
snaga@uptime.jp
In reply to: Michael Paquier (#11)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

Thanks for the review.

2012/10/14 8:55, Michael Paquier wrote:

On Sun, Oct 14, 2012 at 6:00 AM, Fujii Masao <masao.fujii@gmail.com
<mailto:masao.fujii@gmail.com>> wrote:

On Sun, Oct 14, 2012 at 3:34 AM, Fujii Masao <masao.fujii@gmail.com
<mailto:masao.fujii@gmail.com>> wrote:

On Sat, Oct 13, 2012 at 11:34 PM, Satoshi Nagayasu

<snaga@uptime.jp <mailto:snaga@uptime.jp>> wrote:

Hi,

2012/10/13 23:05, Satoshi Nagayasu wrote:

Hi all,

I have fixed my previous patch for pg_stat_lwlocks view, and
as Josh commented, it now supports local and global (shared)
statistics in the same system view.

Sorry, I found my mistakes. New fixed one is attached to this mail.

Thanks for revising the patch. Here are the comments:

The document needs to be updated.

The patch caused the following compile warnings in my machine.

pgstat.c:1357: warning: no previous prototype for

'pgstat_report_lwlockstat'

postgres.c:3922: warning: implicit declaration of function
'pgstat_report_lwlockstat'
pgstatfuncs.c:1854: warning: no previous prototype for

'pg_stat_reset_lwlocks'

Oops. I just fixed them. Thanks.

In my test, this patch caused the measurable performance overhead.
I created the test database by pgbench -s10 and ran pgbench -c8

-j8 -T60 -S.

Results are:

[HEAD]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

[PATCH]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

Oops! Obviously I copied and pasted the test result wrongly...
Here is the right result.

[HEAD]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

[PATCH]
number of transactions actually processed: 1092400
tps = 18179.498013 <tel:18179.498013> (including connections
establishing)
tps = 18182.450824 <tel:18182.450824> (excluding connections
establishing)

Performance difference is due to only the mutex lock taken?

I think it is coming from high-frequent reporting through
pgstat collector process, which means calling
pgstat_report_lwlocks() at PostgresMain().

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 585db1a..5ca2c6f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3919,6 +3919,8 @@ PostgresMain(int argc, char *argv[], const char 
*username)
  				pgstat_report_activity(STATE_IDLE, NULL);
  			}

+ pgstat_report_lwlockstat();
+
ReadyForQuery(whereToSendOutput);
send_ready_for_query = false;
}

When I reduced reporting (or just disabled reporting),
it shows that the performance would not be affected
by this patch.

Here are some additional results of the performance test
which is the same one Fujii-san did.

HEAD
====
number of transactions actually processed: 3439971
tps = 57331.891602 (including connections establishing)
tps = 57340.932324 (excluding connections establishing)

pg_stat_lwlocks patch (reporting enabled)
=========================================
number of transactions actually processed: 2665550
tps = 44425.038125 (including connections establishing)
tps = 44430.565651 (excluding connections establishing)

pg_stat_lwlocks patch (reporting disabled)
==========================================
number of transactions actually processed: 3429370
tps = 57155.286475 (including connections establishing)
tps = 57163.996943 (excluding connections establishing)

pg_stat_lwlocks patch (reporting reduced 1/100)
===============================================
number of transactions actually processed: 3421879
tps = 57030.660814 (including connections establishing)
tps = 57038.950498 (excluding connections establishing)

So, I think some additional hack to reduce reporting is needed.
Would it be acceptable in terms of the performance?

Another comment is; local_calls/waits/time_ms are really required?
I'm not sure how those info would help the performance debugging.

I think there are some needs to observe/determine how your test
query is affected by the other workload from the other sessions.
So, splitting local and shared statistics would be nice to have.
Just my thought though.

Regards,

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org
<mailto:pgsql-hackers@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Michael Paquier
http://michael.otacoo.com

--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#13Fujii Masao
masao.fujii@gmail.com
In reply to: Satoshi Nagayasu (#12)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Sun, Oct 14, 2012 at 10:46 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

HEAD
====
number of transactions actually processed: 3439971
tps = 57331.891602 (including connections establishing)
tps = 57340.932324 (excluding connections establishing)

<snip>

pg_stat_lwlocks patch (reporting disabled)
==========================================
number of transactions actually processed: 3429370
tps = 57155.286475 (including connections establishing)
tps = 57163.996943 (excluding connections establishing)

So, I think some additional hack to reduce reporting is needed.
Would it be acceptable in terms of the performance?

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.

Another comment is; local_calls/waits/time_ms are really required?
I'm not sure how those info would help the performance debugging.

I think there are some needs to observe/determine how your test
query is affected by the other workload from the other sessions.
So, splitting local and shared statistics would be nice to have.
Just my thought though.

What I don't like is that a session can see only local stats of its own
session. It's hard to monitor local stats. Imagine the case where you'd
like to monitor local stats of each pgbench session. To monitor such
stats, you need to modify pgbench so that its each session monitors
its own local stats. Even if you run a monitoring software, it cannot
collect those stats because they don't belong to the session that it uses.

Regards,

--
Fujii Masao

#14Satoshi Nagayasu
snaga@uptime.jp
In reply to: Fujii Masao (#13)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

(2012/10/14 13:26), Fujii Masao wrote:

On Sun, Oct 14, 2012 at 10:46 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

HEAD
====
number of transactions actually processed: 3439971
tps = 57331.891602 (including connections establishing)
tps = 57340.932324 (excluding connections establishing)

<snip>

pg_stat_lwlocks patch (reporting disabled)
==========================================
number of transactions actually processed: 3429370
tps = 57155.286475 (including connections establishing)
tps = 57163.996943 (excluding connections establishing)

So, I think some additional hack to reduce reporting is needed.
Would it be acceptable in terms of the performance?

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.

Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

But it's ok to add a parameter to switch enable/disable it.
Any other comments?

Another comment is; local_calls/waits/time_ms are really required?
I'm not sure how those info would help the performance debugging.

I think there are some needs to observe/determine how your test
query is affected by the other workload from the other sessions.
So, splitting local and shared statistics would be nice to have.
Just my thought though.

What I don't like is that a session can see only local stats of its own
session. It's hard to monitor local stats. Imagine the case where you'd
like to monitor local stats of each pgbench session. To monitor such
stats, you need to modify pgbench so that its each session monitors
its own local stats. Even if you run a monitoring software, it cannot
collect those stats because they don't belong to the session that it uses.

Ok. I'm waiting more comments from others.
Dropping it is easy for me, but any other comments? Josh?

Regards,

Regards,

--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#15Fujii Masao
masao.fujii@gmail.com
In reply to: Satoshi Nagayasu (#14)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Sun, Oct 14, 2012 at 7:52 PM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

(2012/10/14 13:26), Fujii Masao wrote:

On Sun, Oct 14, 2012 at 10:46 AM, Satoshi Nagayasu <snaga@uptime.jp>
wrote:

HEAD
====
number of transactions actually processed: 3439971
tps = 57331.891602 (including connections establishing)
tps = 57340.932324 (excluding connections establishing)

<snip>

pg_stat_lwlocks patch (reporting disabled)
==========================================
number of transactions actually processed: 3429370
tps = 57155.286475 (including connections establishing)
tps = 57163.996943 (excluding connections establishing)

So, I think some additional hack to reduce reporting is needed.
Would it be acceptable in terms of the performance?

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.

Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

But it's ok to add a parameter to switch enable/disable it.
Any other comments?

I thought that pg_stat_lwlocks would be mostly used for the debugging
purpose. IOW most users would disable that in production system. But,
if you think it should be enabled even in production system and most
users would usually improve performance by using that view, I think that
you need to reconsider the lwlockid in pg_stat_lwlocks view. It's hard
to understand what kind of lwlock each lwlockid indicates. Which means
that it's hard for most user to investigate something from current
pg_stat_lwlocks. You might need to expose the lwlock name and its
short description....

Another comment is; local_calls/waits/time_ms are really required?
I'm not sure how those info would help the performance debugging.

I think there are some needs to observe/determine how your test
query is affected by the other workload from the other sessions.
So, splitting local and shared statistics would be nice to have.
Just my thought though.

What I don't like is that a session can see only local stats of its own
session. It's hard to monitor local stats. Imagine the case where you'd
like to monitor local stats of each pgbench session. To monitor such
stats, you need to modify pgbench so that its each session monitors
its own local stats. Even if you run a monitoring software, it cannot
collect those stats because they don't belong to the session that it uses.

Ok. I'm waiting more comments from others.
Dropping it is easy for me, but any other comments? Josh?

What I was thinking useful is to make the stats collector collect also local
stats and report them for each backend ID or distinct query (e.g., adding
such stats into pg_stat_statements, though this would not be good idea).
Which enables every session to see any local stats.

Regards,

--
Fujii Masao

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Satoshi Nagayasu (#14)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

Satoshi Nagayasu <snaga@uptime.jp> writes:

(2012/10/14 13:26), Fujii Masao wrote:

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.

Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus). It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.

So I feel it isn't something that should be turned on in production
builds. I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.

regards, tom lane

#17Satoshi Nagayasu
snaga@uptime.jp
In reply to: Tom Lane (#16)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

2012/10/15 1:43, Tom Lane wrote:

Satoshi Nagayasu <snaga@uptime.jp> writes:

(2012/10/14 13:26), Fujii Masao wrote:

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.

Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus). It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.

Actually, I'm not a developer. I'm just a DBA, and I needed such
instrument when I was asked to investigate storange WAL behavior
that produced unexpected/random commit delays under heavy workload.

And another patch (WAL dirty flush statistic patch) I have submitted
is coming from the same reason.

https://commitfest.postgresql.org/action/patch_view?id=893

Unfortunately, since I didn't have such instrument at that time,
I used SystemTap to investigate WAL behaviors, including calls and
waited time, but using SystemTap was really awful, and I thought
PostgreSQL needs to have some "built-in" instrument for DBA.

I needed to determine the bottleneck around WAL, such as lock contension
and/or write performance of the device, but I couldn't find anything
without an instrument.

I accept that I'm focusing on only WAL related lwlocks, and it is not
enough for ordinally DBAs, but I still need it to understand PostgreSQL
behavior without having deep knowledge and experience on WAL design and
implementation.

So I feel it isn't something that should be turned on in production
builds. I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.

There is another option to eliminate performance overhead for this
purpose.

As I tried in the first patch, instead of reporting through pgstat
collector process, each backend could directly increment lwlock
counters allocated in the shared memory.

http://archives.postgresql.org/message-id/4FE9A6F5.2080405@uptime.jp

Here are another benchmark results, including my first patch.

[HEAD]
number of transactions actually processed: 3439971
tps = 57331.891602 (including connections establishing)
tps = 57340.932324 (excluding connections establishing)

[My first patch]
number of transactions actually processed: 3453745
tps = 57562.196971 (including connections establishing)
tps = 57569.197838 (excluding connections establishing)

Actually, I'm not sure why my patch makes PostgreSQL faster, :D
but the performance seems better than my second patch.

I think it still needs some trick to keep counters in "pgstat.stat"
over restarting, but it would be more acceptable in terms of
performance overhead.

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#18Jeff Janes
jeff.janes@gmail.com
In reply to: Tom Lane (#16)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Satoshi Nagayasu <snaga@uptime.jp> writes:

(2012/10/14 13:26), Fujii Masao wrote:

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.

Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus). It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.

Waiting on BufFreelistLock suggests increasing shared_buffers.

Waiting on ProcArrayLock perhaps suggests use of a connection pooler
(or does it?)

WALWriteLock suggests doing something about IO, either moving logs to
different disks, or getting BBU, or something.

WALInsertLock suggests trying to adapt your data loading process so it
can take advantage of the bulk, or maybe increasing wal_buffers.

And a lot of waiting on any of the locks gives a piece of information
the DBA can use when asking the mailing lists for help, even if it
doesn't allow him to take unilateral action.

So I feel it isn't something that should be turned on in production
builds. I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.

I think hackers would benefit from getting reports from DBAs in the
field with concrete data on bottlenecks.

If the only way to get this is to do some non-standard compile and
deploy it to production, or to create a "benchmarking" copy of the
production database system including a realistic work-load driver and
run the non-standard compile there; either of those is going to
dramatically cut down on the participation.

Cheers,

Jeff

#19Satoshi Nagayasu
snaga@uptime.jp
In reply to: Jeff Janes (#18)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

2012/10/16 2:40, Jeff Janes wrote:

On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Satoshi Nagayasu <snaga@uptime.jp> writes:

(2012/10/14 13:26), Fujii Masao wrote:

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.

Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus). It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.

Waiting on BufFreelistLock suggests increasing shared_buffers.

Waiting on ProcArrayLock perhaps suggests use of a connection pooler
(or does it?)

WALWriteLock suggests doing something about IO, either moving logs to
different disks, or getting BBU, or something.

WALInsertLock suggests trying to adapt your data loading process so it
can take advantage of the bulk, or maybe increasing wal_buffers.

And a lot of waiting on any of the locks gives a piece of information
the DBA can use when asking the mailing lists for help, even if it
doesn't allow him to take unilateral action.

So I feel it isn't something that should be turned on in production
builds. I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.

I think hackers would benefit from getting reports from DBAs in the
field with concrete data on bottlenecks.

If the only way to get this is to do some non-standard compile and
deploy it to production, or to create a "benchmarking" copy of the
production database system including a realistic work-load driver and
run the non-standard compile there; either of those is going to
dramatically cut down on the participation.

Agreed.

The hardest thing to investigate performance issue is
reproducing a situation in the different environment
from the production environment.

I often see people struggling to reproduce a situation
with different hardware and (similar but) different
workload. It is very time consuming, and also it often
fails.

So, we need to collect any piece of information, which
would help us to understand what's going on within
the production PostgreSQL, without any changes of
binaries and configurations in the production environment.

That's the reason why I stick to a "built-in" instrument,
and I disagree to disable such instrument even if it has
minor performance overhead.

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.

Regards,

Cheers,

Jeff

--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#20Robert Haas
robertmhaas@gmail.com
In reply to: Satoshi Nagayasu (#19)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Tue, Oct 16, 2012 at 11:31 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.

This analogy is inapposite, though, because a flight recorder rarely
crashes the aircraft. If it did, people might have second thoughts
about the "never disable the flight recorder" rule. I have had a
couple of different excuses to look into the overhead of timing
lately, and it does indeed seem that on many modern Linux boxes even
extremely frequent gettimeofday calls produce only very modest amounts
of overhead. Sadly, the situation on Windows doesn't look so good. I
don't remember the exact numbers but I think it was something like 40
or 60 or 80 times slower on the Windows box one of my colleagues
tested than it is on Linux. And it turns out that that overhead
really is measurable and does matter if you do it in a code path that
gets run frequently. Of course I am enough of a Linux geek that I
don't use Windows myself and curse my fate when I do have to use it,
but the reality is that we have a huge base of users who only use
PostgreSQL at all because it runs on Windows, and we can't just throw
those people under the bus. I think that older platforms like HP/UX
likely have problems in this area as well although I confess to not
having tested.

That having been said, if we're going to do this, this is probably the
right approach, because it only calls gettimeofday() in the case where
the lock acquisition is contended, and that is a lot cheaper than
calling it in all cases. Maybe it's worth finding a platform where
pg_test_timing reports that timing is very slow and then measuring how
much impact this has on something like a pgbench or pgbench -S
workload. We might find that it is in fact negligible. I'm pretty
certain that it will be almost if not entirely negligible on Linux but
that's not really the case we need to worry about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Ants Aasma
ants@cybertec.at
In reply to: Robert Haas (#20)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Thu, Oct 18, 2012 at 10:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Sadly, the situation on Windows doesn't look so good. I
don't remember the exact numbers but I think it was something like 40
or 60 or 80 times slower on the Windows box one of my colleagues
tested than it is on Linux.

Do you happen to know the hardware and Windows version? Windows
QueryPerformanceCounter that instr_time.h uses should use RDTSC based
timing when the hardware can support it, just like Linux. I don't know
if Windows can avoid syscall overhead though.

Maybe it's worth finding a platform where
pg_test_timing reports that timing is very slow and then measuring how
much impact this has on something like a pgbench or pgbench -S
workload.

This can easily be tested on Linux by changing to the hpet or acpi_pm
clocksource. There probably still are platforms that can do worse than
this, but probably not by orders of magnitude.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

#22Fujii Masao
masao.fujii@gmail.com
In reply to: Satoshi Nagayasu (#19)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Wed, Oct 17, 2012 at 12:31 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

2012/10/16 2:40, Jeff Janes wrote:

On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Satoshi Nagayasu <snaga@uptime.jp> writes:

(2012/10/14 13:26), Fujii Masao wrote:

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage,
e.g.,
by using trace_lwlock parameter, I think.

Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus). It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.

Waiting on BufFreelistLock suggests increasing shared_buffers.

Waiting on ProcArrayLock perhaps suggests use of a connection pooler
(or does it?)

WALWriteLock suggests doing something about IO, either moving logs to
different disks, or getting BBU, or something.

WALInsertLock suggests trying to adapt your data loading process so it
can take advantage of the bulk, or maybe increasing wal_buffers.

And a lot of waiting on any of the locks gives a piece of information
the DBA can use when asking the mailing lists for help, even if it
doesn't allow him to take unilateral action.

So I feel it isn't something that should be turned on in production
builds. I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.

I think hackers would benefit from getting reports from DBAs in the
field with concrete data on bottlenecks.

If the only way to get this is to do some non-standard compile and
deploy it to production, or to create a "benchmarking" copy of the
production database system including a realistic work-load driver and
run the non-standard compile there; either of those is going to
dramatically cut down on the participation.

Agreed.

The hardest thing to investigate performance issue is
reproducing a situation in the different environment
from the production environment.

I often see people struggling to reproduce a situation
with different hardware and (similar but) different
workload. It is very time consuming, and also it often
fails.

So, we need to collect any piece of information, which
would help us to understand what's going on within
the production PostgreSQL, without any changes of
binaries and configurations in the production environment.

That's the reason why I stick to a "built-in" instrument,
and I disagree to disable such instrument even if it has
minor performance overhead.

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.

pg_stat_lwlocks seems not adequate 'flight-recorder'. It collects
only narrow performance data concerning lwlock. What we should
have as 'flight-recorder' is something like Oracle wait event, I think.
Not only lwlocks but also all of wait events should be collected for
DBA to investigate the performance bottleneck. This idea was
proposed by Itagaki-san before. Though he implemented the
sampling-profiler patch, it failed to be committed. I'm not sure why
not. Anyway, I think that this would be more right approach to
provide the 'flight-recorder' to DBA.

Regards,

--
Fujii Masao

#23Satoshi Nagayasu
snaga@uptime.jp
In reply to: Robert Haas (#20)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

2012/10/19 4:36, Robert Haas wrote:

On Tue, Oct 16, 2012 at 11:31 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.

This analogy is inapposite, though, because a flight recorder rarely
crashes the aircraft. If it did, people might have second thoughts
about the "never disable the flight recorder" rule. I have had a
couple of different excuses to look into the overhead of timing
lately, and it does indeed seem that on many modern Linux boxes even
extremely frequent gettimeofday calls produce only very modest amounts
of overhead.

I agree with that such performance instrument needs to be
improved if it has critical performance issue against
production environment. So, I'm still looking for a better
implementation to decrease performance impact.

However, the most important question here is that "How can
we understand postgresql behavior without looking into
tons of source code and hacking skill?"

Recently, I've been having lots of conversation with
database specialists (sales guys and DBAs) trying to use
PostgreSQL instead of a commercial database, and they are
always struggling with understand PostgreSQL behavior,
because no one can observe and/or tell that.

Sadly, the situation on Windows doesn't look so good. I
don't remember the exact numbers but I think it was something like 40
or 60 or 80 times slower on the Windows box one of my colleagues
tested than it is on Linux. And it turns out that that overhead
really is measurable and does matter if you do it in a code path that
gets run frequently. Of course I am enough of a Linux geek that I
don't use Windows myself and curse my fate when I do have to use it,
but the reality is that we have a huge base of users who only use
PostgreSQL at all because it runs on Windows, and we can't just throw
those people under the bus. I think that older platforms like HP/UX
likely have problems in this area as well although I confess to not
having tested.

Do you mean my stat patch should have more performance test
on the other platforms? Yes, I agree with that.

That having been said, if we're going to do this, this is probably the
right approach, because it only calls gettimeofday() in the case where
the lock acquisition is contended, and that is a lot cheaper than
calling it in all cases. Maybe it's worth finding a platform where
pg_test_timing reports that timing is very slow and then measuring how
much impact this has on something like a pgbench or pgbench -S
workload. We might find that it is in fact negligible. I'm pretty
certain that it will be almost if not entirely negligible on Linux but
that's not really the case we need to worry about.

Thanks for a suggestion for a better implementation.
As I mentioned above, I'm always looking for a better idea
and solution to meet our purpose.

Here, I want to share my concern with you again.
PostgreSQL is getting more complicated in order to improve
performance and stability, and I think it may be a good news.
But also getting more difficult to understand without deep
knowledge nowadays, and that would be some bad news actually.

From my point of view, that's a huge hurdle to educate DBAs
and expand PostgreSQL user base.

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#24Satoshi Nagayasu
snaga@uptime.jp
In reply to: Fujii Masao (#22)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

2012/10/19 23:48, Fujii Masao wrote:

On Wed, Oct 17, 2012 at 12:31 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

2012/10/16 2:40, Jeff Janes wrote:

On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Satoshi Nagayasu <snaga@uptime.jp> writes:

(2012/10/14 13:26), Fujii Masao wrote:

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage,
e.g.,
by using trace_lwlock parameter, I think.

Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus). It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.

Waiting on BufFreelistLock suggests increasing shared_buffers.

Waiting on ProcArrayLock perhaps suggests use of a connection pooler
(or does it?)

WALWriteLock suggests doing something about IO, either moving logs to
different disks, or getting BBU, or something.

WALInsertLock suggests trying to adapt your data loading process so it
can take advantage of the bulk, or maybe increasing wal_buffers.

And a lot of waiting on any of the locks gives a piece of information
the DBA can use when asking the mailing lists for help, even if it
doesn't allow him to take unilateral action.

So I feel it isn't something that should be turned on in production
builds. I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.

I think hackers would benefit from getting reports from DBAs in the
field with concrete data on bottlenecks.

If the only way to get this is to do some non-standard compile and
deploy it to production, or to create a "benchmarking" copy of the
production database system including a realistic work-load driver and
run the non-standard compile there; either of those is going to
dramatically cut down on the participation.

Agreed.

The hardest thing to investigate performance issue is
reproducing a situation in the different environment
from the production environment.

I often see people struggling to reproduce a situation
with different hardware and (similar but) different
workload. It is very time consuming, and also it often
fails.

So, we need to collect any piece of information, which
would help us to understand what's going on within
the production PostgreSQL, without any changes of
binaries and configurations in the production environment.

That's the reason why I stick to a "built-in" instrument,
and I disagree to disable such instrument even if it has
minor performance overhead.

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.

pg_stat_lwlocks seems not adequate 'flight-recorder'. It collects
only narrow performance data concerning lwlock. What we should
have as 'flight-recorder' is something like Oracle wait event, I think.
Not only lwlocks but also all of wait events should be collected for
DBA to investigate the performance bottleneck.

That's the reason why I said "I accept that it's not enough
for DBA", and I'm going to work on another lock stats.

This idea was
proposed by Itagaki-san before. Though he implemented the
sampling-profiler patch, it failed to be committed. I'm not sure why
not.

Yeah, I know the previous patch posted by Itagaki-san.
So, I'm questioning why (again) for this time.
I think this is very important question because it would
be critical in order to involve new DBAs to PostgreSQL.

Anyway, I think that this would be more right approach to
provide the 'flight-recorder' to DBA.

Ok, I guess we have reached the consensus to have
"some flight-recorder". Right?

Actually, it seems a great progress from my point of view. :)

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Satoshi Nagayasu (#24)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

Satoshi Nagayasu <snaga@uptime.jp> writes:

Ok, I guess we have reached the consensus to have
"some flight-recorder". Right?

I remain unconvinced. I think the argument that this will promote
novice understanding is complete hogwash: making any sense of
lwlock-level stats will require deep PG understanding, not create it.
And despite Jeff's optimistic views upthread, I don't think very many
people have or will acquire such understanding. So I'm really resistant
to accepting even minimal overhead in pursuit of this goal --- and I
do not believe the overhead will be minimal, either.

regards, tom lane

#26Robert Haas
robertmhaas@gmail.com
In reply to: Satoshi Nagayasu (#23)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Fri, Oct 19, 2012 at 11:52 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

I agree with that such performance instrument needs to be
improved if it has critical performance issue against
production environment. So, I'm still looking for a better
implementation to decrease performance impact.

Frankly, I think the approach of simply providing an "off" switch is
probably a good one. I admit that it would be nice if we could run
with instrumentation like this all the time - but until very fast
time-lookups become ubiquitous, we can't

Another architectural problem here is that I believe this will
increase the size of the stats file, which I think is going to cause
pain for some people. I suspect that's going to be an issue even if
we have an "off" switch. I think somebody's going to have to figure
out a way to split that file up somehow.

However, the most important question here is that "How can
we understand postgresql behavior without looking into
tons of source code and hacking skill?"

Recently, I've been having lots of conversation with
database specialists (sales guys and DBAs) trying to use
PostgreSQL instead of a commercial database, and they are
always struggling with understand PostgreSQL behavior,
because no one can observe and/or tell that.

Agreed.

Sadly, the situation on Windows doesn't look so good. I
don't remember the exact numbers but I think it was something like 40
or 60 or 80 times slower on the Windows box one of my colleagues
tested than it is on Linux. And it turns out that that overhead
really is measurable and does matter if you do it in a code path that
gets run frequently. Of course I am enough of a Linux geek that I
don't use Windows myself and curse my fate when I do have to use it,
but the reality is that we have a huge base of users who only use
PostgreSQL at all because it runs on Windows, and we can't just throw
those people under the bus. I think that older platforms like HP/UX
likely have problems in this area as well although I confess to not
having tested.

Do you mean my stat patch should have more performance test
on the other platforms? Yes, I agree with that.

Yes.

That having been said, if we're going to do this, this is probably the
right approach, because it only calls gettimeofday() in the case where
the lock acquisition is contended, and that is a lot cheaper than
calling it in all cases. Maybe it's worth finding a platform where
pg_test_timing reports that timing is very slow and then measuring how
much impact this has on something like a pgbench or pgbench -S
workload. We might find that it is in fact negligible. I'm pretty
certain that it will be almost if not entirely negligible on Linux but
that's not really the case we need to worry about.

Thanks for a suggestion for a better implementation.
As I mentioned above, I'm always looking for a better idea
and solution to meet our purpose.

Actually, I meant that your existing implementation seemed to be
making some good decisions.

Here, I want to share my concern with you again.
PostgreSQL is getting more complicated in order to improve
performance and stability, and I think it may be a good news.
But also getting more difficult to understand without deep
knowledge nowadays, and that would be some bad news actually.

From my point of view, that's a huge hurdle to educate DBAs
and expand PostgreSQL user base.

Yes, there is definitely more work to be done, here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27Satoshi Nagayasu
snaga@uptime.jp
In reply to: Tom Lane (#25)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

2012/10/20 2:45, Tom Lane wrote:

Satoshi Nagayasu <snaga@uptime.jp> writes:

Ok, I guess we have reached the consensus to have
"some flight-recorder". Right?

I remain unconvinced. I think the argument that this will promote
novice understanding is complete hogwash: making any sense of
lwlock-level stats will require deep PG understanding, not create it.
And despite Jeff's optimistic views upthread, I don't think very many
people have or will acquire such understanding. So I'm really resistant
to accepting even minimal overhead in pursuit of this goal --- and I
do not believe the overhead will be minimal, either.

As I wrote "some", I'm actually not pushing "lwlock stat" itself
in this context, I still believe it would be useful though.
(I don't think it would be hogwash, because I needed it.)

I'm just thinking of how to help DBA understand PostgreSQL behavior
without deep dive into the code when he/she faces some kind of
performance issue, and also how to make it easier to help newbies
by PG experts, including tech support people.

As I posted before, I did an investigation on WAL performance when
I faced with random commit delays, and I found some problem on the
storage device by observing WALInsertLock and WALWriteLock statistic
with using SystemTap.

How could I do that without SystemTap on the production database?
SystemTap would kill performance (more than my patch), and sometimes
process, too.

Do you really think that we do not need any kind of lock statistic
anymore?

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#28Fujii Masao
masao.fujii@gmail.com
In reply to: Satoshi Nagayasu (#24)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

On Sat, Oct 20, 2012 at 1:03 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

2012/10/19 23:48, Fujii Masao wrote:

On Wed, Oct 17, 2012 at 12:31 AM, Satoshi Nagayasu <snaga@uptime.jp>
wrote:

2012/10/16 2:40, Jeff Janes wrote:

On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Satoshi Nagayasu <snaga@uptime.jp> writes:

(2012/10/14 13:26), Fujii Masao wrote:

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not
available.
It should be up to a user to decide whether to trace lwlock usage,
e.g.,
by using trace_lwlock parameter, I think.

Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus). It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.

Waiting on BufFreelistLock suggests increasing shared_buffers.

Waiting on ProcArrayLock perhaps suggests use of a connection pooler
(or does it?)

WALWriteLock suggests doing something about IO, either moving logs to
different disks, or getting BBU, or something.

WALInsertLock suggests trying to adapt your data loading process so it
can take advantage of the bulk, or maybe increasing wal_buffers.

And a lot of waiting on any of the locks gives a piece of information
the DBA can use when asking the mailing lists for help, even if it
doesn't allow him to take unilateral action.

So I feel it isn't something that should be turned on in production
builds. I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.

I think hackers would benefit from getting reports from DBAs in the
field with concrete data on bottlenecks.

If the only way to get this is to do some non-standard compile and
deploy it to production, or to create a "benchmarking" copy of the
production database system including a realistic work-load driver and
run the non-standard compile there; either of those is going to
dramatically cut down on the participation.

Agreed.

The hardest thing to investigate performance issue is
reproducing a situation in the different environment
from the production environment.

I often see people struggling to reproduce a situation
with different hardware and (similar but) different
workload. It is very time consuming, and also it often
fails.

So, we need to collect any piece of information, which
would help us to understand what's going on within
the production PostgreSQL, without any changes of
binaries and configurations in the production environment.

That's the reason why I stick to a "built-in" instrument,
and I disagree to disable such instrument even if it has
minor performance overhead.

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.

pg_stat_lwlocks seems not adequate 'flight-recorder'. It collects
only narrow performance data concerning lwlock. What we should
have as 'flight-recorder' is something like Oracle wait event, I think.
Not only lwlocks but also all of wait events should be collected for
DBA to investigate the performance bottleneck.

That's the reason why I said "I accept that it's not enough
for DBA", and I'm going to work on another lock stats.

This idea was
proposed by Itagaki-san before. Though he implemented the
sampling-profiler patch, it failed to be committed. I'm not sure why
not.

Yeah, I know the previous patch posted by Itagaki-san.
So, I'm questioning why (again) for this time.
I think this is very important question because it would
be critical in order to involve new DBAs to PostgreSQL.

Anyway, I think that this would be more right approach to
provide the 'flight-recorder' to DBA.

Ok, I guess we have reached the consensus to have
"some flight-recorder". Right?

Yes, at least I agree to add that. But whatever is added, if it may
have any impact on the performance, on-off switch must be required.
Also the default value of the switch should be off until we'll have
implemented the low-overhead instrument and agreed to enable it
by default.

Regards,

--
Fujii Masao

#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#26)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

Robert Haas escribió:

On Fri, Oct 19, 2012 at 11:52 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

I agree with that such performance instrument needs to be
improved if it has critical performance issue against
production environment. So, I'm still looking for a better
implementation to decrease performance impact.

Frankly, I think the approach of simply providing an "off" switch is
probably a good one. I admit that it would be nice if we could run
with instrumentation like this all the time - but until very fast
time-lookups become ubiquitous, we can't

Another architectural problem here is that I believe this will
increase the size of the stats file, which I think is going to cause
pain for some people. I suspect that's going to be an issue even if
we have an "off" switch. I think somebody's going to have to figure
out a way to split that file up somehow.

I am closing this patch as Returned with Feedback for now. Hopefully a
version can be submitted for the next commitfest with the "off" switch,
but I am not sure about burdening the submitter with the responsibility
of splitting the stats file.

Regarding Tom's objection to the fundamental issue of providing lwlocks
data, I agree that maybe it's the wrong layer to be measuring to provide
data to DBAs, but not providing any data is worse, because then even PG
developers cannot know what are the real bottlenecks; and it's hard to
see what other layer we need to be measuring. Maybe this can serve as a
foundation to discover useful things to provide in the future.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#29)
Re: pg_stat_lwlocks view - lwlocks statistics, round 2

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Regarding Tom's objection to the fundamental issue of providing lwlocks
data, I agree that maybe it's the wrong layer to be measuring to provide
data to DBAs, but not providing any data is worse, because then even PG
developers cannot know what are the real bottlenecks; and it's hard to
see what other layer we need to be measuring. Maybe this can serve as a
foundation to discover useful things to provide in the future.

FWIW, I am not objecting to having the *ability* to collect such data.
I am questioning the usefulness/wisdom of having it turned on by
default, and I am also concerned about whether there is residual
overhead even when it's not turned on.

regards, tom lane