Re: How much expensive are row level statistics?

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Merlin Moncure (#1)

"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:

The cost depends on your usage patterns. I did tests with one of
my applications and saw no significant performance difference for
simple selects, but a series of insert/update/delete operations ran
about 30% slower when block- and row-level statistics were enabled
versus when the statistics collector was disabled.

That approximately confirms my results, except that the penalty may even
be a little bit higher in the worst-case scenario. Row level stats hit
the hardest if you are doing 1 row at a time operations over a
persistent connection.

IIRC, the only significant cost from enabling stats is the cost of
transmitting the counts to the stats collector, which is a cost
basically paid once at each transaction commit. So short transactions
will definitely have more overhead than longer ones. Even for a really
simple transaction, though, 30% seems high --- the stats code is
designed deliberately to minimize the penalty.

regards, tom lane

Michael Fuhr

mike@fuhr.org

about 20 years ago

In reply to: Tom Lane (#3)

On Mon, Dec 12, 2005 at 06:01:01PM -0500, Tom Lane wrote:

IIRC, the only significant cost from enabling stats is the cost of
transmitting the counts to the stats collector, which is a cost
basically paid once at each transaction commit. So short transactions
will definitely have more overhead than longer ones. Even for a really
simple transaction, though, 30% seems high --- the stats code is
designed deliberately to minimize the penalty.

Now there goes Tom with his skeptical eye again, and here comes me
saying "oops" again. Further tests show that for this application
the killer is stats_command_string, not stats_block_level or
stats_row_level. Here are timings for the same set of operations
(thousands of insert, update, and delete statements in one transaction)
run under various settings:

stats_command_string = off
stats_block_level = off
stats_row_level = off
time: 2:09.46

stats_command_string = off
stats_block_level = on
stats_row_level = off
time: 2:12.28

stats_command_string = off
stats_block_level = on
stats_row_level = on
time: 2:14.38

stats_command_string = on
stats_block_level = off
stats_row_level = off
time: 2:50.58

stats_command_string = on
stats_block_level = on
stats_row_level = on
time: 2:53.76

[Wanders off, swearing that he ran these tests before and saw higher
penalties for block- and row-level statistics.]

--
Michael Fuhr

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Michael Fuhr (#4)

Michael Fuhr <mike@fuhr.org> writes:

Further tests show that for this application
the killer is stats_command_string, not stats_block_level or
stats_row_level.

I tried it with pgbench -c 10, and got these results:
41% reduction in TPS rate for stats_command_string
9% reduction in TPS rate for stats_block/row_level (any combination)

strace'ing a backend confirms my belief that stats_block/row_level send
just one stats message per transaction (at least for the relatively
small number of tables touched per transaction by pgbench). However
stats_command_string sends 14(!) --- there are seven commands per
pgbench transaction and each results in sending a <command> message and
later an <IDLE> message.

Given the rather lackadaisical way in which the stats collector makes
the data available, it seems like the backends are being much too
enthusiastic about posting their stats_command_string status
immediately. Might be worth thinking about how to cut back the
overhead by suppressing some of these messages.

regards, tom lane

Michael Fuhr

mike@fuhr.org

about 20 years ago

In reply to: Tom Lane (#5)

On Mon, Dec 12, 2005 at 10:20:45PM -0500, Tom Lane wrote:

Given the rather lackadaisical way in which the stats collector makes
the data available, it seems like the backends are being much too
enthusiastic about posting their stats_command_string status
immediately. Might be worth thinking about how to cut back the
overhead by suppressing some of these messages.

Would a GUC setting akin to log_min_duration_statement be feasible?
Does the backend support, or could it be easily modified to support,
a mechanism that would post the command string after a configurable
amount of time had expired, and then continue processing the query?
That way admins could avoid the overhead of posting messages for
short-lived queries that nobody's likely to see in pg_stat_activity
anyway.

--
Michael Fuhr

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Michael Fuhr (#6)

Michael Fuhr <mike@fuhr.org> writes:

Does the backend support, or could it be easily modified to support,
a mechanism that would post the command string after a configurable
amount of time had expired, and then continue processing the query?

Not really, unless you want to add the overhead of setting a timer
interrupt for every query. Which is sort of counterproductive when
the motivation is to reduce overhead ...

(It might be more or less free if you have statement_timeout set, since
there would be a setitimer call anyway. But I don't think that's the
norm.)

regards, tom lane

Kevin Brown

kevin@sysexperts.com

about 20 years ago

In reply to: Tom Lane (#7)

Tom Lane wrote:

Michael Fuhr <mike@fuhr.org> writes:

Does the backend support, or could it be easily modified to support,
a mechanism that would post the command string after a configurable
amount of time had expired, and then continue processing the query?

Not really, unless you want to add the overhead of setting a timer
interrupt for every query. Which is sort of counterproductive when
the motivation is to reduce overhead ...

(It might be more or less free if you have statement_timeout set, since
there would be a setitimer call anyway. But I don't think that's the
norm.)

Actually, it's probably not necessary to set the timer at the
beginning of every query. It's probably sufficient to just have it go
off periodically, e.g. once every second, and thus set it when the
timer goes off. And the running command wouldn't need to be re-posted
if it's the same as last time around. Turn off the timer if the
connection is idle now and was idle last time around (or not, if
there's no harm in having the timer running all the time), turn it on
again at the start of the next transaction.

In essence, the backend would be "polling" itself every second or so
and recording its state at that time, rather than on every
transaction.

Assuming that doing all that wouldn't screw something else up...

--
Kevin Brown kevin@sysexperts.com

Simon Riggs

simon@2ndquadrant.com

about 20 years ago

In reply to: Tom Lane (#7)

On Thu, 2005-12-15 at 19:06 -0500, Tom Lane wrote:

Michael Fuhr <mike@fuhr.org> writes:

Does the backend support, or could it be easily modified to support,
a mechanism that would post the command string after a configurable
amount of time had expired, and then continue processing the query?

Not really, unless you want to add the overhead of setting a timer
interrupt for every query. Which is sort of counterproductive when
the motivation is to reduce overhead ...

(It might be more or less free if you have statement_timeout set, since
there would be a setitimer call anyway. But I don't think that's the
norm.)

We could do the deferred send fairly easily. You need only set a timer
when stats_command_string = on, so we'd only do that when requested by
the admin. Overall, that would be a cheaper way of doing it than now.

However, I'm more inclined to the idea of a set of functions that allow
an administrator to retrieve the full SQL text executing in a backend,
with an option to return an EXPLAIN of the currently executing plan.
Right now, stats only gives you the first 1000 chars, so you're always
stuck if its a big query. Plus we don't yet have a way of getting the
exact EXPLAIN of a running query (you can get close, but it could
differ).

Pull is better than push. Asking specific backends what they're doing
when you need to know will be efficient; asking them to send their
command strings, all of the time, deferred or not will always be more
wasteful. Plus if you forgot to turn on stats_command_string before
execution, then you've no way of knowing anyhow.

Best Regards, Simon Riggs

#10

pgman@candle.pha.pa.us

about 20 years ago

In reply to: Tom Lane (#5)

Tom Lane wrote:

Michael Fuhr <mike@fuhr.org> writes:

Further tests show that for this application
the killer is stats_command_string, not stats_block_level or
stats_row_level.

I tried it with pgbench -c 10, and got these results:
41% reduction in TPS rate for stats_command_string

Woh, 41%. That's just off the charts! What are we doing internally
that would cause that?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#11

pgman@candle.pha.pa.us

about 20 years ago

In reply to: Tom Lane (#5)

2 attachment(s)

Stats collector performance improvement

Tom Lane wrote:

Michael Fuhr <mike@fuhr.org> writes:

Further tests show that for this application
the killer is stats_command_string, not stats_block_level or
stats_row_level.

I tried it with pgbench -c 10, and got these results:
41% reduction in TPS rate for stats_command_string
9% reduction in TPS rate for stats_block/row_level (any combination)

strace'ing a backend confirms my belief that stats_block/row_level send
just one stats message per transaction (at least for the relatively
small number of tables touched per transaction by pgbench). However
stats_command_string sends 14(!) --- there are seven commands per
pgbench transaction and each results in sending a <command> message and
later an <IDLE> message.

Given the rather lackadaisical way in which the stats collector makes
the data available, it seems like the backends are being much too
enthusiastic about posting their stats_command_string status
immediately. Might be worth thinking about how to cut back the
overhead by suppressing some of these messages.

I did some research on this because the numbers Tom quotes indicate there
is something wrong in the way we process stats_command_string
statistics.

I made a small test script:

if [ ! -f /tmp/pgstat.sql ]
then i=0
while [ $i -lt 10000 ]
do
i=`expr $i + 1`
echo "SELECT 1;"
done > /tmp/pgstat.sql
fi

time sql test </tmp/pgstat.sql >/dev/null

This sends 10,000 "SELECT 1" queries to the backend, and reports the
execution time. I found that without stats_command_string defined, it
ran in 3.5 seconds. With stats_command_string defined, it took 5.5
seconds, meaning the command string is causing a 57% slowdown. That is
way too much considering that the SELECT 1 has to be send from psql to
the backend, parsed, optimized, and executed, and the result returned to
the psql, while stats_command_string only has to send a string to a
backend collector. There is _no_ way that collector should take 57% of
the time it takes to run the actual query.

With the test program, I tried various options. The basic code we have
sends a UDP packet to a statistics buffer process, which recv()'s the
packet, puts it into a memory queue buffer, and writes it to a pipe()
that is read by the statistics collector process which processes the
packet.

I tried various ways of speeding up the buffer and collector processes.
I found if I put a pg_usleep(100) in the buffer process the backend
speed was good, but packets were lost. What I found worked well was to
do multiple recv() calls in a loop. The previous code did a select(),
then perhaps a recv() and pipe write() based on the results of the
select(). This caused many small packets to be written to the pipe and
the pipe write overhead seems fairly large. The best fix I found was to
loop over the recv() call at most 25 times, collecting a group of
packets that can then be sent to the collector in one pipe write. The
recv() socket is non-blocking, so a zero return indicates there are no
more packets available. Patch attached.

This change reduced the stats_command_string time from 5.5 to 3.9, which
is closer to the 3.5 seconds with stats_command_string off.

A second improvement I discovered is that the statistics collector is
calling gettimeofday() for every packet received, so it can determine
the timeout for the select() call to write the flat file. I removed
that behavior and instead used setitimer() to issue a SIGINT every
500ms, which was the original behavior. This eliminates the
gettimeofday() call and makes the code cleaner. Second patch attached.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Attachments:

/pgpatches/stattext/plainDownload

Index: src/backend/postmaster/pgstat.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/postmaster/pgstat.c,v
retrieving revision 1.116
diff -c -c -r1.116 pgstat.c
*** src/backend/postmaster/pgstat.c	2 Jan 2006 00:58:00 -0000	1.116
--- src/backend/postmaster/pgstat.c	2 Jan 2006 18:36:43 -0000
***************
*** 1911,1916 ****
--- 1911,1918 ----
  	 */
  	for (;;)
  	{
+ loop_again:
+ 
  		FD_ZERO(&rfds);
  		FD_ZERO(&wfds);
  		maxfd = -1;
***************
*** 1970,2014 ****
  		 */
  		if (FD_ISSET(pgStatSock, &rfds))
  		{
! 			len = recv(pgStatSock, (char *) &input_buffer,
! 					   sizeof(PgStat_Msg), 0);
! 			if (len < 0)
! 				ereport(ERROR,
! 						(errcode_for_socket_access(),
! 						 errmsg("could not read statistics message: %m")));
! 
! 			/*
! 			 * We ignore messages that are smaller than our common header
! 			 */
! 			if (len < sizeof(PgStat_MsgHdr))
! 				continue;
! 
! 			/*
! 			 * The received length must match the length in the header
! 			 */
! 			if (input_buffer.msg_hdr.m_size != len)
! 				continue;
! 
  			/*
! 			 * O.K. - we accept this message.  Copy it to the circular
! 			 * msgbuffer.
  			 */
! 			frm = 0;
! 			while (len > 0)
  			{
! 				xfr = PGSTAT_RECVBUFFERSZ - msg_recv;
! 				if (xfr > len)
! 					xfr = len;
! 				Assert(xfr > 0);
! 				memcpy(msgbuffer + msg_recv,
! 					   ((char *) &input_buffer) + frm,
! 					   xfr);
! 				msg_recv += xfr;
! 				if (msg_recv == PGSTAT_RECVBUFFERSZ)
! 					msg_recv = 0;
! 				msg_have += xfr;
! 				frm += xfr;
! 				len -= xfr;
  			}
  		}
  
--- 1972,2033 ----
  		 */
  		if (FD_ISSET(pgStatSock, &rfds))
  		{
! 			int loops = 0;
! 			
  			/*
! 			 *	While pipewrite() can send multiple data packets, recv() pulls
! 			 *	only a single packet per call.  For busy systems, doing
! 			 *	multiple recv() calls and then one pipewrite() can improve
! 			 *	query speed by 40%.  25 was chosen because 25 packets should
! 			 *	easily fit in a single pipewrite() call.  recv()'s socket is
! 			 *	non-blocking.
  			 */
! 			while (++loops < 25 &&
! 				   (len = recv(pgStatSock, (char *) &input_buffer,
! 							   sizeof(PgStat_Msg), 0)) != 0)
  			{
! 				if (len < 0)
! 				{
! 					if (errno == EAGAIN)
! 						continue;
! 					ereport(ERROR,
! 							(errcode_for_socket_access(),
! 							 errmsg("could not read statistics message: %m")));
! 				}
! 	
! 				/*
! 				 * We ignore messages that are smaller than our common header
! 				 */
! 				if (len < sizeof(PgStat_MsgHdr))
! 					goto loop_again;
! 	
! 				/*
! 				 * The received length must match the length in the header
! 				 */
! 				if (input_buffer.msg_hdr.m_size != len)
! 					goto loop_again;
! 	
! 				/*
! 				 * O.K. - we accept this message.  Copy it to the circular
! 				 * msgbuffer.
! 				 */
! 				frm = 0;
! 				while (len > 0)
! 				{
! 					xfr = PGSTAT_RECVBUFFERSZ - msg_recv;
! 					if (xfr > len)
! 						xfr = len;
! 					Assert(xfr > 0);
! 					memcpy(msgbuffer + msg_recv,
! 						   ((char *) &input_buffer) + frm,
! 						   xfr);
! 					msg_recv += xfr;
! 					if (msg_recv == PGSTAT_RECVBUFFERSZ)
! 						msg_recv = 0;
! 					msg_have += xfr;
! 					frm += xfr;
! 					len -= xfr;
! 				}
  			}
  		}
  
***************
*** 2023,2029 ****
  		 * caught up, or because more data arrives so that we have more than
  		 * PIPE_BUF bytes buffered).  This is not good, but is there any way
  		 * around it?  We have no way to tell when the collector has caught
! 		 * up...
  		 */
  		if (FD_ISSET(writePipe, &wfds))
  		{
--- 2042,2048 ----
  		 * caught up, or because more data arrives so that we have more than
  		 * PIPE_BUF bytes buffered).  This is not good, but is there any way
  		 * around it?  We have no way to tell when the collector has caught
! 		 * up.  Followup, the pipe rarely fills up.
  		 */
  		if (FD_ISSET(writePipe, &wfds))
  		{

/pgpatches/stat2text/plainDownload

Index: src/backend/postmaster/pgstat.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/postmaster/pgstat.c,v
retrieving revision 1.116
diff -c -c -r1.116 pgstat.c
*** src/backend/postmaster/pgstat.c	2 Jan 2006 00:58:00 -0000	1.116
--- src/backend/postmaster/pgstat.c	2 Jan 2006 18:21:28 -0000
***************
*** 145,150 ****
--- 145,151 ----
  static PgStat_StatBeEntry *pgStatBeTable = NULL;
  static int	pgStatNumBackends = 0;
  
+ static volatile bool	need_statwrite;
  
  /* ----------
   * Local function forward declarations
***************
*** 164,169 ****
--- 165,171 ----
  
  NON_EXEC_STATIC void PgstatBufferMain(int argc, char *argv[]);
  NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]);
+ static void force_statwrite(SIGNAL_ARGS);
  static void pgstat_recvbuffer(void);
  static void pgstat_exit(SIGNAL_ARGS);
  static void pgstat_die(SIGNAL_ARGS);
***************
*** 1548,1560 ****
  	PgStat_Msg	msg;
  	fd_set		rfds;
  	int			readPipe;
- 	int			nready;
  	int			len = 0;
! 	struct timeval timeout;
! 	struct timeval next_statwrite;
! 	bool		need_statwrite;
  	HASHCTL		hash_ctl;
! 
  	MyProcPid = getpid();		/* reset MyProcPid */
  
  	/*
--- 1550,1560 ----
  	PgStat_Msg	msg;
  	fd_set		rfds;
  	int			readPipe;
  	int			len = 0;
! 	struct itimerval timeval;
  	HASHCTL		hash_ctl;
! 	bool		need_timer = false;
! 	
  	MyProcPid = getpid();		/* reset MyProcPid */
  
  	/*
***************
*** 1572,1578 ****
  	/* kluge to allow buffer process to kill collector; FIXME */
  	pqsignal(SIGQUIT, pgstat_exit);
  #endif
! 	pqsignal(SIGALRM, SIG_IGN);
  	pqsignal(SIGPIPE, SIG_IGN);
  	pqsignal(SIGUSR1, SIG_IGN);
  	pqsignal(SIGUSR2, SIG_IGN);
--- 1572,1578 ----
  	/* kluge to allow buffer process to kill collector; FIXME */
  	pqsignal(SIGQUIT, pgstat_exit);
  #endif
! 	pqsignal(SIGALRM, force_statwrite);
  	pqsignal(SIGPIPE, SIG_IGN);
  	pqsignal(SIGUSR1, SIG_IGN);
  	pqsignal(SIGUSR2, SIG_IGN);
***************
*** 1597,1608 ****
  	init_ps_display("stats collector process", "", "");
  	set_ps_display("");
  
- 	/*
- 	 * Arrange to write the initial status file right away
- 	 */
- 	gettimeofday(&next_statwrite, NULL);
  	need_statwrite = TRUE;
  
  	/*
  	 * Read in an existing statistics stats file or initialize the stats to
  	 * zero.
--- 1597,1608 ----
  	init_ps_display("stats collector process", "", "");
  	set_ps_display("");
  
  	need_statwrite = TRUE;
  
+ 	MemSet(&timeval, 0, sizeof(struct itimerval));
+ 	timeval.it_value.tv_sec = PGSTAT_STAT_INTERVAL / 1000;
+ 	timeval.it_value.tv_usec = PGSTAT_STAT_INTERVAL % 1000;
+ 
  	/*
  	 * Read in an existing statistics stats file or initialize the stats to
  	 * zero.
***************
*** 1634,1667 ****
  	 */
  	for (;;)
  	{
- 		/*
- 		 * If we need to write the status file again (there have been changes
- 		 * in the statistics since we wrote it last) calculate the timeout
- 		 * until we have to do so.
- 		 */
  		if (need_statwrite)
  		{
! 			struct timeval now;
! 
! 			gettimeofday(&now, NULL);
! 			/* avoid assuming that tv_sec is signed */
! 			if (now.tv_sec > next_statwrite.tv_sec ||
! 				(now.tv_sec == next_statwrite.tv_sec &&
! 				 now.tv_usec >= next_statwrite.tv_usec))
! 			{
! 				timeout.tv_sec = 0;
! 				timeout.tv_usec = 0;
! 			}
! 			else
! 			{
! 				timeout.tv_sec = next_statwrite.tv_sec - now.tv_sec;
! 				timeout.tv_usec = next_statwrite.tv_usec - now.tv_usec;
! 				if (timeout.tv_usec < 0)
! 				{
! 					timeout.tv_sec--;
! 					timeout.tv_usec += 1000000;
! 				}
! 			}
  		}
  
  		/*
--- 1634,1644 ----
  	 */
  	for (;;)
  	{
  		if (need_statwrite)
  		{
! 			pgstat_write_statsfile();
! 			need_statwrite = false;
! 			need_timer = true;
  		}
  
  		/*
***************
*** 1673,1681 ****
  		/*
  		 * Now wait for something to do.
  		 */
! 		nready = select(readPipe + 1, &rfds, NULL, NULL,
! 						(need_statwrite) ? &timeout : NULL);
! 		if (nready < 0)
  		{
  			if (errno == EINTR)
  				continue;
--- 1650,1656 ----
  		/*
  		 * Now wait for something to do.
  		 */
! 		if (select(readPipe + 1, &rfds, NULL, NULL, NULL) < 0)
  		{
  			if (errno == EINTR)
  				continue;
***************
*** 1685,1702 ****
  		}
  
  		/*
- 		 * If there are no descriptors ready, our timeout for writing the
- 		 * stats file happened.
- 		 */
- 		if (nready == 0)
- 		{
- 			pgstat_write_statsfile();
- 			need_statwrite = FALSE;
- 
- 			continue;
- 		}
- 
- 		/*
  		 * Check if there is a new statistics message to collect.
  		 */
  		if (FD_ISSET(readPipe, &rfds))
--- 1660,1665 ----
***************
*** 1813,1829 ****
  			 */
  			pgStatNumMessages++;
  
! 			/*
! 			 * If this is the first message after we wrote the stats file the
! 			 * last time, setup the timeout that it'd be written.
! 			 */
! 			if (!need_statwrite)
  			{
! 				gettimeofday(&next_statwrite, NULL);
! 				next_statwrite.tv_usec += ((PGSTAT_STAT_INTERVAL) * 1000);
! 				next_statwrite.tv_sec += (next_statwrite.tv_usec / 1000000);
! 				next_statwrite.tv_usec %= 1000000;
! 				need_statwrite = TRUE;
  			}
  		}
  
--- 1776,1787 ----
  			 */
  			pgStatNumMessages++;
  
! 			if (need_timer)
  			{
! 				if (setitimer(ITIMER_REAL, &timeval, NULL))
! 					ereport(ERROR,
! 						  (errmsg("unable to set statistics collector timer: %m")));
! 				need_timer = false;
  			}
  		}
  
***************
*** 1848,1853 ****
--- 1806,1818 ----
  }
  
  
+ static void
+ force_statwrite(SIGNAL_ARGS)
+ {
+ 	need_statwrite = true;
+ }
+ 
+ 
  /* ----------
   * pgstat_recvbuffer() -
   *

#12

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Bruce Momjian (#11)

Re: Stats collector performance improvement

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I found if I put a pg_usleep(100) in the buffer process the backend
speed was good, but packets were lost. What I found worked well was to
do multiple recv() calls in a loop. The previous code did a select(),
then perhaps a recv() and pipe write() based on the results of the
select(). This caused many small packets to be written to the pipe and
the pipe write overhead seems fairly large. The best fix I found was to
loop over the recv() call at most 25 times, collecting a group of
packets that can then be sent to the collector in one pipe write. The
recv() socket is non-blocking, so a zero return indicates there are no
more packets available. Patch attached.

This seems incredibly OS-specific. How many platforms did you test it
on?

A more serious objection is that it will cause the stats machinery to
work very poorly if there isn't a steady stream of incoming messages.
You can't just sit on 24 messages until the 25th one arrives next week.

regards, tom lane

#13

pgman@candle.pha.pa.us

about 20 years ago

In reply to: Tom Lane (#12)

Re: Stats collector performance improvement

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I found if I put a pg_usleep(100) in the buffer process the backend
speed was good, but packets were lost. What I found worked well was to
do multiple recv() calls in a loop. The previous code did a select(),
then perhaps a recv() and pipe write() based on the results of the
select(). This caused many small packets to be written to the pipe and
the pipe write overhead seems fairly large. The best fix I found was to
loop over the recv() call at most 25 times, collecting a group of
packets that can then be sent to the collector in one pipe write. The
recv() socket is non-blocking, so a zero return indicates there are no
more packets available. Patch attached.

This seems incredibly OS-specific. How many platforms did you test it
on?

Only mine. I am posting the patch so others can test it, of course.

A more serious objection is that it will cause the stats machinery to
work very poorly if there isn't a steady stream of incoming messages.
You can't just sit on 24 messages until the 25th one arrives next week.

You wouldn't. It exits out of the loop on a not found, checks the pipe
write descriptor, and writes on it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#14

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Bruce Momjian (#11)

Re: Stats collector performance improvement

[ moving to -hackers ]

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I did some research on this because the numbers Tom quotes indicate there
is something wrong in the way we process stats_command_string
statistics.
[ ... proposed patch that seems pretty klugy to me ... ]

I wonder whether we shouldn't consider something more drastic, like
getting rid of the intermediate stats buffer process entirely.

The original design for the stats communication code was based on the
premise that it's better to drop data than to make backends wait on
the stats collector. However, as things have turned out I think this
notion is a flop: the people who are using stats at all want the stats
to be reliable. We've certainly seen plenty of gripes from people who
are unhappy that backend-exit messages got dropped, and anyone who's
using autovacuum would really like the tuple update counts to be pretty
solid too.

If we abandoned the unreliable-communication approach, could we build
something with less overhead?

regards, tom lane

#15

zhouqq@cs.toronto.edu

about 20 years ago

In reply to: Bruce Momjian (#11)

Re: Stats collector performance improvement

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

I wonder whether we shouldn't consider something more drastic, like
getting rid of the intermediate stats buffer process entirely.

The original design for the stats communication code was based on the
premise that it's better to drop data than to make backends wait on
the stats collector. However, as things have turned out I think this
notion is a flop: the people who are using stats at all want the stats
to be reliable. We've certainly seen plenty of gripes from people who
are unhappy that backend-exit messages got dropped, and anyone who's
using autovacuum would really like the tuple update counts to be pretty
solid too.

AFAICS if we can maintain the stats counts solid, then it may hurt
performance dramatically. Think if we maintain
pgstat_count_heap_insert()/pgstat_count_heap_delete() pretty well, then we
get a replacement of count(*). To do so, I believe that will add another
lock contention on the target table stats.

Regards,
Qingqing

#16

Hannu Krosing

hannu@skype.net

about 20 years ago

In reply to: Tom Lane (#14)

Re: Stats collector performance improvement

Ühel kenal päeval, E, 2006-01-02 kell 15:20, kirjutas Tom Lane:

[ moving to -hackers ]

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I did some research on this because the numbers Tom quotes indicate there
is something wrong in the way we process stats_command_string
statistics.
[ ... proposed patch that seems pretty klugy to me ... ]

I wonder whether we shouldn't consider something more drastic, like
getting rid of the intermediate stats buffer process entirely.

The original design for the stats communication code was based on the
premise that it's better to drop data than to make backends wait on
the stats collector. However, as things have turned out I think this
notion is a flop: the people who are using stats at all want the stats
to be reliable. We've certainly seen plenty of gripes from people who
are unhappy that backend-exit messages got dropped, and anyone who's
using autovacuum would really like the tuple update counts to be pretty
solid too.

If we abandoned the unreliable-communication approach, could we build
something with less overhead?

Weell, at least it should be non-WAL, and probably non-fsync, at least
optionally . Maybe also inserts inserts + offline aggregator (instead of
updates) to avoid lock contention. Something that collects data in
blocks of local or per-backend shared memory in each backend and then
gives complete blocks to aggregator process. Maybe use 2 alternating
blocks per backend - 1 for ongoing stats collection and another given to
aggregator. this has a little time shift, but will deliver accurate
starts in the end. Things that need up-to-date stats (like
pg_stat_activity), should look (and lock) also the ongoing satas
collection blocks if needed (how do we know know the *if*) and delay
each backend process momentaryly by looking.

-----------------
Hannu

#17

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Qingqing Zhou (#15)

Re: Stats collector performance improvement

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

AFAICS if we can maintain the stats counts solid, then it may hurt
performance dramatically. Think if we maintain
pgstat_count_heap_insert()/pgstat_count_heap_delete() pretty well, then we
get a replacement of count(*).

Not at all. For one thing, the stats don't attempt to maintain
per-transaction state, so they don't have the MVCC issues of count(*).
I'm not suggesting any fundamental changes in what is counted or when.

The two compromises that were made in the original stats design to make
it fast were (1) stats updates lag behind reality, and (2) some updates
may be missed entirely. Now that we have a couple of years' field
experience with the code, it seems that (1) is acceptable for real usage
but (2) not so much. And it's not even clear that we are buying any
performance gain from (2), considering that it's adding the overhead of
passing the data through an extra process.

regards, tom lane

#18

Jan Wieck

JanWieck@Yahoo.com

about 20 years ago

In reply to: Tom Lane (#14)

Re: Stats collector performance improvement

On 1/2/2006 3:20 PM, Tom Lane wrote:

[ moving to -hackers ]

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I did some research on this because the numbers Tom quotes indicate there
is something wrong in the way we process stats_command_string
statistics.
[ ... proposed patch that seems pretty klugy to me ... ]

I wonder whether we shouldn't consider something more drastic, like
getting rid of the intermediate stats buffer process entirely.

The original design for the stats communication code was based on the
premise that it's better to drop data than to make backends wait on

The original design was geared towards searching for useless/missing
indexes and tuning activity like that. This never happened, but instead
people tried to use it as a reliable debugging or access statistics aid
... which is fine but not what it originally was intended for.

So yes, I think looking at what it usually is used for, a message
passing system like SysV message queues (puke) or similar would do a
better job.

Jan

the stats collector. However, as things have turned out I think this
notion is a flop: the people who are using stats at all want the stats
to be reliable. We've certainly seen plenty of gripes from people who
are unhappy that backend-exit messages got dropped, and anyone who's
using autovacuum would really like the tuple update counts to be pretty
solid too.

If we abandoned the unreliable-communication approach, could we build
something with less overhead?

regards, tom lane

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#19

Simon Riggs

simon@2ndquadrant.com

about 20 years ago

In reply to: Tom Lane (#17)

Re: Stats collector performance improvement

On Mon, 2006-01-02 at 16:48 -0500, Tom Lane wrote:

The two compromises that were made in the original stats design to make
it fast were (1) stats updates lag behind reality, and (2) some updates
may be missed entirely. Now that we have a couple of years' field
experience with the code, it seems that (1) is acceptable for real usage
but (2) not so much.

We decided that the stats update had to occur during execution, in case
the statement aborted and row versions were not notified. That means we
must notify things as they happen, yet could use a reliable queuing
system that could suffer a delay in the stats becoming available.

But how often do we lose a backend? Could we simply buffer that a little
better? i.e. don't send message to stats unless we have altered at least
10 rows? So we would buffer based upon the importance of the message,
not the actual size of the message. That way singleton-statements won't
generate the same stats traffic, but we risk losing a buffers worth of
row changes should we crash - everything would still work if we lost a
few small row change notifications.

We can also save lots of cycles on the current statement overhead, which
is currently the worst part of the stats, performance-wise. That
definitely needs redesign. AFAICS we only ever need to know the SQL
statement via the stats system if the statement has been running for
more than a few minutes - the main use case is for an admin to be able
to diagnose a rogue or hung statement. Pushing the statement to stats
every time is just a big overhead. That suggests we should either have a
pull or a deferred push (longer-than-X-secs) approach.

Best Regards, Simon Riggs

#20

Simon Riggs

simon@2ndquadrant.com

about 20 years ago

In reply to: Bruce Momjian (#11)

Re: Stats collector performance improvement

On Mon, 2006-01-02 at 13:40 -0500, Bruce Momjian wrote:

This change reduced the stats_command_string time from 5.5 to 3.9, which
is closer to the 3.5 seconds with stats_command_string off.

Excellent work, port specific or not.

Best Regards, Simon Riggs

#21

Jim C. Nasby

jnasby@pervasive.com

about 20 years ago

In reply to: Simon Riggs (#19)

Re: Stats collector performance improvement

On Tue, Jan 03, 2006 at 09:40:53AM +0000, Simon Riggs wrote:

On Mon, 2006-01-02 at 16:48 -0500, Tom Lane wrote:
We can also save lots of cycles on the current statement overhead, which
is currently the worst part of the stats, performance-wise. That
definitely needs redesign. AFAICS we only ever need to know the SQL
statement via the stats system if the statement has been running for
more than a few minutes - the main use case is for an admin to be able
to diagnose a rogue or hung statement. Pushing the statement to stats
every time is just a big overhead. That suggests we should either have a
pull or a deferred push (longer-than-X-secs) approach.

I would argue that minutes is too long, but of course this could be
user-adjustable. I suspect that even waiting just a second could be a
huge win, since this only matters if you're executing a lot of
statements and you won't be doing that if those statements are taking
more than a second or two to execute.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#22

pgman@candle.pha.pa.us

about 20 years ago

In reply to: Bruce Momjian (#11)

1 attachment(s)

Re: Stats collector performance improvement

Bruce Momjian wrote:

Tom Lane wrote:
A second improvement I discovered is that the statistics collector is
calling gettimeofday() for every packet received, so it can determine
the timeout for the select() call to write the flat file. I removed
that behavior and instead used setitimer() to issue a SIGINT every
500ms, which was the original behavior. This eliminates the
gettimeofday() call and makes the code cleaner. Second patch attached.

I have applied this second patch, with a few small stylistic
improvements.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Attachments:

/rtmp/difftext/plainDownload

Index: src/backend/postmaster/pgstat.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/postmaster/pgstat.c,v
retrieving revision 1.116
diff -c -c -r1.116 pgstat.c
*** src/backend/postmaster/pgstat.c	2 Jan 2006 00:58:00 -0000	1.116
--- src/backend/postmaster/pgstat.c	3 Jan 2006 16:26:04 -0000
***************
*** 117,123 ****
  
  static long pgStatNumMessages = 0;
  
! static bool pgStatRunningInCollector = FALSE;
  
  /*
   * Place where backends store per-table info to be sent to the collector.
--- 117,123 ----
  
  static long pgStatNumMessages = 0;
  
! static bool pgStatRunningInCollector = false;
  
  /*
   * Place where backends store per-table info to be sent to the collector.
***************
*** 145,150 ****
--- 145,151 ----
  static PgStat_StatBeEntry *pgStatBeTable = NULL;
  static int	pgStatNumBackends = 0;
  
+ static volatile bool	need_statwrite;
  
  /* ----------
   * Local function forward declarations
***************
*** 164,169 ****
--- 165,171 ----
  
  NON_EXEC_STATIC void PgstatBufferMain(int argc, char *argv[]);
  NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]);
+ static void force_statwrite(SIGNAL_ARGS);
  static void pgstat_recvbuffer(void);
  static void pgstat_exit(SIGNAL_ARGS);
  static void pgstat_die(SIGNAL_ARGS);
***************
*** 1548,1560 ****
  	PgStat_Msg	msg;
  	fd_set		rfds;
  	int			readPipe;
- 	int			nready;
  	int			len = 0;
! 	struct timeval timeout;
! 	struct timeval next_statwrite;
! 	bool		need_statwrite;
  	HASHCTL		hash_ctl;
! 
  	MyProcPid = getpid();		/* reset MyProcPid */
  
  	/*
--- 1550,1560 ----
  	PgStat_Msg	msg;
  	fd_set		rfds;
  	int			readPipe;
  	int			len = 0;
! 	struct itimerval timeval;
  	HASHCTL		hash_ctl;
! 	bool		need_timer = false;
! 	
  	MyProcPid = getpid();		/* reset MyProcPid */
  
  	/*
***************
*** 1572,1578 ****
  	/* kluge to allow buffer process to kill collector; FIXME */
  	pqsignal(SIGQUIT, pgstat_exit);
  #endif
! 	pqsignal(SIGALRM, SIG_IGN);
  	pqsignal(SIGPIPE, SIG_IGN);
  	pqsignal(SIGUSR1, SIG_IGN);
  	pqsignal(SIGUSR2, SIG_IGN);
--- 1572,1578 ----
  	/* kluge to allow buffer process to kill collector; FIXME */
  	pqsignal(SIGQUIT, pgstat_exit);
  #endif
! 	pqsignal(SIGALRM, force_statwrite);
  	pqsignal(SIGPIPE, SIG_IGN);
  	pqsignal(SIGUSR1, SIG_IGN);
  	pqsignal(SIGUSR2, SIG_IGN);
***************
*** 1597,1613 ****
  	init_ps_display("stats collector process", "", "");
  	set_ps_display("");
  
! 	/*
! 	 * Arrange to write the initial status file right away
! 	 */
! 	gettimeofday(&next_statwrite, NULL);
! 	need_statwrite = TRUE;
  
  	/*
  	 * Read in an existing statistics stats file or initialize the stats to
  	 * zero.
  	 */
! 	pgStatRunningInCollector = TRUE;
  	pgstat_read_statsfile(&pgStatDBHash, InvalidOid, NULL, NULL);
  
  	/*
--- 1597,1613 ----
  	init_ps_display("stats collector process", "", "");
  	set_ps_display("");
  
! 	need_statwrite = true;
! 
! 	MemSet(&timeval, 0, sizeof(struct itimerval));
! 	timeval.it_value.tv_sec = PGSTAT_STAT_INTERVAL / 1000;
! 	timeval.it_value.tv_usec = PGSTAT_STAT_INTERVAL % 1000;
  
  	/*
  	 * Read in an existing statistics stats file or initialize the stats to
  	 * zero.
  	 */
! 	pgStatRunningInCollector = true;
  	pgstat_read_statsfile(&pgStatDBHash, InvalidOid, NULL, NULL);
  
  	/*
***************
*** 1634,1667 ****
  	 */
  	for (;;)
  	{
- 		/*
- 		 * If we need to write the status file again (there have been changes
- 		 * in the statistics since we wrote it last) calculate the timeout
- 		 * until we have to do so.
- 		 */
  		if (need_statwrite)
  		{
! 			struct timeval now;
! 
! 			gettimeofday(&now, NULL);
! 			/* avoid assuming that tv_sec is signed */
! 			if (now.tv_sec > next_statwrite.tv_sec ||
! 				(now.tv_sec == next_statwrite.tv_sec &&
! 				 now.tv_usec >= next_statwrite.tv_usec))
! 			{
! 				timeout.tv_sec = 0;
! 				timeout.tv_usec = 0;
! 			}
! 			else
! 			{
! 				timeout.tv_sec = next_statwrite.tv_sec - now.tv_sec;
! 				timeout.tv_usec = next_statwrite.tv_usec - now.tv_usec;
! 				if (timeout.tv_usec < 0)
! 				{
! 					timeout.tv_sec--;
! 					timeout.tv_usec += 1000000;
! 				}
! 			}
  		}
  
  		/*
--- 1634,1644 ----
  	 */
  	for (;;)
  	{
  		if (need_statwrite)
  		{
! 			pgstat_write_statsfile();
! 			need_statwrite = false;
! 			need_timer = true;
  		}
  
  		/*
***************
*** 1673,1681 ****
  		/*
  		 * Now wait for something to do.
  		 */
! 		nready = select(readPipe + 1, &rfds, NULL, NULL,
! 						(need_statwrite) ? &timeout : NULL);
! 		if (nready < 0)
  		{
  			if (errno == EINTR)
  				continue;
--- 1650,1656 ----
  		/*
  		 * Now wait for something to do.
  		 */
! 		if (select(readPipe + 1, &rfds, NULL, NULL, NULL) < 0)
  		{
  			if (errno == EINTR)
  				continue;
***************
*** 1685,1702 ****
  		}
  
  		/*
- 		 * If there are no descriptors ready, our timeout for writing the
- 		 * stats file happened.
- 		 */
- 		if (nready == 0)
- 		{
- 			pgstat_write_statsfile();
- 			need_statwrite = FALSE;
- 
- 			continue;
- 		}
- 
- 		/*
  		 * Check if there is a new statistics message to collect.
  		 */
  		if (FD_ISSET(readPipe, &rfds))
--- 1660,1665 ----
***************
*** 1813,1829 ****
  			 */
  			pgStatNumMessages++;
  
! 			/*
! 			 * If this is the first message after we wrote the stats file the
! 			 * last time, setup the timeout that it'd be written.
! 			 */
! 			if (!need_statwrite)
  			{
! 				gettimeofday(&next_statwrite, NULL);
! 				next_statwrite.tv_usec += ((PGSTAT_STAT_INTERVAL) * 1000);
! 				next_statwrite.tv_sec += (next_statwrite.tv_usec / 1000000);
! 				next_statwrite.tv_usec %= 1000000;
! 				need_statwrite = TRUE;
  			}
  		}
  
--- 1776,1787 ----
  			 */
  			pgStatNumMessages++;
  
! 			if (need_timer)
  			{
! 				if (setitimer(ITIMER_REAL, &timeval, NULL))
! 					ereport(ERROR,
! 						  (errmsg("unable to set statistics collector timer: %m")));
! 				need_timer = false;
  			}
  		}
  
***************
*** 1848,1853 ****
--- 1806,1818 ----
  }
  
  
+ static void
+ force_statwrite(SIGNAL_ARGS)
+ {
+ 	need_statwrite = true;
+ }
+ 
+ 
  /* ----------
   * pgstat_recvbuffer() -
   *
***************
*** 1865,1871 ****
  	struct timeval timeout;
  	int			writePipe = pgStatPipe[1];
  	int			maxfd;
- 	int			nready;
  	int			len;
  	int			xfr;
  	int			frm;
--- 1830,1835 ----
***************
*** 1907,1912 ****
--- 1871,1884 ----
  	msgbuffer = (char *) palloc(PGSTAT_RECVBUFFERSZ);
  
  	/*
+ 	 * Wait for some work to do; but not for more than 10 seconds. (This
+ 	 * determines how quickly we will shut down after an ungraceful
+ 	 * postmaster termination; so it needn't be very fast.)
+ 	 */
+ 	timeout.tv_sec = 10;
+ 	timeout.tv_usec = 0;
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 1946,1961 ****
  				maxfd = writePipe;
  		}
  
! 		/*
! 		 * Wait for some work to do; but not for more than 10 seconds. (This
! 		 * determines how quickly we will shut down after an ungraceful
! 		 * postmaster termination; so it needn't be very fast.)
! 		 */
! 		timeout.tv_sec = 10;
! 		timeout.tv_usec = 0;
! 
! 		nready = select(maxfd + 1, &rfds, &wfds, NULL, &timeout);
! 		if (nready < 0)
  		{
  			if (errno == EINTR)
  				continue;
--- 1918,1924 ----
  				maxfd = writePipe;
  		}
  
! 		if (select(maxfd + 1, &rfds, &wfds, NULL, &timeout) < 0)
  		{
  			if (errno == EINTR)
  				continue;

#23

Hannu Krosing

hannu@skype.net

about 20 years ago

In reply to: Simon Riggs (#19)

Re: Stats collector performance improvement

Ühel kenal päeval, T, 2006-01-03 kell 09:40, kirjutas Simon Riggs:

We can also save lots of cycles on the current statement overhead, which
is currently the worst part of the stats, performance-wise. That
definitely needs redesign. AFAICS we only ever need to know the SQL
statement via the stats system if the statement has been running for
more than a few minutes - the main use case is for an admin to be able
to diagnose a rogue or hung statement.

Interestingly I use pg_stat_activity view to watch for stuck backends,
"stuck" in the sense that they have not noticed when client want away
and are now waitin the TCP timeout to happen. I query for backends which
have been in "<IDLE>" state for longer than XX seconds. I guess that at
least some kind of indication for this should be available.

Of course this would be much less of a problem if there was a
possibility for sime kind of keepalive system to detect when
client/frontend goes away.

Pushing the statement to stats
every time is just a big overhead. That suggests we should either have a
pull

I could live with "push", where pg_stat_activity would actually ask each
live backend for its "current query". This surely happens less often
than queries are performed (up to few thousand per sec)

-------------
Hannu

#24

Greg Stark

gsstark@mit.edu

about 20 years ago

In reply to: Jim C. Nasby (#21)

Re: Stats collector performance improvement

"Jim C. Nasby" <jnasby@pervasive.com> writes:

I would argue that minutes is too long, but of course this could be
user-adjustable. I suspect that even waiting just a second could be a
huge win, since this only matters if you're executing a lot of
statements and you won't be doing that if those statements are taking
more than a second or two to execute.

That's not necessarily true at all. You could just as easily have a
performance problem caused by a quick statement that is being executed many
times as a slow statement that is being executed few times.

That is, you could be executing dozens of queries that take seconds or minutes
once a second but none of those might be the problem. The problem might be the
query that's taking only 300ms that you're executing hundreds of of times a
minute.

Moreover, if you're not gathering stats for queries that are fast then how
will you know whether they're performing properly when you look at them when
they do show up?

--
greg

#25

http://www.postgresql.org/docs/8.1/interactive/runtime-config-connection.html#GUC-TCP-KEEPALIVES-IDLE

pgman@candle.pha.pa.us

about 20 years ago

In reply to: Bruce Momjian (#11)

2 attachment(s)

Re: Stats collector performance improvement

Bruce Momjian wrote:

I did some research on this because the numbers Tom quotes indicate there
is something wrong in the way we process stats_command_string
statistics.

...

This sends 10,000 "SELECT 1" queries to the backend, and reports the
execution time. I found that without stats_command_string defined, it
ran in 3.5 seconds. With stats_command_string defined, it took 5.5
seconds, meaning the command string is causing a 57% slowdown. That is
way too much considering that the SELECT 1 has to be send from psql to
the backend, parsed, optimized, and executed, and the result returned to
the psql, while stats_command_string only has to send a string to a
backend collector. There is _no_ way that collector should take 57% of
the time it takes to run the actual query.

I have updated information on this performance issue. It seems it is
the blocking activity of recv() that is slowing down the buffer process
and hence the backends. Basically, I found if I use select() or recv()
to block until data arrives, I see the huge performance loss reported
above. If I loop over the recv() call in non-blocking mode, I see
almost no performance hit from stats_command_string (no backend
slowdown), but of course that consumes all the CPU (bad). What I found
worked perfectly was to do a non-blocking recv(), and if no data was
returned, change the socket to blocking mode and loop back over the
recv(). This allowed for no performance loss, and prevented infinite
looping over the recv() call.

My theory is that the kernel blocking logic of select() or recv() is
somehow locking up the socket for a small amount of time, therefore
slowing down the backend. With the on/off blocking, the packets arrive
in groups, we get a few packets then block when nothing is available.

The test program:

TMPFILE=/tmp/pgstat.sql
export TMPFILE

if [ ! -f $TMPFILE ]
then i=0
while [ $i -lt 10000 ]
do
i=`expr $i + 1`
echo "SELECT 1;"
done > $TMPFILE
fi

time psql test < $TMPFILE >/dev/null

is basically sending 30k packets of roughly 26 bytes each, or roughly
800k in 3.5 seconds, meaning there is a packet every 0.0001 seconds. I
wouldn't have thought that was too much volume for a dual Xeon BSD
machine, but it seems it might be. Tom seeing 44% slowdown from pgbench
means Linux might have an issue too.

Two patches are attached. The first patch shows the use of the on/off
blocking method to have almost zero overhead for reading from the
socket. (The packets are discarded.) The second patch removes the
buffer process entirely and uses the on/off buffering to process the
incoming packets. I tried running two test scripts simultaneously and
saw almost no packet loss. Also keep in mind we are writing the stat
file twice a second, which might need to be pushed into a separate
process.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Attachments:

/pgpatches/stat.testtext/plainDownload

Index: src/backend/postmaster/pgstat.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/postmaster/pgstat.c,v
retrieving revision 1.118
diff -c -c -r1.118 pgstat.c
*** src/backend/postmaster/pgstat.c	3 Jan 2006 19:54:08 -0000	1.118
--- src/backend/postmaster/pgstat.c	4 Jan 2006 23:22:44 -0000
***************
*** 1839,1845 ****
  	int			msg_recv = 0;	/* next receive index */
  	int			msg_have = 0;	/* number of bytes stored */
  	bool		overflow = false;
! 
  	/*
  	 * Identify myself via ps
  	 */
--- 1839,1847 ----
  	int			msg_recv = 0;	/* next receive index */
  	int			msg_have = 0;	/* number of bytes stored */
  	bool		overflow = false;
! 	bool		is_block_mode = false;
! 	int			cnt = 0, bloops = 0, nbloops = 0;
! 	
  	/*
  	 * Identify myself via ps
  	 */
***************
*** 1870,1875 ****
--- 1872,1921 ----
  	 */
  	msgbuffer = (char *) palloc(PGSTAT_RECVBUFFERSZ);
  
+ 
+ 	while (1)
+ 	{
+ #if 0
+ 		  FD_ZERO(&rfds);
+ 		  FD_ZERO(&wfds);
+ 		  maxfd = -1;
+ 			  FD_SET(pgStatSock, &rfds);
+ 			  maxfd = pgStatSock;
+ 	
+ 		  timeout.tv_sec = 0;
+ 		  timeout.tv_usec = 0;
+ 	
+ 		  select(maxfd + 1, &rfds, &wfds, NULL, &timeout);
+ #endif
+ 	
+ 			  if (is_block_mode)
+ 				  bloops++;
+ 			  else
+ 				  nbloops++;
+ 	
+ 			  len = recv(pgStatSock, (char *) &input_buffer,
+ 						 sizeof(PgStat_Msg), 0);
+ 			  if (len > 0)
+ 				  cnt += len;
+ 	
+ //fprintf(stderr, "len = %d, errno = %d\n", len, errno);
+ 	
+ 			  if (len > 0 && is_block_mode)
+ 			  {
+ 				  pg_set_noblock(pgStatSock);
+ 				  is_block_mode = false;
+ 			  }
+ 			  else if (len < 0 && errno == EAGAIN && !is_block_mode)
+ 			  {
+ 				  pg_set_block(pgStatSock);
+ 				  is_block_mode = true;
+ 			  }
+ //			  if ((bloops + nbloops)  % 1000 == 0)
+ //				  fprintf(stderr, "cnt = %d, len = %d, bloops = %d, nbloops = %d\n", cnt, len, bloops, nbloops);
+ 	}
+ 	
+ 	exit(1);
+ 
  	/*
  	 * Loop forever
  	 */

/pgpatches/stat.nobuffertext/plainDownload

Index: src/backend/postmaster/pgstat.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/postmaster/pgstat.c,v
retrieving revision 1.118
diff -c -c -r1.118 pgstat.c
*** src/backend/postmaster/pgstat.c	3 Jan 2006 19:54:08 -0000	1.118
--- src/backend/postmaster/pgstat.c	4 Jan 2006 23:06:26 -0000
***************
*** 109,117 ****
   * ----------
   */
  NON_EXEC_STATIC int pgStatSock = -1;
- NON_EXEC_STATIC int pgStatPipe[2] = {-1, -1};
  static struct sockaddr_storage pgStatAddr;
- static pid_t pgStatCollectorPid = 0;
  
  static time_t last_pgstat_start_time;
  
--- 109,115 ----
***************
*** 166,172 ****
  NON_EXEC_STATIC void PgstatBufferMain(int argc, char *argv[]);
  NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]);
  static void force_statwrite(SIGNAL_ARGS);
- static void pgstat_recvbuffer(void);
  static void pgstat_exit(SIGNAL_ARGS);
  static void pgstat_die(SIGNAL_ARGS);
  static void pgstat_beshutdown_hook(int code, Datum arg);
--- 164,169 ----
***************
*** 1491,1536 ****
  	pgstat_parseArgs(argc, argv);
  #endif
  
- 	/*
- 	 * Start a buffering process to read from the socket, so we have a little
- 	 * more time to process incoming messages.
- 	 *
- 	 * NOTE: the process structure is: postmaster is parent of buffer process
- 	 * is parent of collector process.	This way, the buffer can detect
- 	 * collector failure via SIGCHLD, whereas otherwise it wouldn't notice
- 	 * collector failure until it tried to write on the pipe.  That would mean
- 	 * that after the postmaster started a new collector, we'd have two buffer
- 	 * processes competing to read from the UDP socket --- not good.
- 	 */
- 	if (pgpipe(pgStatPipe) < 0)
- 		ereport(ERROR,
- 				(errcode_for_socket_access(),
- 				 errmsg("could not create pipe for statistics buffer: %m")));
- 
  	/* child becomes collector process */
! #ifdef EXEC_BACKEND
! 	pgStatCollectorPid = pgstat_forkexec(STAT_PROC_COLLECTOR);
! #else
! 	pgStatCollectorPid = fork();
! #endif
! 	switch (pgStatCollectorPid)
! 	{
! 		case -1:
! 			ereport(ERROR,
! 					(errmsg("could not fork statistics collector: %m")));
! 
! #ifndef EXEC_BACKEND
! 		case 0:
! 			/* child becomes collector process */
! 			PgstatCollectorMain(0, NULL);
! 			break;
! #endif
! 
! 		default:
! 			/* parent becomes buffer process */
! 			closesocket(pgStatPipe[0]);
! 			pgstat_recvbuffer();
! 	}
  	exit(0);
  }
  
--- 1488,1495 ----
  	pgstat_parseArgs(argc, argv);
  #endif
  
  	/* child becomes collector process */
! 	PgstatCollectorMain(0, NULL);
  	exit(0);
  }
  
***************
*** 1548,1559 ****
  PgstatCollectorMain(int argc, char *argv[])
  {
  	PgStat_Msg	msg;
- 	fd_set		rfds;
- 	int			readPipe;
- 	int			len = 0;
- 	struct itimerval timeval;
  	HASHCTL		hash_ctl;
  	bool		need_timer = false;
  	
  	MyProcPid = getpid();		/* reset MyProcPid */
  
--- 1507,1517 ----
  PgstatCollectorMain(int argc, char *argv[])
  {
  	PgStat_Msg	msg;
  	HASHCTL		hash_ctl;
  	bool		need_timer = false;
+ 	struct itimerval timeval;
+ 	bool		is_block_mode = false;
+ 	int			loops = 0;
  	
  	MyProcPid = getpid();		/* reset MyProcPid */
  
***************
*** 1587,1596 ****
  	pgstat_parseArgs(argc, argv);
  #endif
  
- 	/* Close unwanted files */
- 	closesocket(pgStatPipe[1]);
- 	closesocket(pgStatSock);
- 
  	/*
  	 * Identify myself via ps
  	 */
--- 1545,1550 ----
***************
*** 1626,1791 ****
  	pgStatBeTable = (PgStat_StatBeEntry *)
  		palloc0(sizeof(PgStat_StatBeEntry) * MaxBackends);
  
- 	readPipe = pgStatPipe[0];
- 
  	/*
  	 * Process incoming messages and handle all the reporting stuff until
  	 * there are no more messages.
  	 */
  	for (;;)
  	{
  		if (need_statwrite)
  		{
! 			pgstat_write_statsfile();
  			need_statwrite = false;
  			need_timer = true;
  		}
  
! 		/*
! 		 * Setup the descriptor set for select(2)
! 		 */
! 		FD_ZERO(&rfds);
! 		FD_SET(readPipe, &rfds);
! 
! 		/*
! 		 * Now wait for something to do.
! 		 */
! 		if (select(readPipe + 1, &rfds, NULL, NULL, NULL) < 0)
  		{
! 			if (errno == EINTR)
! 				continue;
! 			ereport(ERROR,
! 					(errcode_for_socket_access(),
! 					 errmsg("select() failed in statistics collector: %m")));
  		}
  
! 		/*
! 		 * Check if there is a new statistics message to collect.
! 		 */
! 		if (FD_ISSET(readPipe, &rfds))
! 		{
! 			/*
! 			 * We may need to issue multiple read calls in case the buffer
! 			 * process didn't write the message in a single write, which is
! 			 * possible since it dumps its buffer bytewise. In any case, we'd
! 			 * need two reads since we don't know the message length
! 			 * initially.
! 			 */
! 			int			nread = 0;
! 			int			targetlen = sizeof(PgStat_MsgHdr);		/* initial */
! 			bool		pipeEOF = false;
  
! 			while (nread < targetlen)
  			{
! 				len = piperead(readPipe, ((char *) &msg) + nread,
! 							   targetlen - nread);
! 				if (len < 0)
! 				{
! 					if (errno == EINTR)
! 						continue;
! 					ereport(ERROR,
! 							(errcode_for_socket_access(),
! 							 errmsg("could not read from statistics collector pipe: %m")));
! 				}
! 				if (len == 0)	/* EOF on the pipe! */
  				{
! 					pipeEOF = true;
! 					break;
! 				}
! 				nread += len;
! 				if (nread == sizeof(PgStat_MsgHdr))
! 				{
! 					/* we have the header, compute actual msg length */
! 					targetlen = msg.msg_hdr.m_size;
! 					if (targetlen < (int) sizeof(PgStat_MsgHdr) ||
! 						targetlen > (int) sizeof(msg))
! 					{
! 						/*
! 						 * Bogus message length implies that we got out of
! 						 * sync with the buffer process somehow. Abort so that
! 						 * we can restart both processes.
! 						 */
! 						ereport(ERROR,
! 							  (errmsg("invalid statistics message length")));
! 					}
  				}
  			}
! 
! 			/*
! 			 * EOF on the pipe implies that the buffer process exited. Fall
! 			 * out of outer loop.
! 			 */
! 			if (pipeEOF)
! 				break;
! 
! 			/*
! 			 * Distribute the message to the specific function handling it.
! 			 */
! 			switch (msg.msg_hdr.m_type)
  			{
! 				case PGSTAT_MTYPE_DUMMY:
! 					break;
  
! 				case PGSTAT_MTYPE_BESTART:
! 					pgstat_recv_bestart((PgStat_MsgBestart *) &msg, nread);
! 					break;
  
! 				case PGSTAT_MTYPE_BETERM:
! 					pgstat_recv_beterm((PgStat_MsgBeterm *) &msg, nread);
! 					break;
  
! 				case PGSTAT_MTYPE_TABSTAT:
! 					pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, nread);
! 					break;
  
! 				case PGSTAT_MTYPE_TABPURGE:
! 					pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, nread);
! 					break;
  
! 				case PGSTAT_MTYPE_ACTIVITY:
! 					pgstat_recv_activity((PgStat_MsgActivity *) &msg, nread);
! 					break;
  
! 				case PGSTAT_MTYPE_DROPDB:
! 					pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, nread);
! 					break;
  
! 				case PGSTAT_MTYPE_RESETCOUNTER:
! 					pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
! 											 nread);
! 					break;
  
! 				case PGSTAT_MTYPE_AUTOVAC_START:
! 					pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, nread);
! 					break;
  
! 				case PGSTAT_MTYPE_VACUUM:
! 					pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, nread);
! 					break;
  
! 				case PGSTAT_MTYPE_ANALYZE:
! 					pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, nread);
! 					break;
  
! 				default:
! 					break;
! 			}
  
! 			/*
! 			 * Globally count messages.
! 			 */
! 			pgStatNumMessages++;
  
! 			if (need_timer)
! 			{
! 				if (setitimer(ITIMER_REAL, &timeval, NULL))
! 					ereport(ERROR,
! 						  (errmsg("unable to set statistics collector timer: %m")));
! 				need_timer = false;
! 			}
  		}
  
  		/*
  		 * Note that we do NOT check for postmaster exit inside the loop; only
  		 * EOF on the buffer pipe causes us to fall out.  This ensures we
  		 * don't exit prematurely if there are still a few messages in the
--- 1580,1704 ----
  	pgStatBeTable = (PgStat_StatBeEntry *)
  		palloc0(sizeof(PgStat_StatBeEntry) * MaxBackends);
  
  	/*
  	 * Process incoming messages and handle all the reporting stuff until
  	 * there are no more messages.
  	 */
  	for (;;)
  	{
+ 		int nread;
+ 		
  		if (need_statwrite)
  		{
! 			//pgstat_write_statsfile();
  			need_statwrite = false;
  			need_timer = true;
  		}
  
! 		if (need_timer)
  		{
! 			if (setitimer(ITIMER_REAL, &timeval, NULL))
! 				ereport(ERROR,
! 					  (errmsg("unable to set statistics collector timer: %m")));
! 			need_timer = false;
  		}
  
! 		nread = recv(pgStatSock, (char *) &msg,
! 				   sizeof(PgStat_Msg), 0);
  
! 		if (nread > 0 && is_block_mode)	/* got data */
! 		{
! 			pg_set_noblock(pgStatSock);
! 			is_block_mode = false;
! 		}
! 		else if (nread < 0)
! 		{
! 			if (errno == EAGAIN)
  			{
! 				if (!is_block_mode)
  				{
! 					/* no data, block mode */
! 					pg_set_block(pgStatSock);
! 					is_block_mode = true;
  				}
+ 				continue;
  			}
! 			else if (errno == EINTR)
  			{
! 				if (!PostmasterIsAlive(true))
! 					ereport(ERROR,
! 							(errmsg("stats collector exited: %m")));
! 				continue;
! 			}
! 			else
! 				ereport(ERROR,
! 						(errmsg("stats collector exited: %m")));
! 		}
  
! //fprintf(stderr, "nread = %d, type = %d\n", nread, msg.msg_hdr.m_type);
! if (++loops % 1000 == 0)
! 	fprintf(stderr, "loops = %d\n", loops);
  
! 		/*
! 		 * Distribute the message to the specific function handling it.
! 		 */
! 		switch (msg.msg_hdr.m_type)
! 		{
! 			case PGSTAT_MTYPE_DUMMY:
! 				break;
  
! 			case PGSTAT_MTYPE_BESTART:
! 				pgstat_recv_bestart((PgStat_MsgBestart *) &msg, nread);
! 				break;
  
! 			case PGSTAT_MTYPE_BETERM:
! 				pgstat_recv_beterm((PgStat_MsgBeterm *) &msg, nread);
! 				break;
  
! 			case PGSTAT_MTYPE_TABSTAT:
! 				pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, nread);
! 				break;
  
! 			case PGSTAT_MTYPE_TABPURGE:
! 				pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, nread);
! 				break;
  
! 			case PGSTAT_MTYPE_ACTIVITY:
! 				pgstat_recv_activity((PgStat_MsgActivity *) &msg, nread);
! 				break;
  
! 			case PGSTAT_MTYPE_DROPDB:
! 				pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, nread);
! 				break;
  
! 			case PGSTAT_MTYPE_RESETCOUNTER:
! 				pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
! 										 nread);
! 				break;
  
! 			case PGSTAT_MTYPE_AUTOVAC_START:
! 				pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, nread);
! 				break;
  
! 			case PGSTAT_MTYPE_VACUUM:
! 				pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, nread);
! 				break;
  
! 			case PGSTAT_MTYPE_ANALYZE:
! 				pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, nread);
! 				break;
  
! 			default:
! 				break;
  		}
  
  		/*
+ 		 * Globally count messages.
+ 		 */
+ 		pgStatNumMessages++;
+ 
+ 
+ 		/*
  		 * Note that we do NOT check for postmaster exit inside the loop; only
  		 * EOF on the buffer pipe causes us to fall out.  This ensures we
  		 * don't exit prematurely if there are still a few messages in the
***************
*** 1813,2032 ****
  }
  
  
- /* ----------
-  * pgstat_recvbuffer() -
-  *
-  *	This is the body of the separate buffering process. Its only
-  *	purpose is to receive messages from the UDP socket as fast as
-  *	possible and forward them over a pipe into the collector itself.
-  *	If the collector is slow to absorb messages, they are buffered here.
-  * ----------
-  */
- static void
- pgstat_recvbuffer(void)
- {
- 	fd_set		rfds;
- 	fd_set		wfds;
- 	struct timeval timeout;
- 	int			writePipe = pgStatPipe[1];
- 	int			maxfd;
- 	int			len;
- 	int			xfr;
- 	int			frm;
- 	PgStat_Msg	input_buffer;
- 	char	   *msgbuffer;
- 	int			msg_send = 0;	/* next send index in buffer */
- 	int			msg_recv = 0;	/* next receive index */
- 	int			msg_have = 0;	/* number of bytes stored */
- 	bool		overflow = false;
- 
- 	/*
- 	 * Identify myself via ps
- 	 */
- 	init_ps_display("stats buffer process", "", "");
- 	set_ps_display("");
- 
- 	/*
- 	 * We want to die if our child collector process does.	There are two ways
- 	 * we might notice that it has died: receive SIGCHLD, or get a write
- 	 * failure on the pipe leading to the child.  We can set SIGPIPE to kill
- 	 * us here.  Our SIGCHLD handler was already set up before we forked (must
- 	 * do it that way, else it's a race condition).
- 	 */
- 	pqsignal(SIGPIPE, SIG_DFL);
- 	PG_SETMASK(&UnBlockSig);
- 
- 	/*
- 	 * Set the write pipe to nonblock mode, so that we cannot block when the
- 	 * collector falls behind.
- 	 */
- 	if (!pg_set_noblock(writePipe))
- 		ereport(ERROR,
- 				(errcode_for_socket_access(),
- 				 errmsg("could not set statistics collector pipe to nonblocking mode: %m")));
- 
- 	/*
- 	 * Allocate the message buffer
- 	 */
- 	msgbuffer = (char *) palloc(PGSTAT_RECVBUFFERSZ);
- 
- 	/*
- 	 * Loop forever
- 	 */
- 	for (;;)
- 	{
- 		FD_ZERO(&rfds);
- 		FD_ZERO(&wfds);
- 		maxfd = -1;
- 
- 		/*
- 		 * As long as we have buffer space we add the socket to the read
- 		 * descriptor set.
- 		 */
- 		if (msg_have <= (int) (PGSTAT_RECVBUFFERSZ - sizeof(PgStat_Msg)))
- 		{
- 			FD_SET(pgStatSock, &rfds);
- 			maxfd = pgStatSock;
- 			overflow = false;
- 		}
- 		else
- 		{
- 			if (!overflow)
- 			{
- 				ereport(LOG,
- 						(errmsg("statistics buffer is full")));
- 				overflow = true;
- 			}
- 		}
- 
- 		/*
- 		 * If we have messages to write out, we add the pipe to the write
- 		 * descriptor set.
- 		 */
- 		if (msg_have > 0)
- 		{
- 			FD_SET(writePipe, &wfds);
- 			if (writePipe > maxfd)
- 				maxfd = writePipe;
- 		}
- 
- 		/*
- 		 * Wait for some work to do; but not for more than 10 seconds. (This
- 		 * determines how quickly we will shut down after an ungraceful
- 		 * postmaster termination; so it needn't be very fast.)  struct timeout
- 		 * is modified by some operating systems.
- 		 */
- 		timeout.tv_sec = 10;
- 		timeout.tv_usec = 0;
- 
- 		if (select(maxfd + 1, &rfds, &wfds, NULL, &timeout) < 0)
- 		{
- 			if (errno == EINTR)
- 				continue;
- 			ereport(ERROR,
- 					(errcode_for_socket_access(),
- 					 errmsg("select() failed in statistics buffer: %m")));
- 		}
- 
- 		/*
- 		 * If there is a message on the socket, read it and check for
- 		 * validity.
- 		 */
- 		if (FD_ISSET(pgStatSock, &rfds))
- 		{
- 			len = recv(pgStatSock, (char *) &input_buffer,
- 					   sizeof(PgStat_Msg), 0);
- 			if (len < 0)
- 				ereport(ERROR,
- 						(errcode_for_socket_access(),
- 						 errmsg("could not read statistics message: %m")));
- 
- 			/*
- 			 * We ignore messages that are smaller than our common header
- 			 */
- 			if (len < sizeof(PgStat_MsgHdr))
- 				continue;
- 
- 			/*
- 			 * The received length must match the length in the header
- 			 */
- 			if (input_buffer.msg_hdr.m_size != len)
- 				continue;
- 
- 			/*
- 			 * O.K. - we accept this message.  Copy it to the circular
- 			 * msgbuffer.
- 			 */
- 			frm = 0;
- 			while (len > 0)
- 			{
- 				xfr = PGSTAT_RECVBUFFERSZ - msg_recv;
- 				if (xfr > len)
- 					xfr = len;
- 				Assert(xfr > 0);
- 				memcpy(msgbuffer + msg_recv,
- 					   ((char *) &input_buffer) + frm,
- 					   xfr);
- 				msg_recv += xfr;
- 				if (msg_recv == PGSTAT_RECVBUFFERSZ)
- 					msg_recv = 0;
- 				msg_have += xfr;
- 				frm += xfr;
- 				len -= xfr;
- 			}
- 		}
- 
- 		/*
- 		 * If the collector is ready to receive, write some data into his
- 		 * pipe.  We may or may not be able to write all that we have.
- 		 *
- 		 * NOTE: if what we have is less than PIPE_BUF bytes but more than the
- 		 * space available in the pipe buffer, most kernels will refuse to
- 		 * write any of it, and will return EAGAIN.  This means we will
- 		 * busy-loop until the situation changes (either because the collector
- 		 * caught up, or because more data arrives so that we have more than
- 		 * PIPE_BUF bytes buffered).  This is not good, but is there any way
- 		 * around it?  We have no way to tell when the collector has caught
- 		 * up...
- 		 */
- 		if (FD_ISSET(writePipe, &wfds))
- 		{
- 			xfr = PGSTAT_RECVBUFFERSZ - msg_send;
- 			if (xfr > msg_have)
- 				xfr = msg_have;
- 			Assert(xfr > 0);
- 			len = pipewrite(writePipe, msgbuffer + msg_send, xfr);
- 			if (len < 0)
- 			{
- 				if (errno == EINTR || errno == EAGAIN)
- 					continue;	/* not enough space in pipe */
- 				ereport(ERROR,
- 						(errcode_for_socket_access(),
- 				errmsg("could not write to statistics collector pipe: %m")));
- 			}
- 			/* NB: len < xfr is okay */
- 			msg_send += len;
- 			if (msg_send == PGSTAT_RECVBUFFERSZ)
- 				msg_send = 0;
- 			msg_have -= len;
- 		}
- 
- 		/*
- 		 * Make sure we forwarded all messages before we check for postmaster
- 		 * termination.
- 		 */
- 		if (msg_have != 0 || FD_ISSET(pgStatSock, &rfds))
- 			continue;
- 
- 		/*
- 		 * If the postmaster has terminated, we die too.  (This is no longer
- 		 * the normal exit path, however.)
- 		 */
- 		if (!PostmasterIsAlive(true))
- 			exit(0);
- 	}
- }
- 
  /* SIGQUIT signal handler for buffer process */
  static void
  pgstat_exit(SIGNAL_ARGS)
--- 1726,1731 ----
***************
*** 2049,2054 ****
--- 1748,1754 ----
  	exit(0);
  }
  
+ 
  /* SIGCHLD signal handler for buffer process */
  static void
  pgstat_die(SIGNAL_ARGS)

#26

Greg Stark

gsstark@mit.edu

about 20 years ago

In reply to: Hannu Krosing (#23)

Re: Stats collector performance improvement

Hannu Krosing <hannu@skype.net> writes:

Interestingly I use pg_stat_activity view to watch for stuck backends,
"stuck" in the sense that they have not noticed when client want away
and are now waitin the TCP timeout to happen. I query for backends which
have been in "<IDLE>" state for longer than XX seconds. I guess that at
least some kind of indication for this should be available.

You mean like the tcp_keepalives_idle option?

--
greg

#27

Hannu Krosing

hannu@skype.net

about 20 years ago

In reply to: Greg Stark (#26)

Re: Stats collector performance improvement

Ühel kenal päeval, P, 2006-01-08 kell 11:49, kirjutas Greg Stark:

Hannu Krosing <hannu@skype.net> writes:

Interestingly I use pg_stat_activity view to watch for stuck backends,
"stuck" in the sense that they have not noticed when client want away
and are now waitin the TCP timeout to happen. I query for backends which
have been in "<IDLE>" state for longer than XX seconds. I guess that at
least some kind of indication for this should be available.

You mean like the tcp_keepalives_idle option?

http://www.postgresql.org/docs/8.1/interactive/runtime-config-connection.html#GUC-TCP-KEEPALIVES-IDLE

Kind of, only I'd like to be able to set timeouts less than 120 minutes.

from:
http://developer.apple.com/documentation/mac/NetworkingOT/NetworkingWOT-390.html#HEADING390-0

kp_timeout
Set the requested timeout value, in minutes. Specify a value of
T_UNSPEC to use the default value. You may specify any positive
value for this field of 120 minutes or greater. The timeout
value is not an absolute requirement; if you specify a value
less than 120 minutes, TCP will renegotiate a timeout of 120
minutes.

-----------
Hannu

#28

pgman@candle.pha.pa.us

over 19 years ago

In reply to: Bruce Momjian (#11)

Test request for Stats collector performance improvement

Would some people please run the attached test procedure and report back
the results? I basically need to know the patch is an improvement on
more platforms than just my own. Thanks

---------------------------------------------------------------------------

Run this script and record the time reported:

ftp://candle.pha.pa.us/pub/postgresql/mypatches/stat.script

Modify postgresql.conf:

stats_command_string = on

and reload the server. Do "SELECT * FROM pg_stat_activity;" to verify
the command string is enabled. You should see your query in the
"current query" column.

Rerun the stat.script again and record the time.

Apply this patch to CVS HEAD:

ftp://candle.pha.pa.us/pub/postgresql/mypatches/stat.nobuffer

Run the stat.script again and record the time.

Report via email your three times and your platform.

If the patch worked, the first and third times will be similar, and
the second time will be high.

---------------------------------------------------------------------------

Bruce Momjian wrote:

Tom Lane wrote:

Michael Fuhr <mike@fuhr.org> writes:

Further tests show that for this application
the killer is stats_command_string, not stats_block_level or
stats_row_level.

I tried it with pgbench -c 10, and got these results:
41% reduction in TPS rate for stats_command_string
9% reduction in TPS rate for stats_block/row_level (any combination)

strace'ing a backend confirms my belief that stats_block/row_level send
just one stats message per transaction (at least for the relatively
small number of tables touched per transaction by pgbench). However
stats_command_string sends 14(!) --- there are seven commands per
pgbench transaction and each results in sending a <command> message and
later an <IDLE> message.

Given the rather lackadaisical way in which the stats collector makes
the data available, it seems like the backends are being much too
enthusiastic about posting their stats_command_string status
immediately. Might be worth thinking about how to cut back the
overhead by suppressing some of these messages.

I did some research on this because the numbers Tom quotes indicate there
is something wrong in the way we process stats_command_string
statistics.

I made a small test script:

if [ ! -f /tmp/pgstat.sql ]
then i=0
while [ $i -lt 10000 ]
do
i=`expr $i + 1`
echo "SELECT 1;"
done > /tmp/pgstat.sql
fi

time psql test </tmp/pgstat.sql >/dev/null

This sends 10,000 "SELECT 1" queries to the backend, and reports the
execution time. I found that without stats_command_string defined, it
ran in 3.5 seconds. With stats_command_string defined, it took 5.5
seconds, meaning the command string is causing a 57% slowdown. That is
way too much considering that the SELECT 1 has to be send from psql to
the backend, parsed, optimized, and executed, and the result returned to
the psql, while stats_command_string only has to send a string to a
backend collector. There is _no_ way that collector should take 57% of
the time it takes to run the actual query.

With the test program, I tried various options. The basic code we have
sends a UDP packet to a statistics buffer process, which recv()'s the
packet, puts it into a memory queue buffer, and writes it to a pipe()
that is read by the statistics collector process which processes the
packet.

I tried various ways of speeding up the buffer and collector processes.
I found if I put a pg_usleep(100) in the buffer process the backend
speed was good, but packets were lost. What I found worked well was to
do multiple recv() calls in a loop. The previous code did a select(),
then perhaps a recv() and pipe write() based on the results of the
select(). This caused many small packets to be written to the pipe and
the pipe write overhead seems fairly large. The best fix I found was to
loop over the recv() call at most 25 times, collecting a group of
packets that can then be sent to the collector in one pipe write. The
recv() socket is non-blocking, so a zero return indicates there are no
more packets available. Patch attached.

This change reduced the stats_command_string time from 5.5 to 3.9, which
is closer to the 3.5 seconds with stats_command_string off.

A second improvement I discovered is that the statistics collector is
calling gettimeofday() for every packet received, so it can determine
the timeout for the select() call to write the flat file. I removed
that behavior and instead used setitimer() to issue a SIGINT every
500ms, which was the original behavior. This eliminates the
gettimeofday() call and makes the code cleaner. Second patch attached.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#29

zhouqq@cs.toronto.edu

over 19 years ago

In reply to: Bruce Momjian (#11)

Re: Test request for Stats collector performance improvement

"Bruce Momjian" <pgman@candle.pha.pa.us> wrote

Would some people please run the attached test procedure and report back
the results? I basically need to know the patch is an improvement on
more platforms than just my own. Thanks

Obviously it matches your expectation.

uname: Linux amd64 2.6.9-5.13smp #1 SMP Wed Aug 10 10:55:44 CST 2005 x86_64
x86_64 x86_64 GNU/Linux
compiler: gcc (GCC) 3.4.3 20041212
configure: '--prefix=/home/qqzhou/pginstall'

--Before patch --
real 0m1.149s
user 0m0.182s
sys 0m0.122s

real 0m1.121s
user 0m0.173s
sys 0m0.103s

real 0m1.128s
user 0m0.116s
sys 0m0.092s

-- After patch --

real 0m1.275s
user 0m0.097s
sys 0m0.160s

real 0m4.063s
user 0m0.663s
sys 0m0.377s

real 0m1.259s
user 0m0.073s
sys 0m0.160s

#30

Larry Rosenman

ler@lerctr.org

over 19 years ago

In reply to: Bruce Momjian (#28)

Re: Test request for Stats collector performance improvement

Bruce Momjian wrote:

Would some people please run the attached test procedure and report
back the results? I basically need to know the patch is an
improvement on more platforms than just my own. Thanks

---------------------------------------------------------------------------

[snip]
FreeBSD thebighonker.lerctr.org 6.1-STABLE FreeBSD 6.1-STABLE #60: Mon Jun
12 16:55:31 CDT 2006
root@thebighonker.lerctr.org:/usr/obj/usr/src/sys/THEBIGHONKER amd64
$
with all stats on, except command string, cvs HEAD, no other patch:

$ sh stat.script
1.92 real 0.35 user 0.42 sys
$
# same as above, with command_string on.

$ sh stat.script
2.51 real 0.34 user 0.45 sys
$
#with patch and command_string ON.
$ sh stat.script
2.37 real 0.35 user 0.34 sys
$
The above uname is for a very current RELENG_6 FreeBSD. This was done
on a dual-xeon in 64-bit mode. HTT *IS* enabled.

LER

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3683 US

#31

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Qingqing Zhou (#29)

Re: Test request for Stats collector performance improvement

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

Obviously it matches your expectation.

Hm? I don't see any improvement there:

--Before patch --
real 0m1.149s
real 0m1.121s
real 0m1.128s

-- After patch --
real 0m1.275s
real 0m4.063s
real 0m1.259s

regards, tom lane

#32

pgman@candle.pha.pa.us

over 19 years ago

In reply to: Tom Lane (#31)

Re: Test request for Stats collector performance improvement

Tom Lane wrote:

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

Obviously it matches your expectation.

Hm? I don't see any improvement there:

--Before patch --
real 0m1.149s
real 0m1.121s
real 0m1.128s

-- After patch --
real 0m1.275s
real 0m4.063s
real 0m1.259s

The report is incomplete. I need three outputs:

stats off
stats on
stats on, patched

He only reported two sets of results.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#33

stefan@kaltenbrunner.cc

over 19 years ago

In reply to: Bruce Momjian (#28)

Re: Test request for Stats collector performance improvement

Bruce Momjian wrote:

Would some people please run the attached test procedure and report back
the results? I basically need to know the patch is an improvement on
more platforms than just my own. Thanks

OpenBSD 3.9-current/x86:

without stats:
0m6.79s real 0m1.56s user 0m1.12s system

-HEAD + stats:
0m10.44s real 0m2.26s user 0m1.22s system

-HEAD + stats + patch:
0m10.68s real 0m2.16s user 0m1.36s system

Stefan

#34

stefan@kaltenbrunner.cc

over 19 years ago

In reply to: Bruce Momjian (#28)

Re: Test request for Stats collector performance improvement

Bruce Momjian wrote:

Would some people please run the attached test procedure and report back
the results? I basically need to know the patch is an improvement on
more platforms than just my own. Thanks

Debian Sarge/AMD64 Kernel 2.6.16.16 (all tests done multiple times with
variation of less then 10%):

-HEAD:

real 0m0.486s
user 0m0.064s
sys 0m0.048s

-HEAD with 100000 "SELECT 1;" queries:

real 0m4.763s
user 0m0.896s
sys 0m1.232s

-HEAD + stats:

real 0m0.720s
user 0m0.128s
sys 0m0.096s

-HEAD + stats (100k):

real 0m7.204s
user 0m1.504s
sys 0m1.028s

-HEAD + stats + patch:

there is something weird going on here - I get either runtimes like:

real 0m0.729s
user 0m0.092s
sys 0m0.100s

and occasionally:

real 0m3.926s
user 0m0.144s
sys 0m0.140s

(always ~0,7 vs ~4 seconds - same variation as Qingqing Zhou seems to see)

-HEAD + stats + patch(100k):

similiar variation with:

real 0m7.955s
user 0m1.124s
sys 0m1.164s

and

real 0m11.836s
user 0m1.368s
sys 0m1.156s

(ie 7-8 seconds vs 11-12 seconds)

looks like this patch is actually a loss on that box.

Stefan

#35

Josh Berkus

josh@agliodbs.com

over 19 years ago

In reply to: Bruce Momjian (#32)

Re: Test request for Stats collector performance improvement

Bruce,

The report is incomplete. I need three outputs:

stats off
stats on
stats on, patched

He only reported two sets of results.

You need stats off, patched too.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#36

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Josh Berkus (#35)

Re: Test request for Stats collector performance improvement

Josh Berkus <josh@agliodbs.com> writes:

You need stats off, patched too.

Shouldn't really be necessary, as the code being patched won't be
executed if stats aren't being collected...

regards, tom lane

#37

pgman@candle.pha.pa.us

over 19 years ago

In reply to: Josh Berkus (#35)

Re: Test request for Stats collector performance improvement

Josh Berkus wrote:

Bruce,

The report is incomplete. I need three outputs:

stats off
stats on
stats on, patched

He only reported two sets of results.

You need stats off, patched too.

No need --- stats off, patched too, should be the same as stats off, no
patch.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#38

zhouqq@cs.toronto.edu

over 19 years ago

In reply to: Bruce Momjian (#11)

Re: Test request for Stats collector performance improvement

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

Hm? I don't see any improvement there:

I was referening this sentence, though I am not sure why that's the
expectation:

"Bruce Momjian" <pgman@candle.pha.pa.us> wrote
If the patch worked, the first and third times will be similar, and
the second time will be high.

-- After patch --

real 0m1.275s
user 0m0.097s
sys 0m0.160s

real 0m4.063s
user 0m0.663s
sys 0m0.377s

real 0m1.259s
user 0m0.073s
sys 0m0.160s

#39

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Qingqing Zhou (#38)

Re: Test request for Stats collector performance improvement

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

Hm? I don't see any improvement there:

I was referening this sentence, though I am not sure why that's the
expectation:

"Bruce Momjian" <pgman@candle.pha.pa.us> wrote
If the patch worked, the first and third times will be similar, and
the second time will be high.

You need to label your results more clearly then. I thought you were
showing us three repeats of the same test, and I gather Bruce thought
so too...

regards, tom lane

#40

pgman@candle.pha.pa.us

over 19 years ago

In reply to: Qingqing Zhou (#38)

Re: Test request for Stats collector performance improvement

Qingqing Zhou wrote:

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

Hm? I don't see any improvement there:

I was referening this sentence, though I am not sure why that's the
expectation:

"Bruce Momjian" <pgman@candle.pha.pa.us> wrote
If the patch worked, the first and third times will be similar, and
the second time will be high.

I meant that the non-stats and the patched stats should be the similar,
and the stats without the patch (the second test) should be high.

-- After patch --

real 0m1.275s
user 0m0.097s
sys 0m0.160s

real 0m4.063s
user 0m0.663s
sys 0m0.377s

real 0m1.259s
user 0m0.073s
sys 0m0.160s

I assume the above is just running the same test three times, right?

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#41

zhouqq@cs.toronto.edu

over 19 years ago

In reply to: Qingqing Zhou (#38)

Re: Test request for Stats collector performance improvement

"Bruce Momjian" <pgman@candle.pha.pa.us> wrote

-- After patch --

real 0m1.275s
user 0m0.097s
sys 0m0.160s

real 0m4.063s
user 0m0.663s
sys 0m0.377s

real 0m1.259s
user 0m0.073s
sys 0m0.160s

I assume the above is just running the same test three times, right?

Right -- it is the result of the patched CVS tip runing three times with
stats_command_string = on. And the tests marked "--Before patch--" is the
result of CVS tip running three times with stats_command_string = on.

Regards,
Qingqing

#42

pgman@candle.pha.pa.us

over 19 years ago

In reply to: Qingqing Zhou (#41)

Re: Test request for Stats collector performance improvement

Qingqing Zhou wrote:

"Bruce Momjian" <pgman@candle.pha.pa.us> wrote

-- After patch --

real 0m1.275s
user 0m0.097s
sys 0m0.160s

real 0m4.063s
user 0m0.663s
sys 0m0.377s

real 0m1.259s
user 0m0.073s
sys 0m0.160s

I assume the above is just running the same test three times, right?

Right -- it is the result of the patched CVS tip runing three times with
stats_command_string = on. And the tests marked "--Before patch--" is the
result of CVS tip running three times with stats_command_string = on.

Any idea why there is such a variance in the result? The second run
looks quite slow.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#43

zhouqq@cs.toronto.edu

over 19 years ago

In reply to: Qingqing Zhou (#41)

Re: Test request for Stats collector performance improvement

"Bruce Momjian" <pgman@candle.pha.pa.us> wrote

Any idea why there is such a variance in the result? The second run
looks quite slow.

No luck so far. It is quite repeatble in my machine -- runing times which
show a long execution time: 2, 11, 14, 21 ... But when I do strace, the
weiredness disappered totally. Have we seen any strange things like this
before?

Regards,
Qingqing

#44

pgman@candle.pha.pa.us

over 19 years ago

In reply to: Qingqing Zhou (#43)

Re: Test request for Stats collector performance improvement

OK, based on reports I have seen, generally stats_query_string adds 50%
to the total runtime of a "SELECT 1" query, and the patch reduces the
overhead to 25%.

However, that 25% is still much too large. Consider that "SELECT 1" has
to travel from psql to the server, go through the
parser/optimizer/executor, and then return, it is clearly wrong that the
stats_query_string performance hit should be measurable.

I am actually surprised that so few people in the community are
concerned about this. While we have lots of people studying large
queries, these small queries should also get attention from a
performance perspective.

I have created a new test that also turns off writing of the stats file.
This will not pass regression tests, but it will show the stats write
overhead.

Updated test to be run:

---------------------------------------------------------------------------

1) Run this script and record the time reported:

ftp://candle.pha.pa.us/pub/postgresql/mypatches/stat.script

It should take only a few seconds.

2) Modify postgresql.conf:

stats_command_string = on

and reload the server. Do "SELECT * FROM pg_stat_activity;" to verify
the command string is enabled. You should see your query in the
"current query" column.

3) Rerun the stat.script again and record the time.

4) Apply this patch to CVS HEAD:

ftp://candle.pha.pa.us/pub/postgresql/mypatches/stat.nobuffer

5) Run the stat.script again and record the time.

6) Revert the patch and apply this patch to CVS HEAD:

ftp://candle.pha.pa.us/pub/postgresql/mypatches/stat.nobuffer_nowrite

7) Run the stat.script again and record the time.

8) Report the four results and your platform via email to
pgman@candle.pha.pa.us. Label times:

stats_command_string = off
stats_command_string = on
stat.nobuffer patch
stat.nobuffer_nowrite patch

---------------------------------------------------------------------------

Qingqing Zhou wrote:

"Bruce Momjian" <pgman@candle.pha.pa.us> wrote

Any idea why there is such a variance in the result? The second run
looks quite slow.

No luck so far. It is quite repeatble in my machine -- runing times which
show a long execution time: 2, 11, 14, 21 ... But when I do strace, the
weiredness disappered totally. Have we seen any strange things like this
before?

Regards,
Qingqing

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#45

stefan@kaltenbrunner.cc

over 19 years ago

In reply to: Bruce Momjian (#44)

Re: Test request for Stats collector performance improvement

Bruce Momjian wrote:

OK, based on reports I have seen, generally stats_query_string adds 50%
to the total runtime of a "SELECT 1" query, and the patch reduces the
overhead to 25%.

that is actually not true for both of the platforms(a slow OpenBSD
3.9/x86 and a very fast Linux/x86_64) I tested on. Both of them show
virtually no improvement with the patch and even worst it causes
considerable (negative) variance on at least the Linux box.

However, that 25% is still much too large. Consider that "SELECT 1" has
to travel from psql to the server, go through the
parser/optimizer/executor, and then return, it is clearly wrong that the
stats_query_string performance hit should be measurable.

I am actually surprised that so few people in the community are
concerned about this. While we have lots of people studying large
queries, these small queries should also get attention from a
performance perspective.

I have created a new test that also turns off writing of the stats file.
This will not pass regression tests, but it will show the stats write
overhead.

will try to run those too in a few.

Stefan

#46

pgman@candle.pha.pa.us

over 19 years ago

In reply to: Stefan Kaltenbrunner (#45)

Re: Test request for Stats collector performance improvement

Stefan Kaltenbrunner wrote:

Bruce Momjian wrote:

OK, based on reports I have seen, generally stats_query_string adds 50%
to the total runtime of a "SELECT 1" query, and the patch reduces the
overhead to 25%.

that is actually not true for both of the platforms(a slow OpenBSD
3.9/x86 and a very fast Linux/x86_64) I tested on. Both of them show
virtually no improvement with the patch and even worst it causes
considerable (negative) variance on at least the Linux box.

I see the results I suggested on OpenBSD that you reported.

OpenBSD 3.9-current/x86:

without stats:
0m6.79s real 0m1.56s user 0m1.12s system

-HEAD + stats:
0m10.44s real 0m2.26s user 0m1.22s system

-HEAD + stats + patch:
0m10.68s real 0m2.16s user 0m1.36s system

and I got similar results reported from a Debian:

Linux 2.6.16 on a single processor HT 2.8Ghz Pentium compiled
with gcc 4.0.4.

real 0m3.306s
real 0m4.905s
real 0m4.448s

I am unclear on the cuase for the widely varying results you saw in
Debian.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#47

stefan@kaltenbrunner.cc

over 19 years ago

In reply to: Bruce Momjian (#46)

Re: Test request for Stats collector performance improvement

Bruce Momjian wrote:

Stefan Kaltenbrunner wrote:

Bruce Momjian wrote:

OK, based on reports I have seen, generally stats_query_string adds 50%
to the total runtime of a "SELECT 1" query, and the patch reduces the
overhead to 25%.

that is actually not true for both of the platforms(a slow OpenBSD
3.9/x86 and a very fast Linux/x86_64) I tested on. Both of them show
virtually no improvement with the patch and even worst it causes
considerable (negative) variance on at least the Linux box.

I see the results I suggested on OpenBSD that you reported.

OpenBSD 3.9-current/x86:

without stats:
0m6.79s real 0m1.56s user 0m1.12s system

-HEAD + stats:
0m10.44s real 0m2.26s user 0m1.22s system

-HEAD + stats + patch:
0m10.68s real 0m2.16s user 0m1.36s system

yep those are very stable even over a large number of runs

and I got similar results reported from a Debian:

Linux 2.6.16 on a single processor HT 2.8Ghz Pentium compiled
with gcc 4.0.4.

real 0m3.306s
real 0m4.905s
real 0m4.448s

I am unclear on the cuase for the widely varying results you saw in
Debian.

I can reproduce the widely varying results on a number of x86 and x86_64
based Linux boxes here (Debian,Fedora and CentOS) though I cannot
reproduce it on a Fedora core 5/ppc box.
All the x86 boxes are SMP - while the ppc one is not - that might have
some influence on the results.

Stefan

#48