pg_test_timing tool for EXPLAIN ANALYZE overhead

Started by Greg Smithalmost 14 years ago7 messages
#1Greg Smith
greg@2ndQuadrant.com
1 attachment(s)

Attached is a feature extracted from the Ants Aasma "add timing of
buffer I/O requests" submission. That included a tool to measure timing
overhead, from gettimeofday or whatever else INSTR_TIME_SET_CURRENT
happens to call. That's what I've broken out here; it's a broader topic
than just buffer timing.

I fixed some trivial bugs and cleaned up the output of the program, then
wrote a full documentation section for it. After that review, I think
this could be ready to commit...with a big picture caveat below. Code
wise, I am mainly concerned about its portability, and that's not often
a place we get good review help on. The tool is based on pg_test_fsync.
Perhaps Bruce might want to take a look at this, to see if any of the
changes he's recently made to pg_test_fsync impact what this utility
should do? He might be able to flesh out the FreeBSD examples too.

As for why this whole topic is important, I've found the results of this
new pg_test_timing track quite well with systems where EXPLAIN ANALYZE
timing overhead is large. As such, it fills in a gap in the existing
docs, where that possibility is raised but no way was given to measure
it--nor determine how to improve it. I expect we'll be worried about
how large timing overhead is more for future features, with the first
example being the rest of Ants's own submission.

A look back on this now that I'm done with it does raise one large
question though. I added some examples of how to measure timing
overhead using psql. While I like the broken down timing data that this
utility provides, I'm not sure whether it's worth adding a contrib
module just to get it now though. Extension that's packaged on
something like PGXN and easy to obtain? Absolutely--but maybe that's a
developer only level thing. Maybe the only code worth distributing is
the little SQL example of how to measure the overhead, along with some
reference good/bad numbers. That plus the intro to timer trivia could
turn this into a documentation section only, no code change. I've
dreamed of running something like this on every system in the build
farm. Even if that's a valuable exercise, even then it may only be
worth doing once, then reverting.

Anyway, the patch does now includes several examples and a short primer
on PC clock hardware, to help guide what good results look like and why
they've been impossible to obtain in the past. That's a bit
Linux-centric, but the hardware described covers almost all systems
using Intel or AMD processors. Only difference with most other
operating systems is how aggressively they have adopted newer timer
hardware. At least this gives a way to measure all of them.

Some references used to put together the clock source tutorial:

Microsoft's intro to HPET:
http://msdn.microsoft.com/en-us/windows/hardware/gg463347
Notes on effective clock resolution:
http://elinux.org/System_Tap_Timestamp_Notes
VMware clock history and impact on VMs:
http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf
VMware timer suggestions for various Linux versions:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006427

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachments:

pg_test_timing-v3.patchtext/x-patch; name=pg_test_timing-v3.patchDownload
diff --git a/contrib/Makefile b/contrib/Makefile
index ac0a80a..d230451 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_test_fsync	\
+		pg_test_timing	\
 		pg_trgm		\
 		pg_upgrade	\
 		pg_upgrade_support \
diff --git a/contrib/pg_test_timing/Makefile b/contrib/pg_test_timing/Makefile
new file mode 100644
index 0000000..b8b266a
--- /dev/null
+++ b/contrib/pg_test_timing/Makefile
@@ -0,0 +1,18 @@
+# contrib/pg_test_timing/Makefile
+
+PGFILEDESC = "pg_test_timing - test timing overhead"
+PGAPPICON = win32
+
+PROGRAM  = pg_test_timing
+OBJS = pg_test_timing.o
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_test_timing
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_test_timing/pg_test_timing.c b/contrib/pg_test_timing/pg_test_timing.c
new file mode 100644
index 0000000..4ff0455
--- /dev/null
+++ b/contrib/pg_test_timing/pg_test_timing.c
@@ -0,0 +1,162 @@
+/*
+ *	pg_test_timing.c
+ *		tests overhead of timing calls and their monoticity:  that
+ * 		they always move forward
+ */
+
+#include "postgres_fe.h"
+
+#include "getopt_long.h"
+#include "portability/instr_time.h"
+
+static const char *progname;
+
+static int32 test_duration = 3;
+
+static void handle_args(int argc, char *argv[]);
+static void test_timing(int32);
+
+int
+main(int argc, char *argv[])
+{
+	progname = get_progname(argv[0]);
+
+	handle_args(argc, argv);
+
+	test_timing(test_duration);
+
+	return 0;
+}
+
+static void
+handle_args(int argc, char *argv[])
+{
+	static struct option long_options[] = {
+		{"duration", required_argument, NULL, 'd'},
+		{NULL, 0, NULL, 0}
+	};
+	int option;			/* Command line option */
+	int	optindex = 0;	/* used by getopt_long */
+
+	if (argc > 1)
+	{
+		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-h") == 0 ||
+			strcmp(argv[1], "-?") == 0)
+		{
+			printf("Usage: %s [-d DURATION]\n", progname);
+			exit(0);
+		}
+		if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
+		{
+			puts("pg_test_timing (PostgreSQL) " PG_VERSION);
+			exit(0);
+		}
+	}
+
+	while ((option = getopt_long(argc, argv, "d:",
+								 long_options, &optindex)) != -1)
+	{
+		switch (option)
+		{
+			case 'd':
+				test_duration = atoi(optarg);
+				break;
+
+			default:
+				fprintf(stderr, "Try \"%s --help\" for more information.\n",
+						progname);
+				exit(1);
+				break;
+		}
+	}
+
+	if (argc > optind)
+	{
+		fprintf(stderr,
+				"%s: too many command-line arguments (first is \"%s\")\n",
+				progname, argv[optind]);
+		fprintf(stderr, "Try \"%s --help\" for more information.\n",
+				progname);
+		exit(1);
+	}
+
+	if (test_duration > 0)
+	{
+		printf("Testing timing overhead for %d seconds.\n", test_duration);
+	}
+	else
+	{
+		fprintf(stderr,
+				"%s: duration must be a positive integer (duration is \"%d\")\n",
+				progname, test_duration);
+		fprintf(stderr, "Try \"%s --help\" for more information.\n",
+				progname);
+		exit(1);
+	}
+}
+
+static void
+test_timing(int32 duration)
+{
+	uint64 total_time;
+	int64 time_elapsed = 0;
+	uint64 loop_count = 0;
+	uint64 prev, cur;
+	int32 diff, i, bits, found;
+
+	instr_time start_time, end_time, temp;
+
+	static int64 histogram[32];
+
+	total_time = duration > 0 ? duration * 1000000 : 0;
+
+	INSTR_TIME_SET_CURRENT(start_time);
+	cur = INSTR_TIME_GET_MICROSEC(start_time);
+
+	while (time_elapsed < total_time)
+	{
+		prev = cur;
+		INSTR_TIME_SET_CURRENT(temp);
+		cur = INSTR_TIME_GET_MICROSEC(temp);
+		diff = cur - prev;
+
+		if (diff < 0)
+		{
+			printf("Detected clock going backwards in time.\n");
+			printf("Time warp: %d microseconds\n", diff);
+			exit(1);
+		}
+
+		bits = 0;
+		while (diff)
+		{
+			diff >>= 1;
+			bits++;
+		}
+		histogram[bits]++;
+
+		loop_count++;
+		INSTR_TIME_SUBTRACT(temp, start_time);
+		time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
+	}
+
+	INSTR_TIME_SET_CURRENT(end_time);
+
+	INSTR_TIME_SUBTRACT(end_time, start_time);
+
+	printf("Per loop time including overhead: %0.2f nsec\n",
+			INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+	printf("Histogram of timing durations:\n");
+	printf("%9s: %10s %9s\n", "< usec", "count", "percent");
+
+	found = 0;
+    for (i = 31; i >= 0; i--)
+    {
+        if (found || histogram[i])
+        {
+            found = 1;
+            printf("%9ld: %10ld %8.5f%%\n", 1l << i, histogram[i],
+                (double) histogram[i] * 100 / loop_count);
+        }
+    }
+}
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index d4da4ee..97031dd 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -121,6 +121,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &pgstatstatements;
  &pgstattuple;
  &pgtestfsync;
+ &pgtesttiming;
  &pgtrgm;
  &pgupgrade;
  &seg;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index b5d3c6d..428a167 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -129,6 +129,7 @@
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
 <!ENTITY pgtestfsync     SYSTEM "pgtestfsync.sgml">
+<!ENTITY pgtesttiming    SYSTEM "pgtesttiming.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgupgrade       SYSTEM "pgupgrade.sgml">
 <!ENTITY seg             SYSTEM "seg.sgml">
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 8e695fd..5a8c6fc 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -770,7 +770,9 @@ ROLLBACK;
     network transmission costs and I/O conversion costs are not included.
     Second, the measurement overhead added by <command>EXPLAIN
     ANALYZE</command> can be significant, especially on machines with slow
-    <function>gettimeofday()</> operating-system calls.
+    <function>gettimeofday()</> operating-system calls. You can use the
+    <xref linkend="pgtesttiming"> tool to measure the overhead of timing
+    on your system.
    </para>
 
    <para>
diff --git a/doc/src/sgml/pgtesttiming.sgml b/doc/src/sgml/pgtesttiming.sgml
new file mode 100644
index 0000000..c3a8b1a
--- /dev/null
+++ b/doc/src/sgml/pgtesttiming.sgml
@@ -0,0 +1,261 @@
+<!-- doc/src/sgml/pgtesttiming.sgml -->
+
+<sect1 id="pgtesttiming" xreflabel="pg_test_timing">
+ <title>pg_test_timing</title>
+
+ <indexterm zone="pgtesttiming">
+  <primary>pg_test_timing</primary>
+ </indexterm>
+
+ <para>
+  <application>pg_test_timing</> is a tool to measure the timing overhead
+  on your system and confirm that the system time never moves backwards.
+  Systems that are slow to collect timing data can give less accurate 
+  <command>EXPLAIN ANALYZE</command> results.
+ </para>
+
+ <sect2>
+  <title>Usage</title>
+
+<synopsis>
+pg_test_timing [options]
+</synopsis>
+
+   <para>
+    <application>pg_test_timing</application> accepts the following
+    command-line options:
+
+    <variablelist>
+
+     <varlistentry>
+      <term><option>-d</option></term>
+      <term><option>--duration</option></term>
+      <listitem>
+       <para>
+        Specifies the number of seconds to run the test. Longer durations
+        give slightly better accuracy, and are more likely to discover
+        problems with the system clock moving backwards. The default
+        test duration is 3 seconds.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </para>
+
+ </sect2>
+
+ <sect2>
+  <title>Interpreting results</title>
+
+  <para>
+        Good results will show most (>90%) individual timing calls take less
+        than one microsecond. Average per loop overhead will be even lower,
+        below 100 nanoseconds. This example from an Intel i7-860 system using
+        a TSC clock source shows excellent performance:
+  </para>
+
+<screen>
+Testing timing overhead for 3 seconds.
+Per loop time including overhead: 35.96 nsec
+Histogram of timing durations:
+   < usec:      count   percent
+       16:          2  0.00000%
+        8:         13  0.00002%
+        4:        126  0.00015%
+        2:    2999652  3.59518%
+        1:   80435604 96.40465%
+</screen>
+
+  <para>
+        Note that different units are used for the per loop time than the
+        histogram. The loop can have resolution within a few nanoseconds
+        (nsec), while the individual timing calls can only resolve down to
+        one microsecond (usec).
+  </para>
+
+ </sect2>
+ <sect2>
+  <title>Measuring executor timing overhead</title>
+
+  <para>
+        When the query executor is running a statement using
+        <command>EXPLAIN ANALYZE</command>, individual operations are
+        timed as well as showing a summary.  The overhead of your system
+        can be checked by counting rows with the psql program:
+  </para>
+
+<screen>
+CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
+\timing
+SELECT COUNT(*) FROM t;
+EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
+</screen>
+
+  <para>
+        The i7-860 system measured runs the count query in 9.8 ms while
+        the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms,
+        each processing just over 100,000 rows.  That 6.8 ms difference
+        means the timing overhead per row is 68 ns, about twice what
+        pg_test_timing estimated it would be.  Even that relatively
+        small amount of overhead is making the fully timed count statement
+        take almost 70% longer.  On more substantial queries, the
+        timing overhead would be less problematic.
+  </para>
+
+ </sect2>
+ <sect2>
+  <title>Changing time sources</title>
+  <para>
+        On some newer Linux systems, it's possible to change the clock
+        source used to collect timing data at any time.  A second example
+        shows the slowdown possible from switching to the slower acpi_pm
+        time source, on the same system used for the fast results above:
+  </para>
+
+<screen>
+# cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
+tsc hpet acpi_pm
+# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
+# pg_test_timing
+Per loop time including overhead: 722.92 nsec
+Histogram of timing durations:
+   < usec:      count   percent
+       16:          3  0.00007%
+        8:        563  0.01357%
+        4:       3241  0.07810%
+        2:    2990371 72.05956%
+        1:    1155682 27.84870%
+</screen>
+
+  <para>
+       In this configuration, the sample <command>EXPLAIN ANALYZE</command>
+       above takes 115.9 ms.  That's 1061 nsec of timing overhead, again
+       a small multiple of what's measured directly by this utility.
+       That much timing overhead means the actual query itself is only
+       taking a tiny fraction of the accounted for time, most of it
+       is being consumed in overhead instead.  In this configuration,
+       any <command>EXPLAIN ANALYZE</command> totals involving many
+       timed operations would be inflated significantly by timing overhead.
+  </para>
+
+  <para>
+       FreeBSD also allows changing the time source on the fly, and
+       it logs information about the timer selected during boot:
+  </para>
+
+<screen>
+dmesg | grep "Timecounter"       
+sysctl kern.timecounter.hardware=TSC
+</screen>
+
+  <para>
+       Other systems may only allow setting the time source on boot.
+       On older Linux systems the "clock" kernel setting is the only way
+       to make this sort of change.  And even on some more recent ones,
+       the only option you'll see for a clock source is "jiffies".  Jiffies
+       are the older Linux software clock implementation, which can have
+       good resolution when it's backed by fast enough timing hardware,
+       as in this example:
+  </para>       
+
+<screen>
+$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
+jiffies 
+$ dmesg | grep time.c
+time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
+time.c: Detected 2400.153 MHz processor.
+$ ./pg_test_timing 
+Testing timing overhead for 3 seconds.
+Per timing duration including loop overhead: 97.75 ns
+Histogram of timing durations:
+   < usec:      count   percent
+       32:          1  0.00000%
+       16:          1  0.00000%
+        8:         22  0.00007%
+        4:       3010  0.00981%
+        2:    2993204  9.75277%
+        1:   27694571 90.23734%
+</screen>
+
+ </sect2>
+ <sect2>
+  <title>Clock hardware and timing accuracy</title>
+
+  <para>
+       Collecting accurate timing information is normally done on computers
+       using hardware clocks with various levels of accuracy.  With some
+       hardware the operating systems can pass the system clock time almost
+       directly to programs.  A system clock can also be derived from a chip
+       that simply provides timing interrupts, periodic ticks at some known
+       time interval.  In either case, operating system kernels provide
+       a clock source that hides these details.  But the accuracy of that
+       clock source and how quickly it can return results varies based
+       on the underlying hardware.
+  </para>
+
+  <para>
+        Inaccurate time keeping can result in system instability.  Test
+        any change to the clock source very carefully.  Operating system
+        defaults are sometimes made to favor reliability over best
+        accuracy. And if you are using a virtual machine, look into the
+        recommended time sources compatible with it.  Virtual hardware
+        faces additional difficulties when emulating timers, and there
+        are often per operating system settings suggested by vendors.
+  </para>
+
+  <para>
+        The Time Stamp Counter (TSC) clock source is the most accurate one
+        available on current generation CPUs. It's the preferred way to track
+        the system time when it's supported by the operating system and the
+        TSC clock is reliable. There are several ways that TSC can fail
+        to provide an accurate timing source, making it unreliable. Older
+        systems can have a TSC clock that varies based on the CPU
+        temperature, making it unusable for timing. Trying to use TSC on some
+        older multi-core CPUs can give a reported time that's inconsistent
+        among multiple cores. This can result in the time going backwards, a
+        problem this program checks for.  And even the newest systems can
+        fail to provide accurate TSC timing with very aggressive power saving
+        configurations.
+  </para>
+
+  <para>
+        Newer operating systems may check for the known TSC problems and
+        switch to a slower, more stable clock source when they are seen. 
+        If your system supports TSC time but doesn't default to that, it
+        may be disabled for a good reason.  And some operating systems may
+        not detect all the possible problems correctly, or will allow using
+        TSC even in situations where it's known to be inaccurate.
+  </para>
+
+  <para>
+        The High Precision Event Timer (HPET) is the preferred timer on
+        systems where it's available and TSC is not accurate.  The timer
+        chip itself is programmable to allow up to 100 nanosecond resolution,
+        but you may not see that much accuracy in your system clock.
+  </para>
+
+  <para>
+        Advanced Configuration and Power Interface (ACPI) provides a
+        Power Management (PM) Timer, which Linux refers to as the acpi_pm.
+        The clock derived from acpi_pm will at best provide 300 nanosecond
+        resolution.
+  </para>
+
+  <para>
+        Timers used on older PC hardware including the 8254 Programmable
+        Interval Timer (PIT), the real-time clock (RTC), the Advanced
+        Programmable Interrupt Controller (APIC) timer, and the Cyclone
+        timer.  These timers aim for millisecond resolution.
+  </para>
+ </sect2>
+
+ <sect2>
+  <title>Author</title>
+
+  <para>
+   Ants Aasma <email>ants.aasma@eesti.ee</email>
+  </para>
+ </sect2>
+
+</sect1>
#2Jay Levitt
jay.levitt@gmail.com
In reply to: Greg Smith (#1)
Re: pg_test_timing tool for EXPLAIN ANALYZE overhead

Greg Smith wrote:

Anyway, the patch does now includes several examples and a short primer on
PC clock hardware, to help guide what good results look like and why they've
been impossible to obtain in the past. That's a bit Linux-centric, but the
hardware described covers almost all systems using Intel or AMD processors.
Only difference with most other operating systems is how aggressively they
have adopted newer timer hardware. At least this gives a way to measure all
of them.

N.B.: Windows has at least two clock APIs, timeGetTime and
QueryPerformanceCounters (and probably more, these days). They rely on
different hardware clocks, and can get out of sync with each other;
meanwhile, QueryPerformanceCounters can get out of sync with itself on
(older?) multi-CPU boards.

So if you're doing high-res timing, it's good to make sure you aren't
relying on two different clocks in different places... I ran into this with
MIDI drivers years ago; and wrote a doc:

http://www.ultimatemetal.com/forum/equipment/315910-midi-latency-problem-nuendo.html#post6315034

and a clock-testing utility:

https://github.com/jaylevitt/miditime

#3Greg Smith
greg@2ndQuadrant.com
In reply to: Jay Levitt (#2)
Re: pg_test_timing tool for EXPLAIN ANALYZE overhead

On 02/22/2012 11:10 AM, Jay Levitt wrote:

N.B.: Windows has at least two clock APIs, timeGetTime and
QueryPerformanceCounters (and probably more, these days). They rely on
different hardware clocks, and can get out of sync with each other;
meanwhile, QueryPerformanceCounters can get out of sync with itself on
(older?) multi-CPU boards.

The PostgreSQL wrapper in src/include/portability/instr_time.h uses
QueryPerformanceCounter and QueryPerformanceFrequency in a way that the
result can be used similarly to how deltas in UNIX dates are returned.

As far as I've been able to tell, there aren't any issues unique to
Windows there. Multiple cores can have their TSC results get out of
sync on Windows for the same reason they do on Linux systems, and
there's also the same frequency/temperature issues.

Newer versions of Windows can use TSC, older versions only use HPET or
ACPI on older versions, and there's some ability to force bad TSC units
to use ACPI instead:

http://blogs.msdn.com/b/psssql/archive/2010/08/18/how-it-works-timer-outputs-in-sql-server-2008-r2-invariant-tsc.aspx
http://blogs.msdn.com/b/psssql/archive/2010/08/18/how-it-works-timer-outputs-in-sql-server-2008-r2-invariant-tsc.aspx

There is a lot of questionable behavior if you try to use the better
timers in Windows XP; check out the obnoxious foot note about XP SP3 at
http://en.wikipedia.org/wiki/High_Precision_Event_Timer for example.

Since there's little you can do about--it's not even possible to figure
out which timer is being used easily--I didn't see any easy ways to
document what Windows does here, in a way that helped anyone. For the
most part, you get the best timer Windows has a driver for with
QueryPerformanceCounter, and that's what the PostgreSQL code uses. The
information I gave about how newer systems can have more accurate timing
is still accurate. Maybe it would be useful to add something pointing
out that newer Windows versions tend to support faster timers, too.
That's something I assume people would guess from what I already wrote,
it may be worth stating explicitly though.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#4Marti Raudsepp
marti@juffo.org
In reply to: Greg Smith (#3)
Re: pg_test_timing tool for EXPLAIN ANALYZE overhead

On Wed, Feb 22, 2012 at 18:44, Greg Smith <greg@2ndquadrant.com> wrote:

As far as I've been able to tell, there aren't any issues unique to Windows
there.  Multiple cores can have their TSC results get out of sync on Windows
for the same reason they do on Linux systems, and there's also the same
frequency/temperature issues.

Not on recent Linux kernel versions. Linux automatically detects when
the TSC is unstable (due to power management or out-of-sync
cores/sockets) and automatically falls back to the more expensive HPET
or ACPI methods.

e.g:
% dmesg |grep -i tsc
[ 0.000000] Fast TSC calibration using PIT
[ 0.164075] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
[ 0.197062] Switching to clocksource tsc
[ 0.260960] Marking TSC unstable due to TSC halts in idle

Whether these tests cover 100% of the possible conditions, and whether
the detection has race conditions or not, I don't know.

Regards,
Marti

#5Greg Smith
greg@2ndQuadrant.com
In reply to: Marti Raudsepp (#4)
Re: pg_test_timing tool for EXPLAIN ANALYZE overhead

On 02/22/2012 12:25 PM, Marti Raudsepp wrote:

On Wed, Feb 22, 2012 at 18:44, Greg Smith<greg@2ndquadrant.com> wrote:

As far as I've been able to tell, there aren't any issues unique to Windows
there. Multiple cores can have their TSC results get out of sync on Windows
for the same reason they do on Linux systems, and there's also the same
frequency/temperature issues.

Not on recent Linux kernel versions. Linux automatically detects when
the TSC is unstable (due to power management or out-of-sync
cores/sockets) and automatically falls back to the more expensive HPET
or ACPI methods.

From the patch:

Newer operating systems may check for the known TSC problems and
switch to a slower, more stable clock source when they are seen.
If your system supports TSC time but doesn't default to that, it
may be disabled for a good reason.

I ran into a case like you're showing here in my longer exploration of
this at
http://archives.postgresql.org/message-id/4EDF1871.2010209@2ndQuadrant.com
I stopped just short of showing what the TSC error message looked
like. I hoped that with the above and some examples showing dmesg |
grep, that would be enough to lead enough people toward finding this on
their own.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#6Marti Raudsepp
marti@juffo.org
In reply to: Greg Smith (#5)
Re: pg_test_timing tool for EXPLAIN ANALYZE overhead

On Wed, Feb 22, 2012 at 19:36, Greg Smith <greg@2ndquadrant.com> wrote:

From the patch:

Newer operating systems may check for the known TSC problems and
switch to a slower, more stable clock source when they are seen.
If your system supports TSC time but doesn't default to that, it
may be disabled for a good reason.

Sorry, I was under the impression that the stability of Windows's
QueryPerformanceCounter() API is hardware-dependent, but I haven't
coded under Windows for a long time -- maybe it's improved in recent
versions.

Regards,
Marti

#7Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#1)
Re: pg_test_timing tool for EXPLAIN ANALYZE overhead

On Wed, Feb 22, 2012 at 6:53 AM, Greg Smith <greg@2ndquadrant.com> wrote:

A look back on this now that I'm done with it does raise one large question
though.  I added some examples of how to measure timing overhead using psql.
 While I like the broken down timing data that this utility provides, I'm
not sure whether it's worth adding a contrib module just to get it now
though.  Extension that's packaged on something like PGXN and easy to
obtain?  Absolutely--but maybe that's a developer only level thing.  Maybe
the only code worth distributing is the little SQL example of how to measure
the overhead, along with some reference good/bad numbers.  That plus the
intro to timer trivia could turn this into a documentation section only, no
code change.  I've dreamed of running something like this on every system in
the build farm.  Even if that's a valuable exercise, even then it may only
be worth doing once, then reverting.

I had similar concerns, but decided to go ahead and commit this. It
doesn't relate particularly closely to the buffer I/O timings patch,
but I think it's worth having. We clearly need something that is
integrated with the PostgreSQL sources, because there is more than one
way to access timers, and this code will measure the overhead of doing
what PostgreSQL does, rather than the overhead of reading times in
some other way. Now, that doesn't preclude shipping this on PGXN, but
we've certainly got other things in contrib that are clearly a lot
less useful than this, and we've discussed before the folly of
shipping a database product without a full set of diagnostic tools.
Since this tool was developed to answer questions about whether
certain PostgreSQL options are safe to enable or whether they'll have
too much overhead, I think that's a sufficient reason to include it in
contrib, especially because you took the time to write some very good
documentation for it. I wonder whether we should perhaps move some of
this discussion of clock sources into the main documentation
somewhere, since it's not specifically related to this utility, but I
didn't try to do that for the moment and just left it as you had it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company