gettimeofday is at the end of its usefulness?

Started by Tom Laneover 11 years ago34 messages
#1Tom Lane
tgl@sss.pgh.pa.us
1 attachment(s)

A recent question from Tim Kane prompted me to measure the overhead
costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
are far worse than I thought. On my current server (by no means
lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
at something like 110 nsec per row:

regression=# create table foo as select x as f1 from generate_series(1,1000000) x;
SELECT 1000000
regression=# vacuum foo;
VACUUM
regression=# explain analyze select * from foo;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.053..111.720 rows=1000000 loops=1)
Planning time: 0.222 ms
Execution time: 166.682 ms
(3 rows)

(and, btw, this is a debug build --- without assert and memory
context checks it'd be faster.)

The problem with this number is that a simple test program shows that
gettimeofday() requires about 40 nsec on this hardware. That means
that two-thirds of the above timing measurement is overhead.

To add insult to injury, gettimeofday's output cannot be more precise than
1 microsecond, making its use for measuring sub-microsecond intervals at
best stochastic.

I looked around a bit and found that recent versions of POSIX have a
function clock_gettime() that's a bit more modern than gettimeofday():
at least, the output struct provides nsec rather than usec available
precision. I benchmarked this, with the CLOCK_REALTIME selector, and
found that it also requires about 40nsec, while the output is actually
good to perhaps 10nsec precision. (I base this on seeing no duplicate
readings in a tight loop, so that the value is certainly getting advanced
more often than once every 40 nsec.)

There's also a CLOCK_REALTIME_COARSE selector, which is noticeably faster
--- about 10nsec for me --- but the output appears to only advance once
every millisecond, so it's probably useless for our purposes.  The other
selectors mentioned in the Linux man page are considerably slower than
CLOCK_REALTIME for me, suggesting that they actually call into the kernel.

I also tried a loop around a bare "rdtsc" assembly instruction, finding
that that instruction takes about 10nsec. That would be a nice
improvement over gettimeofday, except that using that directly would
involve dealing with cross-CPU skew, which seems like no fun at all.
And I don't really want to get into finding equivalents for non-Intel
architectures, either.

Anyway it looks like clock_gettime() might be worth using on Linux
just for the more precise output. It doesn't seem to exist on OS X
though, and I have no idea about elsewhere.

I'm curious if anybody has ideas about other things we might do for
portable high-precision timing.

regards, tom lane

Attachments:

testclock.ctext/x-c; charset=us-ascii; name=testclock.cDownload
#2Peter Geoghegan
pg@heroku.com
In reply to: Tom Lane (#1)
Re: gettimeofday is at the end of its usefulness?

On Tue, May 13, 2014 at 3:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There's also a CLOCK_REALTIME_COARSE selector, which is noticeably faster
--- about 10nsec for me --- but the output appears to only advance once
every millisecond, so it's probably useless for our purposes.  The other
selectors mentioned in the Linux man page are considerably slower than
CLOCK_REALTIME for me, suggesting that they actually call into the kernel.

What Linux kernel version is in use here? Apparently, as I think
you've stated another way, more recent versions have VDSO for this,
which can make a big difference. This article seems like a sensible
guide to all of this:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html

CLOCK_REALTIME_COARSE seemingly influences precision in a way that
allows user space applications to decide on their precision/cost
trade-off, rather than being forced to use the system default (that
procfs surfaces) through gettimeofday():
http://lwn.net/Articles/342018/

I can see a benefit in exposing this trade-off to Postgres code
directly. I still think that a correlated reference period will prove
useful, and while there are a number of ways to amortize the cost of
repeatedly (coarsely) getting the wall time in the ordinary course of
choosing victim buffers, it would be nice to do this too.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#2)
Re: gettimeofday is at the end of its usefulness?

Peter Geoghegan <pg@heroku.com> writes:

On Tue, May 13, 2014 at 3:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There's also a CLOCK_REALTIME_COARSE selector, which is noticeably faster
--- about 10nsec for me --- but the output appears to only advance once
every millisecond, so it's probably useless for our purposes.  The other
selectors mentioned in the Linux man page are considerably slower than
CLOCK_REALTIME for me, suggesting that they actually call into the kernel.

What Linux kernel version is in use here?

Ah, sorry, I should have specified. This is RHEL6.5, current kernel
version 2.6.32-431.17.1.el6.x86_64.

Apparently, as I think
you've stated another way, more recent versions have VDSO for this,
which can make a big difference. This article seems like a sensible
guide to all of this:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html

This appears to be talking about RHEL5, which is quite a bit older
(and, I'd guess, trailing edge for anybody who might deploy PG 9.5).
I did confirm that /proc/sys/kernel/vsyscall64 exists and has a default
setting of 1 on RHEL6. Setting it to 0 causes gettimeofday to take
150ns, which probably represents the time for a trivial kernel call.
The MRG extension described on the linked page doesn't seem to be
implemented in stock RHEL6 (setting vsyscall64 to 2 is allowed but
doesn't change behavior compared to 1). However, if I'm reading it
right, all that does is make gettimeofday behave like
clock_gettime(CLOCK_REALTIME_COARSE).

CLOCK_REALTIME_COARSE seemingly influences precision in a way that
allows user space applications to decide on their precision/cost
trade-off, rather than being forced to use the system default (that
procfs surfaces) through gettimeofday():
http://lwn.net/Articles/342018/

Yeah, I think these are the same implementations exposed to apps in two
different ways, one being a system-wide switch affecting gettimeofday()
while the other allows the app source code to say which one it wants.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Greg Stark
stark@mit.edu
In reply to: Tom Lane (#1)
Re: gettimeofday is at the end of its usefulness?

On Tue, May 13, 2014 at 11:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I also tried a loop around a bare "rdtsc" assembly instruction, finding
that that instruction takes about 10nsec. That would be a nice
improvement over gettimeofday, except that using that directly would
involve dealing with cross-CPU skew, which seems like no fun at all.
And I don't really want to get into finding equivalents for non-Intel
architectures, either.

I always assumed the kernel used rdtsc to implement some of the high
performance timers. It can save the current time in a mapped page when
it schedules a process and then in the vdso syscall (ie in user-space)
it can use rdtsc to calculate the offset needed to adjust that
timestamp to the current time. This seems consistent with your
calculations that showed the 40ns overhead with +/- 10ns precision.

I actually think it would be more interesting if we could measure the
overhead and adjust for it. I don't think people are really concerned
with how long EXPLAIN ANALYZE takes to run if they could get accurate
numbers out of it.

Other profiling tools I poked at in the past ran a tight loop around
the profiling code to estimate the time it actually took and then
subtracted that from all the measurements. I think that might work for
the actual clock_gettime overhead. If we did that then we could call
it twice and measure the time spent in the rest of the EXPLAIN ANALYZE
code and subtract that plus the time for the two clock_gettimes from
the run-time...

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#4)
Re: gettimeofday is at the end of its usefulness?

Greg Stark <stark@mit.edu> writes:

I actually think it would be more interesting if we could measure the
overhead and adjust for it.

Actually, that's quite a good thought. The overhead should be a pretty
stable number on any given machine, so in theory we could do this to
high precision. And the numbers I just showed say that on current
x86_64 platforms, the *best we could possibly hope for* in terms of
direct overhead reduction is about 4x. Which is something, but it
hardly makes the problem vanish.

I have a vague feeling that we discussed subtract-the-overhead once before
and thought it wasn't necessary yet. Maybe it's time.

But we also need to be using something that gives better than 1usec
resolution. So I'm still thinking we should use clock_gettime() where
available, and look for alternative APIs where not.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Jeff Janes
jeff.janes@gmail.com
In reply to: Tom Lane (#1)
Re: gettimeofday is at the end of its usefulness?

On Tuesday, May 13, 2014, Tom Lane <tgl@sss.pgh.pa.us> wrote:

A recent question from Tim Kane prompted me to measure the overhead
costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
are far worse than I thought. On my current server (by no means
lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
at something like 110 nsec per row:

regression=# create table foo as select x as f1 from
generate_series(1,1000000) x;
SELECT 1000000
regression=# vacuum foo;
VACUUM
regression=# explain analyze select * from foo;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..14425.00 rows=1000000 width=4) (actual
time=0.053..111.720 rows=1000000 loops=1)
Planning time: 0.222 ms
Execution time: 166.682 ms
(3 rows)

(and, btw, this is a debug build --- without assert and memory
context checks it'd be faster.)

The problem with this number is that a simple test program shows that
gettimeofday() requires about 40 nsec on this hardware. That means
that two-thirds of the above timing measurement is overhead.

I'm all for finding something better if we can, but in the mean time this
is certainly not unexpected, and isn't it exactly what "explain
(analyze,timing off)" was invented for?

Cheers,

Jeff

#7Ants Aasma
ants@cybertec.at
In reply to: Greg Stark (#4)
Re: gettimeofday is at the end of its usefulness?

On Wed, May 14, 2014 at 6:34 AM, Greg Stark <stark@mit.edu> wrote:

I always assumed the kernel used rdtsc to implement some of the high
performance timers. It can save the current time in a mapped page when
it schedules a process and then in the vdso syscall (ie in user-space)
it can use rdtsc to calculate the offset needed to adjust that
timestamp to the current time. This seems consistent with your
calculations that showed the 40ns overhead with +/- 10ns precision.

Both gettimeofday and clock_gettime do exactly that. [1]https://github.com/torvalds/linux/blob/master/arch/x86/vdso/vclock_gettime.c#L223
clock_gettime(CLOCK_MONOTONIC) is the mode of operation we would want
to use here.

I actually think it would be more interesting if we could measure the
overhead and adjust for it. I don't think people are really concerned
with how long EXPLAIN ANALYZE takes to run if they could get accurate
numbers out of it.

Measuring would also be a good idea so we can automatically turn on
performance counters like IO timing when we know it's not obscenely
expensive.

However, subtracting the overhead will still skew the numbers somewhat
by giving more breathing time for memory and IO prefetching
mechanisms. Another option to consider would be to add a sampling
based mechanism for low overhead time attribution. It would be even
better if we could distinguish between time spent waiting on locks vs.
waiting on IO vs. waiting to be scheduled vs. actually executing.

[1]: https://github.com/torvalds/linux/blob/master/arch/x86/vdso/vclock_gettime.c#L223

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Greg Stark (#4)
Re: gettimeofday is at the end of its usefulness?

On Tue, May 13, 2014 at 11:34 PM, Greg Stark <stark@mit.edu> wrote:

I always assumed the kernel used rdtsc to implement some of the high
performance timers. It can save the current time in a mapped page when
it schedules a process and then in the vdso syscall (ie in user-space)
it can use rdtsc to calculate the offset needed to adjust that
timestamp to the current time. This seems consistent with your
calculations that showed the 40ns overhead with +/- 10ns precision.

Crazy idea: Instead of trying to time precisely the amount of time we
spend in each node, configure a very-high frequency timer interrupt
(or background thread?) that does:

SomeGlobalVariablePointingToTheCurrentNode->profiling_counter++;

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Benedikt Grundmann
bgrundmann@janestreet.com
In reply to: Robert Haas (#8)
Re: gettimeofday is at the end of its usefulness?

I posted this on this mailing list before at Jane Street we have developed
very fast code to get timing information based on TSC if available. It's
all ocaml but well documented and mostly just calls to c functions so
should be easy to port to C and we release it under a very liberal license
so it should be no problem to take the ideas:

https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.mli

Hope this is useful.

Bene

On Wed, May 14, 2014 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Tue, May 13, 2014 at 11:34 PM, Greg Stark <stark@mit.edu> wrote:

I always assumed the kernel used rdtsc to implement some of the high
performance timers. It can save the current time in a mapped page when
it schedules a process and then in the vdso syscall (ie in user-space)
it can use rdtsc to calculate the offset needed to adjust that
timestamp to the current time. This seems consistent with your
calculations that showed the 40ns overhead with +/- 10ns precision.

Crazy idea: Instead of trying to time precisely the amount of time we
spend in each node, configure a very-high frequency timer interrupt
(or background thread?) that does:

SomeGlobalVariablePointingToTheCurrentNode->profiling_counter++;

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Benedikt Grundmann
bgrundmann@janestreet.com
In reply to: Benedikt Grundmann (#9)
Re: gettimeofday is at the end of its usefulness?

On Thu, May 15, 2014 at 8:19 AM, Benedikt Grundmann <
bgrundmann@janestreet.com> wrote:

I posted this on this mailing list before at Jane Street we have developed
very fast code to get timing information based on TSC if available. It's
all ocaml but well documented and mostly just calls to c functions so
should be easy to port to C and we release it under a very liberal license
so it should be no problem to take the ideas:

https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.mli

Hope this is useful.

Bene

Also I'm sorry for top posting. hackers is the only mailing list I'm on
that requires this and some others require top posting so this runs counter
my habits and I only realized after sending...

Show quoted text

On Wed, May 14, 2014 at 12:41 PM, Robert Haas <robertmhaas@gmail.com>wrote:

On Tue, May 13, 2014 at 11:34 PM, Greg Stark <stark@mit.edu> wrote:

I always assumed the kernel used rdtsc to implement some of the high
performance timers. It can save the current time in a mapped page when
it schedules a process and then in the vdso syscall (ie in user-space)
it can use rdtsc to calculate the offset needed to adjust that
timestamp to the current time. This seems consistent with your
calculations that showed the 40ns overhead with +/- 10ns precision.

Crazy idea: Instead of trying to time precisely the amount of time we
spend in each node, configure a very-high frequency timer interrupt
(or background thread?) that does:

SomeGlobalVariablePointingToTheCurrentNode->profiling_counter++;

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Greg Stark
stark@mit.edu
In reply to: Benedikt Grundmann (#9)
Re: gettimeofday is at the end of its usefulness?

On Thu, May 15, 2014 at 8:19 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:

I posted this on this mailing list before at Jane Street we have developed
very fast code to get timing information based on TSC if available. It's
all ocaml but well documented and mostly just calls to c functions so should
be easy to port to C and we release it under a very liberal license so it
should be no problem to take the ideas:

What OS do you run it on though? How fast is your implementation
compared to the kernel implementation of clock_gettime()?

Are you sure your implementation is actually faster? And are you sure
you're protected against clocks going backwards? I think you should
put some i/o in the loop in the test and start several threads running
it to make it more likely the thread is rescheduled to a different
processor during the test. It suspect you'll find the rdtsc goes
backwards sometimes or produces crazy results when switching
processors.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Benedikt Grundmann
bgrundmann@janestreet.com
In reply to: Greg Stark (#11)
Re: gettimeofday is at the end of its usefulness?

On Thu, May 15, 2014 at 11:31 AM, Greg Stark <stark@mit.edu> wrote:

On Thu, May 15, 2014 at 8:19 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:

I posted this on this mailing list before at Jane Street we have

developed

very fast code to get timing information based on TSC if available. It's
all ocaml but well documented and mostly just calls to c functions so

should

be easy to port to C and we release it under a very liberal license so it
should be no problem to take the ideas:

What OS do you run it on though? How fast is your implementation
compared to the kernel implementation of clock_gettime()?

Are you sure your implementation is actually faster? And are you sure
you're protected against clocks going backwards? I think you should
put some i/o in the loop in the test and start several threads running
it to make it more likely the thread is rescheduled to a different
processor during the test. It suspect you'll find the rdtsc goes
backwards sometimes or produces crazy results when switching
processors.

There are benchmarks in the link I posted (obtained by a micro benchmarking
library we developed / use internally which takes great care to obtain
reliable numbers) . We use posix threads extensively. We internally spend
a lot of time setting up ntp and monitoring systems so that clock backwards
never happens (so with other words I wouldn't be surprised if the library
does NOT work correctly when it does -- our protection is outside). I do
not believe we have seen the tdtsc going backwards on thread context switch
you mention (and as said we use lots of threads). OS? Centos 6.5
primarily.

--

Show quoted text

greg

#13Andres Freund
andres@2ndquadrant.com
In reply to: Benedikt Grundmann (#12)
Re: gettimeofday is at the end of its usefulness?

On 2014-05-15 12:04:25 +0100, Benedikt Grundmann wrote:

On Thu, May 15, 2014 at 11:31 AM, Greg Stark <stark@mit.edu> wrote:

On Thu, May 15, 2014 at 8:19 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:

I posted this on this mailing list before at Jane Street we have

developed

very fast code to get timing information based on TSC if available. It's
all ocaml but well documented and mostly just calls to c functions so

should

be easy to port to C and we release it under a very liberal license so it
should be no problem to take the ideas:

What OS do you run it on though? How fast is your implementation
compared to the kernel implementation of clock_gettime()?

Are you sure your implementation is actually faster? And are you sure
you're protected against clocks going backwards? I think you should
put some i/o in the loop in the test and start several threads running
it to make it more likely the thread is rescheduled to a different
processor during the test. It suspect you'll find the rdtsc goes
backwards sometimes or produces crazy results when switching
processors.

There are benchmarks in the link I posted (obtained by a micro benchmarking
library we developed / use internally which takes great care to obtain
reliable numbers) . We use posix threads extensively. We internally spend
a lot of time setting up ntp and monitoring systems so that clock backwards
never happens (so with other words I wouldn't be surprised if the library
does NOT work correctly when it does -- our protection is outside). I do
not believe we have seen the tdtsc going backwards on thread context switch
you mention (and as said we use lots of threads). OS? Centos 6.5
primarily.

Did you test it on server with more one socket (i.e. not just multiple
cores, but distinct cpu cases)? That's where you expect to see
differences in TSC to have funny effects.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: gettimeofday is at the end of its usefulness?

Hi,

On 2014-05-13 18:58:11 -0400, Tom Lane wrote:

Anyway it looks like clock_gettime() might be worth using on Linux
just for the more precise output. It doesn't seem to exist on OS X
though, and I have no idea about elsewhere.

Agreed that using clock_gettime() would be a good idea. I'd say we
should have a wrapper around it that is able to provide nanosecond
precision. If only gettimeofday() (and whatever windows is using) is
available, we can dynamically fall back to that.

I'm curious if anybody has ideas about other things we might do for
portable high-precision timing.

It's far from a solve-it-all, but can we perhaps try to coalesce
repeated time measurements? We'll very frequently do a
InstrStopNode();
/* minimal amount of work */
InstrStartNode();
which will measure the time twice. I think there's a fair number of
scenarios where once would be enough. I'll freely admit that I haven't
looked enough to determine how we could do that API wise.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#1)
Re: gettimeofday is at the end of its usefulness?

On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:

A recent question from Tim Kane prompted me to measure the overhead
costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
are far worse than I thought. On my current server (by no means
lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
at something like 110 nsec per row:

I assume you ran pg_test_timing too:

Testing timing overhead for 3 seconds.
Per loop time including overhead: 41.70 nsec
Histogram of timing durations:
< usec % of total count
1 95.83035 68935459
2 4.16923 2999133
4 0.00037 268
8 0.00004 31
16 0.00000 1
32 0.00000 1

My overhead of 41.70 nsec matches yours.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Thom Brown
thom@linux.com
In reply to: Bruce Momjian (#15)
Re: gettimeofday is at the end of its usefulness?

On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:

A recent question from Tim Kane prompted me to measure the overhead
costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
are far worse than I thought. On my current server (by no means
lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
at something like 110 nsec per row:

I assume you ran pg_test_timing too:

Testing timing overhead for 3 seconds.
Per loop time including overhead: 41.70 nsec
Histogram of timing durations:
< usec % of total count
1 95.83035 68935459
2 4.16923 2999133
4 0.00037 268
8 0.00004 31
16 0.00000 1
32 0.00000 1

My overhead of 41.70 nsec matches yours.

Did this idea die, or is it still worth considering?

Thom

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thom Brown (#16)
Re: gettimeofday is at the end of its usefulness?

Thom Brown <thom@linux.com> writes:

On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:

A recent question from Tim Kane prompted me to measure the overhead
costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
are far worse than I thought. On my current server (by no means
lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
at something like 110 nsec per row:

Did this idea die, or is it still worth considering?

We still have a problem, for sure. I'm not sure that there was any
consensus on what to do about it. Using clock_gettime(CLOCK_REALTIME)
if available would be a straightforward change that should ameliorate
gettimeofday()'s 1-usec-precision-limit problem; but it doesn't do
anything to fix the excessive-overhead problem. The ideas about the
latter were all over the map, and none of them looked easy.

If you're feeling motivated to work on this area, feel free.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#17)
Re: gettimeofday is at the end of its usefulness?

On 6/8/16 9:56 AM, Tom Lane wrote:

Thom Brown <thom@linux.com> writes:

On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:

A recent question from Tim Kane prompted me to measure the overhead
costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
are far worse than I thought. On my current server (by no means
lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
at something like 110 nsec per row:

Did this idea die, or is it still worth considering?

We still have a problem, for sure. I'm not sure that there was any
consensus on what to do about it. Using clock_gettime(CLOCK_REALTIME)
if available would be a straightforward change that should ameliorate
gettimeofday()'s 1-usec-precision-limit problem; but it doesn't do
anything to fix the excessive-overhead problem. The ideas about the
latter were all over the map, and none of them looked easy.

If you're feeling motivated to work on this area, feel free.

Semi-related: someone (Robert I think) recently mentioned investigating
"vectorized" executor nodes, where multiple tuples would be processed in
one shot. If we had that presumably the explain penalty would be a moot
point.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#18)
Re: gettimeofday is at the end of its usefulness?

On Tue, Jun 14, 2016 at 4:27 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Semi-related: someone (Robert I think) recently mentioned investigating
"vectorized" executor nodes, where multiple tuples would be processed in one
shot. If we had that presumably the explain penalty would be a moot point.

Yeah, both Andres and I are interested in that, and I think he's
actively working on it. It would be quite neat if this had the effect
of reducing EXPLAIN ANALYZE's overhead to something trivial.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#19)
Re: gettimeofday is at the end of its usefulness?

On 2016-07-07 14:43:31 -0400, Robert Haas wrote:

On Tue, Jun 14, 2016 at 4:27 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Semi-related: someone (Robert I think) recently mentioned investigating
"vectorized" executor nodes, where multiple tuples would be processed in one
shot. If we had that presumably the explain penalty would be a moot point.

Yeah, both Andres and I are interested in that, and I think he's
actively working on it. It would be quite neat if this had the effect
of reducing EXPLAIN ANALYZE's overhead to something trivial.

I am, and it does reduce the overhead. Depends on the type of plan
though. Index nestloops e.g. don't benefit on the inner side.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Tom Lane (#17)
Re: gettimeofday is at the end of its usefulness?

On Thu, Jun 9, 2016 at 12:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thom Brown <thom@linux.com> writes:

On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:

A recent question from Tim Kane prompted me to measure the overhead
costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
are far worse than I thought. On my current server (by no means
lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
at something like 110 nsec per row:

Did this idea die, or is it still worth considering?

We still have a problem, for sure. I'm not sure that there was any
consensus on what to do about it. Using clock_gettime(CLOCK_REALTIME)
if available would be a straightforward change that should ameliorate
gettimeofday()'s 1-usec-precision-limit problem; but it doesn't do
anything to fix the excessive-overhead problem. The ideas about the
latter were all over the map, and none of them looked easy.

If you're feeling motivated to work on this area, feel free.

How about using both CLOCK_REALTIME and CLOCK_REALTIME_COARSE
as the clock id's in clock_gettime wherever applicable. COARSE option is used
wherever there is no timing calculation is required, because in my laptop, there
is a significant performance difference is observed (like 8 times) compared to
CLOCK_REALTIME.

If it is fine, I will try to update the code and send a patch.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Haribabu Kommi (#21)
1 attachment(s)
Re: gettimeofday is at the end of its usefulness?

On Thu, Aug 25, 2016 at 3:12 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Jun 9, 2016 at 12:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thom Brown <thom@linux.com> writes:

On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:

A recent question from Tim Kane prompted me to measure the overhead
costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
are far worse than I thought. On my current server (by no means
lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
at something like 110 nsec per row:

Did this idea die, or is it still worth considering?

We still have a problem, for sure. I'm not sure that there was any
consensus on what to do about it. Using clock_gettime(CLOCK_REALTIME)
if available would be a straightforward change that should ameliorate
gettimeofday()'s 1-usec-precision-limit problem; but it doesn't do
anything to fix the excessive-overhead problem. The ideas about the
latter were all over the map, and none of them looked easy.

If you're feeling motivated to work on this area, feel free.

How about using both CLOCK_REALTIME and CLOCK_REALTIME_COARSE
as the clock id's in clock_gettime wherever applicable. COARSE option is
used
wherever there is no timing calculation is required, because in my laptop,
there
is a significant performance difference is observed (like 8 times)
compared to
CLOCK_REALTIME.

If it is fine, I will try to update the code and send a patch.

Attached a patch that replaces most of the getimeofday function calls,
except
timeofday(user callable) and GetCurrentTimestamp functions.

Didn't add any configure checks in case if the clock_gettime function is
not available,
the fallback logic to gettimeofday function call.

Any comments in proceeding further?

Regards,
Hari Babu
Fujitsu Australia

Attachments:

clock_gettime_1.patchapplication/octet-stream; name=clock_gettime_1.patchDownload
diff --git a/contrib/pgcrypto/fortuna.c b/contrib/pgcrypto/fortuna.c
index 5028203..1654c41 100644
--- a/contrib/pgcrypto/fortuna.c
+++ b/contrib/pgcrypto/fortuna.c
@@ -31,7 +31,6 @@
 
 #include "postgres.h"
 
-#include <sys/time.h>
 #include <time.h>
 
 #include "px.h"
@@ -89,8 +88,8 @@
  */
 #define NUM_POOLS		23
 
-/* in microseconds */
-#define RESEED_INTERVAL 100000	/* 0.1 sec */
+/* in nanoseconds */
+#define RESEED_INTERVAL 100000000	/* 0.1 sec */
 
 /* for one big request, reseed after this many bytes */
 #define RESEED_BYTES	(1024*1024)
@@ -123,7 +122,7 @@ struct fortuna_state
 	MD_CTX		pool[NUM_POOLS];
 	CIPH_CTX	ciph;
 	unsigned	reseed_count;
-	struct timeval last_reseed_time;
+	struct timespec last_reseed_time;
 	unsigned	pool0_bytes;
 	unsigned	rnd_pos;
 	int			tricks_done;
@@ -223,10 +222,10 @@ static int
 enough_time_passed(FState *st)
 {
 	int			ok;
-	struct timeval tv;
-	struct timeval *last = &st->last_reseed_time;
+	struct timespec tv;
+	struct timespec *last = &st->last_reseed_time;
 
-	gettimeofday(&tv, NULL);
+	clock_gettime(CLOCK_REALTIME, &tv);
 
 	/* check how much time has passed */
 	ok = 0;
@@ -234,10 +233,10 @@ enough_time_passed(FState *st)
 		ok = 1;
 	else if (tv.tv_sec == last->tv_sec + 1)
 	{
-		if (1000000 + tv.tv_usec - last->tv_usec >= RESEED_INTERVAL)
+		if (1000000000 + tv.tv_nsec - last->tv_nsec >= RESEED_INTERVAL)
 			ok = 1;
 	}
-	else if (tv.tv_usec - last->tv_usec >= RESEED_INTERVAL)
+	else if (tv.tv_nsec - last->tv_nsec >= RESEED_INTERVAL)
 		ok = 1;
 
 	/* reseed will happen, update last_reseed_time */
diff --git a/contrib/pgcrypto/random.c b/contrib/pgcrypto/random.c
index d72679e..34bf40e 100644
--- a/contrib/pgcrypto/random.c
+++ b/contrib/pgcrypto/random.c
@@ -162,7 +162,6 @@ try_win32_perfc(uint8 *dst)
 #define TRY_UNIXSTD
 
 #include <sys/types.h>
-#include <sys/time.h>
 #include <time.h>
 #include <unistd.h>
 
@@ -178,7 +177,7 @@ try_unix_std(uint8 *dst)
 	pid_t		pid;
 	int			x;
 	PX_MD	   *md;
-	struct timeval tv;
+	struct timespec tv;
 	int			res;
 
 	/* process id */
@@ -187,7 +186,7 @@ try_unix_std(uint8 *dst)
 	dst += sizeof(pid);
 
 	/* time */
-	gettimeofday(&tv, NULL);
+	clock_gettime(CLOCK_REALTIME_COARSE, &tv);
 	memcpy(dst, (uint8 *) &tv, sizeof(tv));
 	dst += sizeof(tv);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2189c22..1e57a27 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4765,24 +4765,24 @@ BootStrapXLOG(void)
 	char	   *recptr;
 	bool		use_existent;
 	uint64		sysidentifier;
-	struct timeval tv;
+	struct timespec tv;
 	pg_crc32c	crc;
 
 	/*
 	 * Select a hopefully-unique system identifier code for this installation.
-	 * We use the result of gettimeofday(), including the fractional seconds
+	 * We use the result of clock_gettime(), including the fractional seconds
 	 * field, as being about as unique as we can easily get.  (Think not to
 	 * use random(), since it hasn't been seeded and there's no portable way
 	 * to seed it other than the system clock value...)  The upper half of the
 	 * uint64 value is just the tv_sec part, while the lower half contains the
-	 * tv_usec part (which must fit in 20 bits), plus 12 bits from our current
+	 * tv_nsec part (which must fit in 20 bits), plus 12 bits from our current
 	 * PID for a little extra uniqueness.  A person knowing this encoding can
 	 * determine the initialization time of the installation, which could
 	 * perhaps be useful sometimes.
 	 */
-	gettimeofday(&tv, NULL);
+	clock_gettime(CLOCK_REALTIME_COARSE, &tv);
 	sysidentifier = ((uint64) tv.tv_sec) << 32;
-	sysidentifier |= ((uint64) tv.tv_usec) << 12;
+	sysidentifier |= ((uint64) tv.tv_nsec) << 12;
 	sysidentifier |= getpid() & 0xFFF;
 
 	/* First timeline ID is always 1 */
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index d907e6b..d0652bb 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -19,6 +19,7 @@
 #include <sys/socket.h>
 #include <netinet/in.h>
 #include <arpa/inet.h>
+#include <time.h>
 #include <unistd.h>
 
 #include "common/ip.h"
@@ -2434,7 +2435,7 @@ CheckRADIUSAuth(Port *port)
 	char		portstr[128];
 	ACCEPT_TYPE_ARG3 addrsize;
 	fd_set		fdset;
-	struct timeval endtime;
+	struct timespec endtime;
 	int			i,
 				j,
 				r;
@@ -2622,17 +2623,17 @@ CheckRADIUSAuth(Port *port)
 	 * the latch was set would improve the responsiveness to
 	 * timeouts/cancellations.
 	 */
-	gettimeofday(&endtime, NULL);
+	clock_gettime(CLOCK_REALTIME_COARSE, &endtime);
 	endtime.tv_sec += RADIUS_TIMEOUT;
 
 	while (true)
 	{
 		struct timeval timeout;
-		struct timeval now;
+		struct timespec now;
 		int64		timeoutval;
 
-		gettimeofday(&now, NULL);
-		timeoutval = (endtime.tv_sec * 1000000 + endtime.tv_usec) - (now.tv_sec * 1000000 + now.tv_usec);
+		clock_gettime(CLOCK_REALTIME, &now);
+		timeoutval = (endtime.tv_sec * 1000000000 + endtime.tv_nsec) - (now.tv_sec * 1000000000 + now.tv_nsec);
 		if (timeoutval <= 0)
 		{
 			ereport(LOG,
@@ -2640,7 +2641,7 @@ CheckRADIUSAuth(Port *port)
 			closesocket(sock);
 			return STATUS_ERROR;
 		}
-		timeout.tv_sec = timeoutval / 1000000;
+		timeout.tv_sec = timeoutval / 10000000000;
 		timeout.tv_usec = timeoutval % 1000000;
 
 		FD_ZERO(&fdset);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1a92ca1..5b8cee0 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -290,7 +290,7 @@ NON_EXEC_STATIC void AutoVacLauncherMain(int argc, char *argv[]) pg_attribute_no
 
 static Oid	do_start_worker(void);
 static void launcher_determine_sleep(bool canlaunch, bool recursing,
-						 struct timeval * nap);
+						 struct timespec * nap);
 static void launch_worker(TimestampTz now);
 static List *get_database_list(void);
 static void rebuild_database_list(Oid newdb);
@@ -577,7 +577,7 @@ AutoVacLauncherMain(int argc, char *argv[])
 	/* loop until shutdown request */
 	while (!got_SIGTERM)
 	{
-		struct timeval nap;
+		struct timespec nap;
 		TimestampTz current_time = 0;
 		bool		can_launch;
 		int			rc;
@@ -598,7 +598,7 @@ AutoVacLauncherMain(int argc, char *argv[])
 		 */
 		rc = WaitLatch(MyLatch,
 					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   (nap.tv_sec * 1000L) + (nap.tv_usec / 1000L));
+					   (nap.tv_sec * 1000L) + (nap.tv_nsec / 1000000L));
 
 		ResetLatch(MyLatch);
 
@@ -790,7 +790,7 @@ shutdown:
  * cause a long sleep, which will be interrupted when a worker exits.
  */
 static void
-launcher_determine_sleep(bool canlaunch, bool recursing, struct timeval * nap)
+launcher_determine_sleep(bool canlaunch, bool recursing, struct timespec * nap)
 {
 	/*
 	 * We sleep until the next scheduled vacuum.  We trust that when the
@@ -801,7 +801,7 @@ launcher_determine_sleep(bool canlaunch, bool recursing, struct timeval * nap)
 	if (!canlaunch)
 	{
 		nap->tv_sec = autovacuum_naptime;
-		nap->tv_usec = 0;
+		nap->tv_nsec = 0;
 	}
 	else if (!dlist_is_empty(&DatabaseList))
 	{
@@ -809,21 +809,21 @@ launcher_determine_sleep(bool canlaunch, bool recursing, struct timeval * nap)
 		TimestampTz next_wakeup;
 		avl_dbase  *avdb;
 		long		secs;
-		int			usecs;
+		int			nsecs;
 
 		avdb = dlist_tail_element(avl_dbase, adl_node, &DatabaseList);
 
 		next_wakeup = avdb->adl_next_worker;
-		TimestampDifference(current_time, next_wakeup, &secs, &usecs);
+		TimestampDifference(current_time, next_wakeup, &secs, &nsecs);
 
 		nap->tv_sec = secs;
-		nap->tv_usec = usecs;
+		nap->tv_nsec = nsecs;
 	}
 	else
 	{
 		/* list is empty, sleep for whole autovacuum_naptime seconds  */
 		nap->tv_sec = autovacuum_naptime;
-		nap->tv_usec = 0;
+		nap->tv_nsec = 0;
 	}
 
 	/*
@@ -836,7 +836,7 @@ launcher_determine_sleep(bool canlaunch, bool recursing, struct timeval * nap)
 	 * We only recurse once.  rebuild_database_list should always return times
 	 * in the future, but it seems best not to trust too much on that.
 	 */
-	if (nap->tv_sec == 0 && nap->tv_usec == 0 && !recursing)
+	if (nap->tv_sec == 0 && nap->tv_nsec == 0 && !recursing)
 	{
 		rebuild_database_list(InvalidOid);
 		launcher_determine_sleep(canlaunch, true, nap);
@@ -844,10 +844,10 @@ launcher_determine_sleep(bool canlaunch, bool recursing, struct timeval * nap)
 	}
 
 	/* The smallest time we'll allow the launcher to sleep. */
-	if (nap->tv_sec <= 0 && nap->tv_usec <= MIN_AUTOVAC_SLEEPTIME * 1000)
+	if (nap->tv_sec <= 0 && nap->tv_nsec <= MIN_AUTOVAC_SLEEPTIME * 1000000)
 	{
 		nap->tv_sec = 0;
-		nap->tv_usec = MIN_AUTOVAC_SLEEPTIME * 1000;
+		nap->tv_nsec = MIN_AUTOVAC_SLEEPTIME * 1000;
 	}
 
 	/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d702a48..eb0668c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -725,7 +725,7 @@ static bool
 IsCheckpointOnSchedule(double progress)
 {
 	XLogRecPtr	recptr;
-	struct timeval now;
+	struct timespec now;
 	double		elapsed_xlogs,
 				elapsed_time;
 
@@ -778,9 +778,9 @@ IsCheckpointOnSchedule(double progress)
 	/*
 	 * Check progress against time elapsed and checkpoint_timeout.
 	 */
-	gettimeofday(&now, NULL);
+	clock_gettime(CLOCK_REALTIME, &now);
 	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
-					now.tv_usec / 1000000.0) / CheckPointTimeout;
+					now.tv_nsec / 1000000000.0) / CheckPointTimeout;
 
 	if (progress < elapsed_time)
 	{
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eaf3f61..5277956 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -364,7 +364,7 @@ static volatile bool HaveCrashedWorker = false;
  * backend from the postmaster to that backend (via fork).
  */
 static unsigned int random_seed = 0;
-static struct timeval random_start_time;
+static struct timespec random_start_time;
 
 #ifdef USE_BONJOUR
 static DNSServiceRef bonjour_sdref = NULL;
@@ -1283,7 +1283,7 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	PgStartTime = GetCurrentTimestamp();
 	/* PostmasterRandom wants its own copy */
-	gettimeofday(&random_start_time, NULL);
+	clock_gettime(CLOCK_REALTIME, &random_start_time);
 
 	/*
 	 * We're ready to rock and roll...
@@ -4208,7 +4208,7 @@ BackendRun(Port *port)
 	 * a new random sequence in the random() library function.
 	 */
 	random_seed = 0;
-	random_start_time.tv_usec = 0;
+	random_start_time.tv_nsec = 0;
 	/* slightly hacky way to convert timestamptz into integers */
 	TimestampDifference(0, port->SessionStartTime, &secs, &usecs);
 	srandom((unsigned int) (MyProcPid ^ (usecs << 12) ^ secs));
@@ -5111,19 +5111,19 @@ PostmasterRandom(void)
 	{
 		do
 		{
-			struct timeval random_stop_time;
+			struct timespec random_stop_time;
 
-			gettimeofday(&random_stop_time, NULL);
+			clock_gettime(CLOCK_REALTIME, &random_stop_time);
 
 			/*
-			 * We are not sure how much precision is in tv_usec, so we swap
+			 * We are not sure how much precision is in tv_nsec, so we swap
 			 * the high and low 16 bits of 'random_stop_time' and XOR them
 			 * with 'random_start_time'. On the off chance that the result is
 			 * 0, we loop until it isn't.
 			 */
-			random_seed = random_start_time.tv_usec ^
-				((random_stop_time.tv_usec << 16) |
-				 ((random_stop_time.tv_usec >> 16) & 0xffff));
+			random_seed = random_start_time.tv_nsec ^
+				((random_stop_time.tv_nsec << 16) |
+				 ((random_stop_time.tv_nsec >> 16) & 0xffff));
 		}
 		while (random_seed == 0);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 98ccbbb..6d6accb 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4374,13 +4374,13 @@ get_stack_depth_rlimit(void)
 
 
 static struct rusage Save_r;
-static struct timeval Save_t;
+static struct timespec Save_t;
 
 void
 ResetUsage(void)
 {
 	getrusage(RUSAGE_SELF, &Save_r);
-	gettimeofday(&Save_t, NULL);
+	clock_gettime(CLOCK_REALTIME, &Save_t);
 }
 
 void
@@ -4389,17 +4389,17 @@ ShowUsage(const char *title)
 	StringInfoData str;
 	struct timeval user,
 				sys;
-	struct timeval elapse_t;
+	struct timespec elapse_t;
 	struct rusage r;
 
 	getrusage(RUSAGE_SELF, &r);
-	gettimeofday(&elapse_t, NULL);
+	clock_gettime(CLOCK_REALTIME, &elapse_t);
 	memcpy((char *) &user, (char *) &r.ru_utime, sizeof(user));
 	memcpy((char *) &sys, (char *) &r.ru_stime, sizeof(sys));
-	if (elapse_t.tv_usec < Save_t.tv_usec)
+	if (elapse_t.tv_nsec < Save_t.tv_nsec)
 	{
 		elapse_t.tv_sec--;
-		elapse_t.tv_usec += 1000000;
+		elapse_t.tv_nsec += 1000000000;
 	}
 	if (r.ru_utime.tv_usec < Save_r.ru_utime.tv_usec)
 	{
@@ -4425,7 +4425,7 @@ ShowUsage(const char *title)
 	appendStringInfo(&str,
 				"!\t%ld.%06ld elapsed %ld.%06ld user %ld.%06ld system sec\n",
 					 (long) (elapse_t.tv_sec - Save_t.tv_sec),
-					 (long) (elapse_t.tv_usec - Save_t.tv_usec),
+					 (long) (elapse_t.tv_nsec - Save_t.tv_nsec),
 					 (long) (r.ru_utime.tv_sec - Save_r.ru_utime.tv_sec),
 					 (long) (r.ru_utime.tv_usec - Save_r.ru_utime.tv_usec),
 					 (long) (r.ru_stime.tv_sec - Save_r.ru_stime.tv_sec),
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 224ee78..531aa69 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -147,7 +147,7 @@ static int	recursion_depth = 0;	/* to detect actual recursion */
  * Saved timeval and buffers for formatted timestamps that might be used by
  * both log_line_prefix and csv logs.
  */
-static struct timeval saved_timeval;
+static struct timespec saved_timeval;
 static bool saved_timeval_set = false;
 
 #define FORMATTED_TS_LEN 128
@@ -2232,7 +2232,7 @@ setup_formatted_log_time(void)
 
 	if (!saved_timeval_set)
 	{
-		gettimeofday(&saved_timeval, NULL);
+		clock_gettime(CLOCK_REALTIME_COARSE, &saved_timeval);
 		saved_timeval_set = true;
 	}
 
@@ -2249,7 +2249,7 @@ setup_formatted_log_time(void)
 				pg_localtime(&stamp_time, log_timezone));
 
 	/* 'paste' milliseconds into place... */
-	sprintf(msbuf, ".%03d", (int) (saved_timeval.tv_usec / 1000));
+	sprintf(msbuf, ".%03d", (int) (saved_timeval.tv_nsec / 1000000));
 	memcpy(formatted_log_time + 19, msbuf, 4);
 }
 
@@ -2480,12 +2480,12 @@ log_line_prefix(StringInfo buf, ErrorData *edata)
 
 					if (!saved_timeval_set)
 					{
-						gettimeofday(&saved_timeval, NULL);
+						clock_gettime(CLOCK_REALTIME_COARSE, &saved_timeval);
 						saved_timeval_set = true;
 					}
 
 					sprintf(strfbuf, "%ld.%03d", saved_timeval.tv_sec,
-							(int) (saved_timeval.tv_usec / 1000));
+							(int) (saved_timeval.tv_nsec / 1000000));
 
 					if (padding != 0)
 						appendStringInfo(buf, "%*s", padding, strfbuf);
diff --git a/src/backend/utils/misc/pg_rusage.c b/src/backend/utils/misc/pg_rusage.c
index 8781a38..5242dad 100644
--- a/src/backend/utils/misc/pg_rusage.c
+++ b/src/backend/utils/misc/pg_rusage.c
@@ -27,7 +27,7 @@ void
 pg_rusage_init(PGRUsage *ru0)
 {
 	getrusage(RUSAGE_SELF, &ru0->ru);
-	gettimeofday(&ru0->tv, NULL);
+	clock_gettime(CLOCK_REALTIME, (struct timespec *)&ru0->tv);
 }
 
 /*
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 595eaff..8481ec5 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -17,6 +17,7 @@
 #include <string.h>
 #include <sys/time.h>
 #include <sys/types.h>
+#include <time.h>
 #include <unistd.h>
 
 /* for ntohl/htonl */
@@ -445,14 +446,14 @@ int64
 feGetCurrentTimestamp(void)
 {
 	int64		result;
-	struct timeval tp;
+	struct timespec tp;
 
-	gettimeofday(&tp, NULL);
+	clock_gettime(CLOCK_REALTIME_COARSE, &tp);
 
 	result = (int64) tp.tv_sec -
 		((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
 
-	result = (result * USECS_PER_SEC) + tp.tv_usec;
+	result = (result * NSECS_PER_SEC) + tp.tv_nsec;
 
 	return result;
 }
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ba9c276..9fc3a82 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -1756,7 +1756,7 @@ dumpTableData_copy(Archive *fout, void *dcontext)
 		 * Further discussion ensued, and the proposal was dropped.
 		 *
 		 * For those people who want this feature, it can be implemented using
-		 * gettimeofday in each loop, calculating the time since last sleep,
+		 * clock_gettime in each loop, calculating the time since last sleep,
 		 * multiplying that by the sleep ratio, then if the result is more
 		 * than a preset 'minimum sleep time' (say 100ms), call the 'select'
 		 * function to sleep for a subsecond period ie.
@@ -1764,7 +1764,7 @@ dumpTableData_copy(Archive *fout, void *dcontext)
 		 * select(0, NULL, NULL, NULL, &tvi);
 		 *
 		 * This will return after the interval specified in the structure tvi.
-		 * Finally, call gettimeofday again to save the 'last sleep time'.
+		 * Finally, call clock_gettime again to save the 'last sleep time'.
 		 * ----------
 		 */
 	}
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 525b82ba..cc9b03e 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -535,7 +535,7 @@ static void
 GuessControlValues(void)
 {
 	uint64		sysidentifier;
-	struct timeval tv;
+	struct timespec tv;
 
 	/*
 	 * Set up a completely default set of pg_control values.
@@ -550,9 +550,9 @@ GuessControlValues(void)
 	 * Create a new unique installation identifier, since we can no longer use
 	 * any old XLOG records.  See notes in xlog.c about the algorithm.
 	 */
-	gettimeofday(&tv, NULL);
+	clock_gettime(CLOCK_REALTIME_COARSE, &tv);
 	sysidentifier = ((uint64) tv.tv_sec) << 32;
-	sysidentifier |= ((uint64) tv.tv_usec) << 12;
+	sysidentifier |= ((uint64) tv.tv_nsec) << 12;
 	sysidentifier |= getpid() & 0xFFF;
 
 	ControlFile.system_identifier = sysidentifier;
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index c842762..ca8206e 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -26,7 +26,7 @@
 #define LABEL_FORMAT		"        %-30s"
 #define NA_FORMAT			"%20s"
 #define OPS_FORMAT			"%13.3f ops/sec  %6.0f usecs/op"
-#define USECS_SEC			1000000
+#define NSECS_SEC			1000000000
 
 /* These are macros to avoid timing the function call overhead. */
 #ifndef WIN32
@@ -34,7 +34,7 @@
 do { \
 	alarm_triggered = false; \
 	alarm(secs_per_test); \
-	gettimeofday(&start_t, NULL); \
+	clock_gettime(CLOCK_REALTIME, &start_t); \
 } while (0)
 #else
 /* WIN32 doesn't support alarm, so we create a thread and sleep there */
@@ -47,13 +47,13 @@ do { \
 		fprintf(stderr, "Cannot create thread for alarm\n"); \
 		exit(1); \
 	} \
-	gettimeofday(&start_t, NULL); \
-} while (0)
+	clock_gettime(CLOCK_REALTIME, &start_t); \
+	} while (0)
 #endif
 
 #define STOP_TIMER	\
 do { \
-	gettimeofday(&stop_t, NULL); \
+	clock_gettime(CLOCK_REALTIME, &stop_t); \
 	print_elapse(start_t, stop_t, ops); \
 } while (0)
 
@@ -65,7 +65,7 @@ static int	needs_unlink = 0;
 static char full_buf[XLOG_SEG_SIZE],
 		   *buf,
 		   *filename = FSYNC_FILENAME;
-static struct timeval start_t,
+static struct timespec start_t,
 			stop_t;
 static bool alarm_triggered = false;
 
@@ -89,7 +89,7 @@ static void signal_cleanup(int sig);
 #ifdef HAVE_FSYNC_WRITETHROUGH
 static int	pg_fsync_writethrough(int fd);
 #endif
-static void print_elapse(struct timeval start_t, struct timeval stop_t, int ops);
+static void print_elapse(struct timespec start_t, struct timespec stop_t, int ops);
 static void die(const char *str);
 
 
@@ -568,12 +568,12 @@ pg_fsync_writethrough(int fd)
  * print out the writes per second for tests
  */
 static void
-print_elapse(struct timeval start_t, struct timeval stop_t, int ops)
+print_elapse(struct timespec start_t, struct timespec stop_t, int ops)
 {
 	double		total_time = (stop_t.tv_sec - start_t.tv_sec) +
-	(stop_t.tv_usec - start_t.tv_usec) * 0.000001;
+	(stop_t.tv_nsec - start_t.tv_nsec) * 0.000000001;
 	double		per_second = ops / total_time;
-	double		avg_op_time_us = (total_time / ops) * USECS_SEC;
+	double		avg_op_time_us = (total_time / ops) * NSECS_SEC;
 
 	printf(OPS_FORMAT "\n", per_second, avg_op_time_us);
 }
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 56c37d5..d10fe51 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -1772,7 +1772,7 @@ doCustom(TState *thread, CState *st, StatsData *agg)
 	instr_time	now;
 
 	/*
-	 * gettimeofday() isn't free, so we get the current timestamp lazily the
+	 * clock_gettime() isn't free, so we get the current timestamp lazily the
 	 * first time it's needed, and reuse the same value throughout this
 	 * function after that. This also ensures that e.g. the calculated latency
 	 * reported in the log file and in the totals are the same. Zero means
@@ -2247,11 +2247,11 @@ doLog(TState *thread, CState *st, instr_time *now,
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
 					st->id, st->cnt, st->use_file,
-					(long) now->tv_sec, (long) now->tv_usec);
+					(long) now->tv_sec, (long) now->tv_nsec);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) now->tv_sec, (long) now->tv_usec);
+					(long) now->tv_sec, (long) now->tv_nsec);
 #else
 
 		/* On Windows, instr_time doesn't provide a timestamp anyway */
diff --git a/src/include/c.h b/src/include/c.h
index 4ab3f80..020fa30 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -138,6 +138,16 @@
  */
 #define gettext_noop(x) (x)
 
+#if defined(WIN32) || defined(__CYGWIN__)
+typedef struct timespec {
+	long tv_sec;
+	long tv_nsec;
+} timespec;
+
+#define CLOCK_REALTIME 1
+#define CLOCK_REALTIME_COARSE 2
+#endif
+
 
 /* ----------------------------------------------------------------
  *				Section 1: hacks to cope with non-ANSI C compilers
diff --git a/src/include/datatype/timestamp.h b/src/include/datatype/timestamp.h
index 68a41eb..31d39d8 100644
--- a/src/include/datatype/timestamp.h
+++ b/src/include/datatype/timestamp.h
@@ -105,6 +105,9 @@ typedef struct
 #define USECS_PER_MINUTE INT64CONST(60000000)
 #define USECS_PER_SEC	INT64CONST(1000000)
 
+#define NSECS_PER_SEC	INT64CONST(1000000000)
+#define NSECS_PER_USEC	INT64CONST(1000)
+
 /*
  * We allow numeric timezone offsets up to 15:59:59 either way from Greenwich.
  * Currently, the record holders for wackiest offsets in actual use are zones
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index b621ff2..e2efae7 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -231,6 +231,9 @@
 /* Define to 1 if you have the `gettimeofday' function. */
 #undef HAVE_GETTIMEOFDAY
 
+/* Define to 1 if you have the `clock_gettime' function. */
+#undef HAVE_CLOCK_GETTIME 
+
 /* Define to 1 if you have the <gssapi/gssapi.h> header file. */
 #undef HAVE_GSSAPI_GSSAPI_H
 
diff --git a/src/include/port.h b/src/include/port.h
index b81fa4a..a8e80dc 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -319,6 +319,11 @@ extern FILE *pgwin32_popen(const char *command, const char *type);
 /* Last parameter not used */
 extern int	gettimeofday(struct timeval * tp, struct timezone * tzp);
 #endif
+/* New versions of MingW have gettimeofday, old mingw and msvc don't */
+#ifndef HAVE_CLOCK_GETTIME
+/* First parameter is not used */
+extern int	clock_gettime(int clock_id, struct timespec * tp);
+#endif
 #else							/* !WIN32 */
 
 /*
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 16caf6e..ebcc976 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,7 +4,7 @@
  *	  portable high-precision interval timing
  *
  * This file provides an abstraction layer to hide portability issues in
- * interval timing.  On Unix we use gettimeofday(), but on Windows that
+ * interval timing.  On Unix we use clock_gettime(), but on Windows that
  * gives a low-precision result so we must use QueryPerformanceCounter()
  * instead.  These macros also give some breathing room to use other
  * high-precision-timing APIs on yet other platforms.
@@ -54,24 +54,24 @@
 
 #ifndef WIN32
 
-#include <sys/time.h>
+#include <time.h>
 
-typedef struct timeval instr_time;
+typedef struct timespec instr_time;
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_usec == 0 && (t).tv_sec == 0)
+#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_usec = 0)
+#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	gettimeofday(&(t), NULL)
+#define INSTR_TIME_SET_CURRENT(t)	clock_gettime(CLOCK_REALTIME, &(t))
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
 		(x).tv_sec += (y).tv_sec; \
-		(x).tv_usec += (y).tv_usec; \
+		(x).tv_nsec += (y).tv_nsec; \
 		/* Normalize */ \
-		while ((x).tv_usec >= 1000000) \
+		while ((x).tv_nsec >= 1000000000) \
 		{ \
-			(x).tv_usec -= 1000000; \
+			(x).tv_nsec -= 1000000000; \
 			(x).tv_sec++; \
 		} \
 	} while (0)
@@ -79,11 +79,11 @@ typedef struct timeval instr_time;
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
 		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_usec -= (y).tv_usec; \
+		(x).tv_nsec -= (y).tv_nsec; \
 		/* Normalize */ \
-		while ((x).tv_usec < 0) \
+		while ((x).tv_nsec < 0) \
 		{ \
-			(x).tv_usec += 1000000; \
+			(x).tv_nsec += 1000000000; \
 			(x).tv_sec--; \
 		} \
 	} while (0)
@@ -91,28 +91,28 @@ typedef struct timeval instr_time;
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
 		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_usec += (y).tv_usec - (z).tv_usec; \
+		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
 		/* Normalize after each add to avoid overflow/underflow of tv_usec */ \
-		while ((x).tv_usec < 0) \
+		while ((x).tv_nsec < 0) \
 		{ \
-			(x).tv_usec += 1000000; \
+			(x).tv_nsec += 1000000000; \
 			(x).tv_sec--; \
 		} \
-		while ((x).tv_usec >= 1000000) \
+		while ((x).tv_nsec >= 1000000000) \
 		{ \
-			(x).tv_usec -= 1000000; \
+			(x).tv_nsec -= 1000000; \
 			(x).tv_sec++; \
 		} \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_usec) / 1000000.0)
+	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_usec) / 1000.0)
+	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) (t).tv_usec)
+	(((uint64) (t).tv_sec * (uint64) 1000000) + ((double) (t).tv_nsec) / 1000.0)
 #else							/* WIN32 */
 
 typedef LARGE_INTEGER instr_time;
diff --git a/src/include/utils/pg_rusage.h b/src/include/utils/pg_rusage.h
index 7c264eb..9ca9436 100644
--- a/src/include/utils/pg_rusage.h
+++ b/src/include/utils/pg_rusage.h
@@ -15,6 +15,7 @@
 #define PG_RUSAGE_H
 
 #include <sys/time.h>
+#include <time.h>
 
 #ifdef HAVE_SYS_RESOURCE_H
 #include <sys/resource.h>
diff --git a/src/port/clock_gettime.c b/src/port/clock_gettime.c
new file mode 100644
index 0000000..05641ad
--- /dev/null
+++ b/src/port/clock_gettime.c
@@ -0,0 +1,117 @@
+/*
+ * clock_gettime.c
+ *	  Win32 clock_gettime() replacement
+ *
+ * src/port/clock_gettime.c
+ *
+ * Copyright (c) 2003 SRA, Inc.
+ * Copyright (c) 2003 SKC, Inc.
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
+ * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include "c.h"
+
+#include <time.h>
+
+/* FILETIME of Jan 1 1970 00:00:00, the PostgreSQL epoch */
+static const unsigned __int64 epoch = UINT64CONST(116444736000000000);
+
+/*
+ * FILETIME represents the number of 100-nanosecond intervals since
+ * January 1, 1601 (UTC).
+ */
+#define FILETIME_UNITS_PER_SEC	10000000L
+#define FILETIME_UNITS_PER_USEC 10
+
+/*
+ * Both GetSystemTimeAsFileTime and GetSystemTimePreciseAsFileTime share a
+ * signature, so we can just store a pointer to whichever we find. This
+ * is the pointer's type.
+ */
+typedef		VOID(WINAPI * PgGetSystemTimeFn) (LPFILETIME);
+
+/* One-time initializer function, must match that signature. */
+static void WINAPI init_clock_gettime(LPFILETIME lpSystemTimeAsFileTime);
+
+/* Storage for the function we pick at runtime */
+static PgGetSystemTimeFn pg_get_system_time = &init_clock_gettime;
+
+/*
+ * One time initializer.  Determine whether GetSystemTimePreciseAsFileTime
+ * is available and if so, plan to use it; if not, fall back to
+ * GetSystemTimeAsFileTime.
+ */
+static void WINAPI
+init_clock_gettime(LPFILETIME lpSystemTimeAsFileTime)
+{
+	/*
+	 * Because it's guaranteed that kernel32.dll will be linked into our
+	 * address space already, we don't need to LoadLibrary it and worry about
+	 * closing it afterwards, so we're not using Pg's dlopen/dlsym() wrapper.
+	 *
+	 * We'll just look up the address of GetSystemTimePreciseAsFileTime if
+	 * present.
+	 *
+	 * While we could look up the Windows version and skip this on Windows
+	 * versions below Windows 8 / Windows Server 2012 there isn't much point,
+	 * and determining the windows version is its self somewhat Windows
+	 * version and development SDK specific...
+	 */
+	pg_get_system_time = (PgGetSystemTimeFn) GetProcAddress(
+									   GetModuleHandle(TEXT("kernel32.dll")),
+										   "GetSystemTimePreciseAsFileTime");
+	if (pg_get_system_time == NULL)
+	{
+		/*
+		 * The expected error from GetLastError() is ERROR_PROC_NOT_FOUND, if
+		 * the function isn't present. No other error should occur.
+		 *
+		 * We can't report an error here because this might be running in
+		 * frontend code; and even if we're in the backend, it's too early to
+		 * elog(...) if we get some unexpected error.  Also, it's not a
+		 * serious problem, so just silently fall back to
+		 * GetSystemTimeAsFileTime irrespective of why the failure occurred.
+		 */
+		pg_get_system_time = &GetSystemTimeAsFileTime;
+	}
+
+	(*pg_get_system_time) (lpSystemTimeAsFileTime);
+}
+
+/*
+ * Note: this function is not for Win32 high precision timing purposes. See
+ * elapsed_time().
+ */
+int
+clock_gettime(int clock_id, struct timespec * tp)
+{
+	FILETIME	file_time;
+	ULARGE_INTEGER ularge;
+
+	(*pg_get_system_time) (&file_time);
+	ularge.LowPart = file_time.dwLowDateTime;
+	ularge.HighPart = file_time.dwHighDateTime;
+
+	tp->tv_sec = (long) ((ularge.QuadPart - epoch) / FILETIME_UNITS_PER_SEC);
+	tp->tv_nsec = (long) ((((ularge.QuadPart - epoch) % FILETIME_UNITS_PER_SEC)
+						  / FILETIME_UNITS_PER_USEC) * 1000);
+
+	return 0;
+}
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index db2b559..d31aef5 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -11,6 +11,7 @@
 #include <windows.h>
 #endif
 #include <sys/time.h>
+#include <time.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -698,7 +699,7 @@ try_complete_step(Step *step, int flags)
 {
 	PGconn	   *conn = conns[1 + step->session];
 	fd_set		read_set;
-	struct timeval start_time;
+	struct timespec start_time;
 	struct timeval timeout;
 	int			sock = PQsocket(conn);
 	int			ret;
@@ -711,7 +712,7 @@ try_complete_step(Step *step, int flags)
 		exit_nicely();
 	}
 
-	gettimeofday(&start_time, NULL);
+	clock_gettime(CLOCK_REALTIME, &start_time);
 	FD_ZERO(&read_set);
 
 	while (PQisBusy(conn))
@@ -730,7 +731,7 @@ try_complete_step(Step *step, int flags)
 		}
 		else if (ret == 0)		/* select() timeout: check for lock wait */
 		{
-			struct timeval current_time;
+			struct timespec current_time;
 			int64		td;
 
 			/* If it's OK for the step to block, check whether it has. */
@@ -762,10 +763,10 @@ try_complete_step(Step *step, int flags)
 			}
 
 			/* Figure out how long we've been waiting for this step. */
-			gettimeofday(&current_time, NULL);
+			clock_gettime(CLOCK_REALTIME, &current_time);
 			td = (int64) current_time.tv_sec - (int64) start_time.tv_sec;
-			td *= USECS_PER_SEC;
-			td += (int64) current_time.tv_usec - (int64) start_time.tv_usec;
+			td *= NSECS_PER_SEC;
+			td += (int64) current_time.tv_nsec - (int64) start_time.tv_nsec;
 
 			/*
 			 * After 60 seconds, try to cancel the query.
@@ -776,7 +777,7 @@ try_complete_step(Step *step, int flags)
 			 * presumably lead to this permutation failing, but remaining
 			 * permutations and tests should still be OK.
 			 */
-			if (td > 60 * USECS_PER_SEC && !canceled)
+			if (td > 60 * NSECS_PER_SEC && !canceled)
 			{
 				PGcancel   *cancel = PQgetCancel(conn);
 
@@ -799,7 +800,7 @@ try_complete_step(Step *step, int flags)
 			 * later tests to fail.  That stinks, but it's better than waiting
 			 * forever for the server to respond to the cancel.
 			 */
-			if (td > 75 * USECS_PER_SEC)
+			if (td > 75 * NSECS_PER_SEC)
 			{
 				fprintf(stderr, "step %s timed out after 75 seconds\n",
 						step->name);
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 93dfd24..1b16684 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -88,8 +88,8 @@ sub mkvcbuild
 	$solution = CreateSolution($vsVersion, $config);
 
 	our @pgportfiles = qw(
-	  chklocale.c crypt.c fls.c fseeko.c getrusage.c inet_aton.c random.c
-	  srandom.c getaddrinfo.c gettimeofday.c inet_net_ntop.c kill.c open.c
+	  chklocale.c clock_gettime.c crypt.c fls.c fseeko.c getrusage.c inet_aton.c
+	  random.c srandom.c getaddrinfo.c gettimeofday.c inet_net_ntop.c kill.c open.c
 	  erand48.c snprintf.c strlcat.c strlcpy.c dirmod.c noblock.c path.c
 	  pgcheckdir.c pgmkdirp.c pgsleep.c pgstrcasecmp.c pqsignal.c
 	  mkdtemp.c qsort.c qsort_arg.c quotes.c system.c
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Haribabu Kommi (#22)
Re: gettimeofday is at the end of its usefulness?

Haribabu Kommi <kommi.haribabu@gmail.com> writes:

Attached a patch that replaces most of the getimeofday function calls,
except timeofday(user callable) and GetCurrentTimestamp functions.
Didn't add any configure checks in case if the clock_gettime function is
not available, the fallback logic to gettimeofday function call.

Well, of course, configure is the hard part.

I got interested in this area again pursuant to a question from Joel
Jacobson, and looked around to see if things had changed any since
2014. One pleasant surprise is that Apple got around to implementing
clock_gettime() as of the current macOS release (10.12 "Sierra").
That means that pretty much all interesting platforms now have
clock_gettime(), which removes one possible objection to starting to
use it. However, it seems like there is not a lot of commonality
to the best "clockid" to use.

In theory, according to the POSIX spec, CLOCK_MONOTONIC would be what we
want to use for time interval measurement (EXPLAIN ANALYZE), since that
would be impervious to possible clock setting changes. But some
implementations might only optimize the more common CLOCK_REALTIME,
and I found that there are a lot of platform-specific clock IDs that
we might want to consider.

On Linux (RHEL6, 2.4GHz x86_64), I find that gettimeofday(),
clock_gettime(CLOCK_MONOTONIC), and clock_gettime(CLOCK_REALTIME)
all take about 40ns. Of course gettimeofday() only has 1us resolution,
but the other two have perhaps 10ns resolution (I get no duplicate
readings in a tight loop). Other documented clockids include
CLOCK_REALTIME_COARSE: about 10ns to read, but only 1ms resolution
CLOCK_MONOTONIC_COARSE: about 12ns to read, but only 1ms resolution
CLOCK_MONOTONIC_RAW: full resolution but very slow, ~145ns to read
So CLOCK_MONOTONIC seems to be the thing to use here. It won't buy
us anything speed-wise but the extra resolution will be nice.
However, we need to do more research to see if this holds true on
other popular distros.

On macOS (10.12.2, 2.7GHz x86_64), clock_gettime(CLOCK_REALTIME)
is actually a shade faster than gettimeofday: 40ns versus 46ns.
But it's only giving 1us resolution, no better than gettimeofday.
CLOCK_MONOTONIC is also 1us and it takes 75ns to read. But there's
a CLOCK_MONOTONIC_RAW that takes 44ns to read and seems to offer
full precision -- no duplicate readings in a tight loop. There's
also CLOCK_MONOTONIC_RAW_APPROX which can be read in 23ns but
the resolution is only around half an ms.

I also tried FreeBSD 11.0 on another Mac (2.3GHz x86_64),
and found that gettimeofday as well as basically all their
clock_gettime variants run in 27 to 28 ns; and clock_gettime
reliably delivers full precision, except for CLOCK_SECOND which
is intentionally truncated to 1s precision. So there would be
no need to work with anything but CLOCK_MONOTONIC here.

However, it seems that these impressive results date back only to
June 2012, cf
https://github.com/freebsd/freebsd/commit/13a9f42818f6b89a72b3e40923be809b490400d8
and at least as of that commit, only x86 and x86_64 had the fast
clock_gettime code. Older FreeBSD, or FreeBSD on another architecture,
is likely to be a lot worse. But I lack an installation to try.

I also tried OpenBSD 6.0 on that same Mac, and got pretty horrid
results: gettimeofday, CLOCK_REALTIME, and CLOCK_MONOTONIC all
take about 613ns to read. Ouch. And so does time(NULL); double
ouch. Evidently there's no optimization on this platform and
what we're seeing is the minimum cost for a kernel call. Still,
we do get better precision from clock_gettime than gettimeofday,
so we might as well switch.

So it seems like the configure support we'd need is to detect
whether clock_gettime is available (note on Linux there's also
a library requirement, -lrt), and we would also need a way to
provide a platform-specific choice of clockid; we at least need
enough smarts to use CLOCK_MONOTONIC_RAW on macOS.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Greg Stark
stark@mit.edu
In reply to: Tom Lane (#23)
Re: gettimeofday is at the end of its usefulness?

On Dec 26, 2016 10:35 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

So it seems like the configure support we'd need is to detect
whether clock_gettime is available (note on Linux there's also
a library requirement, -lrt), and we would also need a way to
provide a platform-specific choice of clockid; we at least need
enough smarts to use CLOCK_MONOTONIC_RAW on macOS.

This seems like something that really should be checked at runtime. It's
very specific to the specific kernel you're running on, not the build
environment, and it can hopefully be measured in only a second or even a
fraction of a second. The only Pebblebrook would be if other things running
on the system made the test results unpredictable so that you had a small
chance of getting a very suboptimal choice and we ruling the dice each time
you restarted...

#25Andres Freund
andres@anarazel.de
In reply to: Greg Stark (#24)
Re: gettimeofday is at the end of its usefulness?

On 2016-12-27 01:35:05 +0000, Greg Stark wrote:

On Dec 26, 2016 10:35 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

So it seems like the configure support we'd need is to detect
whether clock_gettime is available (note on Linux there's also
a library requirement, -lrt), and we would also need a way to
provide a platform-specific choice of clockid; we at least need
enough smarts to use CLOCK_MONOTONIC_RAW on macOS.

This seems like something that really should be checked at runtime. It's
very specific to the specific kernel you're running on, not the build
environment, and it can hopefully be measured in only a second or even a
fraction of a second. The only Pebblebrook would be if other things running
on the system made the test results unpredictable so that you had a small
chance of getting a very suboptimal choice and we ruling the dice each time
you restarted...

I'm pretty strongly against doing performance measurements at
startup. Both the delay and the potential for differing test results
seem like pretty bad consequences.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#25)
Re: gettimeofday is at the end of its usefulness?

Andres Freund <andres@anarazel.de> writes:

On 2016-12-27 01:35:05 +0000, Greg Stark wrote:

On Dec 26, 2016 10:35 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

So it seems like the configure support we'd need is to detect
whether clock_gettime is available (note on Linux there's also
a library requirement, -lrt), and we would also need a way to
provide a platform-specific choice of clockid; we at least need
enough smarts to use CLOCK_MONOTONIC_RAW on macOS.

This seems like something that really should be checked at runtime.

I'm pretty strongly against doing performance measurements at
startup. Both the delay and the potential for differing test results
seem like pretty bad consequences.

Yeah, that doesn't sound great to me either. And I don't entirely
see the point, at least not with what we know now. I am a bit concerned
that we'll find out there are popular platforms where clock_gettime
compiles but fails with ENOSYS, or some similarly unhelpful behavior.
But we won't find that out if we don't try.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Florian Weimer
fw@deneb.enyo.de
In reply to: Tom Lane (#23)
Re: gettimeofday is at the end of its usefulness?

* Tom Lane:

On Linux (RHEL6, 2.4GHz x86_64), I find that gettimeofday(),
clock_gettime(CLOCK_MONOTONIC), and clock_gettime(CLOCK_REALTIME)
all take about 40ns. Of course gettimeofday() only has 1us resolution,
but the other two have perhaps 10ns resolution (I get no duplicate
readings in a tight loop). Other documented clockids include
CLOCK_REALTIME_COARSE: about 10ns to read, but only 1ms resolution
CLOCK_MONOTONIC_COARSE: about 12ns to read, but only 1ms resolution
CLOCK_MONOTONIC_RAW: full resolution but very slow, ~145ns to read
So CLOCK_MONOTONIC seems to be the thing to use here. It won't buy
us anything speed-wise but the extra resolution will be nice.
However, we need to do more research to see if this holds true on
other popular distros.

Isn't this very specific to kernel and glibc versions, depending on
things like CONFIG_HZ settings and what level of vDSO support has been
backported?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Florian Weimer (#27)
Re: gettimeofday is at the end of its usefulness?

Florian Weimer <fw@deneb.enyo.de> writes:

* Tom Lane:

On Linux (RHEL6, 2.4GHz x86_64), I find that gettimeofday(),
clock_gettime(CLOCK_MONOTONIC), and clock_gettime(CLOCK_REALTIME)
all take about 40ns. Of course gettimeofday() only has 1us resolution,
but the other two have perhaps 10ns resolution (I get no duplicate
readings in a tight loop).

Isn't this very specific to kernel and glibc versions, depending on
things like CONFIG_HZ settings and what level of vDSO support has been
backported?

No doubt, but I have yet to find a platform where clock_gettime() exists
but performs worse than gettimeofday(). Do you know of one?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#23)
Re: gettimeofday is at the end of its usefulness?

On Tue, Dec 27, 2016 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I also tried FreeBSD 11.0 on another Mac (2.3GHz x86_64),
and found that gettimeofday as well as basically all their
clock_gettime variants run in 27 to 28 ns; and clock_gettime
reliably delivers full precision, except for CLOCK_SECOND which
is intentionally truncated to 1s precision. So there would be
no need to work with anything but CLOCK_MONOTONIC here.

However, it seems that these impressive results date back only to
June 2012, cf
https://github.com/freebsd/freebsd/commit/13a9f42818f6b89a72b3e40923be809b490400d8
and at least as of that commit, only x86 and x86_64 had the fast
clock_gettime code. Older FreeBSD, or FreeBSD on another architecture,
is likely to be a lot worse. But I lack an installation to try.

That commit is in every 'production' and 'legacy' release of
FreeBSD[1]https://www.freebsd.org/releases/, meaning as far back as 9.3 (expected to be EoL in the next
few days), because it landed in 9.2 (EoL). ARM support landed in
FreeBSD 11.0[2]https://github.com/freebsd/freebsd/commit/80e8626b434515d16b3576174438526755336810. That leaves the following architectures without
fast-path time functions:

macaque:freebsd munro$ git grep 'trivial-vdso_tc.c'
lib/libc/mips/sys/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/powerpc/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/powerpc64/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/powerpcspe/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/riscv/sys/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/sparc64/Makefile.inc:SRCS+= trivial-vdso_tc.c

[1]: https://www.freebsd.org/releases/
[2]: https://github.com/freebsd/freebsd/commit/80e8626b434515d16b3576174438526755336810

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#29)
Re: gettimeofday is at the end of its usefulness?

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Tue, Dec 27, 2016 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, it seems that these impressive results date back only to
June 2012, cf
https://github.com/freebsd/freebsd/commit/13a9f42818f6b89a72b3e40923be809b490400d8
and at least as of that commit, only x86 and x86_64 had the fast
clock_gettime code. Older FreeBSD, or FreeBSD on another architecture,
is likely to be a lot worse. But I lack an installation to try.

That commit is in every 'production' and 'legacy' release of
FreeBSD[1], meaning as far back as 9.3 (expected to be EoL in the next
few days), because it landed in 9.2 (EoL).

I'm unclear on whether there's any significant number of people running
out-of-support *BSD releases. If it's not something we have to worry
about, fine.

That leaves the following architectures without
fast-path time functions:

macaque:freebsd munro$ git grep 'trivial-vdso_tc.c'
lib/libc/mips/sys/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/powerpc/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/powerpc64/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/powerpcspe/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/riscv/sys/Makefile.inc:SRCS+= trivial-vdso_tc.c
lib/libc/sparc64/Makefile.inc:SRCS+= trivial-vdso_tc.c

Yeah, I just finished getting results from FreeBSD 10.3 on PPC
(1.33GHz G4 laptop): gettimeofday takes about 1180 ns and clock_gettime
about 1200 ns. That difference seems to be repeatable, but since it's
only 2% I'm not too fussed about it. Interestingly, it's very easy
to tell that it is entering the kernel, because time(1) shows a
significant fraction of system time:

$ time ./testclock
0 bogus readings
100000000 distinct readings
117.96 real 26.80 user 90.31 sys

The same test on platforms with vDSO support shows zero system time.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Haribabu Kommi (#22)
1 attachment(s)
Re: gettimeofday is at the end of its usefulness?

Haribabu Kommi <kommi.haribabu@gmail.com> writes:

Attached a patch that replaces most of the getimeofday function calls,
except timeofday(user callable) and GetCurrentTimestamp functions.

I looked at this for awhile and could not convince myself that it's
a good idea. Trying to do s/gettimeofday/clock_gettime/g is not going
to do much for us except create portability headaches. According
to my tests, clock_gettime is not noticeably faster than gettimeofday
on any platform, except that if you use nonstandard clockids like
CLOCK_REALTIME_COARSE then on *some* platforms it's a little bit quicker,
at the cost of being a great deal less precise. But we'd have to research
the existence and effects of nonstandard clockids on every platform.
So AFAICS the only clear advantage to switching is the extra precision
available from clock_gettime.

But ... most of the places you've touched in this patch have neither any
need for sub-microsecond precision nor any great need to worry about
shaving a few ns off the time taken by the call. As far as I can find,
the only place where it's actually worth our trouble to deal with it is
instr_time.h (ie, EXPLAIN ANALYZE and a few other uses).

So I think we should do something more like the attached.

One issue I did not resolve in this WIP patch is what to do with this
gem of abstraction violation in pgbench:

/* no, print raw transactions */
#ifndef WIN32

/* This is more than we really ought to know about instr_time */
if (skipped)
fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
st->id, st->cnt, st->use_file,
(long) now->tv_sec, (long) now->tv_usec);
else
fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
st->id, st->cnt, latency, st->use_file,
(long) now->tv_sec, (long) now->tv_usec);
#else

/* On Windows, instr_time doesn't provide a timestamp anyway */
if (skipped)
fprintf(logfile, "%d " INT64_FORMAT " skipped %d 0 0",
st->id, st->cnt, st->use_file);
else
fprintf(logfile, "%d " INT64_FORMAT " %.0f %d 0 0",
st->id, st->cnt, latency, st->use_file);
#endif

We could either rip out the non-Windows code path entirely, or do
something about providing an honest elapsed-time measurement, perhaps
by doing INSTR_TIME_GET_DOUBLE() on the diff from run start to "now".
Given that we're calling fprintf, I doubt that the extra arithmetic
needed for that is a big problem.

regards, tom lane

Attachments:

clock_gettime_for_EXPLAIN_1.patchtext/x-diff; charset=us-ascii; name=clock_gettime_for_EXPLAIN_1.patchDownload
diff --git a/configure b/configure
index 0f143a0..e5dd6fb 100755
*** a/configure
--- b/configure
*************** if test "$ac_res" != no; then :
*** 9055,9060 ****
--- 9055,9116 ----
  
  fi
  
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing clock_gettime" >&5
+ $as_echo_n "checking for library containing clock_gettime... " >&6; }
+ if ${ac_cv_search_clock_gettime+:} false; then :
+   $as_echo_n "(cached) " >&6
+ else
+   ac_func_search_save_LIBS=$LIBS
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+ /* end confdefs.h.  */
+ 
+ /* Override any GCC internal prototype to avoid an error.
+    Use char because int might match the return type of a GCC
+    builtin and then its argument prototype would still apply.  */
+ #ifdef __cplusplus
+ extern "C"
+ #endif
+ char clock_gettime ();
+ int
+ main ()
+ {
+ return clock_gettime ();
+   ;
+   return 0;
+ }
+ _ACEOF
+ for ac_lib in '' rt posix4; do
+   if test -z "$ac_lib"; then
+     ac_res="none required"
+   else
+     ac_res=-l$ac_lib
+     LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+   fi
+   if ac_fn_c_try_link "$LINENO"; then :
+   ac_cv_search_clock_gettime=$ac_res
+ fi
+ rm -f core conftest.err conftest.$ac_objext \
+     conftest$ac_exeext
+   if ${ac_cv_search_clock_gettime+:} false; then :
+   break
+ fi
+ done
+ if ${ac_cv_search_clock_gettime+:} false; then :
+ 
+ else
+   ac_cv_search_clock_gettime=no
+ fi
+ rm conftest.$ac_ext
+ LIBS=$ac_func_search_save_LIBS
+ fi
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_clock_gettime" >&5
+ $as_echo "$ac_cv_search_clock_gettime" >&6; }
+ ac_res=$ac_cv_search_clock_gettime
+ if test "$ac_res" != no; then :
+   test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+ 
+ fi
+ 
  # Solaris:
  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5
  $as_echo_n "checking for library containing fdatasync... " >&6; }
*************** fi
*** 12520,12526 ****
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do :
    as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
--- 12576,12582 ----
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do :
    as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index b9831bc..77aa3b4 100644
*** a/configure.in
--- b/configure.in
*************** AC_SEARCH_LIBS(getopt_long, [getopt gnug
*** 1016,1021 ****
--- 1016,1022 ----
  AC_SEARCH_LIBS(crypt, crypt)
  AC_SEARCH_LIBS(shm_open, rt)
  AC_SEARCH_LIBS(shm_unlink, rt)
+ AC_SEARCH_LIBS(clock_gettime, [rt posix4])
  # Solaris:
  AC_SEARCH_LIBS(fdatasync, [rt posix4])
  # Required for thread_test.c on Solaris
*************** PGAC_FUNC_WCSTOMBS_L
*** 1415,1421 ****
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
  
  AC_REPLACE_FUNCS(fseeko)
  case $host_os in
--- 1416,1422 ----
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
  
  AC_REPLACE_FUNCS(fseeko)
  case $host_os in
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 42a3fc8..b9dfdd4 100644
*** a/src/include/pg_config.h.in
--- b/src/include/pg_config.h.in
***************
*** 105,110 ****
--- 105,113 ----
  /* Define to 1 if you have the `class' function. */
  #undef HAVE_CLASS
  
+ /* Define to 1 if you have the `clock_gettime' function. */
+ #undef HAVE_CLOCK_GETTIME
+ 
  /* Define to 1 if you have the <crtdefs.h> header file. */
  #undef HAVE_CRTDEFS_H
  
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index ceb8b79..199668c 100644
*** a/src/include/pg_config.h.win32
--- b/src/include/pg_config.h.win32
***************
*** 75,80 ****
--- 75,83 ----
  /* Define to 1 if you have the `class' function. */
  /* #undef HAVE_CLASS */
  
+ /* Define to 1 if you have the `clock_gettime' function. */
+ /* #undef HAVE_CLOCK_GETTIME */
+ 
  /* Define to 1 if you have the `crypt' function. */
  /* #undef HAVE_CRYPT */
  
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 16caf6e..b10d6f0 100644
*** a/src/include/portability/instr_time.h
--- b/src/include/portability/instr_time.h
***************
*** 4,13 ****
   *	  portable high-precision interval timing
   *
   * This file provides an abstraction layer to hide portability issues in
!  * interval timing.  On Unix we use gettimeofday(), but on Windows that
!  * gives a low-precision result so we must use QueryPerformanceCounter()
!  * instead.  These macros also give some breathing room to use other
!  * high-precision-timing APIs on yet other platforms.
   *
   * The basic data type is instr_time, which all callers should treat as an
   * opaque typedef.  instr_time can store either an absolute time (of
--- 4,13 ----
   *	  portable high-precision interval timing
   *
   * This file provides an abstraction layer to hide portability issues in
!  * interval timing.  On Unix we use clock_gettime() if available, else
!  * gettimeofday().  On Windows, gettimeofday() gives a low-precision result
!  * so we must use QueryPerformanceCounter() instead.  These macros also give
!  * some breathing room to use other high-precision-timing APIs.
   *
   * The basic data type is instr_time, which all callers should treat as an
   * opaque typedef.  instr_time can store either an absolute time (of
***************
*** 54,59 ****
--- 54,147 ----
  
  #ifndef WIN32
  
+ #ifdef HAVE_CLOCK_GETTIME
+ 
+ /* Use clock_gettime() */
+ 
+ #include <time.h>
+ 
+ /*
+  * The best clockid to use according to the POSIX spec is CLOCK_MONOTONIC,
+  * since that will give reliable interval timing even in the face of changes
+  * to the system clock.  However, POSIX doesn't require implementations to
+  * provide anything except CLOCK_REALTIME, so fall back to that if we don't
+  * find CLOCK_MONOTONIC.
+  *
+  * Also, some implementations have nonstandard clockids with better properties
+  * than CLOCK_MONOTONIC.  In particular, as of macOS 10.12, Apple provides
+  * CLOCK_MONOTONIC_RAW which is both faster to read and higher resolution than
+  * their version of CLOCK_MONOTONIC.
+  */
+ #if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
+ #define PG_INSTR_CLOCK	CLOCK_MONOTONIC_RAW
+ #elif defined(CLOCK_MONOTONIC)
+ #define PG_INSTR_CLOCK	CLOCK_MONOTONIC
+ #else
+ #define PG_INSTR_CLOCK	CLOCK_REALTIME
+ #endif
+ 
+ typedef struct timespec instr_time;
+ 
+ #define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+ 
+ #define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+ 
+ #define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+ 
+ #define INSTR_TIME_ADD(x,y) \
+ 	do { \
+ 		(x).tv_sec += (y).tv_sec; \
+ 		(x).tv_nsec += (y).tv_nsec; \
+ 		/* Normalize */ \
+ 		while ((x).tv_nsec >= 1000000000) \
+ 		{ \
+ 			(x).tv_nsec -= 1000000000; \
+ 			(x).tv_sec++; \
+ 		} \
+ 	} while (0)
+ 
+ #define INSTR_TIME_SUBTRACT(x,y) \
+ 	do { \
+ 		(x).tv_sec -= (y).tv_sec; \
+ 		(x).tv_nsec -= (y).tv_nsec; \
+ 		/* Normalize */ \
+ 		while ((x).tv_nsec < 0) \
+ 		{ \
+ 			(x).tv_nsec += 1000000000; \
+ 			(x).tv_sec--; \
+ 		} \
+ 	} while (0)
+ 
+ #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
+ 	do { \
+ 		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
+ 		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
+ 		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
+ 		while ((x).tv_nsec < 0) \
+ 		{ \
+ 			(x).tv_nsec += 1000000000; \
+ 			(x).tv_sec--; \
+ 		} \
+ 		while ((x).tv_nsec >= 1000000000) \
+ 		{ \
+ 			(x).tv_nsec -= 1000000000; \
+ 			(x).tv_sec++; \
+ 		} \
+ 	} while (0)
+ 
+ #define INSTR_TIME_GET_DOUBLE(t) \
+ 	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
+ 
+ #define INSTR_TIME_GET_MILLISEC(t) \
+ 	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
+ 
+ #define INSTR_TIME_GET_MICROSEC(t) \
+ 	(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+ 
+ #else							/* !HAVE_CLOCK_GETTIME */
+ 
+ /* Use gettimeofday() */
+ 
  #include <sys/time.h>
  
  typedef struct timeval instr_time;
*************** typedef struct timeval instr_time;
*** 113,120 ****
--- 201,213 ----
  
  #define INSTR_TIME_GET_MICROSEC(t) \
  	(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) (t).tv_usec)
+ 
+ #endif   /* HAVE_CLOCK_GETTIME */
+ 
  #else							/* WIN32 */
  
+ /* Use QueryPerformanceCounter() */
+ 
  typedef LARGE_INTEGER instr_time;
  
  #define INSTR_TIME_IS_ZERO(t)	((t).QuadPart == 0)
*************** GetTimerFrequency(void)
*** 149,154 ****
--- 242,248 ----
  	QueryPerformanceFrequency(&f);
  	return (double) f.QuadPart;
  }
+ 
  #endif   /* WIN32 */
  
  #endif   /* INSTR_TIME_H */
#32Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Tom Lane (#31)
Re: gettimeofday is at the end of its usefulness?

On Fri, Dec 30, 2016 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Haribabu Kommi <kommi.haribabu@gmail.com> writes:

Attached a patch that replaces most of the getimeofday function calls,
except timeofday(user callable) and GetCurrentTimestamp functions.

I looked at this for awhile and could not convince myself that it's
a good idea. Trying to do s/gettimeofday/clock_gettime/g is not going
to do much for us except create portability headaches. According
to my tests, clock_gettime is not noticeably faster than gettimeofday
on any platform, except that if you use nonstandard clockids like
CLOCK_REALTIME_COARSE then on *some* platforms it's a little bit quicker,
at the cost of being a great deal less precise. But we'd have to research
the existence and effects of nonstandard clockids on every platform.
So AFAICS the only clear advantage to switching is the extra precision
available from clock_gettime.

But ... most of the places you've touched in this patch have neither any
need for sub-microsecond precision nor any great need to worry about
shaving a few ns off the time taken by the call. As far as I can find,
the only place where it's actually worth our trouble to deal with it is
instr_time.h (ie, EXPLAIN ANALYZE and a few other uses).

So I think we should do something more like the attached.

Thanks for your valuable input.

As the getimeofday() function is obsolete and any further enhancements
may happen to clock_gettime() function only, because of this reason, I
changed
it many places.

Yes, I agree that until unless the clock_gettime() function that performs
faster
in all platforms compared to gettimeofday(), we can retain the getimeofday()
function.

Regards,
Hari Babu
Fujitsu Australia

#33Florian Weimer
fw@deneb.enyo.de
In reply to: Tom Lane (#28)
Re: gettimeofday is at the end of its usefulness?

* Tom Lane:

Florian Weimer <fw@deneb.enyo.de> writes:

* Tom Lane:

On Linux (RHEL6, 2.4GHz x86_64), I find that gettimeofday(),
clock_gettime(CLOCK_MONOTONIC), and clock_gettime(CLOCK_REALTIME)
all take about 40ns. Of course gettimeofday() only has 1us resolution,
but the other two have perhaps 10ns resolution (I get no duplicate
readings in a tight loop).

Isn't this very specific to kernel and glibc versions, depending on
things like CONFIG_HZ settings and what level of vDSO support has been
backported?

No doubt, but I have yet to find a platform where clock_gettime() exists
but performs worse than gettimeofday(). Do you know of one?

ppc64le with all the vDSO fixes for clock_gettime?

glibc has some test cases which fail because clock_gettime gives
inconsistent results. This has been fixed in current kernels, but I
don't know if everyone uses them.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Florian Weimer (#33)
Re: gettimeofday is at the end of its usefulness?

Florian Weimer <fw@deneb.enyo.de> writes:

glibc has some test cases which fail because clock_gettime gives
inconsistent results. This has been fixed in current kernels, but I
don't know if everyone uses them.

hmm, for which clock IDs?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers