That EXPLAIN ANALYZE patch still needs work

Started by Tom Lanealmost 20 years ago90 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Just got this rather surprising result:

regression=# \timing
Timing is on.
regression=# explain analyze select * from tenk1 a join tenk1 b on a.unique1 = b.unique2;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=825.00..2484.00 rows=10000 width=488) (actual time=250.510..2557.129 rows=10000 loops=1)
Hash Cond: (a.unique1 = b.unique2)
-> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.084..30.070 rows=10000 loops=1)
-> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=250.242..250.242 rows=10000 loops=1)
-> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244) (actual time=0.019..23.317 rows=10000 loops=1)
Total runtime: 847.855 ms
(6 rows)

Time: 856.179 ms
regression=#

The "Total runtime" is correct AFAICT, which puts the top node's "actual
time" rather far out in left field. This is pretty repeatable on my old
slow HPPA machine. A new Xeon shows less of a discrepancy, but it's
still claiming top node actual > total, which is not right.

I realize that the sampling code can't be expected to be exactly right,
but a factor-of-three error is not acceptable. One of us broke this :-(

regards, tom lane

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#1)
Re: That EXPLAIN ANALYZE patch still needs work

On Mon, Jun 05, 2006 at 11:02:33PM -0400, Tom Lane wrote:

Just got this rather surprising result:

<snip bogus explain output>

The "Total runtime" is correct AFAICT, which puts the top node's "actual
time" rather far out in left field. This is pretty repeatable on my old
slow HPPA machine. A new Xeon shows less of a discrepancy, but it's
still claiming top node actual > total, which is not right.

Wierd. Can you get the output of *instr in each call of
InstrEndLoop? Preferably after it does the calculation but before it
clears the values. So we get an idea of number of samples and what it
guesses. SampleOverhead would be good too.

I know my version produced sensible results on my machine and the
handful of people testing, so I'll try it again with your changes, see
how it looks...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#2)
Re: That EXPLAIN ANALYZE patch still needs work

Martijn van Oosterhout <kleptog@svana.org> writes:

Wierd. Can you get the output of *instr in each call of
InstrEndLoop? Preferably after it does the calculation but before it
clears the values. So we get an idea of number of samples and what it
guesses. SampleOverhead would be good too.

The problem looks to be an underestimation of SampleOverhead, and on
reflection it's clear why: what CalculateSampleOverhead is measuring
isn't the total overhead, but the time between the two gettimeofday
calls. Which is probably about half the true overhead. What we
ought to do is iterate InstStartNode/InstrStopNode N times, and
*separately* measure the total elapsed time spent.

It occurs to me that what we really want to know is not so much the
total time spent in InstStartNode/InstrStopNode, as the difference in
the time spent when sampling is on vs when it is off. I'm not quite
sure if the time spent when it's off is negligible. Off to do some
measuring ...

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: That EXPLAIN ANALYZE patch still needs work

I wrote:

The problem looks to be an underestimation of SampleOverhead, and on
reflection it's clear why: what CalculateSampleOverhead is measuring
isn't the total overhead, but the time between the two gettimeofday
calls. Which is probably about half the true overhead.

On further thought, I take that back: the "true overhead" is not the
point here. The time elapsed during a plan node execution when
sampling can be broken down into three phases:
time before first gettimeofday call
time between gettimeofday calls
time after second gettimeofday call
Only the second part is actually measured by the instrumentation system;
the other parts are overhead that has never been counted by EXPLAIN
ANALYZE, sampling version or no. Moreover, most of the runtime of
InstrStartNode and InstrStopNode falls into the first or third parts.

What we would actually like to set SampleOverhead to is the portion
of the second-part runtime that doesn't occur when sampling = false.
Assuming that gettimeofday() has consistent runtime and the actual
time reported is measured at a consistent instant within that runtime,
I believe that we should take the SampleOverhead as just equal to
the runtime of a single gettimeofday() call. The added or removed
second-part time within InstrStartNode is just the tail time of
gettimeofday, and the added or removed second-part time within
InstrStopNode is basically just the head time of gettimeofday. (To make
this as true as possible, we need to change the order of operations so
that gettimeofday is invoked *immediately* after the "if (sampling)"
test, but that's easy.)

So this line of thought leads to the conclusion that
CalculateSampleOverhead is actually overestimating SampleOverhead
a bit, and we should simplify it to just time INSTR_TIME_SET_CURRENT().

But that still leaves me with a problem because my machine is clearly
overestimating the correction needed. I added some printouts and got

raw totaltime = 0.370937
per_iter = 0.000156913, SampleOverhead = 3.28e-06
adj totaltime = 1.82976
sampling = 0
starttime = 0/000000
counter = 0/370937
firsttuple = 0.258321
tuplecount = 10000
itercount = 10001
samplecount = 704
nextsample = 10011
startup = 0.258321
total = 1.82976
ntuples = 10000
nloops = 1

on a run with an actual elapsed time near 750 msec. Clearly the
sampling adjustment is wrong, but why?

I have a theory about this, and it's not pleasant at all. What I
think is that we have a Heisenberg problem here: the act of invoking
gettimeofday() actually changes what is measured. That is, the
runtime of the "second part" of ExecProcNode is actually longer when
we sample than when we don't, not merely due to the extra time spent
in gettimeofday(). It's not very hard to guess at reasons why, either.
The kernel entry is probably flushing some part of the CPU's state,
such as virtual/physical address mapping for the userland address
space. After returning from the kernel call, the time to reload
that state shows up as more execution time within the "second part".

This theory explains two observations that otherwise are hard to
explain. One, that the effect is platform-specific: your machine
may avoid flushing as much state during a kernel call as mine does.
And two, that upper plan nodes seem much more affected than lower
ones. That makes sense because the execution cycle of an upper node
will involve touching more userspace data than a lower node, and
therefore more of the flushed TLB entries will need to be reloaded.

If this theory is correct, then the entire notion of EXPLAIN ANALYZE
sampling has just crashed and burned. We can't ship a measurement
tool that is accurate on some platforms and not others.

I'm wondering if it's at all practical to go over to a gprof-like
measurement method for taking EXPLAIN ANALYZE runtime measurements;
that is, take an interrupt every so often and bump the count for the
currently active plan node. This would spread the TLB-flush overhead
more evenly and thus avoid introducing that bias. There may be too much
platform dependency in this idea, though, and I also wonder if it'd
break the ability to do normal gprof profiling of the backend.

regards, tom lane

#5Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#4)
Re: That EXPLAIN ANALYZE patch still needs work

On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote:

I have a theory about this, and it's not pleasant at all. What I
think is that we have a Heisenberg problem here: the act of invoking
gettimeofday() actually changes what is measured. That is, the
runtime of the "second part" of ExecProcNode is actually longer when
we sample than when we don't, not merely due to the extra time spent
in gettimeofday(). It's not very hard to guess at reasons why, either.
The kernel entry is probably flushing some part of the CPU's state,
such as virtual/physical address mapping for the userland address
space. After returning from the kernel call, the time to reload
that state shows up as more execution time within the "second part".

This theory explains two observations that otherwise are hard to
explain. One, that the effect is platform-specific: your machine
may avoid flushing as much state during a kernel call as mine does.
And two, that upper plan nodes seem much more affected than lower
ones. That makes sense because the execution cycle of an upper node
will involve touching more userspace data than a lower node, and
therefore more of the flushed TLB entries will need to be reloaded.

If that's the case, then maybe a more sopdisticated method of measuring
the overhead would work. My thought is that on the second call to pull a
tuple from a node (second because the first probably has some anomolies
due to startup), we measure the overhead for that node. This would
probably mean doing the following:
get start time # I'm not refering to this as gettimeofday to avoid
# confusion
gettimeofday() # this is the gettimeofday call that will happen during
# normal operation
get end time

Hopefully, there's no caching effect that would come into play from not
actually touching any of the data structures after the gettimeofday()
call. If that's not the case, it makes measuring the overhead more
complex, but I think it should still be doable...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#5)
Re: That EXPLAIN ANALYZE patch still needs work

"Jim C. Nasby" <jnasby@pervasive.com> writes:

If that's the case, then maybe a more sopdisticated method of measuring
the overhead would work.

I think you missed the point: the time spent in gettimeofday() itself
is not the major overhead of EXPLAIN ANALYZE. At least it appears that
this is the case on my machine.

I'm thinking that interrupt-driven sampling might work OK though.
My previous worries were based on trying to use the ITIMER_PROF
timer, which might not be portable and would conflict with gprof
anyway. But EXPLAIN ANALYZE has always been interested in real time
rather than CPU time, so the correct interrupt to use is ITIMER_REAL.
That means we only have to share with our own existing usages of that
interrupt, which turns it from a portability issue into just a Small
Matter Of Programming.

regards, tom lane

#7Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#4)
Re: That EXPLAIN ANALYZE patch still needs work

On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote:

But that still leaves me with a problem because my machine is clearly
overestimating the correction needed. I added some printouts and got

raw totaltime = 0.370937
per_iter = 0.000156913, SampleOverhead = 3.28e-06
adj totaltime = 1.82976
sampling = 0
starttime = 0/000000
counter = 0/370937
firsttuple = 0.258321
tuplecount = 10000
itercount = 10001
samplecount = 704
nextsample = 10011
startup = 0.258321
total = 1.82976
ntuples = 10000
nloops = 1

on a run with an actual elapsed time near 750 msec. Clearly the
sampling adjustment is wrong, but why?

This doesn't make any sense at all. How can a sampling run that only
sampled 7% of the actual tuples, end up with a actual measured time
that's more than 50% of the actual final runtime?

Can you get an estimate of the actual overhead (ie compare wall clock
time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is
on the order of a few hundred cycles, I'm not sure about the cost of
TLB flushes (that's probably highly architechture dependant).

To be honest, I wonder about caching effects, but for the disks. The
first few cycles of any plan (like an index scan) is going to incur
costs that won't happen later on. Because we sample much more heavily
at the beginning rather than the end, this will bias towards higher
numbers. You should be able to see this by seeing if running queries
that don't require disk access fare better.

That would suggest a much more careful correction method that works for
non-linear timing patterns...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#7)
Re: That EXPLAIN ANALYZE patch still needs work

Martijn van Oosterhout <kleptog@svana.org> writes:

This doesn't make any sense at all. How can a sampling run that only
sampled 7% of the actual tuples, end up with a actual measured time
that's more than 50% of the actual final runtime?

AFAICS, the only conclusion is that the sampled executions are in fact
taking longer than supposedly-equivalent unsampled ones, and by a pretty
good percentage too. I'm growing unsure again about the mechanism
responsible for that, however.

Can you get an estimate of the actual overhead (ie compare wall clock
time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is
on the order of a few hundred cycles, I'm not sure about the cost of
TLB flushes (that's probably highly architechture dependant).

Here's some examples. Keep in mind I've already determined that
gettimeofday() takes about 3 usec on this hardware ...

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=797.412..797.416 rows=1 loops=1)
-> Limit (cost=825.00..2484.00 rows=10000 width=488) (actual time=208.208..2576.528 rows=10000 loops=1)
-> Hash Join (cost=825.00..2484.00 rows=10000 width=488) (actual time=208.190..2082.577 rows=10000 loops=1)
Hash Cond: (a.unique1 = b.unique2)
-> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.082..3.718 rows=10000 loops=1)
-> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=207.933..207.933 rows=10000 loops=1)
-> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244) (actual time=0.017..3.583 rows=10000 loops=1)
Total runtime: 805.036 ms
(8 rows)

Time: 816.463 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;
count
-------
10000
(1 row)

Time: 816.970 m

The actual elapsed time for EXPLAIN ANALYZE seems to jump around quite
a bit, probably because of the random variation we're using in sampling
interval. This particular combination was unusually close. But in any
case, the *actual* overhead of EXPLAIN ANALYZE is clearly pretty small
here; the problem is that we're incorrectly extrapolating the measured
runtime to the unmeasured executions.

What's especially interesting is that the excess time doesn't seem to
show up if I form the query in a way that doesn't require pushing as
much data around:

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1341.00..1341.01 rows=1 width=0) (actual time=212.313..212.317 rows=1 loops=1)
-> Hash Join (cost=483.00..1316.00 rows=10000 width=0) (actual time=88.061..160.886 rows=10000 loops=1)
Hash Cond: (a.unique1 = b.unique2)
-> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=4) (actual time=0.071..4.068 rows=10000 loops=1)
-> Hash (cost=458.00..458.00 rows=10000 width=4) (actual time=87.862..87.862 rows=10000 loops=1)
-> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=4) (actual time=0.031..4.780 rows=10000 loops=1)
Total runtime: 221.022 ms
(7 rows)

Time: 229.377 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss;
count
-------
10000
(1 row)

Time: 202.531 ms
regression=#

(Without the OFFSET 0, the planner flattens the subquery and discovers
that it doesn't actually need to fetch any of the non-join-key table
columns.) Note the only plan nodes showing whacked-out timings are the
ones returning wide tuples (large "width" values). I'm not entirely
sure what to make of this. It could be interpreted as evidence for my
theory about TLB reloads during userspace data access being the problem.
But I'm getting a bit disenchanted with that theory after running the
same test case in 8.1:

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=1033.866..1033.870 rows=1 loops=1)
-> Limit (cost=825.00..2484.00 rows=10000 width=488) (actual time=330.961..989.272 rows=10000 loops=1)
-> Hash Join (cost=825.00..2484.00 rows=10000 width=488) (actual time=330.946..920.398 rows=10000 loops=1)
Hash Cond: ("outer".unique1 = "inner".unique2)
-> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.050..66.861 rows=10000 loops=1)
-> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=330.745..330.745 rows=10000 loops=1)
-> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244) (actual time=0.010..65.890 rows=10000 loops=1)
Total runtime: 1041.293 ms
(8 rows)

Time: 1051.594 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;
count
-------
10000
(1 row)

Time: 870.348 ms

Both runtimes are pretty repeatable. That says that on this example,
where we have sampled 40000 node executions (remember HASH is only
called once), the overhead of EXPLAIN ANALYZE is about 4.5 usec per
node, which is not very far from twice the gettimeofday() time;
it certainly doesn't show any evidence for a huge additional overhead
due to TLB flushes.

One thing I was wondering about was whether the samplings of upper nodes
might be correlating to a greater-than-random extent with the samplings
of lower nodes. That would result in the observed problem of
overestimating upper nodes. It doesn't explain the apparent dependence
on targetlist width, though. So I'm confused again.

To be honest, I wonder about caching effects, but for the disks.

All of the above are repeated executions that should be fully cached
in RAM.

regards, tom lane

#9Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#4)
Re: That EXPLAIN ANALYZE patch still needs work

Tom Lane <tgl@sss.pgh.pa.us> writes:

And two, that upper plan nodes seem much more affected than lower
ones. That makes sense because the execution cycle of an upper node
will involve touching more userspace data than a lower node, and
therefore more of the flushed TLB entries will need to be reloaded.

I would have expected the opposite effect. If you only execute one instruction
then the cache miss can make it take many times longer than normal. But as the
number of instructions grows the cache gets repopulated and the overhead
levels off and becomes negligible relative to the total time.

The other option aside from gprof-like profiling would be to investigate those
cpu timing instructions again. I know some of them are unsafe on multi-cpu
systems but surely there's a solution out there. It's not like there aren't a
million games, music playing, and other kewl kid toys that depend on accurate
low overhead timing these days.

--
greg

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#9)
Re: That EXPLAIN ANALYZE patch still needs work

Greg Stark <gsstark@mit.edu> writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

And two, that upper plan nodes seem much more affected than lower
ones. That makes sense because the execution cycle of an upper node
will involve touching more userspace data than a lower node, and
therefore more of the flushed TLB entries will need to be reloaded.

I would have expected the opposite effect. If you only execute one instruction
then the cache miss can make it take many times longer than normal. But as the
number of instructions grows the cache gets repopulated and the overhead
levels off and becomes negligible relative to the total time.

Well, none of our plan nodes are in the "one instruction" regime ;-).
I was thinking that the total volume of data accessed was the critical
factor. Right at the moment I'm disillusioned with the TLB-access
theory though.

Something I'm noticing right now is that it seems like only hash joins
are really seriously misestimated --- nest and merge joins have some
small issues but only the hash is way out there. What's going on??
Can anyone else reproduce this?

The other option aside from gprof-like profiling would be to
investigate those cpu timing instructions again. I know some of them
are unsafe on multi-cpu systems but surely there's a solution out
there. It's not like there aren't a million games, music playing, and
other kewl kid toys that depend on accurate low overhead timing these
days.

Yeah, and they all work only on Windoze and Intel chips :-(

regards, tom lane

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#4)
Re: That EXPLAIN ANALYZE patch still needs work

On Tue, 2006-06-06 at 16:50 -0400, Tom Lane wrote:

I have a theory about this, and it's not pleasant at all. What I
think is that we have a Heisenberg problem here: the act of invoking
gettimeofday() actually changes what is measured.

If this theory is correct, then the entire notion of EXPLAIN ANALYZE
sampling has just crashed and burned. We can't ship a measurement
tool that is accurate on some platforms and not others.

Regrettably, I would agree and so conclude that we shouldn't pursue the
sampling idea further. Heisenbugs suck time like no other. Interesting,
though.

That leaves us with a number of possibilities:
0. Do Nothing
1. Option to skip timing altogether on an EXPLAIN ANALYZE
2. Option to produce a partial execution only, to locate problem areas.

Any others?

Option 2 would be harder to interpret, but still useful - originally
discussed in a current thread on -perform.
Option 1 wouldn't be as useful as the original sampling idea, but if its
not on the table any longer....

I'd revert back to Option 1 as being the best choice for further work.

Do we agree the idea can't go further? What next?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#11)
Re: That EXPLAIN ANALYZE patch still needs work

Simon Riggs <simon@2ndquadrant.com> writes:

Do we agree the idea can't go further? What next?

It still needs investigation; I'm no longer convinced that the TLB-flush
theory is correct. See rest of thread. We may well have to revert the
current patch, but I'd like to be sure we understand why.

If we do have to revert, I'd propose that we pursue the notion of
interrupt-driven sampling like gprof uses.

regards, tom lane

#13Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#12)
Re: That EXPLAIN ANALYZE patch still needs work

On Wed, Jun 07, 2006 at 09:53:32AM -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Do we agree the idea can't go further? What next?

It still needs investigation; I'm no longer convinced that the TLB-flush
theory is correct. See rest of thread. We may well have to revert the
current patch, but I'd like to be sure we understand why.

One thing I'm thinking of trying is to, instead of assuming we can
estimate the duractions of all the nodes by taking the total time
divided by samples. we assume that the duration of tuple X is similar
in duration to tuple X+1 but not necessarily the same as all other
tuples.

This moves the calculation from EndLoop to StopInstr. Basically in
StopInstr you do the steps:

if( sampling )
{
x = get time for this tuple
n = number of tuples skipped

cumulativetime += x*n
}

This would mean that we wouldn't be assuming that tuples near the end
take as long as tuples near the beginning. Except we're now dealing
will smaller numbers, so I'm worried about error accumlation.

If we do have to revert, I'd propose that we pursue the notion of
interrupt-driven sampling like gprof uses.

How would that work? You could then estimate how much time was spent in
each node, but you no longer have any idea about when they were entered
or left.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#13)
Re: That EXPLAIN ANALYZE patch still needs work

Martijn van Oosterhout <kleptog@svana.org> writes:

On Wed, Jun 07, 2006 at 09:53:32AM -0400, Tom Lane wrote:

If we do have to revert, I'd propose that we pursue the notion of
interrupt-driven sampling like gprof uses.

How would that work? You could then estimate how much time was spent in
each node, but you no longer have any idea about when they were entered
or left.

Hm? It would work the same way gprof works. I'm imagining something
along the lines of

global variable:
volatile Instrumentation *current_instr = NULL;

also add an "Instrumentation *parent_instr" field to Instrumentation

InstrStartNode does:
instr->parent_instr = current_instr;
current_instr = instr;

InstrStopNode restores the previous value of the global:
current_instr = instr->parent_instr;

timer interrupt routine does this once every few milliseconds:
total_samples++;
for (instr = current_instr; instr; instr = instr->parent_instr)
instr->samples++;

You still count executions and tuples in InstrStartNode/InstrStopNode,
but you never call gettimeofday there. You *do* call gettimeofday at
beginning and end of the whole query to measure the total runtime,
and then you blame a fraction samples/total_samples of that on each
node.

The bubble-up of sample counts to parent nodes could perhaps be done
while printing the results instead of on-the-fly as sketched above, but
the above seems simpler.

regards, tom lane

#15Bruce Momjian
bruce@momjian.us
In reply to: Martijn van Oosterhout (#13)
Re: That EXPLAIN ANALYZE patch still needs work

Martijn van Oosterhout <kleptog@svana.org> writes:

This would mean that we wouldn't be assuming that tuples near the end
take as long as tuples near the beginning. Except we're now dealing
will smaller numbers, so I'm worried about error accumlation.

Hm, that would explain why Hash joins suffer from this especially. Even when
functioning properly hashes get slower as the buckets fill up and there are
longer lists to traverse. Perhaps the hashes are suffering inordinately from
collisions though. Some of the data type hash functions looked kind of suspect
when I peeked at them a while back.

--
greg

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#15)
Re: That EXPLAIN ANALYZE patch still needs work

Greg Stark <gsstark@mit.edu> writes:

Martijn van Oosterhout <kleptog@svana.org> writes:

This would mean that we wouldn't be assuming that tuples near the end
take as long as tuples near the beginning. Except we're now dealing
will smaller numbers, so I'm worried about error accumlation.

Hm, that would explain why Hash joins suffer from this especially. Even when
functioning properly hashes get slower as the buckets fill up and there are
longer lists to traverse.

Nope, that is certainly not the explanation, because the hash table is
loaded in the (single) call of the Hash node at the start of the query.
It is static all through the sampled-and-not executions of the Hash Join
node, which is where our problem is.

I don't see that Martijn's idea responds to the problem anyway, if it is
some sort of TLB-related issue. The assumption we are making is not
"tuples near the end take as long as tuples near the beginning", it is
"tuples we sample take as long as tuples we don't" (both statements of
course meaning "on the average"). If the act of sampling incurs
overhead beyond the gettimeofday() call itself, then we are screwed,
and playing around with which iterations we sample and how we do the
extrapolation won't make the slightest bit of difference.

I'm unsure about the TLB-flush theory because I see no evidence of any
such overhead in the 8.1 timings; but on the other hand it's hard to see
what else could explain the apparent dependence on targetlist width.

regards, tom lane

#17Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#16)
Re: That EXPLAIN ANALYZE patch still needs work

Tom Lane <tgl@sss.pgh.pa.us> writes:

Nope, that is certainly not the explanation, because the hash table is
loaded in the (single) call of the Hash node at the start of the query.
It is static all through the sampled-and-not executions of the Hash Join
node, which is where our problem is.

At the risk of asking a stupid question, it's not perchance including that
hash build in the first sample it takes of the hash join node?

--
greg

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#17)
Re: That EXPLAIN ANALYZE patch still needs work

Greg Stark <gsstark@mit.edu> writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

Nope, that is certainly not the explanation, because the hash table is
loaded in the (single) call of the Hash node at the start of the query.
It is static all through the sampled-and-not executions of the Hash Join
node, which is where our problem is.

At the risk of asking a stupid question, it's not perchance including that
hash build in the first sample it takes of the hash join node?

Sure. Which is one of the reasons why the first tuple is excluded from
the extrapolation...

regards, tom lane

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#13)
Re: That EXPLAIN ANALYZE patch still needs work

Ah-hah, I've sussed it. The faulty assumption can actually be stated
as: "all the executions, except maybe the first, will take approximately
the same amount of time". The failing test case I've been looking at
is one where the system decides to use a "batched" hash join, and in
this plan type some iterations take much longer than others. (The
apparent dependence on targetlist width is now easy to understand,
because that affects the estimated hashtable size and hence the decision
whether batching is needed. I'm not sure why I don't see the effect
when running the identical case on my other machine, but since the
other one is a 64-bit machine its space calculations are probably a
bit different.)

I added some printf's to instrument.c to print out the actual time
measurements for each sample, as well as the calculations in
InstrEndLoop. Attached are the printouts that occurred for the HashJoin
node. The thing that is killing the extrapolation is of course the very
large time for iteration 2, which the extrapolation includes in its
average. But there's well over 10:1 variation in the later samples
as well.

On reflection it's easy to imagine other cases where some iterations
take much longer than others in a not-very-predictable way. For
instance, a join where only a subset of the outer tuples have a match
is going to act that way. I don't think there's any good way that we
can be sure we have a representative sample of executions, and so I'm
afraid this approach to sampling EXPLAIN ANALYZE is a failure.

I propose we revert this patch and think about an interrupt-driven
sampling method instead.

regards, tom lane

401489a0 1: 331616 usec in iter 1
401489a0 1: 110338 usec in iter 2
401489a0 1: 54 usec in iter 3
401489a0 1: 99 usec in iter 4
401489a0 1: 77 usec in iter 5
401489a0 1: 145 usec in iter 6
401489a0 1: 117 usec in iter 7
401489a0 1: 33 usec in iter 8
401489a0 1: 97 usec in iter 9
401489a0 1: 98 usec in iter 10
401489a0 1: 52 usec in iter 11
401489a0 1: 33 usec in iter 12
401489a0 1: 51 usec in iter 13
401489a0 1: 83 usec in iter 14
401489a0 1: 153 usec in iter 15
401489a0 1: 115 usec in iter 16
401489a0 1: 52 usec in iter 17
401489a0 1: 242 usec in iter 18
401489a0 1: 48 usec in iter 19
401489a0 1: 87 usec in iter 20
401489a0 1: 23 usec in iter 21
401489a0 1: 80 usec in iter 22
401489a0 1: 57 usec in iter 23
401489a0 1: 17 usec in iter 24
401489a0 1: 51 usec in iter 25
401489a0 1: 18 usec in iter 26
401489a0 1: 16 usec in iter 27
401489a0 1: 100 usec in iter 28
401489a0 1: 45 usec in iter 29
401489a0 1: 174 usec in iter 30
401489a0 1: 131 usec in iter 31
401489a0 1: 17 usec in iter 32
401489a0 1: 45 usec in iter 33
401489a0 1: 16 usec in iter 34
401489a0 1: 120 usec in iter 35
401489a0 1: 15 usec in iter 36
401489a0 1: 17 usec in iter 37
401489a0 1: 15 usec in iter 38
401489a0 1: 48 usec in iter 39
401489a0 1: 127 usec in iter 40
401489a0 1: 36 usec in iter 41
401489a0 1: 41 usec in iter 42
401489a0 1: 69 usec in iter 43
401489a0 1: 50 usec in iter 44
401489a0 1: 104 usec in iter 45
401489a0 1: 22 usec in iter 46
401489a0 1: 50 usec in iter 47
401489a0 1: 17 usec in iter 48
401489a0 1: 47 usec in iter 49
401489a0 1: 54 usec in iter 50
401489a0 1: 46 usec in iter 51
401489a0 1: 20 usec in iter 54
401489a0 1: 38 usec in iter 55
401489a0 1: 68 usec in iter 56
401489a0 1: 17 usec in iter 60
401489a0 1: 16 usec in iter 61
401489a0 1: 15 usec in iter 67
401489a0 1: 31 usec in iter 68
401489a0 1: 15 usec in iter 70
401489a0 1: 61 usec in iter 78
401489a0 1: 143 usec in iter 85
401489a0 1: 21 usec in iter 89
401489a0 1: 14 usec in iter 96
401489a0 1: 21 usec in iter 104
401489a0 1: 21 usec in iter 107
401489a0 1: 16 usec in iter 116
401489a0 1: 194 usec in iter 118
401489a0 1: 136 usec in iter 122
401489a0 1: 34 usec in iter 127
401489a0 1: 46 usec in iter 131
401489a0 1: 15 usec in iter 133
401489a0 1: 15 usec in iter 135
401489a0 1: 34 usec in iter 137
401489a0 1: 54 usec in iter 142
401489a0 1: 206 usec in iter 151
401489a0 1: 75 usec in iter 162
401489a0 1: 20 usec in iter 172
401489a0 1: 66 usec in iter 177
401489a0 1: 21 usec in iter 181
401489a0 1: 69 usec in iter 186
401489a0 1: 16 usec in iter 193
401489a0 1: 46 usec in iter 201
401489a0 1: 33 usec in iter 210
401489a0 1: 50 usec in iter 216
401489a0 1: 21 usec in iter 222
401489a0 1: 18 usec in iter 224
401489a0 1: 33 usec in iter 229
401489a0 1: 20 usec in iter 232
401489a0 1: 44 usec in iter 236
401489a0 1: 29 usec in iter 239
401489a0 1: 34 usec in iter 240
401489a0 1: 31 usec in iter 241
401489a0 1: 27 usec in iter 254
401489a0 1: 45 usec in iter 257
401489a0 1: 147 usec in iter 259
401489a0 1: 15 usec in iter 269
401489a0 1: 16 usec in iter 278
401489a0 1: 14 usec in iter 279
401489a0 1: 58 usec in iter 290
401489a0 1: 15 usec in iter 291
401489a0 1: 53 usec in iter 295
401489a0 1: 15 usec in iter 306
401489a0 1: 16 usec in iter 318
401489a0 1: 34 usec in iter 328
401489a0 1: 37 usec in iter 339
401489a0 1: 28 usec in iter 346
401489a0 1: 21 usec in iter 353
401489a0 1: 15 usec in iter 361
401489a0 1: 49 usec in iter 370
401489a0 1: 61 usec in iter 380
401489a0 1: 21 usec in iter 385
401489a0 1: 50 usec in iter 397
401489a0 1: 36 usec in iter 411
401489a0 1: 44 usec in iter 417
401489a0 1: 55 usec in iter 420
401489a0 1: 15 usec in iter 427
401489a0 1: 74 usec in iter 429
401489a0 1: 178 usec in iter 433
401489a0 1: 182 usec in iter 440
401489a0 1: 41 usec in iter 454
401489a0 1: 39 usec in iter 455
401489a0 1: 28 usec in iter 470
401489a0 1: 57 usec in iter 473
401489a0 1: 15 usec in iter 480
401489a0 1: 21 usec in iter 489
401489a0 1: 76 usec in iter 498
401489a0 1: 38 usec in iter 513
401489a0 1: 21 usec in iter 523
401489a0 1: 33 usec in iter 529
401489a0 1: 39 usec in iter 545
401489a0 1: 25 usec in iter 559
401489a0 1: 27 usec in iter 566
401489a0 1: 15 usec in iter 574
401489a0 1: 25 usec in iter 590
401489a0 1: 62 usec in iter 606
401489a0 1: 27 usec in iter 610
401489a0 1: 16 usec in iter 625
401489a0 1: 49 usec in iter 641
401489a0 1: 27 usec in iter 654
401489a0 1: 22 usec in iter 665
401489a0 1: 17 usec in iter 669
401489a0 1: 39 usec in iter 682
401489a0 1: 27 usec in iter 684
401489a0 1: 68 usec in iter 697
401489a0 1: 15 usec in iter 700
401489a0 1: 16 usec in iter 712
401489a0 1: 15 usec in iter 728
401489a0 1: 22 usec in iter 746
401489a0 1: 16 usec in iter 752
401489a0 1: 45 usec in iter 766
401489a0 1: 44 usec in iter 777
401489a0 1: 33 usec in iter 785
401489a0 1: 21 usec in iter 802
401489a0 1: 23 usec in iter 814
401489a0 1: 15 usec in iter 820
401489a0 1: 21 usec in iter 839
401489a0 1: 23 usec in iter 840
401489a0 1: 39 usec in iter 858
401489a0 1: 94 usec in iter 867
401489a0 1: 33 usec in iter 877
401489a0 1: 56 usec in iter 895
401489a0 1: 14 usec in iter 898
401489a0 1: 22 usec in iter 900
401489a0 1: 14 usec in iter 910
401489a0 1: 20 usec in iter 918
401489a0 1: 14 usec in iter 937
401489a0 1: 28 usec in iter 942
401489a0 1: 29 usec in iter 944
401489a0 1: 21 usec in iter 947
401489a0 1: 68 usec in iter 957
401489a0 1: 46 usec in iter 969
401489a0 1: 15 usec in iter 971
401489a0 1: 16 usec in iter 987
401489a0 1: 22 usec in iter 1000
401489a0 1: 36 usec in iter 1001
401489a0 1: 22 usec in iter 1017
401489a0 1: 17 usec in iter 1031
401489a0 1: 36 usec in iter 1042
401489a0 1: 21 usec in iter 1062
401489a0 1: 15 usec in iter 1077
401489a0 1: 21 usec in iter 1097
401489a0 1: 88 usec in iter 1114
401489a0 1: 62 usec in iter 1123
401489a0 1: 16 usec in iter 1131
401489a0 1: 21 usec in iter 1149
401489a0 1: 15 usec in iter 1156
401489a0 1: 33 usec in iter 1177
401489a0 1: 22 usec in iter 1190
401489a0 1: 83 usec in iter 1201
401489a0 1: 16 usec in iter 1209
401489a0 1: 184 usec in iter 1224
401489a0 1: 33 usec in iter 1234
401489a0 1: 39 usec in iter 1236
401489a0 1: 21 usec in iter 1247
401489a0 1: 33 usec in iter 1267
401489a0 1: 16 usec in iter 1277
401489a0 1: 15 usec in iter 1297
401489a0 1: 14 usec in iter 1319
401489a0 1: 21 usec in iter 1329
401489a0 1: 15 usec in iter 1342
401489a0 1: 16 usec in iter 1357
401489a0 1: 57 usec in iter 1378
401489a0 1: 16 usec in iter 1400
401489a0 1: 16 usec in iter 1410
401489a0 1: 113 usec in iter 1417
401489a0 1: 16 usec in iter 1418
401489a0 1: 15 usec in iter 1433
401489a0 1: 23 usec in iter 1434
401489a0 1: 96 usec in iter 1442
401489a0 1: 30 usec in iter 1449
401489a0 1: 121 usec in iter 1454
401489a0 1: 16 usec in iter 1462
401489a0 1: 29 usec in iter 1482
401489a0 1: 48 usec in iter 1490
401489a0 1: 40 usec in iter 1512
401489a0 1: 20 usec in iter 1529
401489a0 1: 48 usec in iter 1545
401489a0 1: 45 usec in iter 1547
401489a0 1: 271 usec in iter 1557
401489a0 1: 15 usec in iter 1558
401489a0 1: 44 usec in iter 1576
401489a0 1: 15 usec in iter 1592
401489a0 1: 148 usec in iter 1616
401489a0 1: 15 usec in iter 1640
401489a0 1: 34 usec in iter 1650
401489a0 1: 27 usec in iter 1659
401489a0 1: 15 usec in iter 1673
401489a0 1: 59 usec in iter 1686
401489a0 1: 22 usec in iter 1689
401489a0 1: 16 usec in iter 1696
401489a0 1: 27 usec in iter 1715
401489a0 1: 23 usec in iter 1736
401489a0 1: 16 usec in iter 1749
401489a0 1: 27 usec in iter 1757
401489a0 1: 15 usec in iter 1777
401489a0 1: 15 usec in iter 1801
401489a0 1: 16 usec in iter 1804
401489a0 1: 227 usec in iter 1829
401489a0 1: 27 usec in iter 1848
401489a0 1: 16 usec in iter 1869
401489a0 1: 33 usec in iter 1881
401489a0 1: 57 usec in iter 1902
401489a0 1: 29 usec in iter 1926
401489a0 1: 15 usec in iter 1932
401489a0 1: 58 usec in iter 1955
401489a0 1: 15 usec in iter 1980
401489a0 1: 22 usec in iter 1992
401489a0 1: 14 usec in iter 2001
401489a0 1: 56 usec in iter 2010
401489a0 1: 22 usec in iter 2018
401489a0 1: 16 usec in iter 2043
401489a0 1: 15 usec in iter 2046
401489a0 1: 50 usec in iter 2051
401489a0 1: 16 usec in iter 2053
401489a0 1: 22 usec in iter 2078
401489a0 1: 89 usec in iter 2094
401489a0 1: 69 usec in iter 2103
401489a0 1: 33 usec in iter 2121
401489a0 1: 28 usec in iter 2124
401489a0 1: 34 usec in iter 2135
401489a0 1: 22 usec in iter 2148
401489a0 1: 22 usec in iter 2167
401489a0 1: 15 usec in iter 2186
401489a0 1: 40 usec in iter 2199
401489a0 1: 23 usec in iter 2219
401489a0 1: 21 usec in iter 2223
401489a0 1: 27 usec in iter 2235
401489a0 1: 14 usec in iter 2238
401489a0 1: 22 usec in iter 2250
401489a0 1: 31 usec in iter 2252
401489a0 1: 36 usec in iter 2259
401489a0 1: 15 usec in iter 2282
401489a0 1: 259 usec in iter 2300
401489a0 1: 37 usec in iter 2318
401489a0 1: 21 usec in iter 2331
401489a0 1: 16 usec in iter 2345
401489a0 1: 15 usec in iter 2350
401489a0 1: 16 usec in iter 2361
401489a0 1: 69 usec in iter 2364
401489a0 1: 29 usec in iter 2374
401489a0 1: 33 usec in iter 2383
401489a0 1: 116 usec in iter 2400
401489a0 1: 37 usec in iter 2420
401489a0 1: 39 usec in iter 2447
401489a0 1: 29 usec in iter 2455
401489a0 1: 114 usec in iter 2476
401489a0 1: 33 usec in iter 2488
401489a0 1: 22 usec in iter 2493
401489a0 1: 53 usec in iter 2512
401489a0 1: 22 usec in iter 2539
401489a0 1: 57 usec in iter 2563
401489a0 1: 32 usec in iter 2564
401489a0 1: 39 usec in iter 2572
401489a0 1: 22 usec in iter 2574
401489a0 1: 14 usec in iter 2597
401489a0 1: 74 usec in iter 2610
401489a0 1: 23 usec in iter 2618
401489a0 1: 21 usec in iter 2626
401489a0 1: 16 usec in iter 2627
401489a0 1: 45 usec in iter 2653
401489a0 1: 28 usec in iter 2662
401489a0 1: 20 usec in iter 2679
401489a0 1: 27 usec in iter 2694
401489a0 1: 17 usec in iter 2696
401489a0 1: 14 usec in iter 2706
401489a0 1: 22 usec in iter 2720
401489a0 1: 36 usec in iter 2742
401489a0 1: 20 usec in iter 2753
401489a0 1: 20 usec in iter 2761
401489a0 1: 16 usec in iter 2788
401489a0 1: 22 usec in iter 2795
401489a0 1: 16 usec in iter 2804
401489a0 1: 144 usec in iter 2812
401489a0 1: 16 usec in iter 2817
401489a0 1: 16 usec in iter 2834
401489a0 1: 16 usec in iter 2853
401489a0 1: 53 usec in iter 2872
401489a0 1: 22 usec in iter 2882
401489a0 1: 28 usec in iter 2907
401489a0 1: 38 usec in iter 2922
401489a0 1: 56 usec in iter 2935
401489a0 1: 188 usec in iter 2954
401489a0 1: 15 usec in iter 2964
401489a0 1: 22 usec in iter 2984
401489a0 1: 23 usec in iter 3007
401489a0 1: 27 usec in iter 3028
401489a0 1: 16 usec in iter 3051
401489a0 1: 178 usec in iter 3064
401489a0 1: 15 usec in iter 3082
401489a0 1: 20 usec in iter 3083
401489a0 1: 15 usec in iter 3097
401489a0 1: 14 usec in iter 3120
401489a0 1: 16 usec in iter 3126
401489a0 1: 15 usec in iter 3145
401489a0 1: 15 usec in iter 3171
401489a0 1: 15 usec in iter 3200
401489a0 1: 15 usec in iter 3218
401489a0 1: 15 usec in iter 3242
401489a0 1: 50 usec in iter 3263
401489a0 1: 16 usec in iter 3267
401489a0 1: 15 usec in iter 3284
401489a0 1: 14 usec in iter 3300
401489a0 1: 16 usec in iter 3317
401489a0 1: 15 usec in iter 3342
401489a0 1: 14 usec in iter 3372
401489a0 1: 15 usec in iter 3377
401489a0 1: 14 usec in iter 3379
401489a0 1: 14 usec in iter 3389
401489a0 1: 14 usec in iter 3417
401489a0 1: 16 usec in iter 3419
401489a0 1: 16 usec in iter 3446
401489a0 1: 16 usec in iter 3461
401489a0 1: 15 usec in iter 3475
401489a0 1: 16 usec in iter 3484
401489a0 1: 16 usec in iter 3489
401489a0 1: 14 usec in iter 3511
401489a0 1: 15 usec in iter 3519
401489a0 1: 15 usec in iter 3527
401489a0 1: 15 usec in iter 3552
401489a0 1: 15 usec in iter 3572
401489a0 1: 16 usec in iter 3574
401489a0 1: 14 usec in iter 3584
401489a0 1: 15 usec in iter 3586
401489a0 1: 16 usec in iter 3610
401489a0 1: 14 usec in iter 3619
401489a0 1: 15 usec in iter 3629
401489a0 1: 15 usec in iter 3658
401489a0 1: 14 usec in iter 3682
401489a0 1: 14 usec in iter 3683
401489a0 1: 15 usec in iter 3692
401489a0 1: 15 usec in iter 3718
401489a0 1: 15 usec in iter 3749
401489a0 1: 15 usec in iter 3758
401489a0 1: 16 usec in iter 3783
401489a0 1: 15 usec in iter 3796
401489a0 1: 15 usec in iter 3812
401489a0 1: 15 usec in iter 3818
401489a0 1: 15 usec in iter 3826
401489a0 1: 15 usec in iter 3836
401489a0 1: 15 usec in iter 3851
401489a0 1: 15 usec in iter 3875
401489a0 1: 57 usec in iter 3891
401489a0 1: 15 usec in iter 3922
401489a0 1: 15 usec in iter 3925
401489a0 1: 15 usec in iter 3933
401489a0 1: 16 usec in iter 3942
401489a0 1: 15 usec in iter 3967
401489a0 1: 14 usec in iter 3994
401489a0 1: 14 usec in iter 4015
401489a0 1: 15 usec in iter 4025
401489a0 1: 16 usec in iter 4056
401489a0 1: 15 usec in iter 4063
401489a0 1: 15 usec in iter 4070
401489a0 1: 15 usec in iter 4079
401489a0 1: 15 usec in iter 4100
401489a0 1: 15 usec in iter 4122
401489a0 1: 14 usec in iter 4149
401489a0 1: 14 usec in iter 4171
401489a0 1: 14 usec in iter 4186
401489a0 1: 16 usec in iter 4208
401489a0 1: 15 usec in iter 4231
401489a0 1: 15 usec in iter 4262
401489a0 1: 14 usec in iter 4280
401489a0 1: 14 usec in iter 4293
401489a0 1: 15 usec in iter 4300
401489a0 1: 14 usec in iter 4323
401489a0 1: 34 usec in iter 4336
401489a0 1: 15 usec in iter 4359
401489a0 1: 15 usec in iter 4378
401489a0 1: 15 usec in iter 4384
401489a0 1: 15 usec in iter 4388
401489a0 1: 15 usec in iter 4406
401489a0 1: 14 usec in iter 4432
401489a0 1: 15 usec in iter 4452
401489a0 1: 15 usec in iter 4460
401489a0 1: 15 usec in iter 4469
401489a0 1: 54 usec in iter 4492
401489a0 1: 15 usec in iter 4493
401489a0 1: 16 usec in iter 4505
401489a0 1: 16 usec in iter 4527
401489a0 1: 16 usec in iter 4557
401489a0 1: 15 usec in iter 4569
401489a0 1: 16 usec in iter 4588
401489a0 1: 15 usec in iter 4598
401489a0 1: 15 usec in iter 4607
401489a0 1: 55 usec in iter 4629
401489a0 1: 14 usec in iter 4644
401489a0 1: 14 usec in iter 4652
401489a0 1: 14 usec in iter 4680
401489a0 1: 14 usec in iter 4692
401489a0 1: 16 usec in iter 4696
401489a0 1: 15 usec in iter 4706
401489a0 1: 15 usec in iter 4718
401489a0 1: 15 usec in iter 4722
401489a0 1: 14 usec in iter 4730
401489a0 1: 16 usec in iter 4731
401489a0 1: 15 usec in iter 4743
401489a0 1: 15 usec in iter 4754
401489a0 1: 16 usec in iter 4785
401489a0 1: 14 usec in iter 4787
401489a0 1: 14 usec in iter 4808
401489a0 1: 14 usec in iter 4827
401489a0 1: 15 usec in iter 4861
401489a0 1: 15 usec in iter 4869
401489a0 1: 14 usec in iter 4890
401489a0 1: 15 usec in iter 4910
401489a0 1: 15 usec in iter 4943
401489a0 1: 14 usec in iter 4967
401489a0 1: 15 usec in iter 4969
401489a0 1: 15 usec in iter 4980
401489a0 1: 15 usec in iter 4985
401489a0 1: 16 usec in iter 5008
401489a0 1: 16 usec in iter 5042
401489a0 1: 15 usec in iter 5049
401489a0 1: 15 usec in iter 5055
401489a0 1: 15 usec in iter 5064
401489a0 1: 16 usec in iter 5071
401489a0 1: 16 usec in iter 5091
401489a0 1: 15 usec in iter 5117
401489a0 1: 15 usec in iter 5136
401489a0 1: 15 usec in iter 5162
401489a0 1: 15 usec in iter 5167
401489a0 1: 14 usec in iter 5180
401489a0 1: 15 usec in iter 5182
401489a0 1: 15 usec in iter 5193
401489a0 1: 14 usec in iter 5196
401489a0 1: 16 usec in iter 5206
401489a0 1: 14 usec in iter 5211
401489a0 1: 16 usec in iter 5221
401489a0 1: 15 usec in iter 5241
401489a0 1: 15 usec in iter 5272
401489a0 1: 54 usec in iter 5284
401489a0 1: 16 usec in iter 5288
401489a0 1: 15 usec in iter 5307
401489a0 1: 16 usec in iter 5337
401489a0 1: 15 usec in iter 5351
401489a0 1: 15 usec in iter 5352
401489a0 1: 16 usec in iter 5374
401489a0 1: 14 usec in iter 5409
401489a0 1: 15 usec in iter 5410
401489a0 1: 14 usec in iter 5442
401489a0 1: 14 usec in iter 5446
401489a0 1: 15 usec in iter 5465
401489a0 1: 14 usec in iter 5489
401489a0 1: 14 usec in iter 5518
401489a0 1: 14 usec in iter 5536
401489a0 1: 15 usec in iter 5563
401489a0 1: 53 usec in iter 5594
401489a0 1: 16 usec in iter 5610
401489a0 1: 15 usec in iter 5616
401489a0 1: 14 usec in iter 5626
401489a0 1: 15 usec in iter 5630
401489a0 1: 15 usec in iter 5662
401489a0 1: 15 usec in iter 5693
401489a0 1: 14 usec in iter 5699
401489a0 1: 14 usec in iter 5730
401489a0 1: 16 usec in iter 5754
401489a0 1: 15 usec in iter 5787
401489a0 1: 14 usec in iter 5789
401489a0 1: 15 usec in iter 5808
401489a0 1: 16 usec in iter 5809
401489a0 1: 16 usec in iter 5841
401489a0 1: 14 usec in iter 5849
401489a0 1: 14 usec in iter 5873
401489a0 1: 15 usec in iter 5887
401489a0 1: 14 usec in iter 5908
401489a0 1: 16 usec in iter 5937
401489a0 1: 15 usec in iter 5961
401489a0 1: 15 usec in iter 5982
401489a0 1: 15 usec in iter 5984
401489a0 1: 59 usec in iter 6004
401489a0 1: 14 usec in iter 6032
401489a0 1: 15 usec in iter 6051
401489a0 1: 15 usec in iter 6081
401489a0 1: 15 usec in iter 6112
401489a0 1: 15 usec in iter 6119
401489a0 1: 15 usec in iter 6136
401489a0 1: 60 usec in iter 6140
401489a0 1: 14 usec in iter 6167
401489a0 1: 56 usec in iter 6168
401489a0 1: 14 usec in iter 6179
401489a0 1: 13 usec in iter 6209
401489a0 1: 15 usec in iter 6227
401489a0 1: 14 usec in iter 6249
401489a0 1: 61 usec in iter 6250
401489a0 1: 14 usec in iter 6270
401489a0 1: 14 usec in iter 6291
401489a0 1: 13 usec in iter 6312
401489a0 1: 16 usec in iter 6331
401489a0 1: 14 usec in iter 6358
401489a0 1: 53 usec in iter 6359
401489a0 1: 15 usec in iter 6368
401489a0 1: 14 usec in iter 6373
401489a0 1: 13 usec in iter 6391
401489a0 1: 15 usec in iter 6411
401489a0 1: 15 usec in iter 6432
401489a0 1: 15 usec in iter 6454
401489a0 1: 15 usec in iter 6476
401489a0 1: 15 usec in iter 6508
401489a0 1: 15 usec in iter 6545
401489a0 1: 14 usec in iter 6564
401489a0 1: 15 usec in iter 6581
401489a0 1: 15 usec in iter 6611
401489a0 1: 14 usec in iter 6641
401489a0 1: 14 usec in iter 6661
401489a0 1: 15 usec in iter 6680
401489a0 1: 16 usec in iter 6700
401489a0 1: 15 usec in iter 6728
401489a0 1: 14 usec in iter 6738
401489a0 1: 15 usec in iter 6758
401489a0 1: 14 usec in iter 6777
401489a0 1: 15 usec in iter 6782
401489a0 1: 15 usec in iter 6784
401489a0 1: 15 usec in iter 6799
401489a0 1: 15 usec in iter 6803
401489a0 1: 16 usec in iter 6817
401489a0 1: 15 usec in iter 6831
401489a0 1: 14 usec in iter 6840
401489a0 1: 15 usec in iter 6861
401489a0 1: 14 usec in iter 6893
401489a0 1: 15 usec in iter 6911
401489a0 1: 14 usec in iter 6935
401489a0 1: 16 usec in iter 6950
401489a0 1: 15 usec in iter 6976
401489a0 1: 54 usec in iter 6987
401489a0 1: 16 usec in iter 6989
401489a0 1: 15 usec in iter 7020
401489a0 1: 14 usec in iter 7043
401489a0 1: 15 usec in iter 7065
401489a0 1: 14 usec in iter 7097
401489a0 1: 14 usec in iter 7110
401489a0 1: 16 usec in iter 7117
401489a0 1: 15 usec in iter 7126
401489a0 1: 15 usec in iter 7142
401489a0 1: 14 usec in iter 7153
401489a0 1: 14 usec in iter 7157
401489a0 1: 15 usec in iter 7160
401489a0 1: 15 usec in iter 7187
401489a0 1: 15 usec in iter 7223
401489a0 1: 15 usec in iter 7251
401489a0 1: 15 usec in iter 7271
401489a0 1: 15 usec in iter 7277
401489a0 1: 15 usec in iter 7298
401489a0 1: 14 usec in iter 7301
401489a0 1: 15 usec in iter 7334
401489a0 1: 16 usec in iter 7359
401489a0 1: 15 usec in iter 7378
401489a0 1: 15 usec in iter 7404
401489a0 1: 17 usec in iter 7417
401489a0 1: 17 usec in iter 7420
401489a0 1: 15 usec in iter 7431
401489a0 1: 16 usec in iter 7437
401489a0 1: 16 usec in iter 7451
401489a0 1: 16 usec in iter 7478
401489a0 1: 15 usec in iter 7502
401489a0 1: 15 usec in iter 7503
401489a0 1: 14 usec in iter 7535
401489a0 1: 15 usec in iter 7556
401489a0 1: 15 usec in iter 7558
401489a0 1: 52 usec in iter 7578
401489a0 1: 16 usec in iter 7583
401489a0 1: 13 usec in iter 7596
401489a0 1: 15 usec in iter 7618
401489a0 1: 14 usec in iter 7623
401489a0 1: 16 usec in iter 7624
401489a0 1: 15 usec in iter 7626
401489a0 1: 15 usec in iter 7636
401489a0 1: 16 usec in iter 7638
401489a0 1: 16 usec in iter 7661
401489a0 1: 16 usec in iter 7690
401489a0 1: 14 usec in iter 7716
401489a0 1: 16 usec in iter 7739
401489a0 1: 15 usec in iter 7767
401489a0 1: 15 usec in iter 7776
401489a0 1: 15 usec in iter 7815
401489a0 1: 16 usec in iter 7844
401489a0 1: 15 usec in iter 7875
401489a0 1: 15 usec in iter 7891
401489a0 1: 14 usec in iter 7893
401489a0 1: 14 usec in iter 7930
401489a0 1: 15 usec in iter 7937
401489a0 1: 14 usec in iter 7975
401489a0 1: 16 usec in iter 8001
401489a0 1: 14 usec in iter 8039
401489a0 1: 14 usec in iter 8044
401489a0 1: 15 usec in iter 8054
401489a0 1: 15 usec in iter 8080
401489a0 1: 15 usec in iter 8113
401489a0 1: 16 usec in iter 8120
401489a0 1: 15 usec in iter 8159
401489a0 1: 15 usec in iter 8174
401489a0 1: 16 usec in iter 8210
401489a0 1: 14 usec in iter 8226
401489a0 1: 55 usec in iter 8261
401489a0 1: 16 usec in iter 8291
401489a0 1: 15 usec in iter 8319
401489a0 1: 15 usec in iter 8322
401489a0 1: 14 usec in iter 8348
401489a0 1: 14 usec in iter 8359
401489a0 1: 14 usec in iter 8375
401489a0 1: 15 usec in iter 8401
401489a0 1: 15 usec in iter 8422
401489a0 1: 16 usec in iter 8437
401489a0 1: 15 usec in iter 8457
401489a0 1: 15 usec in iter 8485
401489a0 1: 14 usec in iter 8511
401489a0 1: 15 usec in iter 8512
401489a0 1: 16 usec in iter 8536
401489a0 1: 15 usec in iter 8540
401489a0 1: 15 usec in iter 8570
401489a0 1: 15 usec in iter 8609
401489a0 1: 15 usec in iter 8637
401489a0 1: 15 usec in iter 8648
401489a0 1: 15 usec in iter 8670
401489a0 1: 14 usec in iter 8673
401489a0 1: 16 usec in iter 8714
401489a0 1: 15 usec in iter 8729
401489a0 1: 14 usec in iter 8750
401489a0 1: 15 usec in iter 8787
401489a0 1: 14 usec in iter 8803
401489a0 1: 16 usec in iter 8816
401489a0 1: 15 usec in iter 8838
401489a0 1: 14 usec in iter 8843
401489a0 1: 15 usec in iter 8844
401489a0 1: 16 usec in iter 8865
401489a0 1: 17 usec in iter 8870
401489a0 1: 16 usec in iter 8879
401489a0 1: 16 usec in iter 8896
401489a0 1: 15 usec in iter 8937
401489a0 1: 15 usec in iter 8949
401489a0 1: 15 usec in iter 8960
401489a0 1: 15 usec in iter 8994
401489a0 1: 16 usec in iter 9027
401489a0 1: 17 usec in iter 9040
401489a0 1: 15 usec in iter 9046
401489a0 1: 15 usec in iter 9050
401489a0 1: 14 usec in iter 9057
401489a0 1: 16 usec in iter 9058
401489a0 1: 14 usec in iter 9075
401489a0 1: 15 usec in iter 9076
401489a0 1: 16 usec in iter 9099
401489a0 1: 14 usec in iter 9139
401489a0 1: 15 usec in iter 9158
401489a0 1: 15 usec in iter 9186
401489a0 1: 14 usec in iter 9212
401489a0 1: 15 usec in iter 9238
401489a0 1: 16 usec in iter 9269
401489a0 1: 17 usec in iter 9275
401489a0 1: 14 usec in iter 9298
401489a0 1: 16 usec in iter 9312
401489a0 1: 14 usec in iter 9325
401489a0 1: 15 usec in iter 9345
401489a0 1: 16 usec in iter 9362
401489a0 1: 16 usec in iter 9382
401489a0 1: 16 usec in iter 9387
401489a0 1: 15 usec in iter 9424
401489a0 1: 52 usec in iter 9435
401489a0 1: 15 usec in iter 9438
401489a0 1: 15 usec in iter 9481
401489a0 1: 15 usec in iter 9491
401489a0 1: 15 usec in iter 9499
401489a0 1: 16 usec in iter 9535
401489a0 1: 15 usec in iter 9554
401489a0 1: 14 usec in iter 9561
401489a0 1: 14 usec in iter 9576
401489a0 1: 15 usec in iter 9614
401489a0 1: 15 usec in iter 9640
401489a0 1: 14 usec in iter 9668
401489a0 1: 15 usec in iter 9700
401489a0 1: 16 usec in iter 9705
401489a0 1: 15 usec in iter 9729
401489a0 1: 15 usec in iter 9738
401489a0 1: 15 usec in iter 9777
401489a0 1: 15 usec in iter 9816
401489a0 1: 16 usec in iter 9822
401489a0 1: 15 usec in iter 9828
401489a0 1: 17 usec in iter 9870
401489a0 1: 15 usec in iter 9911
401489a0 1: 15 usec in iter 9916
401489a0 1: 15 usec in iter 9956
401489a0 1: 15 usec in iter 9991
401489a0 1: raw totaltime = 0.462843
401489a0 1: per_iter = 0.00017897, SampleOverhead = 3.29e-06
401489a0 1: adj totaltime = 2.12368
401489a0 1: sampling = 0
401489a0 1: starttime = 0/000000
401489a0 1: counter = 0/462843
401489a0 1: firsttuple = 0.331616
401489a0 1: tuplecount = 10000
401489a0 1: itercount = 10001
401489a0 1: samplecount = 721
401489a0 1: nextsample = 10018.7
401489a0 1: startup = 0.331616
401489a0 1: total = 2.12368
401489a0 1: ntuples = 10000

#20Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#19)
Re: That EXPLAIN ANALYZE patch still needs work

On Wed, Jun 07, 2006 at 03:32:35PM -0400, Tom Lane wrote:

On reflection it's easy to imagine other cases where some iterations
take much longer than others in a not-very-predictable way. For
instance, a join where only a subset of the outer tuples have a match
is going to act that way. I don't think there's any good way that we
can be sure we have a representative sample of executions, and so I'm
afraid this approach to sampling EXPLAIN ANALYZE is a failure.

I don't think we ever assumed it would never be a problem. We just
assumed that the sampling would cancel the effect out to give a decent
average.

Thing is, I never expected to get a three order magnitude difference
between samples. That's just far too much to be corrected in any way.
The random sampling should counter most effects, and I didn't consider
the one tuple in a million that takes much longer to be a particularly
common case.

Anyway, as a test, if you take the approach that the measurement at
item X only applies to the tuples immediately preceding it, for the
data you posted you get a result of 0.681148 seconds. How long did that
query run that produced that data?

(The bit of perl I used is:

cat data | perl -lne 'BEGIN { $last=0; $sum =0 }
/: (\d+) usec in iter (\d+)/ && do { $sum += ($2-$last)*$1; $last=$2 };
END { print "$sum\n" }'

I propose we revert this patch and think about an interrupt-driven
sampling method instead.

That's another possibility ofcourse...
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#20)
#22Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#14)
#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#19)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#22)
#26Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#21)
#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#24)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#27)
#29Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#25)
#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#28)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#30)
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#28)
#35Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Tom Lane (#31)
#36Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#34)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#36)
#38A.M.
agentm@themactionfaction.com
In reply to: Tom Lane (#37)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron Mayer (#35)
#40Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#37)
#41Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#34)
#42Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#34)
#43Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#33)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#43)
#45Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#43)
#46Larry Rosenman
ler@lerctr.org
In reply to: Tom Lane (#44)
#47Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#44)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#47)
#49Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#44)
#50Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#48)
#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#49)
#52Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jim Nasby (#49)
#53Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#51)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#45)
#55Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Tom Lane (#54)
#56A.M.
agentm@themactionfaction.com
In reply to: Mark Kirkwood (#55)
#57Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Alvaro Herrera (#41)
#58Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#48)
#59Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#47)
#60Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#58)
#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#59)
#62Larry Rosenman
ler@lerctr.org
In reply to: Tom Lane (#61)
#63Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#61)
#64Tom Lane
tgl@sss.pgh.pa.us
In reply to: Larry Rosenman (#62)
#65Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#64)
#66Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#65)
#67Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#64)
#68A.M.
agentm@themactionfaction.com
In reply to: Martijn van Oosterhout (#67)
#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#67)
#70Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#66)
#71Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#57)
#72Martijn van Oosterhout
kleptog@svana.org
In reply to: A.M. (#68)
#73Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#66)
#74Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#73)
#75Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#74)
#76Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#75)
#77Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#76)
#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#77)
#79Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#78)
#80Benny Amorsen
benny+usenet@amorsen.dk
In reply to: Larry Rosenman (#62)
#81Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#72)
#82Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#79)
#83Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#71)
#84Luke Lonergan
llonergan@greenplum.com
In reply to: Ron Mayer (#35)
#85Martijn van Oosterhout
kleptog@svana.org
In reply to: Luke Lonergan (#84)
#86Luke Lonergan
llonergan@greenplum.com
In reply to: Martijn van Oosterhout (#85)
#87Martijn van Oosterhout
kleptog@svana.org
In reply to: Luke Lonergan (#86)
#88Luke Lonergan
llonergan@greenplum.com
In reply to: Martijn van Oosterhout (#87)
#89Tom Lane
tgl@sss.pgh.pa.us
In reply to: Luke Lonergan (#84)
#90Simon Riggs
simon@2ndQuadrant.com
In reply to: Martijn van Oosterhout (#87)