Where does the time go?

Started by Kevin Grittneralmost 20 years ago26 messages
#1Kevin Grittner
Kevin.Grittner@wicourts.gov

I have some odd results from timing two versions of an update query, and
was hoping to get a better handle on how to interpret this. The query
does an update of one table. One version does three NOT IN tests
against three related tables. The other version does the logically
equivalent NOT EXISTS tests. Here are the timings:

(a) NOT IN EXPLAIN ANALYZE reports: 150 ms.

(b) Difference between SELECT CURRENT_TIMESTAMP values before and
after the NOT IN EXPLAIN ANALYZE: 171 ms.

(c) Difference between SELECT CURRENT_TIMESTAMP values before and
after NOT IN query without EXPLAIN ANALYZE: 140 ms.

(d) NOT EXISTS EXPLAIN ANALYZE reports: 9 ms.

(e) Difference between SELECT CURRENT_TIMESTAMP values before and
after the NOT EXISTS EXPLAIN ANALYZE: 62 ms.

(f) Difference between SELECT CURRENT_TIMESTAMP values before and
after NOT EXISTS query without EXPLAIN ANALYZE: 62 ms.

I repeated these tests many times. After each test I updated the table
back to its original state and ran CLUSTER and VACUUM ANALYZE, then
allowed a minute of settling time. The timings are remarkably
consistent from one try to the next with an occasional outlier on the
high side.

It seems to me that these results indicate that EXPLAIN ANALYZE
distorts the plan for the IN query (b - c = 31 ms), but not the plan for
the EXISTS query (e == f). The time to bracket the UPDATE with the
SELECT CURRENT_TIMESTAMP queries can't exceed 21 ms (b - a). That seems
like a long time, though, so I suspect that some of that time is going
to the same place that most of the 53 ms (e - d) goes on the NOT EXISTS
query. Where would that be? Parse and plan phases? Is there a way to
pin that down better?

I was planning on posting the queries with plans and timings to
illustrate a costing issue, but when I saw these timings I figured I'd
ask about this first. It raises what seems like a more fundamental
question, which would have an impact on collecting all of the
interesting information. If execution time is 9 ms, but there is
something else adding several times that, perhaps that something else is
interesting in itself.

Thanks for any info.

-Kevin

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#1)
Re: Where does the time go?

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

I have some odd results from timing two versions of an update query, and
was hoping to get a better handle on how to interpret this.

You didn't show us the explain analyze results, but I'm betting that a
big part of your issue is that the EXPLAIN ANALYZE instrumentation
overhead is (1) significant and (2) different for the two query plans.
The instrumentation overhead should be about the same for any one plan
node execution, but the two plans could involve very different numbers
of plan node executions ...

regards, tom lane

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#2)
Re: Where does the time go?

On Wed, Mar 22, 2006 at 8:59 pm, in message

<27386.1143082792@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
You didn't show us the explain analyze results,

The below is cut & paste directly from a psql run without editing.

bigbird=> UPDATE "User" SET "isActive" = TRUE WHERE "countyNo" = 13 and
"isActive" = FALSE;
UPDATE 0
bigbird=> cluster "User";
CLUSTER
bigbird=> vacuum analyze "User";
VACUUM
bigbird=> select current_timestamp;
now
----------------------------
2006-03-22 16:30:55.875-06
(1 row)

bigbird=> explain analyze
bigbird-> UPDATE "User"
bigbird-> SET "isActive" = FALSE
bigbird-> WHERE "countyNo" = 13
bigbird-> AND ("userId", "countyNo") NOT IN (SELECT "userId",
"countyNo" FROM "UserAuthority")
bigbird-> AND ("userId", "countyNo") NOT IN (SELECT "userId",
"countyNo" FROM "UserDivision")
bigbird-> AND ("userId", "countyNo") NOT IN (SELECT "userId",
"countyNo" FROM "UserCtofcAuthority");
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Index Scan using "User_pkey" on "User" (cost=2365.43..2499.34 rows=44
width=126) (actual time=145.968..147.061 rows=153 loops=1)
Index Cond: (("countyNo")::smallint = 13)
Filter: ((NOT (hashed subplan)) AND (NOT (hashed subplan)) AND (NOT
(hashed subplan)))
SubPlan
-> Seq Scan on "UserCtofcAuthority" (cost=0.00..332.65
rows=15565 width=14) (actual time=0.003..10.105 rows=15565 loops=1)
-> Seq Scan on "UserDivision" (cost=0.00..326.65 rows=15765
width=14) (actual time=0.003..10.409 rows=15764 loops=1)
-> Seq Scan on "UserAuthority" (cost=0.00..1451.24 rows=70624
width=14) (actual time=0.007..45.823 rows=70626 loops=1)
Total runtime: 150.340 ms
(8 rows)

bigbird=> select current_timestamp;
now
----------------------------
2006-03-22 16:30:56.046-06
(1 row)

bigbird=> UPDATE "User" SET "isActive" = TRUE WHERE "countyNo" = 13 and
"isActive" = FALSE;
UPDATE 153
bigbird=> cluster "User";
CLUSTER
bigbird=> vacuum analyze "User";
VACUUM
bigbird=> select current_timestamp;
now
----------------------------
2006-03-22 16:30:56.203-06
(1 row)

bigbird=> UPDATE "User"
bigbird-> SET "isActive" = FALSE
bigbird-> WHERE "countyNo" = 13
bigbird-> AND ("userId", "countyNo") NOT IN (SELECT "userId",
"countyNo" FROM "UserAuthority")
bigbird-> AND ("userId", "countyNo") NOT IN (SELECT "userId",
"countyNo" FROM "UserDivision")
bigbird-> AND ("userId", "countyNo") NOT IN (SELECT "userId",
"countyNo" FROM "UserCtofcAuthority");
UPDATE 153
bigbird=> select current_timestamp;
now
----------------------------
2006-03-22 16:30:56.343-06
(1 row)

bigbird=> UPDATE "User" SET "isActive" = TRUE WHERE "countyNo" = 13 and
"isActive" = FALSE;
UPDATE 153
bigbird=> cluster "User";
CLUSTER
bigbird=> vacuum analyze "User";
VACUUM
bigbird=> select current_timestamp;
now
----------------------------
2006-03-22 16:30:56.484-06
(1 row)

bigbird=> explain analyze
bigbird-> UPDATE "User"
bigbird-> SET "isActive" = FALSE
bigbird-> WHERE "countyNo" = 13
bigbird-> AND NOT EXISTS (SELECT * FROM "UserAuthority" a where
a."countyNo" = "User"."countyNo" and a."userId" = "User"."userId")
bigbird-> AND NOT EXISTS (SELECT * FROM "UserDivision" b where
b."countyNo" = "User"."countyNo" and b."userId" = "User"."userId")
bigbird-> AND NOT EXISTS (SELECT * FROM "UserCtofcAuthority" c
where c."countyNo" = "User"."countyNo" and c."userId" =
"User"."userId");

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using "User_pkey" on "User" (cost=0.00..3650.67 rows=42
width=111) (actual time=0.057..5.722 rows=153 loops=1)
Index Cond: (("countyNo")::smallint = 13)
Filter: ((NOT (subplan)) AND (NOT (subplan)) AND (NOT (subplan)))
SubPlan
-> Index Scan using "UserCtofcAuthority_pkey" on
"UserCtofcAuthority" c (cost=0.00..3.48 rows=1 width=50) (actual
time=0.004..0.004 rows=0 loops=153)
Index Cond: ((("countyNo")::smallint = ($0)::smallint) AND
(("userId")::bpchar = ($1)::bpchar))
-> Index Scan using "UserDivision_pkey" on "UserDivision" b
(cost=0.00..3.53 rows=1 width=42) (actual time=0.006..0.006 rows=0
loops=156)
Index Cond: ((("countyNo")::smallint = ($0)::smallint) AND
(("userId")::bpchar = ($1)::bpchar))
-> Index Scan using "UserAuthority_pkey" on "UserAuthority" a
(cost=0.00..3.60 rows=1 width=42) (actual time=0.007..0.007 rows=1
loops=341)
Index Cond: ((("countyNo")::smallint = ($0)::smallint) AND
(("userId")::bpchar = ($1)::bpchar))
Total runtime: 9.136 ms
(11 rows)

bigbird=> select current_timestamp;
now
----------------------------
2006-03-22 16:30:56.546-06
(1 row)

bigbird=> UPDATE "User" SET "isActive" = TRUE WHERE "countyNo" = 13 and
"isActive" = FALSE;
UPDATE 153
bigbird=> cluster "User";
CLUSTER
bigbird=> vacuum analyze "User";
VACUUM
bigbird=> select current_timestamp;
now
----------------------------
2006-03-22 16:30:56.703-06
(1 row)

bigbird=> UPDATE "User"
bigbird-> SET "isActive" = FALSE
bigbird-> WHERE "countyNo" = 13
bigbird-> AND NOT EXISTS (SELECT * FROM "UserAuthority" a where
a."countyNo" = "User"."countyNo" and a."userId" = "User"."userId")
bigbird-> AND NOT EXISTS (SELECT * FROM "UserDivision" b where
b."countyNo" = "User"."countyNo" and b."userId" = "User"."userId")
bigbird-> AND NOT EXISTS (SELECT * FROM "UserCtofcAuthority" c
where c."countyNo" = "User"."countyNo" and c."userId" =
"User"."userId");
UPDATE 153
bigbird=> select current_timestamp;
now
----------------------------
2006-03-22 16:30:56.765-06
(1 row)

I'm betting that a
big part of your issue is that the EXPLAIN ANALYZE instrumentation
overhead is (1) significant and (2) different for the two query

plans.

The run time of the NOT IN query, as measured by elapsed time between
SELECT CURRENT_TIMESTAMP executions, increased by 31 ms. The run time
of the NOT EXISTS did not increase measurably. I factored out the
EXPLAIN ANALYZE time in my original post for this topic. My question
was about where the REST of the time goes.

On this topic, I'm interested in the difference between the time
reported by the EXPLAIN ANALYZE and the elapsed time to run the query.
I've run a series of tests to determine what the minimum time is for a
query bracketed by SELECT CURRENT_TIMESTAMP statements for this
environment, and it is 12 ms (on average -- the time increments 15 ms to
16 ms at a time, which helps explain the odd consistency of the
timings). That leaves an unaccounted difference between the time
reported by EXPLAIN ANALYZE and the timestamp elapsed time of (on
average) 9 ms for the NOT IN form of the query, and 41 ms for the NOT
EXISTS for of the query. (In the run shown above, it's higher.) I'm
guessing that this is the time spent in parsing and planning the query.
Since for one query it is many times the actual run time reported by
EXPLAIN ANALYZE, it has piqued my interest. It's also interesting that
this time would be so much higher for one query than the other.

What is the best way to see where this time is going? (Or, if it is
obvious to those more familiar with the internals, please share.)

-Kevin

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#3)
Re: Where does the time go?

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

You didn't show us the explain analyze results,

The below is cut & paste directly from a psql run without editing.

OK, so the two plans do indeed have much different node execution
counts. The EXPLAIN ANALYZE instrumentation overhead is basically
proportional to (rows+1)*loops summed over all the nodes in the plan,
so I count about 102112 node executions in the NOT IN plan versus
1145 in the NOT EXISTS plan --- in other words, 100x more overhead for
the former.

The run time of the NOT IN query, as measured by elapsed time between
SELECT CURRENT_TIMESTAMP executions, increased by 31 ms.

Works out to about 30 microsec per node execution, which seems a bit
high for modern machines ... and the coarse quantization of the
CURRENT_TIMESTAMP results is odd too. What platform is this on exactly?

That leaves an unaccounted difference between the time
reported by EXPLAIN ANALYZE and the timestamp elapsed time of (on
average) 9 ms for the NOT IN form of the query, and 41 ms for the NOT
EXISTS for of the query. (In the run shown above, it's higher.) I'm
guessing that this is the time spent in parsing and planning the query.

Parse/plan time is one component, and another is the time spent by
EXPLAIN preparing its output display, which is not an area we've spent
any time at all optimizing --- I wouldn't be surprised if it's kinda
slow. However, these plans are relatively similar in terms of the
complexity of the display, so it is odd that there'd be so much
difference.

What is the best way to see where this time is going?

Profiling with gprof or some such tool might be educational.

regards, tom lane

#5Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#4)
Re: Where does the time go?

On Thu, Mar 23, 2006 at 11:27 am, in message

<10223.1143134839@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

The run time of the NOT IN query, as measured by elapsed time

between

SELECT CURRENT_TIMESTAMP executions, increased by 31 ms.

Works out to about 30 microsec per node execution, which seems a bit
high for modern machines ... and the coarse quantization of the
CURRENT_TIMESTAMP results is odd too. What platform is this on

exactly?

This is a smaller machine with a copy of the full production database.
A single 3.6 GHz Xeon with 4 GB RAM running Windows Server 2003. It was
being used to test update scripts before applying them to the production
machines. I stumbled across a costing issue I thought worth posting,
and in the course of gathering data noticed this time difference I
didn't understand.

What is the best way to see where this time is going?

Profiling with gprof or some such tool might be educational.

Our builds are all done with --enable-debug, but this machine doesn't
even have msys installed. I'll try to put together some way to profile
it on this machine or some other. (It might be easier to move it to a
Linux machine and confirm the problem there, then profile.)

Thanks,

-Kevin

#6Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#4)
Re: Where does the time go?

On Thu, Mar 23, 2006 at 11:27 am, in message

<10223.1143134839@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Profiling with gprof or some such tool might be educational.

I've never used gprof before, and from a quick scan of the info, it
appears that I need to compile and link a special version of the
software to generate the file that gprof needs. Is this correct? Does
it work on a Windows build, or will I need to use Linux? Any tips?

Thanks, all.

-Kevin

#7Martijn van Oosterhout
kleptog@svana.org
In reply to: Kevin Grittner (#5)
Re: Where does the time go?

On Thu, Mar 23, 2006 at 12:29:27PM -0600, Kevin Grittner wrote:

Works out to about 30 microsec per node execution, which seems a
bit high for modern machines ... and the coarse quantization of the
CURRENT_TIMESTAMP results is odd too. What platform is this on
exactly?

This is a smaller machine with a copy of the full production database.
A single 3.6 GHz Xeon with 4 GB RAM running Windows Server 2003. It was
being used to test update scripts before applying them to the production
machines. I stumbled across a costing issue I thought worth posting,
and in the course of gathering data noticed this time difference I
didn't understand.

This may be the cause of the problem (windows). On UNIX platforms the
gettimeofday() call is used to calculate the timings in both cases. On
Windows the EXPLAIN ANALYZE measures time in a different way using the
CPU counters. It uses the interface but it will run into issues if the
CPU speed is not properly calculated or there is drift between the
different CPUs.

Here's one person who claims that the performance counter frequency is
often wrong:

http://archives.postgresql.org/pgsql-hackers-win32/2005-03/msg00063.php

It's also been pointed out before that the code actually divides by the
wrong number (it uses GetTimerFrequency() rather than
QueryPerformenceFrequency()). If you can find the values of these two
functions on your machine, see how it compares to your actual clock
speed.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#6)
Re: Where does the time go?

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

I've never used gprof before, and from a quick scan of the info, it
appears that I need to compile and link a special version of the
software to generate the file that gprof needs. Is this correct? Does
it work on a Windows build, or will I need to use Linux? Any tips?

I dunno anything about profiling on Windows. If you don't mind moving
the test case to Linux it's pretty easy:

./configure --enable-debug --whatever-other-options
make PROFILE="-pg -DLINUX_PROFILE"
install postgres executable

(On non-Linux Unixen you can omit that -D flag, but otherwise the recipe
is the same.)

Run the test case (you'll want a test script that does the same thing over
and over, enough times to build up a decent set of statistics; I like to
have about a minute's worth of accumulated CPU time in a profile run).
Exit the session --- the gmon.out file will only be dumped at backend
process exit. Then do

gprof /path/to/postgres-executable $PGDATA/gmon.out >outfile

BTW, in 8.1 you want to be sure to do this with autovacuum off, else
exit of the autovacuum process might clobber the gmon.out file before
you can run gprof.

regards, tom lane

#9Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Where does the time go?

On Wed, 2006-03-22 at 21:59 -0500, Tom Lane wrote:

I'm betting that a
big part of your issue is that the EXPLAIN ANALYZE instrumentation
overhead is (1) significant

I would like to implement an additional mode for EXPLAIN ANALYZE that
does no timing instrumentation at all. Most of the time just looking for
differences between estimated and actual row counts is all you need.

For long queries, the additional time can make the execution infeasible,
yet frequently they are the ones you want to see the output for.

EXPLAIN ANALYZE NOTIMING sounds a little dry, any ideas?

Best Regards, Simon Riggs

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#9)
Re: Where does the time go?

Simon Riggs <simon@2ndquadrant.com> writes:

I would like to implement an additional mode for EXPLAIN ANALYZE that
does no timing instrumentation at all. Most of the time just looking for
differences between estimated and actual row counts is all you need.

I don't really agree with that premise ... without timings, you cannot
for instance tell if the planner has over/underestimated the cost of an
index fetch.

regards, tom lane

#11Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#10)
Re: Where does the time go?

On Sat, Mar 25, 2006 at 10:00:51AM -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

I would like to implement an additional mode for EXPLAIN ANALYZE that
does no timing instrumentation at all. Most of the time just looking for
differences between estimated and actual row counts is all you need.

I don't really agree with that premise ... without timings, you cannot
for instance tell if the planner has over/underestimated the cost of an
index fetch.

I agree. However, if it's the overhead of calling gettimeofday() that
slows everything down, perhaps we should tackle that end. For example,
have a sampling mode that only times say 5% of the executed nodes.

EXPLAIN ANALYZE SAMPLE blah;

And then in InstrStart have a quick test that skips the gettimeofday
for this interation sometimes. You'd probably need some heuristics
because you always want to catch the first iteration but after the
10,000th tuple in an indexscan, you're probably not going to learn
anything new.

You could do a non-random sampling fairly easily:

if( ntuples < 16 )
yes
else if( ntuples < 16*16 && (ntuples%16) == 0)
yes
else if( ntuples < 16*16*16 && ntuples%(16*16) == 0)
yes
else
etc etc etc

This mean that the more often a node is executed, the less often you
actually time it. Note, we store ntuples as a doulbe so the mod
operation won't work...

How does this sound?
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#12Martijn van Oosterhout
kleptog@svana.org
In reply to: Martijn van Oosterhout (#11)
Re: Where does the time go?

On Sat, Mar 25, 2006 at 04:24:05PM +0100, Martijn van Oosterhout wrote:

I agree. However, if it's the overhead of calling gettimeofday() that
slows everything down, perhaps we should tackle that end. For example,
have a sampling mode that only times say 5% of the executed nodes.

EXPLAIN ANALYZE SAMPLE blah;

<snip>

You could do a non-random sampling fairly easily:

Actually, I thought of a better way to control the sampling that's
probably better than the crude if-then-else structure I gave which also
takes advantage of the fact that the numbers are floating point:

InstrInit:
next_sample = 0;

InstrStart:
if( ntuples < 16 )
dosample = yes;
else if( ntuples > next_sample )
{
dosample = yes;
next_sample += log2(ntuples);
}
else
dosample = no;

This will sample approxiamtly log2 of the actual executions. You could
use log10(), sqrt() or any other function you find reflects the sample
you want.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#13Simon Riggs
simon@2ndquadrant.com
In reply to: Martijn van Oosterhout (#11)
Re: Where does the time go?

On Sat, 2006-03-25 at 16:24 +0100, Martijn van Oosterhout wrote:

I agree. However, if it's the overhead of calling gettimeofday() that
slows everything down, perhaps we should tackle that end. For example,
have a sampling mode that only times say 5% of the executed nodes.

EXPLAIN ANALYZE SAMPLE blah;

I like this idea. Why not do this all the time? I'd say we don't need
the SAMPLE clause at all, just do this for all EXPLAIN ANALYZEs.

And then in InstrStart have a quick test that skips the gettimeofday
for this interation sometimes. You'd probably need some heuristics
because you always want to catch the first iteration but after the
10,000th tuple in an indexscan, you're probably not going to learn
anything new.

How does this sound?

Something even simpler? First 40 plus 5% random sample after that? I'd
prefer a random sample so we have the highest level of trust in the
numbers produced. Otherwise we might accidentally introduce bias from
systematic effects such as nested loops queries speeding up towards the
end of their run. (I know we would do that at the start, but we are
stuck because we don't know the population size ahead of time and we
know we need a reasonable number of data points).

Best Regards, Simon Riggs

#14Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#13)
Re: Where does the time go?

On Sat, Mar 25, 2006 at 05:38:26PM +0000, Simon Riggs wrote:

On Sat, 2006-03-25 at 16:24 +0100, Martijn van Oosterhout wrote:

I agree. However, if it's the overhead of calling gettimeofday() that
slows everything down, perhaps we should tackle that end. For example,
have a sampling mode that only times say 5% of the executed nodes.

EXPLAIN ANALYZE SAMPLE blah;

I like this idea. Why not do this all the time? I'd say we don't need
the SAMPLE clause at all, just do this for all EXPLAIN ANALYZEs.

I was wondering about that. But then you may run into wierd results if
a subselect takes a long time for just a few value. But maybe it should
be the default, and have a FULL mode to say you want to measure
everything.

Something even simpler? First 40 plus 5% random sample after that? I'd
prefer a random sample so we have the highest level of trust in the
numbers produced. Otherwise we might accidentally introduce bias from
systematic effects such as nested loops queries speeding up towards the
end of their run. (I know we would do that at the start, but we are
stuck because we don't know the population size ahead of time and we
know we need a reasonable number of data points).

Well, I was wondering if a fixed percentage was appropriate. 5% of 10
million is still a lot for possibly not a lot of benefit. The followup
email suggested a sampling that keeps happening less often as the
number of tuples increases it a logorithmic based way. But we could add
dome randomness that'd be cool. The question is, what's the overhead of
calling random()?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#15Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#13)
1 attachment(s)
[PATCH] Improve EXPLAIN ANALYZE overhead by sampling

This was a suggestion made back in March that would dramatically reduce
the overhead of EXPLAIN ANALYZE on queries that loop continuously over
the same nodes.

http://archives.postgresql.org/pgsql-hackers/2006-03/msg01114.php

What it does behave normally for the first 50 tuples of any node, but
after that it starts sampling at ever increasing intervals, the
intervals controlled by an exponential function. So for a node
iterating over 1 million tuples it takes around 15,000 samples. The
result is that EXPLAIN ANALYZE has a much reduced effect on the total
execution time.

Without EXPLAIN ANALYZE:

postgres=# select count(*) from generate_series(1,1000000);
count
---------
1000000
(1 row)

Time: 2303.599 ms

EXPLAIN ANALYZE without patch:

postgres=# explain analyze select count(*) from generate_series(1,1000000);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=15.00..15.01 rows=1 width=0) (actual time=8022.070..8022.073 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=0) (actual time=1381.762..4873.369 rows=1000000 loops=1)
Total runtime: 8042.472 ms
(3 rows)

Time: 8043.401 ms

EXPLAIN ANALYZE with patch:

postgres=# explain analyze select count(*) from generate_series(1,1000000);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=15.00..15.01 rows=1 width=0) (actual time=2469.491..2469.494 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=0) (actual time=1405.002..2187.417 rows=1000000 loops=1)
Total runtime: 2496.529 ms
(3 rows)

Time: 2497.488 ms

As you can see, the overhead goes from 5.7 seconds to 0.2 seconds.
Obviously this is an extreme case, but it will probably help in a lot
of other cases people have been complaining about.

- To get this close it needs to get an estimate of the sampling overhead.
It does this by a little calibration loop that is run once per backend.
If you don't do this, you end up assuming all tuples take the same time
as tuples with the overhead, resulting in nodes apparently taking
longer than their parent nodes. Incidently, I measured the overhead to
be about 3.6us per tuple per node on my (admittedly slightly old)
machine.

Note that the resulting times still include the overhead actually
incurred, I didn't filter it out. I want the times to remain reflecting
reality as closely as possible.

- I also removed InstrStopNodeMulti and made InstrStopNode take a tuple
count parameter instead. This is much clearer all round.

- I also didn't make it optional. I'm unsure about whether it should be
optional or not, given the number of cases where it will make a
difference to be very few.

- The tuple counter for sampling restarts every loop. Thus a node that is
called repeatedly only returning one value each time will still measure
every tuple, though its parent node won't. We'll need some field
testing to see if that remains a significant effect.

- I don't let the user know anywhere how many samples it took. Is this
something users should care about?

Any comments?
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

Attachments:

explain.patchtext/plain; charset=us-asciiDownload
Index: src/backend/commands/trigger.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/trigger.c,v
retrieving revision 1.201
diff -c -r1.201 trigger.c
*** src/backend/commands/trigger.c	27 Apr 2006 00:33:41 -0000	1.201
--- src/backend/commands/trigger.c	9 May 2006 20:35:07 -0000
***************
*** 1306,1312 ****
  	 * one "tuple returned" (really the number of firings).
  	 */
  	if (instr)
! 		InstrStopNode(instr + tgindx, true);
  
  	return (HeapTuple) DatumGetPointer(result);
  }
--- 1306,1312 ----
  	 * one "tuple returned" (really the number of firings).
  	 */
  	if (instr)
! 		InstrStopNode(instr + tgindx, 1);
  
  	return (HeapTuple) DatumGetPointer(result);
  }
***************
*** 2154,2160 ****
  	 * one "tuple returned" (really the number of firings).
  	 */
  	if (instr)
! 		InstrStopNode(instr + tgindx, true);
  }
  
  
--- 2154,2160 ----
  	 * one "tuple returned" (really the number of firings).
  	 */
  	if (instr)
! 		InstrStopNode(instr + tgindx, 1);
  }
  
  
Index: src/backend/executor/execProcnode.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/execProcnode.c,v
retrieving revision 1.54
diff -c -r1.54 execProcnode.c
*** src/backend/executor/execProcnode.c	5 Mar 2006 15:58:26 -0000	1.54
--- src/backend/executor/execProcnode.c	9 May 2006 20:35:07 -0000
***************
*** 423,429 ****
  	}
  
  	if (node->instrument)
! 		InstrStopNode(node->instrument, !TupIsNull(result));
  
  	return result;
  }
--- 423,429 ----
  	}
  
  	if (node->instrument)
! 		InstrStopNode(node->instrument, TupIsNull(result) ? 0 : 1 );
  
  	return result;
  }
Index: src/backend/executor/instrument.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/instrument.c,v
retrieving revision 1.14
diff -c -r1.14 instrument.c
*** src/backend/executor/instrument.c	5 Mar 2006 15:58:26 -0000	1.14
--- src/backend/executor/instrument.c	9 May 2006 20:35:07 -0000
***************
*** 14,22 ****
--- 14,84 ----
  #include "postgres.h"
  
  #include <unistd.h>
+ #include <math.h>
  
  #include "executor/instrument.h"
  
+ /* This is the function that is used to determine the sampling intervals. In
+  * general, if the function is f(x), then for N tuples we will take on the
+  * order of integral(1/f(x), x=0..N) samples. Some examples follow, with the
+  * number of samples that would be collected over 1,000,000 tuples.
+ 
+   f(x) = x         =>   log2(N)                        20
+   f(x) = x^(1/2)   =>   2 * N^(1/2)                  2000
+   f(x) = x^(1/3)   =>   1.5 * N^(2/3)               15000
+   
+  * I've chosen the last one as it seems to provide a good compromise between
+  * low overhead but still getting a meaningful number of samples. However,
+  * not all machines have the cbrt() function so on those we substitute
+  * sqrt(). The difference is not very significant in the tests I made.
+ */ 
+ #ifdef HAVE_CBRT
+ #define SampleFunc cbrt
+ #else
+ #define SampleFunc sqrt
+ #endif
+ 
+ #define SAMPLE_THRESHOLD 50
+ 
+ static double SampleOverhead;
+ static bool SampleOverheadCalculated;
+ 
+ static void
+ CalculateSampleOverhead()
+ {
+ 	Instrumentation instr;
+ 	int i;
+ 	
+ 	/* We want to determine the sampling overhead, to correct
+ 	 * calculations later. This only needs to be done once per backend.
+ 	 * Is this the place? A wrong value here (due to a mistimed
+ 	 * task-switch) will cause bad calculations later.
+ 	 *
+ 	 * To minimize the risk we do it a few times and take the lowest.
+ 	 */
+ 	
+ 	SampleOverhead = 1.0e6;
+ 	
+ 	for( i = 0; i<5; i++ )
+ 	{
+ 		int j;
+ 		double overhead;
+ 		
+ 		memset( &instr, 0, sizeof(instr) );
+ 	
+ 		/* Loop SAMPLE_THRESHOLD times or 100 microseconds, whichever is faster */
+ 		for( j=0; j<SAMPLE_THRESHOLD && INSTR_TIME_GET_DOUBLE(instr.counter) < 100e-6; i++ )
+ 		{
+ 			InstrStartNode( &instr );
+ 			InstrStopNode( &instr, 1 );
+ 		}
+ 		overhead = INSTR_TIME_GET_DOUBLE(instr.counter) / instr.samplecount;
+ 		if( overhead < SampleOverhead )
+ 			SampleOverhead = overhead;
+ 	}
+ 		
+ 	SampleOverheadCalculated = true;
+ }
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
***************
*** 25,31 ****
  	Instrumentation *instr = palloc0(n * sizeof(Instrumentation));
  
  	/* we don't need to do any initialization except zero 'em */
! 
  	return instr;
  }
  
--- 87,96 ----
  	Instrumentation *instr = palloc0(n * sizeof(Instrumentation));
  
  	/* we don't need to do any initialization except zero 'em */
! 	
! 	/* Calculate overhead, if not done yet */
! 	if( !SampleOverheadCalculated )
! 		CalculateSampleOverhead();
  	return instr;
  }
  
***************
*** 34,82 ****
  InstrStartNode(Instrumentation *instr)
  {
  	if (INSTR_TIME_IS_ZERO(instr->starttime))
! 		INSTR_TIME_SET_CURRENT(instr->starttime);
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
  }
  
  /* Exit from a plan node */
  void
! InstrStopNode(Instrumentation *instr, bool returnedTuple)
  {
  	instr_time	endtime;
  
  	/* count the returned tuples */
! 	if (returnedTuple)
! 		instr->tuplecount += 1;
  
! 	if (INSTR_TIME_IS_ZERO(instr->starttime))
  	{
! 		elog(DEBUG2, "InstrStopNode called without start");
! 		return;
! 	}
  
! 	INSTR_TIME_SET_CURRENT(endtime);
  
  #ifndef WIN32
! 	instr->counter.tv_sec += endtime.tv_sec - instr->starttime.tv_sec;
! 	instr->counter.tv_usec += endtime.tv_usec - instr->starttime.tv_usec;
  
! 	/* Normalize after each add to avoid overflow/underflow of tv_usec */
! 	while (instr->counter.tv_usec < 0)
! 	{
! 		instr->counter.tv_usec += 1000000;
! 		instr->counter.tv_sec--;
! 	}
! 	while (instr->counter.tv_usec >= 1000000)
! 	{
! 		instr->counter.tv_usec -= 1000000;
! 		instr->counter.tv_sec++;
! 	}
  #else							/* WIN32 */
! 	instr->counter.QuadPart += (endtime.QuadPart - instr->starttime.QuadPart);
  #endif
  
! 	INSTR_TIME_SET_ZERO(instr->starttime);
  
  	/* Is this the first tuple of this cycle? */
  	if (!instr->running)
--- 99,167 ----
  InstrStartNode(Instrumentation *instr)
  {
  	if (INSTR_TIME_IS_ZERO(instr->starttime))
! 	{
! 		/* We always sample the first SAMPLE_THRESHOLD tuples, so small nodes are always accurate */
! 		if (instr->tuplecount < SAMPLE_THRESHOLD)
! 			instr->sampling = true;
! 		else
! 		{
! 			/* Otherwise we go to sampling, see the comments on SampleFunc at the top of the file */
! 			if( instr->tuplecount > instr->nextsample )
! 			{
! 				instr->sampling = true;
! 				/* The doubling is so the random will average 1 over time */
! 				instr->nextsample += 2.0 * SampleFunc(instr->tuplecount) * (double)rand() / (double)RAND_MAX;
! 			}
! 		}
! 		if (instr->sampling)
! 			INSTR_TIME_SET_CURRENT(instr->starttime);
! 	}
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
  }
  
  /* Exit from a plan node */
  void
! InstrStopNode(Instrumentation *instr, double nTuples)
  {
  	instr_time	endtime;
  
  	/* count the returned tuples */
! 	instr->tuplecount += nTuples;
  
! 	if (instr->sampling)
  	{
! 		if (INSTR_TIME_IS_ZERO(instr->starttime))
! 		{
! 			elog(DEBUG2, "InstrStopNode called without start");
! 			return;
! 		}
  
! 		INSTR_TIME_SET_CURRENT(endtime);
  
  #ifndef WIN32
! 		instr->counter.tv_sec += endtime.tv_sec - instr->starttime.tv_sec;
! 		instr->counter.tv_usec += endtime.tv_usec - instr->starttime.tv_usec;
  
! 		/* Normalize after each add to avoid overflow/underflow of tv_usec */
! 		while (instr->counter.tv_usec < 0)
! 		{
! 			instr->counter.tv_usec += 1000000;
! 			instr->counter.tv_sec--;
! 		}
! 		while (instr->counter.tv_usec >= 1000000)
! 		{
! 			instr->counter.tv_usec -= 1000000;
! 			instr->counter.tv_sec++;
! 		}
  #else							/* WIN32 */
! 		instr->counter.QuadPart += (endtime.QuadPart - instr->starttime.QuadPart);
  #endif
  
! 		INSTR_TIME_SET_ZERO(instr->starttime);
! 		instr->samplecount += nTuples;
! 		instr->sampling = false;
! 	}
  
  	/* Is this the first tuple of this cycle? */
  	if (!instr->running)
***************
*** 86,102 ****
  	}
  }
  
- /* As above, but count multiple tuples returned at once */
- void
- InstrStopNodeMulti(Instrumentation *instr, double nTuples)
- {
- 	/* count the returned tuples */
- 	instr->tuplecount += nTuples;
- 
- 	/* delegate the rest */
- 	InstrStopNode(instr, false);
- }
- 
  /* Finish a run cycle for a plan node */
  void
  InstrEndLoop(Instrumentation *instr)
--- 171,176 ----
***************
*** 114,121 ****
  	totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
  
  	instr->startup += instr->firsttuple;
! 	instr->total += totaltime;
  	instr->ntuples += instr->tuplecount;
  	instr->nloops += 1;
  
  	/* Reset for next cycle (if any) */
--- 188,211 ----
  	totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
  
  	instr->startup += instr->firsttuple;
! 
! 	/* Here we take into account sampling effects. Doing it naively ends
! 	 * up assuming the sampling overhead applies to all tuples, even the
! 	 * ones we didn't measure. We've calculated an overhead, so we
! 	 * subtract that for all samples we didn't measure. The first tuple
! 	 * is also special cased, because it usually takes longer. */
! 
! 	if( instr->samplecount < instr->tuplecount )
! 	{
! 		double pertuple = (totaltime - instr->firsttuple) / (instr->samplecount - 1);
! 		instr->total += instr->firsttuple + (pertuple * (instr->samplecount - 1))
! 		                                  + ((pertuple - SampleOverhead) * (instr->tuplecount - instr->samplecount));
! 	}
! 	else
! 		instr->total += totaltime;
! 		
  	instr->ntuples += instr->tuplecount;
+ 	instr->nsamples += instr->samplecount;
  	instr->nloops += 1;
  
  	/* Reset for next cycle (if any) */
***************
*** 123,127 ****
--- 213,218 ----
  	INSTR_TIME_SET_ZERO(instr->starttime);
  	INSTR_TIME_SET_ZERO(instr->counter);
  	instr->firsttuple = 0;
+ 	instr->samplecount = 0;
  	instr->tuplecount = 0;
  }
Index: src/backend/executor/nodeBitmapAnd.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeBitmapAnd.c,v
retrieving revision 1.6
diff -c -r1.6 nodeBitmapAnd.c
*** src/backend/executor/nodeBitmapAnd.c	5 Mar 2006 15:58:26 -0000	1.6
--- src/backend/executor/nodeBitmapAnd.c	9 May 2006 20:35:07 -0000
***************
*** 163,169 ****
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
! 		InstrStopNodeMulti(node->ps.instrument, 0 /* XXX */ );
  
  	return (Node *) result;
  }
--- 163,169 ----
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
! 		InstrStopNode(node->ps.instrument, 0 /* XXX */ );
  
  	return (Node *) result;
  }
Index: src/backend/executor/nodeBitmapIndexscan.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeBitmapIndexscan.c,v
retrieving revision 1.17
diff -c -r1.17 nodeBitmapIndexscan.c
*** src/backend/executor/nodeBitmapIndexscan.c	5 Mar 2006 15:58:26 -0000	1.17
--- src/backend/executor/nodeBitmapIndexscan.c	9 May 2006 20:35:07 -0000
***************
*** 112,118 ****
  
  	/* must provide our own instrumentation support */
  	if (node->ss.ps.instrument)
! 		InstrStopNodeMulti(node->ss.ps.instrument, nTuples);
  
  	return (Node *) tbm;
  }
--- 112,118 ----
  
  	/* must provide our own instrumentation support */
  	if (node->ss.ps.instrument)
! 		InstrStopNode(node->ss.ps.instrument, nTuples);
  
  	return (Node *) tbm;
  }
Index: src/backend/executor/nodeBitmapOr.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeBitmapOr.c,v
retrieving revision 1.5
diff -c -r1.5 nodeBitmapOr.c
*** src/backend/executor/nodeBitmapOr.c	5 Mar 2006 15:58:26 -0000	1.5
--- src/backend/executor/nodeBitmapOr.c	9 May 2006 20:35:07 -0000
***************
*** 179,185 ****
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
! 		InstrStopNodeMulti(node->ps.instrument, 0 /* XXX */ );
  
  	return (Node *) result;
  }
--- 179,185 ----
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
! 		InstrStopNode(node->ps.instrument, 0 /* XXX */ );
  
  	return (Node *) result;
  }
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.101
diff -c -r1.101 nodeHash.c
*** src/backend/executor/nodeHash.c	5 Mar 2006 15:58:26 -0000	1.101
--- src/backend/executor/nodeHash.c	9 May 2006 20:35:07 -0000
***************
*** 97,103 ****
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
! 		InstrStopNodeMulti(node->ps.instrument, hashtable->totalTuples);
  
  	/*
  	 * We do not return the hash table directly because it's not a subtype of
--- 97,103 ----
  
  	/* must provide our own instrumentation support */
  	if (node->ps.instrument)
! 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
  
  	/*
  	 * We do not return the hash table directly because it's not a subtype of
Index: src/include/executor/instrument.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/instrument.h,v
retrieving revision 1.13
diff -c -r1.13 instrument.h
*** src/include/executor/instrument.h	5 Mar 2006 15:58:56 -0000	1.13
--- src/include/executor/instrument.h	9 May 2006 20:35:09 -0000
***************
*** 61,77 ****
  	instr_time	counter;		/* Accumulated runtime for this node */
  	double		firsttuple;		/* Time for first tuple of this cycle */
  	double		tuplecount;		/* Tuples emitted so far this cycle */
  	/* Accumulated statistics across all completed cycles: */
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
  } Instrumentation;
  
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
! extern void InstrStopNode(Instrumentation *instr, bool returnedTuple);
! extern void InstrStopNodeMulti(Instrumentation *instr, double nTuples);
  extern void InstrEndLoop(Instrumentation *instr);
  
  #endif   /* INSTRUMENT_H */
--- 61,81 ----
  	instr_time	counter;		/* Accumulated runtime for this node */
  	double		firsttuple;		/* Time for first tuple of this cycle */
  	double		tuplecount;		/* Tuples emitted so far this cycle */
+ 	double		samplecount;		/* Samples collected this cycle */
  	/* Accumulated statistics across all completed cycles: */
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
+ 	double		nsamples;		/* # of samples taken */
+ 	/* Tracking for sampling */
+ 	bool		sampling;		/* Are we sampling this iteration */
+ 	double		nextsample;		/* The next tuplecount we're going to sample */
  } Instrumentation;
  
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
! extern void InstrStopNode(Instrumentation *instr, double nTuples);
  extern void InstrEndLoop(Instrumentation *instr);
  
  #endif   /* INSTRUMENT_H */
#16Simon Riggs
simon@2ndquadrant.com
In reply to: Martijn van Oosterhout (#15)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

On Tue, 2006-05-09 at 22:37 +0200, Martijn van Oosterhout wrote:

This was a suggestion made back in March that would dramatically reduce
the overhead of EXPLAIN ANALYZE on queries that loop continuously over
the same nodes.

http://archives.postgresql.org/pgsql-hackers/2006-03/msg01114.php

As you can see, the overhead goes from 5.7 seconds to 0.2 seconds.
Obviously this is an extreme case, but it will probably help in a lot
of other cases people have been complaining about.

This seems much more useful behaviour than currently. Running an EXPLAIN
ANALYZE for large queries can be a real pain, especially on a production
box which is in live use - so tuning a test tool has a meaningful effect
on other users performance too.

There's a lot of thought gone in here, so I'd vote yes, though without
having done a detailed code review.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#17Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#15)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

On Tue, May 09, 2006 at 10:37:04PM +0200, Martijn van Oosterhout wrote:

Note that the resulting times still include the overhead actually
incurred, I didn't filter it out. I want the times to remain reflecting
reality as closely as possible.

If we actually know the overhead I think it'd be very useful at times to
be able to remove it, especially if you're actually trying to compare to
the planner estimates. Maybe worth adding an option to the command?

- I also didn't make it optional. I'm unsure about whether it should be
optional or not, given the number of cases where it will make a
difference to be very few.

The real question is how important it is to have the real data in the
cases where it would make a difference, and I suspect we can't answer
that until this is out in the field. It *might* be worth a #define or
some other way to disable it that doesn't require patching code, but
probably not.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#18Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Kevin Grittner (#1)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

"Martijn van Oosterhout" <kleptog@svana.org> wrote

What it does behave normally for the first 50 tuples of any node, but
after that it starts sampling at ever increasing intervals, the
intervals controlled by an exponential function.

I got two questions after scanning the patch:

(1) For a node with 50 loops and another one 50+10^3 loops, the first
one will be measured 50 times and the second one will be measured 50+10
times? I am not sure if this is rational.

(2) Will this patch instruct multinode without interval? This is because
we always use ntuples=0 for multinode, so the tuplecount will not
change.

Maybe another way is to measure the cost of timing, then substruct it
from the result - but this is a hand-waiving only so far ...

Regards,
Qingqing

#19Martijn van Oosterhout
kleptog@svana.org
In reply to: Jim C. Nasby (#17)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

On Thu, May 11, 2006 at 06:37:03PM -0500, Jim C. Nasby wrote:

On Tue, May 09, 2006 at 10:37:04PM +0200, Martijn van Oosterhout wrote:

Note that the resulting times still include the overhead actually
incurred, I didn't filter it out. I want the times to remain reflecting
reality as closely as possible.

If we actually know the overhead I think it'd be very useful at times to
be able to remove it, especially if you're actually trying to compare to
the planner estimates. Maybe worth adding an option to the command?

It's not quite as easy as that unfortunatly. Each node can estimate how
much overhead was incurred on that node. However, each node also
includes as part of its timing the overhead of all its decendant nodes.
So to really remove the overhead, the top-level would have to recurse
through the whole tree to decide what to remove.

What I'm hoping is that this patch will make the overhead so low in
normal operation that we don't need to go to that kind of effort.

- I also didn't make it optional. I'm unsure about whether it should be
optional or not, given the number of cases where it will make a
difference to be very few.

The real question is how important it is to have the real data in the
cases where it would make a difference, and I suspect we can't answer
that until this is out in the field. It *might* be worth a #define or
some other way to disable it that doesn't require patching code, but
probably not.

A #define is doable, though messy. The code isn't all that long anyway
so a few #ifdefs might make it confusing. But I'll see what I can do.

Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#20Simon Riggs
simon@2ndquadrant.com
In reply to: Martijn van Oosterhout (#19)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

On Fri, 2006-05-12 at 12:22 +0200, Martijn van Oosterhout wrote:

On Thu, May 11, 2006 at 06:37:03PM -0500, Jim C. Nasby wrote:

On Tue, May 09, 2006 at 10:37:04PM +0200, Martijn van Oosterhout wrote:

Note that the resulting times still include the overhead actually
incurred, I didn't filter it out. I want the times to remain reflecting
reality as closely as possible.

If we actually know the overhead I think it'd be very useful at times to
be able to remove it, especially if you're actually trying to compare to
the planner estimates. Maybe worth adding an option to the command?

It's not quite as easy as that unfortunatly. Each node can estimate how
much overhead was incurred on that node. However, each node also
includes as part of its timing the overhead of all its decendant nodes.
So to really remove the overhead, the top-level would have to recurse
through the whole tree to decide what to remove.

Agreed

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#21Martijn van Oosterhout
kleptog@svana.org
In reply to: Qingqing Zhou (#18)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

[Sorry for the delay, I'm not subscribed to I didn't see your message
till I checked the archive. Please CC for a quicker response.]

I got two questions after scanning the patch:

(1) For a node with 50 loops and another one 50+10^3 loops, the first
one will be measured 50 times and the second one will be measured 50+10
times? I am not sure if this is rational.

You're miscalculating. For N tuples it samples approximatly 1.5*N^(2/3)
so that would be a bit less than 50+150 samples (my little script
suggests 197 samples).

$ perl -MMath::Complex -e '
for $i (1..1050) {
if( $i < 50 ) { $s++ }
else {
if( $i > $t ) { $s++; $t += cbrt($i); }
}
}; print "$s\n"; '
197

(2) Will this patch instruct multinode without interval? This is
because we always use ntuples=0 for multinode, so the tuplecount will
not change.

Well, if the tuple count always stays under 50 then it will always
sample. At the time it decides whether to sample or not (the beginning
of the node) it obviously has no idea what will be returned.

Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#22Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#19)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

On Fri, May 12, 2006 at 12:22:54PM +0200, Martijn van Oosterhout wrote:

- I also didn't make it optional. I'm unsure about whether it should be
optional or not, given the number of cases where it will make a
difference to be very few.

The real question is how important it is to have the real data in the
cases where it would make a difference, and I suspect we can't answer
that until this is out in the field. It *might* be worth a #define or
some other way to disable it that doesn't require patching code, but
probably not.

A #define is doable, though messy. The code isn't all that long anyway
so a few #ifdefs might make it confusing. But I'll see what I can do.

If it proves messy, it's probably not worth doing. Presumably anyone
able to tweak a #define could probably apply a patch as well. If you are
going to go through the effort it probably makes the most sense to just
add the remaining syntax to make it dynamic.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#23Martijn van Oosterhout
kleptog@svana.org
In reply to: Jim C. Nasby (#22)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

On Mon, May 15, 2006 at 12:09:37AM -0500, Jim C. Nasby wrote:

On Fri, May 12, 2006 at 12:22:54PM +0200, Martijn van Oosterhout wrote:

A #define is doable, though messy. The code isn't all that long anyway
so a few #ifdefs might make it confusing. But I'll see what I can do.

If it proves messy, it's probably not worth doing. Presumably anyone
able to tweak a #define could probably apply a patch as well. If you are
going to go through the effort it probably makes the most sense to just
add the remaining syntax to make it dynamic.

Making it configurable via a GUC would be much easier than making ik
optional at compile time because then you just need to skip the tests
for 'to sample or not'. To make it optional at compile time you'd need
to actually take out all the code relating to sampling.

Maybe:

enable_explain_sample (default: yes)

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#24Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Martijn van Oosterhout (#15)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

Patch applied. Thanks.

---------------------------------------------------------------------------

Martijn van Oosterhout wrote:
-- Start of PGP signed section.

This was a suggestion made back in March that would dramatically reduce
the overhead of EXPLAIN ANALYZE on queries that loop continuously over
the same nodes.

http://archives.postgresql.org/pgsql-hackers/2006-03/msg01114.php

What it does behave normally for the first 50 tuples of any node, but
after that it starts sampling at ever increasing intervals, the
intervals controlled by an exponential function. So for a node
iterating over 1 million tuples it takes around 15,000 samples. The
result is that EXPLAIN ANALYZE has a much reduced effect on the total
execution time.

Without EXPLAIN ANALYZE:

postgres=# select count(*) from generate_series(1,1000000);
count
---------
1000000
(1 row)

Time: 2303.599 ms

EXPLAIN ANALYZE without patch:

postgres=# explain analyze select count(*) from generate_series(1,1000000);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=15.00..15.01 rows=1 width=0) (actual time=8022.070..8022.073 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=0) (actual time=1381.762..4873.369 rows=1000000 loops=1)
Total runtime: 8042.472 ms
(3 rows)

Time: 8043.401 ms

EXPLAIN ANALYZE with patch:

postgres=# explain analyze select count(*) from generate_series(1,1000000);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=15.00..15.01 rows=1 width=0) (actual time=2469.491..2469.494 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=0) (actual time=1405.002..2187.417 rows=1000000 loops=1)
Total runtime: 2496.529 ms
(3 rows)

Time: 2497.488 ms

As you can see, the overhead goes from 5.7 seconds to 0.2 seconds.
Obviously this is an extreme case, but it will probably help in a lot
of other cases people have been complaining about.

- To get this close it needs to get an estimate of the sampling overhead.
It does this by a little calibration loop that is run once per backend.
If you don't do this, you end up assuming all tuples take the same time
as tuples with the overhead, resulting in nodes apparently taking
longer than their parent nodes. Incidently, I measured the overhead to
be about 3.6us per tuple per node on my (admittedly slightly old)
machine.

Note that the resulting times still include the overhead actually
incurred, I didn't filter it out. I want the times to remain reflecting
reality as closely as possible.

- I also removed InstrStopNodeMulti and made InstrStopNode take a tuple
count parameter instead. This is much clearer all round.

- I also didn't make it optional. I'm unsure about whether it should be
optional or not, given the number of cases where it will make a
difference to be very few.

- The tuple counter for sampling restarts every loop. Thus a node that is
called repeatedly only returning one value each time will still measure
every tuple, though its parent node won't. We'll need some field
testing to see if that remains a significant effect.

- I don't let the user know anywhere how many samples it took. Is this
something users should care about?

Any comments?
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

From each according to his ability. To each according to his ability to litigate.

[ Attachment, skipping... ]
-- End of PGP section, PGP failed!

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#25Martijn van Oosterhout
kleptog@svana.org
In reply to: Bruce Momjian (#24)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

On Tue, May 30, 2006 at 10:01:49AM -0400, Bruce Momjian wrote:

Patch applied. Thanks.

I note Tom made some changes to this patch after it went in. For the
record, it was always my intention that samplecount count the number of
_tuples_ returned while sampling, rather than the number of
_iterations_. I'll admit the comment in the header was wrong.

While my original patch had a small error in the case of multiple
tuples returned, it would've been correctable by counting the actual
number of sample. The way it is now, it will show a bias if the number
of tuples returned increases after the first sampled 50 tuples.
However, my knowledge of statistics isn't good enough to determine if
this is an actual problem or not, since the way it is now will sample
more initialially...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#25)
Re: [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

Martijn van Oosterhout <kleptog@svana.org> writes:

I note Tom made some changes to this patch after it went in. For the
record, it was always my intention that samplecount count the number of
_tuples_ returned while sampling, rather than the number of
_iterations_. I'll admit the comment in the header was wrong.

While my original patch had a small error in the case of multiple
tuples returned, it would've been correctable by counting the actual
number of sample. The way it is now, it will show a bias if the number
of tuples returned increases after the first sampled 50 tuples.

How so? The number of tuples doesn't enter into it at all. What the
code is now assuming is that the time per node iteration is constant.
More importantly, it's subtracting off an overhead estimate that's
measured per iteration. In the math you had before, the overhead was
effectively assumed to be per tuple, which is clearly wrong.

For nodes that return a variable number of tuples, it might be sensible
to presume that the node iteration time is roughly linear in the number
of tuples returned, but I find that debatable. In any case the sampling
overhead is certainly not dependent on how many tuples an iteration
returns.

This is all really moot at the moment, since we have only two kinds of
nodes: those that always return 1 tuple (until done) and those that
return all their tuples in a single iteration. If we ever get into
nodes that return varying numbers of tuples per iteration --- say,
exposing btree's page-at-a-time behavior at the plan node level ---
we'd have to rethink this. But AFAICS we'd need to count both tuples
and iterations to have a model that made any sense at all, so the
extra counter I added is needed anyway.

regards, tom lane