Re: slower merge join on sorted data chosen over

Started by Kevin Grittnerover 20 years ago48 messageshackers
Jump to latest
#1Kevin Grittner
Kevin.Grittner@wicourts.gov

I am absolutely sure that I never changed any of the enable_xxx
settings other than enable_mergejoin during these tests. The only
other setting I played with was random_page_cost, but the nested
loop query always chose the plain index scan in my tests.

There was one odd thing. I started testing on one machine, then
we dumped it and loaded it onto a new machine, to try a slightly
different configuration for comparison. I re-ran the tests and
noticed that on both boxes, with no enable_xxx options turned off,
when I reduced the random_page_cost to 1.2 the merge case went
to a plain index scan rather than the bitmap scan. What was odd was
that on the original box, the bitmap scan version was consistently
faster, while on the new box it was slightly slower. I repeated the
runs a few times because I found that surprising, but it seemed to
hold. That seems more likely to be related to the density of the
index pages following the restore than to the difference in
hardware or OS. It wasn't a very dramatic difference either way.

I notice that on the fully cached runs, the actual time for the plain
index was (barely) faster than the bitmap heap scan:

-> Bitmap Heap Scan on "DbTranRepository" dtr (cost=297.07..47081.47 rows=25067 width=17) (actual time=30.432..427.463 rows=39690 loops=1)

-> Index Scan using "DbTranRepository_timestamp" on "DbTranRepository" dtr (cost=0.00..49419.45 rows=25067 width=17) (actual time=0.083..412.268 rows=39690 loops=1)

However, it doesn't seem to be predicting that result when you look
at the cost numbers, so it is a mystery. Is there anything I can set to
tell it to spit out the cost info for every path it considers, so we can
see why it is making this choice? It doesn't bias toward a lower
starting cost when it is going to use a nested index scan, does it?
(OK, "starting cost" is probably not the right argot -- I'm talking about
the "0.00" in "cost=0.00..49419.45".)

As for expectations -- an experienced programmer who knew the
schema, the data, and had just written the queries to retrieve data,
was testing the application with various selection criteria, and
saying something like "This one should be sub-second. Check.
This one should take a few seconds. Check. Here's a big one, this'll
probably take two or three minutes. Dang -- it took seven and a half
minutes. Could you look at this and see why it's so slow? It seems
like it should be able to use this index and get the job done faster."
Some might call this "gut feel"; others might describe it as putting
forward a series of hypotheses, testing each, and investigating
unexpected results. ;-)

The good news is that we don't think this sort of query will be
chosen by the users very often -- a few times a month, perhaps,
among the roughly 200 management-level people who will have
access to this application. An extra five minutes per query isn't a
deal breaker, although it would obviously be nice if it ran faster.

I'll review the archives for prior discussions of the problem.

-Kevin

Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>>

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

In both the 8.1beta2 and using a build from this morning's
dev snapshot, this query ran slower than expected:

There's a known issue that the planner tends to overestimate the cost of
inner-index-scan nestloops, because it doesn't allow for the strong
caching effects associated with repeated scans of the same index (in
particular, that all the upper index levels tend to stay in cache).
See the archives for past inconclusive discussions about how to fix
this.

However, something else caught my eye:

-> Bitmap Heap Scan on "DbTranRepository" dtr (cost=297.07..47081.47 rows=25067 width=17) (actual time=69.056..5560.895 rows=39690 loops=1)

-> Index Scan using "DbTranRepository_timestamp" on "DbTranRepository" dtr (cost=0.00..49419.45 rows=25067 width=17) (actual time=33.625..11510.723 rows=39690 loops=1)

I don't understand why the second case chose a plain index scan when
there was no need for sorted output; the bitmap scan is faster both
per cost estimate and in reality. Are you sure you turned off only
enable_mergejoin and not enable_bitmapscan?

Also, when you say "slower than expected", what is setting your
expectation?

regards, tom lane

#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#1)

Thanks, Tom.

I spent a few hours trying different searches in the archives, and
found three very interesting threads on the topic. All were from
2003. Should I keep digging for more recent threads, or would
these probably represent the current state of the issue?

These left me somewhat concerned, since people were reporting
queries which ran orders of magnitude slower using merge joins
than when they forced them to nested loop index scans. In our
first brush with the issue it caused our query to run only two to
three times slower than it would if the planner could more
accurately cost the nested index scans, and it was in an area with
minimal impact to the organization, so this one is relatively benign.

We are looking at doing much more with PostgreSQL over the
next two years, and it seems likely that this issue will come up
again where it is more of a problem. It sounded like there was
some agreement on HOW this was to be fixed, yet I don't see
any mention of doing it in the TODO list. Is there any sort of
estimate for how much programming work would be involved?
Any chance of an incremental improvement, such as only
counting leaf access in a nested select? (While that's not
perfect, it seems likely to be closer, and therefore beneficial
overall.)

Thanks,

-Kevin

Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>>

There's a known issue that the planner tends to overestimate the cost of
inner-index-scan nestloops, because it doesn't allow for the strong
caching effects associated with repeated scans of the same index (in
particular, that all the upper index levels tend to stay in cache).
See the archives for past inconclusive discussions about how to fix
this.

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#2)

Hmmm... With that much direction, I really have no excuse
not to try a change and provide some test cases, do I?

A couple questions occur to me, though. I'm not clear on why
ceil is called -- do we need to eliminate the fraction here?
It seems to me that it wouldn't matter much except when the
index is small, in which case the odds are greater that pages
would be cached. It seems like it would be a less radical change
to eliminate the ceil call than the increment. As a first pass, I'm
inclined to try that in combination with a reduction in the
configured cpu_index_tuple_cost. Does that seem reasonable?

Also, to be clear, I thought someone said that only the leaf level
was counted -- it looks like the other levels are being factored in,
but not as heavily as they would be accessed with logical reads,
because we're taking the fraction of tuples for the index multiplied
by the total number of index pages except for the metadata page.
This should bias the cost slightly higher for additional index levels,
shouldn't it?

I toyed with the idea of eliminating both the ceil and the increment,
and just having a minimum of 1.0, but that seemed extreme.

-Kevin

Tom Lane <tgl@sss.pgh.pa.us> 10/10/05 4:23 PM >>>

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>>

There's a known issue that the planner tends to overestimate the cost of
inner-index-scan nestloops, because it doesn't allow for the strong
caching effects associated with repeated scans of the same index (in
particular, that all the upper index levels tend to stay in cache).
See the archives for past inconclusive discussions about how to fix
this.

I spent a few hours trying different searches in the archives, and
found three very interesting threads on the topic. All were from
2003. Should I keep digging for more recent threads, or would
these probably represent the current state of the issue?

No, I don't recall that anything much has happened with it.

The place where the rubber hits the road is this passage in
genericcostestimate():

/*
* Estimate the number of index pages that will be retrieved.
*
* For all currently-supported index types, the first page of the index
* is a metadata page, and we should figure on fetching that plus a
* pro-rated fraction of the remaining pages.
*/
if (index->pages > 1 && index->tuples > 0)
{
numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1);
numIndexPages += 1; /* count the metapage too */
numIndexPages = ceil(numIndexPages);
}
else
numIndexPages = 1.0;

/*
* Compute the index access cost.
*
* Disk cost: our generic assumption is that the index pages will be read
* sequentially, so they have cost 1.0 each, not random_page_cost.
*/
*indexTotalCost = numIndexPages;

An important point here: in all the currently supported index types,
that last assumption is ridiculously optimistic. Except maybe in a
freshly-built btree, logically consecutive index pages won't be
physically adjacent, and so a honest cost accounting would require
charging random_page_cost per page. However, that's not the direction
we want the estimate to go in :-(

We could do something based on effective_cache_size to estimate the
probability that the index page is already in cache and hence need not
be re-read. The main trouble with this is estimating what fraction of
the effective_cache_size can fairly be assumed to be populated by each
index --- without some global knowledge about what other indexes are in
heavy use, it's difficult to see how to get there from here. (OTOH,
one could also level the same charge against the existing uses of
effective_cache_size... maybe we could redefine it as cache available
per query, or something like that, to put part of the problem on the
DBA's shoulders.)

One simple tweak we could make is to not count the index metapage in the
pages-to-be-fetched estimate, which could be justified on the grounds
that the metapage is almost certain to be in cache. (Note that the
existing estimation technique is already logically wrong for btrees,
because it's considering only the metapage and the leaf page(s) that
need to be visited, and not the internal pages that need to be descended
through to get to the right leaf. However, again we can say that the
upper internal pages are probably in cache and so it's fair not to count
them as needing to be read.)

Whether this tweak would make the overall behavior better or worse is
really hard to say. It'd be interesting to look at some test cases
though.

regards, tom lane

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Kevin Grittner (#2)

On Mon, 2005-10-10 at 15:14 -0500, Kevin Grittner wrote:

We are looking at doing much more with PostgreSQL over the
next two years, and it seems likely that this issue will come up
again where it is more of a problem. It sounded like there was
some agreement on HOW this was to be fixed, yet I don't see
any mention of doing it in the TODO list.

Is there any sort of
estimate for how much programming work would be involved?

The main work here is actually performance testing, not programming. The
cost model is built around an understanding of the timings and costs
involved in the execution.

Once we have timings to cover a sufficiently large range of cases, we
can derive the cost model. Once derived, we can program it. Discussing
improvements to the cost model without test results is never likely to
convince people. Everybody knows the cost models can be improved, the
only question is in what cases? and in what ways?

So deriving the cost model needs lots of trustworthy test results that
can be assessed and discussed, so we know how to improve things. [...and
I don't mean 5 minutes with pg_bench...]

Detailed analysis such as that is time consuming and also needs to be
done in a sufficiently reproducible manner that we can rely on it.

Your help would be greatly appreciated in that area. You and your team
clearly have an eye for the fine detail of these issues.

...IIRC there is a TODO item relating to that.

Perhaps we should put a more general call out on the TODO list towards
detailed, complete, accurate and reproducible performance test results?

Best Regards, Simon Riggs

#5Ilia Kantor
ilia@obnovlenie.ru
In reply to: Simon Riggs (#4)
Re: slow IN() clause for many cases

When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..

I mean,
"SELECT * FROM table WHERE field IN (1,2...30)" will be slower than
"SELECT * FROM table JOIN (SRF returning 1...30) USING(field)"

I'm not quite sure, where the difference starts, but sometimes I need to
make selects with 30 or more items by primary key and I get significant
speed up by this transform:

CREATE OR REPLACE FUNCTION array2table(arr int[]) RETURNS SETOF int

select * from persons join (select array2table as id from
array2table(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,2
3,24,25,26,27,28,29,30])) a using(id);

I'm sure that backend could do that in a much faster and elegant fasion.
Bitmap-or is nice, but for many IN arguments it is still much slower than
join.

#6Jonah H. Harris
jonah.harris@gmail.com
In reply to: Ilia Kantor (#5)
Re: slow IN() clause for many cases

Please post an explain analyze on your query with a 20-30 item IN clause so
that we can see what plan is being generated.

On 10/11/05, Ilia Kantor <ilia@obnovlenie.ru> wrote:

When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..

I mean,
"SELECT * FROM table WHERE field IN (1,2...30)" will be slower than
"SELECT * FROM table JOIN (SRF returning 1...30) USING(field)"

I'm not quite sure, where the difference starts, but sometimes I need to
make selects with 30 or more items by primary key and I get significant
speed up by this transform:

CREATE OR REPLACE FUNCTION array2table(arr int[]) RETURNS SETOF int

select * from persons join (select array2table as id from

array2table(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,2
3,24,25,26,27,28,29,30])) a using(id);

I'm sure that backend could do that in a much faster and elegant fasion.
Bitmap-or is nice, but for many IN arguments it is still much slower than
join.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#7Ilia Kantor
ilia@obnovlenie.ru
In reply to: Jonah H. Harris (#6)
Re: slow IN() clause for many cases

Please post an explain analyze on your query with a 20-30 item IN clause so

that we can see what plan is being generated.

It is bitmap-OR on multiple index(PK) lookups.

#8Josh Berkus
josh@agliodbs.com
In reply to: Ilia Kantor (#7)
Re: slow IN() clause for many cases

Ilia,

It is bitmap-OR on multiple index(PK) lookups.

Describing it doesn't help. We need an *actual* EXPLAIN ANALYZE.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

#9Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#4)

You don't get to read part of a page, but you may be dealing with
probabilities. For example, consider a case where there are ten data
pages, and you are going to read 15% of the tuples. There is a 50%
chance that your scan will start in the first half of a leaf page and
only need two leaf pages. There is a 50% chance that it will start in
the second half and need to access three leaf pages. Wouldn't it be
better to cost this at 2.5 pages rather than forcing it to either of the
possible values?

That said, a minimum of 1.0 is appropriate; that is why I was thinking
of adding the one and not using ceil.

I have gotten approval from my client to work on this costing issue.
Before I try modifying any code, however, I want to make sure I'm going
down the right path. The more I think about it, the more it seems like
fudging the fundamental cost of an index scan down so that it isn't so
far off when used in a nested context (which is, as you've pointed out,
what we're talking about here) should be avoided if at all possible.
The thought occurs to me that I could do a heck of a lot better
solution if the genericcostestimate function included an iteration
count parameter. I could generate a fairly accurate cost for the whole
set of iterations, then divide by the iteration count, so that the
value coming out could still be used as it has been. (The iteration
count would essentially be a "clue" to help determine how many physical
I/O might be needed.) It is not immediately obvious to me how this
context info could be passed in, which is probably why it hasn't been
done already -- but if you have any suggestions in this regard, I'd be
happy to hear them.

Also, I am wondering if there is any chance that the sort/merge
technique is somehow UNDERestimated. My experience, and what I've seen
in the archives, establishes that the relative cost of the nested index
scan can be too high compared to the cost of the scan/sort/mergejoin
technique; but I haven't seen totally convincing evidence yet regarding
which one is off. You've stated that the caching for the upper index
levels isn't factored in with the nested technique, but when running
with everything totally cached for both techniques, our real-life app
runs three times as fast with the nested index scan versus only twice
as fast when neither technique starts with anything cached. Something
isn't sitting right for me when I match those results to the
explanation. What am I missing here?

Clearly, a good set of repeatable tests is critical. I'm still
thinking about that. Any suggestions on this welcome.

-Kevin

Tom Lane <tgl@sss.pgh.pa.us> 10/10/05 8:10 PM >>>

Well, you don't get to read part of a page. In particular, fetching 1.0
index tuples requires fetching 1.0 pages, not (say) 0.01 page. We
consistently use ceil() in these sorts of estimates, and I think it's
correct.

#10Ilia Kantor
ilia@obnovlenie.ru
In reply to: Josh Berkus (#8)
Re: slow IN() clause for many cases

It is bitmap-OR on multiple index(PK) lookups.

Describing it doesn't help. We need an *actual* EXPLAIN ANALYZE.

Sure, why not..

6ms for

Bitmap Heap Scan on objects_hier (cost=60.29..179.57 rows=80 width=600)
(actual time=0.835..1.115 rows=138 loops=1)
Recheck Cond: ((id = 1) OR (id = 2) OR (id = 3) OR (id = 4) OR (id = 5)
OR (id = 6) OR (id = 7) OR (id = 8) OR (id = 9)
OR (id = 10) OR (id = 11) OR (id = 12) OR (id = 13) OR (id = 14) OR (id =
15) OR (id = 16) OR (id = 17) OR (id = 18) OR (
id = 19) OR (id = 20) OR (id = 21) OR (id = 22) OR (id = 23) OR (id = 24) OR
(id = 25) OR (id = 26) OR (id = 27) OR (id =
28) OR (id = 29) OR (id = 30))
-> BitmapOr (cost=60.29..60.29 rows=82 width=0) (actual
time=0.553..0.553 rows=0 loops=1)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.036..0.036 rows=6 loops=1)
Index Cond: (id = 1)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.044..0.044 rows=6 loops=1)
Index Cond: (id = 2)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 3)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 4)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 5)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 6)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 7)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 8)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 9)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 10)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 11)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 12)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 13)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 14)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 15)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 16)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 17)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 18)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 19)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 20)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 21)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 22)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 23)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 24)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 25)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 26)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 27)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 28)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 29)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 30)

4ms for

explain analyze select * from objects_hier join (select array2table as id
from
array2table(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,2
3,24,25,26,27,28,29,30])) a using(id);
QUERY PLAN

----------------------------------------------------------------------------
------------------------------------------------------
Merge Join (cost=62.33..576.80 rows=1117 width=600) (actual
time=0.542..2.898 rows=138 loops=1)
Merge Cond: ("outer".id = "inner".array2table)
-> Index Scan using lkjk on objects_hier (cost=0.00..493.52 rows=1680
width=600) (actual time=0.025..1.248 rows=139 loops=1)
-> Sort (cost=62.33..64.83 rows=1000 width=4) (actual time=0.510..0.799
rows=145 loops=1)
Sort Key: array2table.array2table
-> Function Scan on array2table (cost=0.00..12.50 rows=1000
width=4) (actual time=0.081..0.141 rows=30 loops=1)

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ilia Kantor (#10)
Re: slow IN() clause for many cases

"Ilia Kantor" <ilia@obnovlenie.ru> writes:

Bitmap Heap Scan on objects_hier (cost=60.29..179.57 rows=80 width=600)
(actual time=0.835..1.115 rows=138 loops=1)

vs

Merge Join (cost=62.33..576.80 rows=1117 width=600) (actual
time=0.542..2.898 rows=138 loops=1)

Hmm, sure looks from here like the bitmap plan is faster.

regards, tom lane

#12Jonah H. Harris
jonah.harris@gmail.com
In reply to: Tom Lane (#11)
Re: slow IN() clause for many cases

true dat :)

On 10/12/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Ilia Kantor" <ilia@obnovlenie.ru> writes:

Bitmap Heap Scan on objects_hier (cost=60.29..179.57 rows=80 width=600)
(actual time=0.835..1.115 rows=138 loops=1)

vs

Merge Join (cost=62.33..576.80 rows=1117 width=600) (actual
time=0.542..2.898 rows=138 loops=1)

Hmm, sure looks from here like the bitmap plan is faster.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#13Andrew - Supernews
andrew+nonews@supernews.com
In reply to: Simon Riggs (#4)
Re: slow IN() clause for many cases

On 2005-10-11, "Ilia Kantor" <ilia@obnovlenie.ru> wrote:

When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..

or even a different way of writing the IN clause.

This one is one I've used after considerable research:

select * from table
where field in (select (some_array_of_N_items)[i]
from generate_series(1,N) as s(i));

This generally plans as a nestloop, with a HashAggregate of the function
scan (of generate_series) on the outer path, and index lookups on the inner
path.

It's worth noting that EXPLAIN ANALYZE doesn't tell the whole story when
comparing queries of this kind. The IN (1,2,...30) form is much slower to
plan, and usually can't usefully be used in prepared form (you'd need to
prepare it separately for every different number of items); in contrast,
the array version can be prepared once and reused.

As the number of items in the IN clause increases, the planning time grows
rather radically. For example with 1000 items (here stashed in a psql
convenience variable for brevity), using 8.1beta2:

test=# prepare tstplan1 as select * from test where id in (:v);
Time: 4881.702 ms

compare:

test=# prepare tstplan2 as select * from test where id in
(select (ARRAY[:v])[i] from generate_series(1,1000) s(i));
Time: 10.889 ms

(on my machine the break-even point for these two is less than 20 items,
or even less if the array is passed in as a literal or a single parameter
rather than constructed with ARRAY[].)

The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.

What this means is that unless you're going to prepare in advance every
possible number of parameters to IN that your app is ever going to use,
the only way to get useful performance for IN queries with more than a
handful of literal values is to use an array method, in spite of the fact
that the bitmap-OR execution plan is actually at least as fast.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#14Ilia Kantor
ilia@obnovlenie.ru
In reply to: Andrew - Supernews (#13)
Re: slow IN() clause for many cases

1) merge join is faster for me then hash join (6 vs 3-4ms):

explain analyze select * from objects_hier where id in (select
(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
27,28,29,30])[i] from generate_series(1,30) s(i));

Hash Join (cost=17.50..162.93 rows=223 width=600) (actual
time=1.148..32.654 rows=138 loops=1)
Hash Cond: ("outer".id =
('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,2
8,29,30}'::integer[])["inner".i])
-> Seq Scan on objects_hier (cost=0.00..134.80 rows=1680 width=600)
(actual time=0.235..12.271 rows=1680 loops=1)
-> Hash (cost=17.00..17.00 rows=200 width=4) (actual time=0.893..0.893
rows=30 loops=1)
-> HashAggregate (cost=15.00..17.00 rows=200 width=4) (actual
time=0.161..0.830 rows=30 loops=1)
-> Function Scan on generate_series s (cost=0.00..12.50
rows=1000 width=4) (actual time=0.037..0.097 rows=30 loops=1)

2) Is it possible to make the planner plan that better, e.g use Bitmap-OR
for <20numbers, use some other method for >20 numbers (actual number depends
on costs) ?

Maybe that should be added to TODO ?

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Andrew - Supernews
Sent: Wednesday, October 12, 2005 1:41 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] slow IN() clause for many cases

On 2005-10-11, "Ilia Kantor" <ilia@obnovlenie.ru> wrote:

When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..

or even a different way of writing the IN clause.

This one is one I've used after considerable research:

select * from table
where field in (select (some_array_of_N_items)[i]
from generate_series(1,N) as s(i));

This generally plans as a nestloop, with a HashAggregate of the function
scan (of generate_series) on the outer path, and index lookups on the inner
path.

It's worth noting that EXPLAIN ANALYZE doesn't tell the whole story when
comparing queries of this kind. The IN (1,2,...30) form is much slower to
plan, and usually can't usefully be used in prepared form (you'd need to
prepare it separately for every different number of items); in contrast,
the array version can be prepared once and reused.

As the number of items in the IN clause increases, the planning time grows
rather radically. For example with 1000 items (here stashed in a psql
convenience variable for brevity), using 8.1beta2:

test=# prepare tstplan1 as select * from test where id in (:v);
Time: 4881.702 ms

compare:

test=# prepare tstplan2 as select * from test where id in
(select (ARRAY[:v])[i] from generate_series(1,1000) s(i));
Time: 10.889 ms

(on my machine the break-even point for these two is less than 20 items,
or even less if the array is passed in as a literal or a single parameter
rather than constructed with ARRAY[].)

The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.

What this means is that unless you're going to prepare in advance every
possible number of parameters to IN that your app is ever going to use,
the only way to get useful performance for IN queries with more than a
handful of literal values is to use an array method, in spite of the fact
that the bitmap-OR execution plan is actually at least as fast.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew - Supernews (#13)
Re: slow IN() clause for many cases

Andrew - Supernews <andrew+nonews@supernews.com> writes:

As the number of items in the IN clause increases, the planning time grows
rather radically.

I was looking at this yesterday. There is some O(N^2) behavior in
create_bitmap_subplan, stemming from trying to remove duplicated qual
conditions. That strikes me as a relatively useless activity, and I was
thinking the easiest answer might be to just delete that "optimization".

The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.

Yeah, this isn't all that surprising because of the larger number of
plan nodes involved in the bitmap plan.

regards, tom lane

#16Jonah H. Harris
jonah.harris@gmail.com
In reply to: Ilia Kantor (#14)
Re: slow IN() clause for many cases

IMHO, we should first look at whether the O(N^2) behavior is needed.

On 10/12/05, Ilia Kantor <ilia@obnovlenie.ru> wrote:

1) merge join is faster for me then hash join (6 vs 3-4ms):

explain analyze select * from objects_hier where id in (select

(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
27,28,29,30])[i] from generate_series(1,30) s(i));

Hash Join (cost=17.50..162.93 rows=223 width=600) (actual
time=1.148..32.654 rows=138 loops=1)
Hash Cond: ("outer".id =

('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,2
8,29,30}'::integer[])["inner".i])
-> Seq Scan on objects_hier (cost=0.00..134.80 rows=1680 width=600)
(actual time=0.235..12.271 rows=1680 loops=1)
-> Hash (cost=17.00..17.00 rows=200 width=4) (actual time=0.893..0.893
rows=30 loops=1)
-> HashAggregate (cost=15.00..17.00 rows=200 width=4) (actual
time=0.161..0.830 rows=30 loops=1)
-> Function Scan on generate_series s (cost=0.00..12.50
rows=1000 width=4) (actual time=0.037..0.097 rows=30 loops=1)

2) Is it possible to make the planner plan that better, e.g use Bitmap-OR
for <20numbers, use some other method for >20 numbers (actual number
depends
on costs) ?

Maybe that should be added to TODO ?

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Andrew -
Supernews
Sent: Wednesday, October 12, 2005 1:41 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] slow IN() clause for many cases

On 2005-10-11, "Ilia Kantor" <ilia@obnovlenie.ru> wrote:

When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..

or even a different way of writing the IN clause.

This one is one I've used after considerable research:

select * from table
where field in (select (some_array_of_N_items)[i]
from generate_series(1,N) as s(i));

This generally plans as a nestloop, with a HashAggregate of the function
scan (of generate_series) on the outer path, and index lookups on the
inner
path.

It's worth noting that EXPLAIN ANALYZE doesn't tell the whole story when
comparing queries of this kind. The IN (1,2,...30) form is much slower to
plan, and usually can't usefully be used in prepared form (you'd need to
prepare it separately for every different number of items); in contrast,
the array version can be prepared once and reused.

As the number of items in the IN clause increases, the planning time grows
rather radically. For example with 1000 items (here stashed in a psql
convenience variable for brevity), using 8.1beta2:

test=# prepare tstplan1 as select * from test where id in (:v);
Time: 4881.702 ms

compare:

test=# prepare tstplan2 as select * from test where id in
(select (ARRAY[:v])[i] from generate_series(1,1000) s(i));
Time: 10.889 ms

(on my machine the break-even point for these two is less than 20 items,
or even less if the array is passed in as a literal or a single parameter
rather than constructed with ARRAY[].)

The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different
between
the two forms.

What this means is that unless you're going to prepare in advance every
possible number of parameters to IN that your app is ever going to use,
the only way to get useful performance for IN queries with more than a
handful of literal values is to use an array method, in spite of the fact
that the bitmap-OR execution plan is actually at least as fast.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#17Andrew - Supernews
andrew+nonews@supernews.com
In reply to: Simon Riggs (#4)
Re: slow IN() clause for many cases

On 2005-10-12, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew - Supernews <andrew+nonews@supernews.com> writes:

As the number of items in the IN clause increases, the planning time grows
rather radically.

I was looking at this yesterday. There is some O(N^2) behavior in
create_bitmap_subplan, stemming from trying to remove duplicated qual
conditions. That strikes me as a relatively useless activity, and I was
thinking the easiest answer might be to just delete that "optimization".

Well, the behaviour is definitely bad.

For comparison, on 8.0 the IN (list of 1000 items) version plans only about
four times slower than IN (select array...), and again the execution times
are comparable. But for the case of a real-world app that uses IN () a lot
(dspam), I've had reports that the performance improvements from switching
to the array method are even more substantial than my raw timings suggest.

The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.

Yeah, this isn't all that surprising because of the larger number of
plan nodes involved in the bitmap plan.

I think you have this backwards - the instrumentation overhead is _lower_
for the bitmap-OR plan than for the nestloop. This was also true in 8.0;
the instrumentation overhead of an index OR scan plan is lower than the
nestloop plan, though not by nearly as much.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#18Andrew - Supernews
andrew+nonews@supernews.com
In reply to: Simon Riggs (#4)
Re: slow IN() clause for many cases

On 2005-10-12, Andrew - Supernews <andrew+nonews@supernews.com> wrote:

On 2005-10-12, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew - Supernews <andrew+nonews@supernews.com> writes:

As the number of items in the IN clause increases, the planning time grows
rather radically.

I was looking at this yesterday. There is some O(N^2) behavior in
create_bitmap_subplan, stemming from trying to remove duplicated qual
conditions. That strikes me as a relatively useless activity, and I was
thinking the easiest answer might be to just delete that "optimization".

Well, the behaviour is definitely bad.

For comparison, on 8.0 the IN (list of 1000 items) version plans only about
four times slower than IN (select array...), and again the execution times
are comparable. But for the case of a real-world app that uses IN () a lot
(dspam), I've had reports that the performance improvements from switching
to the array method are even more substantial than my raw timings suggest.

Trying this again, on the same data, with the latest planner change shows
that the plan time for IN (list of 1000 items) is now much better, 47ms
rather than nearly five seconds, but that's still substantially longer than
the execution time in the case where the data is in cache. I'm not sure why,
but the execution time for that query has improved too, it's now about 20%
faster for the bitmap-OR than for the array method.

With everything in cache, selecting 1000 random rows from a 200k row table,
I get:

for IN (list): planning time 47ms, execution time 27ms
array/nestloop: planning time 8ms, execution time 34ms

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew - Supernews (#18)
Re: slow IN() clause for many cases

Andrew - Supernews <andrew+nonews@supernews.com> writes:

With everything in cache, selecting 1000 random rows from a 200k row table,
I get:

for IN (list): planning time 47ms, execution time 27ms
array/nestloop: planning time 8ms, execution time 34ms

The reason for the slow planning time is that IN (list) is currently
converted (by the grammar) into an OR structure:

(x = val1) OR (x = val2) OR (x = val3) OR (x = val4) OR ...

which means that the planner has to consider each subclause separately:
discover that it can be matched to an index, build the indexscan plan,
etc. Even just looking up all the "=" operators during parsing takes a
fair amount of time :-( The large number of plan nodes also imply
relatively slow executor startup.

It strikes me that now that we have the bitmap indexscan mechanism,
it wouldn't be hard to do better. I'm thinking that IN should be
converted to a ScalarArrayOpExpr, ie

x = ANY (ARRAY[val1,val2,val3,val4,...])

and then we could treat both OpExpr and ScalarArrayOpExpr as matching
an index when the LHS is the index key and the operator is in the
index's opclass. This wouldn't fit comfortably into a plain indexscan,
but a bitmap indexscan has already got the mechanism for ORing together
the results of several primitive indexscans (one per array element).
You just do the scans and insert all the results into your output
bitmap.

This approach would reduce the planner and executor-startup costs of
even a long IN list to be pretty comparable to a single index key
comparison. The actual runtime of the plan wouldn't change much,
I think, but it's the overhead that's hurting us here.

This also solves the problem mentioned earlier in the thread that
you can't make a prepared plan for varying lengths of IN-lists:
instead of using IN, you do something like
x = ANY($1::int[])
and then ship your lookup keys as a single array parameter instead of
multiple scalars.

The main objection I can see is that this technique requires the
elements of the IN list to be converted to a common datatype (ie, the
element datatype of the ARRAY construct). Historically our code has
made no such assumption. However, I believe that the SQL standard
expects that conversion to happen: "x IN (list)" is defined as
"x = ANY (VALUES list)" and <table value constructor> expects all its
rows to be converted to a common rowtype. So we could point to the
standard if anyone complains that the behavior changed.

Obviously this is not happening for 8.1, but it seems very do-able for
8.2.

Comments?

regards, tom lane

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#19)
Re: slow IN() clause for many cases

I wrote:

I'm thinking that IN should be
converted to a ScalarArrayOpExpr, ie

x = ANY (ARRAY[val1,val2,val3,val4,...])

Actually, there is one little thing in the way of doing this: it'll
fail if any of the IN-list elements are NULL, because we have not got
support for arrays with null elements. So we'd have to fix that first.

regards, tom lane

#21David Fetter
david@fetter.org
In reply to: Tom Lane (#20)
#22Mark Mielke
mark@mark.mielke.cc
In reply to: Tom Lane (#20)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Mielke (#22)
#24Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#19)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#24)
#26Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#26)
#28Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#27)
#29Martijn van Oosterhout
kleptog@svana.org
In reply to: Bruce Momjian (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#28)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#29)
#32Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#31)
#33Martijn van Oosterhout
kleptog@svana.org
In reply to: Bruce Momjian (#32)
#34Bruce Momjian
bruce@momjian.us
In reply to: Martijn van Oosterhout (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#34)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#20)
#37Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Simon Riggs (#4)
#38Simon Riggs
simon@2ndQuadrant.com
In reply to: Jim Nasby (#37)
#39Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Simon Riggs (#38)
#40Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#36)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#40)
#42Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#41)
#43Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#41)
#44Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#43)
#45Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#43)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#44)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Martijn van Oosterhout (#44)
#48Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#47)