Re: slower merge join on sorted data chosen over
I am absolutely sure that I never changed any of the enable_xxx
settings other than enable_mergejoin during these tests. The only
other setting I played with was random_page_cost, but the nested
loop query always chose the plain index scan in my tests.
There was one odd thing. I started testing on one machine, then
we dumped it and loaded it onto a new machine, to try a slightly
different configuration for comparison. I re-ran the tests and
noticed that on both boxes, with no enable_xxx options turned off,
when I reduced the random_page_cost to 1.2 the merge case went
to a plain index scan rather than the bitmap scan. What was odd was
that on the original box, the bitmap scan version was consistently
faster, while on the new box it was slightly slower. I repeated the
runs a few times because I found that surprising, but it seemed to
hold. That seems more likely to be related to the density of the
index pages following the restore than to the difference in
hardware or OS. It wasn't a very dramatic difference either way.
I notice that on the fully cached runs, the actual time for the plain
index was (barely) faster than the bitmap heap scan:
-> Bitmap Heap Scan on "DbTranRepository" dtr (cost=297.07..47081.47 rows=25067 width=17) (actual time=30.432..427.463 rows=39690 loops=1)
-> Index Scan using "DbTranRepository_timestamp" on "DbTranRepository" dtr (cost=0.00..49419.45 rows=25067 width=17) (actual time=0.083..412.268 rows=39690 loops=1)
However, it doesn't seem to be predicting that result when you look
at the cost numbers, so it is a mystery. Is there anything I can set to
tell it to spit out the cost info for every path it considers, so we can
see why it is making this choice? It doesn't bias toward a lower
starting cost when it is going to use a nested index scan, does it?
(OK, "starting cost" is probably not the right argot -- I'm talking about
the "0.00" in "cost=0.00..49419.45".)
As for expectations -- an experienced programmer who knew the
schema, the data, and had just written the queries to retrieve data,
was testing the application with various selection criteria, and
saying something like "This one should be sub-second. Check.
This one should take a few seconds. Check. Here's a big one, this'll
probably take two or three minutes. Dang -- it took seven and a half
minutes. Could you look at this and see why it's so slow? It seems
like it should be able to use this index and get the job done faster."
Some might call this "gut feel"; others might describe it as putting
forward a series of hypotheses, testing each, and investigating
unexpected results. ;-)
The good news is that we don't think this sort of query will be
chosen by the users very often -- a few times a month, perhaps,
among the roughly 200 management-level people who will have
access to this application. An extra five minutes per query isn't a
deal breaker, although it would obviously be nice if it ran faster.
I'll review the archives for prior discussions of the problem.
-Kevin
Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>>
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
In both the 8.1beta2 and using a build from this morning's
dev snapshot, this query ran slower than expected:
There's a known issue that the planner tends to overestimate the cost of
inner-index-scan nestloops, because it doesn't allow for the strong
caching effects associated with repeated scans of the same index (in
particular, that all the upper index levels tend to stay in cache).
See the archives for past inconclusive discussions about how to fix
this.
However, something else caught my eye:
-> Bitmap Heap Scan on "DbTranRepository" dtr (cost=297.07..47081.47 rows=25067 width=17) (actual time=69.056..5560.895 rows=39690 loops=1)
-> Index Scan using "DbTranRepository_timestamp" on "DbTranRepository" dtr (cost=0.00..49419.45 rows=25067 width=17) (actual time=33.625..11510.723 rows=39690 loops=1)
I don't understand why the second case chose a plain index scan when
there was no need for sorted output; the bitmap scan is faster both
per cost estimate and in reality. Are you sure you turned off only
enable_mergejoin and not enable_bitmapscan?
Also, when you say "slower than expected", what is setting your
expectation?
regards, tom lane
Thanks, Tom.
I spent a few hours trying different searches in the archives, and
found three very interesting threads on the topic. All were from
2003. Should I keep digging for more recent threads, or would
these probably represent the current state of the issue?
These left me somewhat concerned, since people were reporting
queries which ran orders of magnitude slower using merge joins
than when they forced them to nested loop index scans. In our
first brush with the issue it caused our query to run only two to
three times slower than it would if the planner could more
accurately cost the nested index scans, and it was in an area with
minimal impact to the organization, so this one is relatively benign.
We are looking at doing much more with PostgreSQL over the
next two years, and it seems likely that this issue will come up
again where it is more of a problem. It sounded like there was
some agreement on HOW this was to be fixed, yet I don't see
any mention of doing it in the TODO list. Is there any sort of
estimate for how much programming work would be involved?
Any chance of an incremental improvement, such as only
counting leaf access in a nested select? (While that's not
perfect, it seems likely to be closer, and therefore beneficial
overall.)
Thanks,
-Kevin
Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>>
There's a known issue that the planner tends to overestimate the cost of
inner-index-scan nestloops, because it doesn't allow for the strong
caching effects associated with repeated scans of the same index (in
particular, that all the upper index levels tend to stay in cache).
See the archives for past inconclusive discussions about how to fix
this.
Import Notes
Resolved by subject fallback
Hmmm... With that much direction, I really have no excuse
not to try a change and provide some test cases, do I?
A couple questions occur to me, though. I'm not clear on why
ceil is called -- do we need to eliminate the fraction here?
It seems to me that it wouldn't matter much except when the
index is small, in which case the odds are greater that pages
would be cached. It seems like it would be a less radical change
to eliminate the ceil call than the increment. As a first pass, I'm
inclined to try that in combination with a reduction in the
configured cpu_index_tuple_cost. Does that seem reasonable?
Also, to be clear, I thought someone said that only the leaf level
was counted -- it looks like the other levels are being factored in,
but not as heavily as they would be accessed with logical reads,
because we're taking the fraction of tuples for the index multiplied
by the total number of index pages except for the metadata page.
This should bias the cost slightly higher for additional index levels,
shouldn't it?
I toyed with the idea of eliminating both the ceil and the increment,
and just having a minimum of 1.0, but that seemed extreme.
-Kevin
Tom Lane <tgl@sss.pgh.pa.us> 10/10/05 4:23 PM >>>
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>>
There's a known issue that the planner tends to overestimate the cost of
inner-index-scan nestloops, because it doesn't allow for the strong
caching effects associated with repeated scans of the same index (in
particular, that all the upper index levels tend to stay in cache).
See the archives for past inconclusive discussions about how to fix
this.
I spent a few hours trying different searches in the archives, and
found three very interesting threads on the topic. All were from
2003. Should I keep digging for more recent threads, or would
these probably represent the current state of the issue?
No, I don't recall that anything much has happened with it.
The place where the rubber hits the road is this passage in
genericcostestimate():
/*
* Estimate the number of index pages that will be retrieved.
*
* For all currently-supported index types, the first page of the index
* is a metadata page, and we should figure on fetching that plus a
* pro-rated fraction of the remaining pages.
*/
if (index->pages > 1 && index->tuples > 0)
{
numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1);
numIndexPages += 1; /* count the metapage too */
numIndexPages = ceil(numIndexPages);
}
else
numIndexPages = 1.0;
/*
* Compute the index access cost.
*
* Disk cost: our generic assumption is that the index pages will be read
* sequentially, so they have cost 1.0 each, not random_page_cost.
*/
*indexTotalCost = numIndexPages;
An important point here: in all the currently supported index types,
that last assumption is ridiculously optimistic. Except maybe in a
freshly-built btree, logically consecutive index pages won't be
physically adjacent, and so a honest cost accounting would require
charging random_page_cost per page. However, that's not the direction
we want the estimate to go in :-(
We could do something based on effective_cache_size to estimate the
probability that the index page is already in cache and hence need not
be re-read. The main trouble with this is estimating what fraction of
the effective_cache_size can fairly be assumed to be populated by each
index --- without some global knowledge about what other indexes are in
heavy use, it's difficult to see how to get there from here. (OTOH,
one could also level the same charge against the existing uses of
effective_cache_size... maybe we could redefine it as cache available
per query, or something like that, to put part of the problem on the
DBA's shoulders.)
One simple tweak we could make is to not count the index metapage in the
pages-to-be-fetched estimate, which could be justified on the grounds
that the metapage is almost certain to be in cache. (Note that the
existing estimation technique is already logically wrong for btrees,
because it's considering only the metapage and the leaf page(s) that
need to be visited, and not the internal pages that need to be descended
through to get to the right leaf. However, again we can say that the
upper internal pages are probably in cache and so it's fair not to count
them as needing to be read.)
Whether this tweak would make the overall behavior better or worse is
really hard to say. It'd be interesting to look at some test cases
though.
regards, tom lane
Import Notes
Resolved by subject fallback
On Mon, 2005-10-10 at 15:14 -0500, Kevin Grittner wrote:
We are looking at doing much more with PostgreSQL over the
next two years, and it seems likely that this issue will come up
again where it is more of a problem. It sounded like there was
some agreement on HOW this was to be fixed, yet I don't see
any mention of doing it in the TODO list.
Is there any sort of
estimate for how much programming work would be involved?
The main work here is actually performance testing, not programming. The
cost model is built around an understanding of the timings and costs
involved in the execution.
Once we have timings to cover a sufficiently large range of cases, we
can derive the cost model. Once derived, we can program it. Discussing
improvements to the cost model without test results is never likely to
convince people. Everybody knows the cost models can be improved, the
only question is in what cases? and in what ways?
So deriving the cost model needs lots of trustworthy test results that
can be assessed and discussed, so we know how to improve things. [...and
I don't mean 5 minutes with pg_bench...]
Detailed analysis such as that is time consuming and also needs to be
done in a sufficiently reproducible manner that we can rely on it.
Your help would be greatly appreciated in that area. You and your team
clearly have an eye for the fine detail of these issues.
...IIRC there is a TODO item relating to that.
Perhaps we should put a more general call out on the TODO list towards
detailed, complete, accurate and reproducible performance test results?
Best Regards, Simon Riggs
When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..
I mean,
"SELECT * FROM table WHERE field IN (1,2...30)" will be slower than
"SELECT * FROM table JOIN (SRF returning 1...30) USING(field)"
I'm not quite sure, where the difference starts, but sometimes I need to
make selects with 30 or more items by primary key and I get significant
speed up by this transform:
CREATE OR REPLACE FUNCTION array2table(arr int[]) RETURNS SETOF int
select * from persons join (select array2table as id from
array2table(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,2
3,24,25,26,27,28,29,30])) a using(id);
I'm sure that backend could do that in a much faster and elegant fasion.
Bitmap-or is nice, but for many IN arguments it is still much slower than
join.
Please post an explain analyze on your query with a 20-30 item IN clause so
that we can see what plan is being generated.
On 10/11/05, Ilia Kantor <ilia@obnovlenie.ru> wrote:
When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..I mean,
"SELECT * FROM table WHERE field IN (1,2...30)" will be slower than
"SELECT * FROM table JOIN (SRF returning 1...30) USING(field)"I'm not quite sure, where the difference starts, but sometimes I need to
make selects with 30 or more items by primary key and I get significant
speed up by this transform:CREATE OR REPLACE FUNCTION array2table(arr int[]) RETURNS SETOF int
select * from persons join (select array2table as id from
array2table(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,2
3,24,25,26,27,28,29,30])) a using(id);I'm sure that backend could do that in a much faster and elegant fasion.
Bitmap-or is nice, but for many IN arguments it is still much slower than
join.---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
--
Respectfully,
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/
Please post an explain analyze on your query with a 20-30 item IN clause so
that we can see what plan is being generated.
It is bitmap-OR on multiple index(PK) lookups.
Ilia,
It is bitmap-OR on multiple index(PK) lookups.
Describing it doesn't help. We need an *actual* EXPLAIN ANALYZE.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
You don't get to read part of a page, but you may be dealing with
probabilities. For example, consider a case where there are ten data
pages, and you are going to read 15% of the tuples. There is a 50%
chance that your scan will start in the first half of a leaf page and
only need two leaf pages. There is a 50% chance that it will start in
the second half and need to access three leaf pages. Wouldn't it be
better to cost this at 2.5 pages rather than forcing it to either of the
possible values?
That said, a minimum of 1.0 is appropriate; that is why I was thinking
of adding the one and not using ceil.
I have gotten approval from my client to work on this costing issue.
Before I try modifying any code, however, I want to make sure I'm going
down the right path. The more I think about it, the more it seems like
fudging the fundamental cost of an index scan down so that it isn't so
far off when used in a nested context (which is, as you've pointed out,
what we're talking about here) should be avoided if at all possible.
The thought occurs to me that I could do a heck of a lot better
solution if the genericcostestimate function included an iteration
count parameter. I could generate a fairly accurate cost for the whole
set of iterations, then divide by the iteration count, so that the
value coming out could still be used as it has been. (The iteration
count would essentially be a "clue" to help determine how many physical
I/O might be needed.) It is not immediately obvious to me how this
context info could be passed in, which is probably why it hasn't been
done already -- but if you have any suggestions in this regard, I'd be
happy to hear them.
Also, I am wondering if there is any chance that the sort/merge
technique is somehow UNDERestimated. My experience, and what I've seen
in the archives, establishes that the relative cost of the nested index
scan can be too high compared to the cost of the scan/sort/mergejoin
technique; but I haven't seen totally convincing evidence yet regarding
which one is off. You've stated that the caching for the upper index
levels isn't factored in with the nested technique, but when running
with everything totally cached for both techniques, our real-life app
runs three times as fast with the nested index scan versus only twice
as fast when neither technique starts with anything cached. Something
isn't sitting right for me when I match those results to the
explanation. What am I missing here?
Clearly, a good set of repeatable tests is critical. I'm still
thinking about that. Any suggestions on this welcome.
-Kevin
Tom Lane <tgl@sss.pgh.pa.us> 10/10/05 8:10 PM >>>
Well, you don't get to read part of a page. In particular, fetching 1.0
index tuples requires fetching 1.0 pages, not (say) 0.01 page. We
consistently use ceil() in these sorts of estimates, and I think it's
correct.
Import Notes
Resolved by subject fallback
It is bitmap-OR on multiple index(PK) lookups.
Describing it doesn't help. We need an *actual* EXPLAIN ANALYZE.
Sure, why not..
6ms for
Bitmap Heap Scan on objects_hier (cost=60.29..179.57 rows=80 width=600)
(actual time=0.835..1.115 rows=138 loops=1)
Recheck Cond: ((id = 1) OR (id = 2) OR (id = 3) OR (id = 4) OR (id = 5)
OR (id = 6) OR (id = 7) OR (id = 8) OR (id = 9)
OR (id = 10) OR (id = 11) OR (id = 12) OR (id = 13) OR (id = 14) OR (id =
15) OR (id = 16) OR (id = 17) OR (id = 18) OR (
id = 19) OR (id = 20) OR (id = 21) OR (id = 22) OR (id = 23) OR (id = 24) OR
(id = 25) OR (id = 26) OR (id = 27) OR (id =
28) OR (id = 29) OR (id = 30))
-> BitmapOr (cost=60.29..60.29 rows=82 width=0) (actual
time=0.553..0.553 rows=0 loops=1)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.036..0.036 rows=6 loops=1)
Index Cond: (id = 1)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.044..0.044 rows=6 loops=1)
Index Cond: (id = 2)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 3)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 4)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 5)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 6)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 7)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 8)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 9)
-> Bitmap Index Scan on lkjk (cost=0.00..2.02 rows=6 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 10)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 11)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 12)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 13)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 14)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 15)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 16)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 17)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 18)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 19)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 20)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 21)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 22)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=6 loops=1)
Index Cond: (id = 23)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 24)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 25)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 26)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 27)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 28)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 29)
-> Bitmap Index Scan on lkjk (cost=0.00..2.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id = 30)
4ms for
explain analyze select * from objects_hier join (select array2table as id
from
array2table(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,2
3,24,25,26,27,28,29,30])) a using(id);
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Merge Join (cost=62.33..576.80 rows=1117 width=600) (actual
time=0.542..2.898 rows=138 loops=1)
Merge Cond: ("outer".id = "inner".array2table)
-> Index Scan using lkjk on objects_hier (cost=0.00..493.52 rows=1680
width=600) (actual time=0.025..1.248 rows=139 loops=1)
-> Sort (cost=62.33..64.83 rows=1000 width=4) (actual time=0.510..0.799
rows=145 loops=1)
Sort Key: array2table.array2table
-> Function Scan on array2table (cost=0.00..12.50 rows=1000
width=4) (actual time=0.081..0.141 rows=30 loops=1)
"Ilia Kantor" <ilia@obnovlenie.ru> writes:
Bitmap Heap Scan on objects_hier (cost=60.29..179.57 rows=80 width=600)
(actual time=0.835..1.115 rows=138 loops=1)
vs
Merge Join (cost=62.33..576.80 rows=1117 width=600) (actual
time=0.542..2.898 rows=138 loops=1)
Hmm, sure looks from here like the bitmap plan is faster.
regards, tom lane
true dat :)
On 10/12/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Ilia Kantor" <ilia@obnovlenie.ru> writes:
Bitmap Heap Scan on objects_hier (cost=60.29..179.57 rows=80 width=600)
(actual time=0.835..1.115 rows=138 loops=1)vs
Merge Join (cost=62.33..576.80 rows=1117 width=600) (actual
time=0.542..2.898 rows=138 loops=1)Hmm, sure looks from here like the bitmap plan is faster.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
--
Respectfully,
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/
On 2005-10-11, "Ilia Kantor" <ilia@obnovlenie.ru> wrote:
When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..
or even a different way of writing the IN clause.
This one is one I've used after considerable research:
select * from table
where field in (select (some_array_of_N_items)[i]
from generate_series(1,N) as s(i));
This generally plans as a nestloop, with a HashAggregate of the function
scan (of generate_series) on the outer path, and index lookups on the inner
path.
It's worth noting that EXPLAIN ANALYZE doesn't tell the whole story when
comparing queries of this kind. The IN (1,2,...30) form is much slower to
plan, and usually can't usefully be used in prepared form (you'd need to
prepare it separately for every different number of items); in contrast,
the array version can be prepared once and reused.
As the number of items in the IN clause increases, the planning time grows
rather radically. For example with 1000 items (here stashed in a psql
convenience variable for brevity), using 8.1beta2:
test=# prepare tstplan1 as select * from test where id in (:v);
Time: 4881.702 ms
compare:
test=# prepare tstplan2 as select * from test where id in
(select (ARRAY[:v])[i] from generate_series(1,1000) s(i));
Time: 10.889 ms
(on my machine the break-even point for these two is less than 20 items,
or even less if the array is passed in as a literal or a single parameter
rather than constructed with ARRAY[].)
The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.
What this means is that unless you're going to prepare in advance every
possible number of parameters to IN that your app is ever going to use,
the only way to get useful performance for IN queries with more than a
handful of literal values is to use an array method, in spite of the fact
that the bitmap-OR execution plan is actually at least as fast.
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
1) merge join is faster for me then hash join (6 vs 3-4ms):
explain analyze select * from objects_hier where id in (select
(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
27,28,29,30])[i] from generate_series(1,30) s(i));
Hash Join (cost=17.50..162.93 rows=223 width=600) (actual
time=1.148..32.654 rows=138 loops=1)
Hash Cond: ("outer".id =
('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,2
8,29,30}'::integer[])["inner".i])
-> Seq Scan on objects_hier (cost=0.00..134.80 rows=1680 width=600)
(actual time=0.235..12.271 rows=1680 loops=1)
-> Hash (cost=17.00..17.00 rows=200 width=4) (actual time=0.893..0.893
rows=30 loops=1)
-> HashAggregate (cost=15.00..17.00 rows=200 width=4) (actual
time=0.161..0.830 rows=30 loops=1)
-> Function Scan on generate_series s (cost=0.00..12.50
rows=1000 width=4) (actual time=0.037..0.097 rows=30 loops=1)
2) Is it possible to make the planner plan that better, e.g use Bitmap-OR
for <20numbers, use some other method for >20 numbers (actual number depends
on costs) ?
Maybe that should be added to TODO ?
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Andrew - Supernews
Sent: Wednesday, October 12, 2005 1:41 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] slow IN() clause for many cases
On 2005-10-11, "Ilia Kantor" <ilia@obnovlenie.ru> wrote:
When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..
or even a different way of writing the IN clause.
This one is one I've used after considerable research:
select * from table
where field in (select (some_array_of_N_items)[i]
from generate_series(1,N) as s(i));
This generally plans as a nestloop, with a HashAggregate of the function
scan (of generate_series) on the outer path, and index lookups on the inner
path.
It's worth noting that EXPLAIN ANALYZE doesn't tell the whole story when
comparing queries of this kind. The IN (1,2,...30) form is much slower to
plan, and usually can't usefully be used in prepared form (you'd need to
prepare it separately for every different number of items); in contrast,
the array version can be prepared once and reused.
As the number of items in the IN clause increases, the planning time grows
rather radically. For example with 1000 items (here stashed in a psql
convenience variable for brevity), using 8.1beta2:
test=# prepare tstplan1 as select * from test where id in (:v);
Time: 4881.702 ms
compare:
test=# prepare tstplan2 as select * from test where id in
(select (ARRAY[:v])[i] from generate_series(1,1000) s(i));
Time: 10.889 ms
(on my machine the break-even point for these two is less than 20 items,
or even less if the array is passed in as a literal or a single parameter
rather than constructed with ARRAY[].)
The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.
What this means is that unless you're going to prepare in advance every
possible number of parameters to IN that your app is ever going to use,
the only way to get useful performance for IN queries with more than a
handful of literal values is to use an array method, in spite of the fact
that the bitmap-OR execution plan is actually at least as fast.
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
Andrew - Supernews <andrew+nonews@supernews.com> writes:
As the number of items in the IN clause increases, the planning time grows
rather radically.
I was looking at this yesterday. There is some O(N^2) behavior in
create_bitmap_subplan, stemming from trying to remove duplicated qual
conditions. That strikes me as a relatively useless activity, and I was
thinking the easiest answer might be to just delete that "optimization".
The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.
Yeah, this isn't all that surprising because of the larger number of
plan nodes involved in the bitmap plan.
regards, tom lane
IMHO, we should first look at whether the O(N^2) behavior is needed.
On 10/12/05, Ilia Kantor <ilia@obnovlenie.ru> wrote:
1) merge join is faster for me then hash join (6 vs 3-4ms):
explain analyze select * from objects_hier where id in (select
(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
27,28,29,30])[i] from generate_series(1,30) s(i));Hash Join (cost=17.50..162.93 rows=223 width=600) (actual
time=1.148..32.654 rows=138 loops=1)
Hash Cond: ("outer".id =('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,2
8,29,30}'::integer[])["inner".i])
-> Seq Scan on objects_hier (cost=0.00..134.80 rows=1680 width=600)
(actual time=0.235..12.271 rows=1680 loops=1)
-> Hash (cost=17.00..17.00 rows=200 width=4) (actual time=0.893..0.893
rows=30 loops=1)
-> HashAggregate (cost=15.00..17.00 rows=200 width=4) (actual
time=0.161..0.830 rows=30 loops=1)
-> Function Scan on generate_series s (cost=0.00..12.50
rows=1000 width=4) (actual time=0.037..0.097 rows=30 loops=1)2) Is it possible to make the planner plan that better, e.g use Bitmap-OR
for <20numbers, use some other method for >20 numbers (actual number
depends
on costs) ?Maybe that should be added to TODO ?
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Andrew -
Supernews
Sent: Wednesday, October 12, 2005 1:41 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] slow IN() clause for many casesOn 2005-10-11, "Ilia Kantor" <ilia@obnovlenie.ru> wrote:
When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..or even a different way of writing the IN clause.
This one is one I've used after considerable research:
select * from table
where field in (select (some_array_of_N_items)[i]
from generate_series(1,N) as s(i));This generally plans as a nestloop, with a HashAggregate of the function
scan (of generate_series) on the outer path, and index lookups on the
inner
path.It's worth noting that EXPLAIN ANALYZE doesn't tell the whole story when
comparing queries of this kind. The IN (1,2,...30) form is much slower to
plan, and usually can't usefully be used in prepared form (you'd need to
prepare it separately for every different number of items); in contrast,
the array version can be prepared once and reused.As the number of items in the IN clause increases, the planning time grows
rather radically. For example with 1000 items (here stashed in a psql
convenience variable for brevity), using 8.1beta2:test=# prepare tstplan1 as select * from test where id in (:v);
Time: 4881.702 mscompare:
test=# prepare tstplan2 as select * from test where id in
(select (ARRAY[:v])[i] from generate_series(1,1000) s(i));
Time: 10.889 ms(on my machine the break-even point for these two is less than 20 items,
or even less if the array is passed in as a literal or a single parameter
rather than constructed with ARRAY[].)The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different
between
the two forms.What this means is that unless you're going to prepare in advance every
possible number of parameters to IN that your app is ever going to use,
the only way to get useful performance for IN queries with more than a
handful of literal values is to use an array method, in spite of the fact
that the bitmap-OR execution plan is actually at least as fast.--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?http://www.postgresql.org/docs/faq
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
--
Respectfully,
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/
On 2005-10-12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew - Supernews <andrew+nonews@supernews.com> writes:
As the number of items in the IN clause increases, the planning time grows
rather radically.I was looking at this yesterday. There is some O(N^2) behavior in
create_bitmap_subplan, stemming from trying to remove duplicated qual
conditions. That strikes me as a relatively useless activity, and I was
thinking the easiest answer might be to just delete that "optimization".
Well, the behaviour is definitely bad.
For comparison, on 8.0 the IN (list of 1000 items) version plans only about
four times slower than IN (select array...), and again the execution times
are comparable. But for the case of a real-world app that uses IN () a lot
(dspam), I've had reports that the performance improvements from switching
to the array method are even more substantial than my raw timings suggest.
The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.Yeah, this isn't all that surprising because of the larger number of
plan nodes involved in the bitmap plan.
I think you have this backwards - the instrumentation overhead is _lower_
for the bitmap-OR plan than for the nestloop. This was also true in 8.0;
the instrumentation overhead of an index OR scan plan is lower than the
nestloop plan, though not by nearly as much.
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
On 2005-10-12, Andrew - Supernews <andrew+nonews@supernews.com> wrote:
On 2005-10-12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew - Supernews <andrew+nonews@supernews.com> writes:
As the number of items in the IN clause increases, the planning time grows
rather radically.I was looking at this yesterday. There is some O(N^2) behavior in
create_bitmap_subplan, stemming from trying to remove duplicated qual
conditions. That strikes me as a relatively useless activity, and I was
thinking the easiest answer might be to just delete that "optimization".Well, the behaviour is definitely bad.
For comparison, on 8.0 the IN (list of 1000 items) version plans only about
four times slower than IN (select array...), and again the execution times
are comparable. But for the case of a real-world app that uses IN () a lot
(dspam), I've had reports that the performance improvements from switching
to the array method are even more substantial than my raw timings suggest.
Trying this again, on the same data, with the latest planner change shows
that the plan time for IN (list of 1000 items) is now much better, 47ms
rather than nearly five seconds, but that's still substantially longer than
the execution time in the case where the data is in cache. I'm not sure why,
but the execution time for that query has improved too, it's now about 20%
faster for the bitmap-OR than for the array method.
With everything in cache, selecting 1000 random rows from a 200k row table,
I get:
for IN (list): planning time 47ms, execution time 27ms
array/nestloop: planning time 8ms, execution time 34ms
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
Andrew - Supernews <andrew+nonews@supernews.com> writes:
With everything in cache, selecting 1000 random rows from a 200k row table,
I get:for IN (list): planning time 47ms, execution time 27ms
array/nestloop: planning time 8ms, execution time 34ms
The reason for the slow planning time is that IN (list) is currently
converted (by the grammar) into an OR structure:
(x = val1) OR (x = val2) OR (x = val3) OR (x = val4) OR ...
which means that the planner has to consider each subclause separately:
discover that it can be matched to an index, build the indexscan plan,
etc. Even just looking up all the "=" operators during parsing takes a
fair amount of time :-( The large number of plan nodes also imply
relatively slow executor startup.
It strikes me that now that we have the bitmap indexscan mechanism,
it wouldn't be hard to do better. I'm thinking that IN should be
converted to a ScalarArrayOpExpr, ie
x = ANY (ARRAY[val1,val2,val3,val4,...])
and then we could treat both OpExpr and ScalarArrayOpExpr as matching
an index when the LHS is the index key and the operator is in the
index's opclass. This wouldn't fit comfortably into a plain indexscan,
but a bitmap indexscan has already got the mechanism for ORing together
the results of several primitive indexscans (one per array element).
You just do the scans and insert all the results into your output
bitmap.
This approach would reduce the planner and executor-startup costs of
even a long IN list to be pretty comparable to a single index key
comparison. The actual runtime of the plan wouldn't change much,
I think, but it's the overhead that's hurting us here.
This also solves the problem mentioned earlier in the thread that
you can't make a prepared plan for varying lengths of IN-lists:
instead of using IN, you do something like
x = ANY($1::int[])
and then ship your lookup keys as a single array parameter instead of
multiple scalars.
The main objection I can see is that this technique requires the
elements of the IN list to be converted to a common datatype (ie, the
element datatype of the ARRAY construct). Historically our code has
made no such assumption. However, I believe that the SQL standard
expects that conversion to happen: "x IN (list)" is defined as
"x = ANY (VALUES list)" and <table value constructor> expects all its
rows to be converted to a common rowtype. So we could point to the
standard if anyone complains that the behavior changed.
Obviously this is not happening for 8.1, but it seems very do-able for
8.2.
Comments?
regards, tom lane
I wrote:
I'm thinking that IN should be
converted to a ScalarArrayOpExpr, ie
x = ANY (ARRAY[val1,val2,val3,val4,...])
Actually, there is one little thing in the way of doing this: it'll
fail if any of the IN-list elements are NULL, because we have not got
support for arrays with null elements. So we'd have to fix that first.
regards, tom lane
On Fri, Oct 14, 2005 at 07:09:17PM -0400, Tom Lane wrote:
I wrote:
I'm thinking that IN should be converted to a ScalarArrayOpExpr,
iex = ANY (ARRAY[val1,val2,val3,val4,...])
Actually, there is one little thing in the way of doing this: it'll
fail if any of the IN-list elements are NULL, because we have not
got support for arrays with null elements. So we'd have to fix
that first.
+1 on fixing that. :)
Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778
Remember to vote!
On Fri, Oct 14, 2005 at 07:09:17PM -0400, Tom Lane wrote:
I wrote:
I'm thinking that IN should be
converted to a ScalarArrayOpExpr, ie
x = ANY (ARRAY[val1,val2,val3,val4,...])Actually, there is one little thing in the way of doing this: it'll
fail if any of the IN-list elements are NULL, because we have not got
support for arrays with null elements. So we'd have to fix that first.
Hey Tom.
Obviously your area of expertise, so please tell me where I'm wrong -
But doesn't "x IN (NULL)" already fail to ever match, similar to "x
= NULL"? (Assuming that compatibility switch isn't enabled)
So, I'd hope people weren't using such an expression? :-) Or is that
not good enough? What does NULL::text[] turn into? An empty string? Is
that the danger? It might match against an empty string?
Cheers,
mark
--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...
mark@mark.mielke.cc writes:
But doesn't "x IN (NULL)" already fail to ever match, similar to "x
= NULL"? (Assuming that compatibility switch isn't enabled)
The case I'm worried about is "x IN (1,2,NULL)". This *can* match.
Also, it's supposed to return NULL (not FALSE) if it doesn't match.
So simply discarding NULLs before we build the array wouldn't give
the right answer.
Considering that allowing null array entries has been a big wishlist
item for many years, I don't think that anyone will object to fixing
that in 8.2, anyway ...
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
It strikes me that now that we have the bitmap indexscan mechanism,
it wouldn't be hard to do better. I'm thinking that IN should be
converted to a ScalarArrayOpExpr, iex = ANY (ARRAY[val1,val2,val3,val4,...])
and then we could treat both OpExpr and ScalarArrayOpExpr as matching
an index when the LHS is the index key and the operator is in the
index's opclass. This wouldn't fit comfortably into a plain indexscan,
but a bitmap indexscan has already got the mechanism for ORing together
the results of several primitive indexscans (one per array element).
You just do the scans and insert all the results into your output
bitmap.
Would this mean it would be impossible to get a fast-start plan for an IN
expression though?
I would fear queries like
SELECT * from tab WHERE x IN (1,2,3) LIMIT 1
Which ought to be instantaneous would become potentially slow.
(Actually I'm more interested in cases where instead of LIMIT 1 it's the
client that will stop as soon as it finds the record it's really looking for.)
--
greg
Greg Stark <gsstark@mit.edu> writes:
I would fear queries like
SELECT * from tab WHERE x IN (1,2,3) LIMIT 1
Which ought to be instantaneous would become potentially slow.
Why? They certainly wouldn't be any slower than they are now.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
Greg Stark <gsstark@mit.edu> writes:
I would fear queries like
SELECT * from tab WHERE x IN (1,2,3) LIMIT 1
Which ought to be instantaneous would become potentially slow.
Why? They certainly wouldn't be any slower than they are now.
Well currently they get rewritten to use OR which does a single index scan
which I assumed returned rows as soon as it finds them like it does for
regular range lookup index scans. Is that assumption wrong?
The bitmap scan has to traverse all the index entries for matching records
before it can fetch the first record. So it wouldn't be a fast-start plan. Not
as bad as performing a sort step or anything like that of course.
--
greg
Greg Stark <gsstark@mit.edu> writes:
Tom Lane <tgl@sss.pgh.pa.us> writes:
Why? They certainly wouldn't be any slower than they are now.
Well currently they get rewritten to use OR which does a single index scan
Not in 8.1 it doesn't; that code is long gone.
2005-04-24 21:30 tgl
Remove support for OR'd indexscans internal to a single IndexScan
plan node, as this behavior is now better done as a bitmap OR
indexscan. This allows considerable simplification in
nodeIndexscan.c itself as well as several planner modules concerned
with indexscan plan generation. Also we can improve the sharing of
code between regular and bitmap indexscans, since they are now
working with nigh-identical Plan nodes.
The bitmap scan has to traverse all the index entries for matching records
before it can fetch the first record. So it wouldn't be a fast-start
plan.
If the fraction of records matching the IN-list is so large as to make
that an issue, I'd think the planner would prefer a seqscan anyway.
Besides which, it's a bit silly to worry about whether a plan is
fast-start without taking into account the amount of planner time needed
to create it.
Another point here is that LIMIT without any ORDER BY isn't an amazingly
useful case. Neither the old implementation of OR indexscans nor the
new can produce ordered output, which means you're not going to get
fast-start behavior anyway for real queries.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
If the fraction of records matching the IN-list is so large as to make
that an issue, I'd think the planner would prefer a seqscan anyway.
Besides which, it's a bit silly to worry about whether a plan is
fast-start without taking into account the amount of planner time needed
to create it.
Well I'm thinking about the cases where there's a short IN-list but many
records match those clauses. Something like:
WHERE user_status IN ('active','pending')
Where there could be thousands of matching records.
Another point here is that LIMIT without any ORDER BY isn't an amazingly
useful case. Neither the old implementation of OR indexscans nor the
new can produce ordered output, which means you're not going to get
fast-start behavior anyway for real queries.
That's true. That's why I was wondering more about cases where the client end
was going to read all the records until it found the record it's looking for
or found enough records for its purposes.
There are other uses of fast-start as well. If the records are going to be put
through a slow process it can be more efficient to pipeline them through while
the later records are still coming from the database.
But I guess this is all moot if the option isn't there, at least not without a
lot more programming.
The example above raises another idea though. Would it be possible for the
optimizer to recognize when a clause is so expansive that it would be faster
to read the complement than the actual clause as written?
That could be useful with bitmap operations if it meant less time reading the
index to build the bitmap but the eventual result after the bitmap operations
is restrictive enough to justify an index scan.
eg a case where 90% of users are active, and 10% have pending set and there's
an index on each of these:
WHERE user_status = 'active' AND pending = true
If the optimizer tries to a bitmap index scan now it has to read in a huge
index; it's probably better off just using the pending index alone. But if it
realizes that it can use the index to find the tuples that *aren't* active and
then take the complement of that bitmap it could build a bitmap reading in
only 10% of that index.
--
greg
On Sun, Oct 16, 2005 at 12:03:33PM -0400, Greg Stark wrote:
Tom Lane <tgl@sss.pgh.pa.us> writes:
Another point here is that LIMIT without any ORDER BY isn't an amazingly
useful case. Neither the old implementation of OR indexscans nor the
new can produce ordered output, which means you're not going to get
fast-start behavior anyway for real queries.That's true. That's why I was wondering more about cases where the client end
was going to read all the records until it found the record it's looking for
or found enough records for its purposes.
I would argue that the client should simply ask for what it wants
rather than filtering on the client end. Then PostgreSQL has the info
to optimise appropriately.
There are other uses of fast-start as well. If the records are going to be put
through a slow process it can be more efficient to pipeline them through while
the later records are still coming from the database.
Well, IIRC if you create a cursor or use LIMIT, PostgreSQL will prefer
fast-start plans if that gets the requested result faster.
The example above raises another idea though. Would it be possible for the
optimizer to recognize when a clause is so expansive that it would be faster
to read the complement than the actual clause as written?That could be useful with bitmap operations if it meant less time reading the
index to build the bitmap but the eventual result after the bitmap operations
is restrictive enough to justify an index scan.
The problem is, the bitmaps are lossy. If so many matches indicate
about the same area in the table, they are coalesced into a single
block. Hence, you cannot meaningfully ask for a complement.
However, if you knew from the beginning that you wanted the complement
you may be able to arrange something. OTOH, there's no way to tell from
the index if it's referred to *all* the tuples in a page so there is no
way to exclude a page from consideration.
eg a case where 90% of users are active, and 10% have pending set and there's
an index on each of these:WHERE user_status = 'active' AND pending = true
If the optimizer tries to a bitmap index scan now it has to read in a huge
index; it's probably better off just using the pending index alone. But if it
realizes that it can use the index to find the tuples that *aren't* active and
then take the complement of that bitmap it could build a bitmap reading in
only 10% of that index.
As I pointed out above, I don't think you can take the complement of a
bitmap. Besides, I havn't looked at the costings for bitmap scans but
ISTM that if the stats indicate a 90% coverage for user_status and the
index is not highly correlated, it should exclude that index from
consideration altogether. And whether it uses a normal index scan or a
bitmap scan for pending also depends on the stats. It's the comparison
between scanning the whole index and then the pages in sequence versus
just grabbing the tuples from the index.
Besides, the case you provide is the perfect candidate for a partial
index. Attributes that only apply to a small fraction of the table are
better off as predicates if you're going to be searching for them a
lot.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
Greg Stark <gsstark@mit.edu> writes:
The example above raises another idea though. Would it be possible for the
optimizer to recognize when a clause is so expansive that it would be faster
to read the complement than the actual clause as written?
Being able to compute the complement, much less do so with an indexable
clause, is usually difficult in SQL (think about NULLs). In any case
I think this is the situation where you are happy to fall back to a
seqscan.
regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes:
On Sun, Oct 16, 2005 at 12:03:33PM -0400, Greg Stark wrote:
That's true. That's why I was wondering more about cases where the client end
was going to read all the records until it found the record it's looking for
or found enough records for its purposes.
I would argue that the client should simply ask for what it wants
rather than filtering on the client end. Then PostgreSQL has the info
to optimise appropriately.
Certainly, if you do not supply a LIMIT, there is no justification
at all for expecting the planner to prefer fast-start over
minimum-total-cost.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
Martijn van Oosterhout <kleptog@svana.org> writes:
On Sun, Oct 16, 2005 at 12:03:33PM -0400, Greg Stark wrote:
That's true. That's why I was wondering more about cases where the client end
was going to read all the records until it found the record it's looking for
or found enough records for its purposes.I would argue that the client should simply ask for what it wants
rather than filtering on the client end. Then PostgreSQL has the info
to optimise appropriately.Certainly, if you do not supply a LIMIT, there is no justification
at all for expecting the planner to prefer fast-start over
minimum-total-cost.
Well figuring out when to prefer one or the other is a hard problem.
Fundamentally the server simply does not have the information it needs to
determine that available.
(I think there really ought to be a bit in the protocol that the client sends
with the query to indicate which is needed. That would be cleaner than
Oracle's /*+ FIRST_ROW */ and /*+ ALL_ROWS */ hints.)
But having it as an option is a separate question. Even if the server needs
some cajoling to actually choose the right one it's always a good thing if
it's at least possible.
--
greg
On Sun, Oct 16, 2005 at 05:09:57PM -0400, Greg Stark wrote:
Tom Lane <tgl@sss.pgh.pa.us> writes:
Certainly, if you do not supply a LIMIT, there is no justification
at all for expecting the planner to prefer fast-start over
minimum-total-cost.Well figuring out when to prefer one or the other is a hard problem.
Fundamentally the server simply does not have the information it needs to
determine that available.
Umm, not really. Notice how EXPLAIN has two numbers: time to first row,
time to last row. If you add limit 1 it will favour plans that return
the first row quickly. If you don't it'll favour plans that have the
lowest total execution time, even if the first tuple takes longer.
(I think there really ought to be a bit in the protocol that the client sends
with the query to indicate which is needed. That would be cleaner than
Oracle's /*+ FIRST_ROW */ and /*+ ALL_ROWS */ hints.)
It's called LIMIT and has been supported for a long time.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes:
(I think there really ought to be a bit in the protocol that the client sends
with the query to indicate which is needed. That would be cleaner than
Oracle's /*+ FIRST_ROW */ and /*+ ALL_ROWS */ hints.)It's called LIMIT and has been supported for a long time.
And if I *don't* want to limit the number of rows I retriev?
--
greg
Greg Stark <gsstark@mit.edu> writes:
Martijn van Oosterhout <kleptog@svana.org> writes:
It's called LIMIT and has been supported for a long time.
And if I *don't* want to limit the number of rows I retriev?
You still need to do something proactive to inform the system that you
want a fast-start plan. What's more, you really really do not want to
give it an unlimited license to prefer zero-start-cost plans, else you
might still be waiting for the third row when hell freezes over.
In the current system structure, the only very reasonable way to set up
incremental client processing of a query result is to use a cursor and
FETCH a few rows at a time from it. The planner already has a hack to
give some preference to fast-start plans when planning DECLARE CURSOR.
My recollection is that it optimizes on the assumption that 10% of the
rows will be retrieved, which gives some balance between start speed and
not blowing out the total runtime unreasonably. We've already had some
discussion about exposing that 10% figure as a GUC variable, so that you
could put a finger on the scale depending on how much of the result you
expected to fetch. But nobody got excited enough to make it happen...
regards, tom lane
On Fri, 2005-10-14 at 19:09 -0400, Tom Lane wrote:
I wrote:
I'm thinking that IN should be
converted to a ScalarArrayOpExpr, iex = ANY (ARRAY[val1,val2,val3,val4,...])
Actually, there is one little thing in the way of doing this: it'll
fail if any of the IN-list elements are NULL, because we have not got
support for arrays with null elements. So we'd have to fix that first.
You'd also need to consider how this effects partial indexes and
constraint exclusion. Not much of an issue, but an extra case to handle
in the predicate proving code.
= = =
Just had a case where using an IN list was quicker than using a join
because it allowed an index lookup to occur. There is also some clear
mileage in transforming this type of query to a more plannable form:
select * from bigtable where word IN (
select word from customer_word where customer = 6)
i.e. where the values for the IN clause are evaluated at run time,
rather than at plan time.
Best Regards, Simon Riggs
On Tue, Oct 11, 2005 at 10:58:58AM +0100, Simon Riggs wrote:
On Mon, 2005-10-10 at 15:14 -0500, Kevin Grittner wrote:
We are looking at doing much more with PostgreSQL over the
next two years, and it seems likely that this issue will come up
again where it is more of a problem. It sounded like there was
some agreement on HOW this was to be fixed, yet I don't see
any mention of doing it in the TODO list.Is there any sort of
estimate for how much programming work would be involved?The main work here is actually performance testing, not programming. The
cost model is built around an understanding of the timings and costs
involved in the execution.Once we have timings to cover a sufficiently large range of cases, we
can derive the cost model. Once derived, we can program it. Discussing
improvements to the cost model without test results is never likely to
convince people. Everybody knows the cost models can be improved, the
only question is in what cases? and in what ways?So deriving the cost model needs lots of trustworthy test results that
can be assessed and discussed, so we know how to improve things. [...and
I don't mean 5 minutes with pg_bench...]Detailed analysis such as that is time consuming and also needs to be
done in a sufficiently reproducible manner that we can rely on it.Your help would be greatly appreciated in that area. You and your team
clearly have an eye for the fine detail of these issues....IIRC there is a TODO item relating to that.
Perhaps we should put a more general call out on the TODO list towards
detailed, complete, accurate and reproducible performance test results?
I touched on some of this in
http://archives.postgresql.org/pgsql-performance/2005-05/msg00336.php:
"In terms of a testing system, here's what I'm thinking of. For each cost
estimate, there will be a number of input variables we want to vary, and
then check to see how changes in them effect run time. Using index scan
as a simple example, 1st order variables will likely be index and table
size (especially in relation to cache size), and correlation. So, we
need some kind of a test harness that can vary these variables
(prefferably one at a time), and run a test case after each change. It
would then need to store the timing info, possibly along with other
information (such as explain output). If I'm the one to write this it'll
end up in perl, since that's the only language I know that would be able
to accomplish this. DBT seems to be a reasonable test database to do
this testing with, especially since it would provide a ready means to
provide external load."
Of course, this work can be done by hand, but it's a very slow, tedeous
process. It's also rather error-prone.
There's been some discussion on the osdl-dbt mailing lists about
providing a front-end that would allow for scheduling tests and storing
results in a database where you could data-mine more easily than you
currently can (currently everything's just stored as files on a disk
somewhere). ISTM that having that would make this kind of testing much
easier to do. But I've also been working with dbt quite a bit recently,
so it's my hammer that makes everything performance related look like a
nail...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Mon, 2005-10-17 at 14:55 -0500, Jim C. Nasby wrote:
On Tue, Oct 11, 2005 at 10:58:58AM +0100, Simon Riggs wrote:
On Mon, 2005-10-10 at 15:14 -0500, Kevin Grittner wrote:
We are looking at doing much more with PostgreSQL over the
next two years, and it seems likely that this issue will come up
again where it is more of a problem. It sounded like there was
some agreement on HOW this was to be fixed, yet I don't see
any mention of doing it in the TODO list.Is there any sort of
estimate for how much programming work would be involved?The main work here is actually performance testing, not programming. The
cost model is built around an understanding of the timings and costs
involved in the execution.Once we have timings to cover a sufficiently large range of cases, we
can derive the cost model. Once derived, we can program it. Discussing
improvements to the cost model without test results is never likely to
convince people. Everybody knows the cost models can be improved, the
only question is in what cases? and in what ways?So deriving the cost model needs lots of trustworthy test results that
can be assessed and discussed, so we know how to improve things. [...and
I don't mean 5 minutes with pg_bench...]
...
DBT seems to be a reasonable test database
I was discussing finding the cost equations to use within the optimizer
based upon a series of exploratory tests using varying data. That is
different to using the same database with varying parameters. Both sound
interesting, but it is the former that, IMHO, would be the more
important.
Best Regards, Simon Riggs
On Mon, Oct 17, 2005 at 09:30:24PM +0100, Simon Riggs wrote:
On Mon, 2005-10-17 at 14:55 -0500, Jim C. Nasby wrote:
On Tue, Oct 11, 2005 at 10:58:58AM +0100, Simon Riggs wrote:
On Mon, 2005-10-10 at 15:14 -0500, Kevin Grittner wrote:
We are looking at doing much more with PostgreSQL over the
next two years, and it seems likely that this issue will come up
again where it is more of a problem. It sounded like there was
some agreement on HOW this was to be fixed, yet I don't see
any mention of doing it in the TODO list.Is there any sort of
estimate for how much programming work would be involved?The main work here is actually performance testing, not programming. The
cost model is built around an understanding of the timings and costs
involved in the execution.Once we have timings to cover a sufficiently large range of cases, we
can derive the cost model. Once derived, we can program it. Discussing
improvements to the cost model without test results is never likely to
convince people. Everybody knows the cost models can be improved, the
only question is in what cases? and in what ways?So deriving the cost model needs lots of trustworthy test results that
can be assessed and discussed, so we know how to improve things. [...and
I don't mean 5 minutes with pg_bench...]...
DBT seems to be a reasonable test database
I was discussing finding the cost equations to use within the optimizer
based upon a series of exploratory tests using varying data. That is
different to using the same database with varying parameters. Both sound
interesting, but it is the former that, IMHO, would be the more
important.
True, although that doesn't necessarily mean you can't use the same data
generation. For the testing I was doing before I was just varying
correlation using cluster (or selecting from different fields with
different correlations).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Mon, 2005-10-17 at 12:49 +0100, Simon Riggs wrote:
On Fri, 2005-10-14 at 19:09 -0400, Tom Lane wrote:
I wrote:
I'm thinking that IN should be
converted to a ScalarArrayOpExpr, iex = ANY (ARRAY[val1,val2,val3,val4,...])
Actually, there is one little thing in the way of doing this: it'll
fail if any of the IN-list elements are NULL, because we have not got
support for arrays with null elements. So we'd have to fix that first.You'd also need to consider how this effects partial indexes and
constraint exclusion. Not much of an issue, but an extra case to handle
in the predicate proving code.= = =
Just had a case where using an IN list was quicker than using a join
because it allowed an index lookup to occur. There is also some clear
mileage in transforming this type of query to a more plannable form:select * from bigtable where word IN (
select word from customer_word where customer = 6)i.e. where the values for the IN clause are evaluated at run time,
rather than at plan time.
Do you think we'll be able to generate a single ScalarArrayOpExpr from a
small subselect and pass it through as an indexable expression?
I'm guessing its not lost on you that this would give a Star join like
capability, when joining multiple dimension tables to a large Fact
table.
e.g.
Select * From Sales where month IN (
select month from time_dimension where FinYear = 2005 and Quarter = 3)
...
Having taught predtest.c about ScalarArrayOpExpr means that would allow
this to work with constraint exclusion.
So that solves the how-to-join-AND-partition problem I've been
struggling with: don't join, transform. Very cool.
Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
Do you think we'll be able to generate a single ScalarArrayOpExpr from a
small subselect and pass it through as an indexable expression?
If you don't mind spelling it with the ARRAY(sub-select) syntax, which
I think is a Postgres-ism (though it's possible Joe got it from
SQL2003).
regression=# explain select * from tenk1 where unique1 = any (array(select f1 from int4_tbl));
QUERY PLAN
-----------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=3.09..37.86 rows=10 width=244)
Recheck Cond: (unique1 = ANY ($0))
InitPlan
-> Seq Scan on int4_tbl (cost=0.00..1.05 rows=5 width=4)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..2.04 rows=10 width=0)
Index Cond: (unique1 = ANY ($0))
(6 rows)
Of course the planner is just guessing about how many rows this will
produce.
e.g.
Select * From Sales where month IN (
select month from time_dimension where FinYear = 2005 and Quarter = 3)
Having taught predtest.c about ScalarArrayOpExpr means that would allow
this to work with constraint exclusion.
Not hardly, unless you want to play fast and loose with semantics by
evaluating subselects at plan time instead of run time. You could
persuade that to happen by wrapping the ARRAY(sub-select) into a
function mis-declared as IMMUTABLE, but I'd be pretty resistant to
having the planner assume any such thing by default.
regards, tom lane
Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Do you think we'll be able to generate a single ScalarArrayOpExpr from a
small subselect and pass it through as an indexable expression?If you don't mind spelling it with the ARRAY(sub-select) syntax, which
I think is a Postgres-ism (though it's possible Joe got it from
SQL2003).
It's in SQL 2003. See excerpt below.
Joe
6.36 <array value constructor>
Function
Specify construction of an array.
Format
<array value constructor> ::=
<array value constructor by enumeration> |
<array value constructor by query>
<array value constructor by enumeration> ::=
ARRAY <left bracket or trigraph>
<array element list>
<right bracket or trigraph>
<array value constructor by query> ::=
ARRAY <left paren>
<query expression> [ <order by clause> ]
<right paren>
On Tue, 2005-11-29 at 17:21 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Do you think we'll be able to generate a single ScalarArrayOpExpr from a
small subselect and pass it through as an indexable expression?If you don't mind spelling it with the ARRAY(sub-select) syntax, which
I think is a Postgres-ism (though it's possible Joe got it from
SQL2003).regression=# explain select * from tenk1 where unique1 = any (array(select f1 from int4_tbl));
QUERY PLAN
-----------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=3.09..37.86 rows=10 width=244)
Recheck Cond: (unique1 = ANY ($0))
InitPlan
-> Seq Scan on int4_tbl (cost=0.00..1.05 rows=5 width=4)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..2.04 rows=10 width=0)
Index Cond: (unique1 = ANY ($0))
(6 rows)Of course the planner is just guessing about how many rows this will
produce.
So we could teach the planner to transform:
IN (subselect)
into
= ANY(array(subselect))
if we had the planner think the subselect had say < 1000 rows?
e.g.
Select * From Sales where month IN (
select month from time_dimension where FinYear = 2005 and Quarter = 3)Having taught predtest.c about ScalarArrayOpExpr means that would allow
this to work with constraint exclusion.Not hardly, unless you want to play fast and loose with semantics by
evaluating subselects at plan time instead of run time. You could
persuade that to happen by wrapping the ARRAY(sub-select) into a
function mis-declared as IMMUTABLE, but I'd be pretty resistant to
having the planner assume any such thing by default.
Man, thats a horrible thought. I must be dragging you down :-)
IMHO the only way to do joins that access partitions is to do the
constraint exclusion at run time, but I can see thats a longer
conversation than I can start right now.
Best Regards, Simon Riggs
On Tue, Nov 29, 2005 at 10:53:38PM +0000, Simon Riggs wrote:
On Tue, 2005-11-29 at 17:21 -0500, Tom Lane wrote:
regression=# explain select * from tenk1 where unique1 = any (array(select f1 from int4_tbl));
<snip>
So we could teach the planner to transform:
IN (subselect)
into
= ANY(array(subselect))
if we had the planner think the subselect had say < 1000 rows?
Do these constructs have the same semantics w.r.t. NULL? Currently
arrays can't have nulls but that is changing. Also, won't they behave
differently in the case where the subselect returns duplicate values?
And finally, why can't:
Select * From Sales where month IN (
select month from time_dimension where FinYear = 2005 and Quarter = 3)
Be written as:
Select sales.* From Sales, time_dimension
where month = time_dimension.inYear = 2005 and time_dimension.Quarter = 3;
As long as there are no NULLs it returns the same as the IN() version
and PostgreSQL can optimise it just fine.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
Simon Riggs <simon@2ndquadrant.com> writes:
IMHO the only way to do joins that access partitions is to do the
constraint exclusion at run time, but I can see thats a longer
conversation than I can start right now.
My experience in Oracle was that you can see three different types of
partition plans
1) Plans that the planner can statically deduce will access specific
partitions. This is basically what Postgres has now.
2) Plans that the planner can statically deduce will only access a limited
number of partitions but which partitions will be determined at run-time.
This is sort of like how Postgres handles index scans in subqueries where
the value being looked up is a parameter.
3) Plans that expect to span all the partitions because it can't find any
constraint in the query from which it can deduce which partitions to look
at.
Option 2 happens both for joins, and also for simple lookups where the value
being looked up is a parameter in a prepared query. My personal inclination is
that prepared queries are the way of the future so I think this is an
important case.
--
greg
Martijn van Oosterhout <kleptog@svana.org> writes:
Do these constructs have the same semantics w.r.t. NULL?
I think so, though it'd be good to read the spec closely.
Currently arrays can't have nulls
That's so last week ;-)
regards, tom lane
On Wed, 2005-11-30 at 07:18 +0100, Martijn van Oosterhout wrote:
And finally, why can't:
Select * From Sales where month IN (
select month from time_dimension where FinYear = 2005 and Quarter = 3)Be written as:
Select sales.* From Sales, time_dimension
where month = time_dimension.inYear = 2005 and time_dimension.Quarter = 3;As long as there are no NULLs it returns the same as the IN() version
and PostgreSQL can optimise it just fine.
It can, of course, but there must be value in that optimization.
If you consider how IN () would be transformed into
=ANY(ARRAY(subselect)) you'll see that the subselect values would be
treated as constants that could result in a bitmap index lookup.
Transforming IN () into straight joins would not take the same approach
when more than one join (i.e. 3 or more tables) was requested.
Best Regards, Simon Riggs
On Fri, Dec 02, 2005 at 08:18:44AM +0000, Simon Riggs wrote:
It can, of course, but there must be value in that optimization.
If you consider how IN () would be transformed into
=ANY(ARRAY(subselect)) you'll see that the subselect values would be
treated as constants that could result in a bitmap index lookup.Transforming IN () into straight joins would not take the same approach
when more than one join (i.e. 3 or more tables) was requested.
Are you sure? If you have one table joined to many others, that is the
single most obvious case for bitmap indexes. And joins are converted to
bitmap index scans all the time so I'm unsure why this case would be
any different. If the results are the same, the optimiser should
optimise both the same, no?
Anyway, maybe I'm just old fashioned in thinking that joins are by far
the easiest to optimise because they are closest to relational algebra.
IN() can also be converted to a join, except for the NULL effect.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.