Recheck condition

Started by Josh Harrisonover 18 years ago16 messagesgeneral

joshques@gmail.com

over 18 years ago

Hi,
Sorry if my question is silly.
When I use explain analyze command I come across 'recheck condition' in some
places.
I googled for this but didn't get any solid answers.

What is recheck condition and when does the query planner choose this?
Thanks
josh

Martijn van Oosterhout

kleptog@svana.org

over 18 years ago

In reply to: Josh Harrison (#1)

Re: Recheck condition

On Wed, Nov 28, 2007 at 01:18:56PM -0500, Josh Harrison wrote:

Hi,
Sorry if my question is silly.
When I use explain analyze command I come across 'recheck condition' in some
places.
I googled for this but didn't get any solid answers.

Some indexes are inexact, i.e. they may sometimes return tuples that
don't actually match the index condition. This also happens with bitmap
scans, because it'll return anything in the bitmap which will probably
be more than what you asked for. The recheck just means that the
planner retests the index condition on the result to make sure you only
get the rows you wanted.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

Josh Harrison

joshques@gmail.com

over 18 years ago

In reply to: Martijn van Oosterhout (#2)

Re: Recheck condition

Some indexes are inexact, i.e. they may sometimes return tuples that
don't actually match the index condition.

What causes an index to be inexact. When you create an index and vacuum it
regularly, it is suppose to be correct....right??

This also happens with bitmap
scans, because it'll return anything in the bitmap which will probably
be more than what you asked for. The recheck just means that the
planner retests the index condition on the result to make sure you only
get the rows you wanted

So does recheck condition affect the performance of the queries since it
basically rechecks the condition?
Also does it goes to the heap to retest ?

For example for this query
explain analyze select count(*) from foo where foo_id=1 I get the following
plan

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=1949.84..1949.85 rows=1 width=0) (actual
time=7.996..7.996rows=1 loops=1)
-> Bitmap Heap Scan on foo (cost=277.45..1924.94 rows=9959 width=0)
(actual time=1.903..5.270 rows=10020 loops=1)
Recheck Cond: (foo_id =
1::numeric)

-> Bitmap Index Scan on foo_pk (cost=0.00..274.96 rows=9959
width=0) (actual time=1.864..1.864 rows=10020 loops=1)
Index Cond: (foo_id =
1::numeric)

Total runtime: 8.062 ms

Can you please explain to me with respect to this example what is happening
here? This is a small table but for big tables the performance is not very
good. Does recheck condition brings down the query performance?

Thanks
josh

Import Notes

Resolved by subject fallback

Martijn van Oosterhout

kleptog@svana.org

over 18 years ago

In reply to: Josh Harrison (#3)

Re: Recheck condition

On Wed, Nov 28, 2007 at 02:20:11PM -0500, Josh Harrison wrote:

Some indexes are inexact, i.e. they may sometimes return tuples that
don't actually match the index condition.

What causes an index to be inexact. When you create an index and vacuum it
regularly, it is suppose to be correct....right??

The nature of the beast. For example, if you create an index on large
integer arrays it doesn't store the actual array in the index, but a
hashed version thereof. When we scan the index because of this hashing
it might match other arrays that shouldn't be. Hence the recheck.

Similarly for geometry indexes. The index only stores bounding boxes
and an intersection test might hit the bounding box but not match the
actual query.

So does recheck condition affect the performance of the queries since it
basically rechecks the condition?
Also does it goes to the heap to retest ?

At the time of the recheck the data is already in memory. So no, it
doesn't go back to the heap.

For example for this query
explain analyze select count(*) from foo where foo_id=1 I get the following
plan

It isn't the recheck that's costing it, it's probably just that you're
matching a lot of rows. A bitmap scan classically needs a recheck
because if a lot of rows need to be stored it might remember only
blocks 2044-2060. It then needs to recheck each row as it comes through
to make sure it really matches the conditions.

This query is 8ms, I imagine when it takes a long time it's matching
lots of rows?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

Martijn van Oosterhout

kleptog@svana.org

over 18 years ago

In reply to: Josh Harrison (#3)

Re: Recheck condition

Please always CC the list so other people can respond.

On Wed, Nov 28, 2007 at 10:21:39PM -0500, Josh Harrison wrote:

It isn't the recheck that's costing it, it's probably just that you're
matching a lot of rows. A bitmap scan classically needs a recheck
because if a lot of rows need to be stored it might remember only
blocks 2044-2060. It then needs to recheck each row as it comes through
to make sure it really matches the conditions.

What is this number 2044-2060? Is this a fixed number in postgres?

Ofcourse not. Have you read the documentation on explain yet?
http://www.postgresql.org/docs/8.2/static/using-explain.html

The point is that the bitmap may have an inexact representation of the
tuples that match. If your scan indicates you'll match 10 million
entries and you only want to use 16KB for your bitmap, obviously you
can't store all the locations exactly.

For example if I have a table Person with 3 fields (name,city_id,age). And
the table contains 1000 rows. The table has 2 indexes city_id and age
If I have a query :
SELECT * FROM PERSON WHERE city_id=5 AND AGE=30

The answer is "it depends". Postgres has a cost based planner, it will
estimate the costs of each different way of getting the result and use
the cheapest. The factors that are important is how many rows each
condition will match.

Given your table is only 8MB, the system may decide that it's all in
memory and just do a scan.

Or it maybe see that city_id is almost unique and use that index and
check the matches for the second condition. Or vice-versa.

Or maybe it will scan both indexes, calculate the intersection and then
looks up the matches in the heap (with a recheck).

In other words, Will this query cause 1000 random heap access or 10 random
heap access ?

I don't know, run it and see.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

Import Notes

Reply to msg id not found: 8d89ea1d0711281921t3bf9ece0p82b73f787c8fe18f@mail.gmail.com

Josh Harrison

joshques@gmail.com

over 18 years ago

In reply to: Martijn van Oosterhout (#5)

Re: Recheck condition

For example if I have a table Person with 3 fields (name,city_id,age).

And

the table contains 1000 rows. The table has 2 indexes city_id and age
If I have a query :
SELECT * FROM PERSON WHERE city_id=5 AND AGE=30

The answer is "it depends". Postgres has a cost based planner, it will
estimate the costs of each different way of getting the result and use
the cheapest. The factors that are important is how many rows each
condition will match.

Given your table is only 8MB, the system may decide that it's all in
memory and just do a scan.

Or it maybe see that city_id is almost unique and use that index and
check the matches for the second condition. Or vice-versa.

Or maybe it will scan both indexes, calculate the intersection and then
looks up the matches in the heap (with a recheck).

Okay....So If I have a query like the above and the query plan shows a
'recheck condition' and bitmap scan, then does that mean it scans the
indexes first to get the intermediate results and goto the heap only for the
final data?

Thanks

Alvaro Herrera

alvherre@2ndquadrant.com

over 18 years ago

In reply to: Josh Harrison (#6)

Re: Recheck condition

Josh Harrison escribiï¿½:

For example if I have a table Person with 3 fields (name,city_id,age).

And

the table contains 1000 rows. The table has 2 indexes city_id and age
If I have a query :
SELECT * FROM PERSON WHERE city_id=5 AND AGE=30

Okay....So If I have a query like the above and the query plan shows a
'recheck condition' and bitmap scan, then does that mean it scans the
indexes first to get the intermediate results and goto the heap only for the
final data?

Yes.

If the table actually contains 1000 rows, the most likely outcome is
that the bitmaps would not be lossy and therefore no rechecking is
needed at all. (Tuple bitmaps become lossy only if they have to store a
lot of tuples, in which case they forget the idea of storing each tuple,
and instead "compress" the representation to storing only the page
numbers where matching tuples are to be found).

Note however, that even if the bitmaps are not lossy, the visit to the
heap is still required, because the need to check for visibility.

--
Alvaro Herrera Valdivia, Chile ICBM: S 39ï¿½ 49' 18.1", W 73ï¿½ 13' 56.4"
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees." (E. Dijkstra)

Josh Harrison

joshques@gmail.com

over 18 years ago

In reply to: Alvaro Herrera (#7)

Re: Recheck condition

On Nov 29, 2007 8:15 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Josh Harrison escribió:

For example if I have a table Person with 3 fields

(name,city_id,age).

And

the table contains 1000 rows. The table has 2 indexes city_id and

age

If I have a query :
SELECT * FROM PERSON WHERE city_id=5 AND AGE=30

Okay....So If I have a query like the above and the query plan shows a
'recheck condition' and bitmap scan, then does that mean it scans the
indexes first to get the intermediate results and goto the heap only for

the

final data?

Yes.

If the table actually contains 1000 rows, the most likely outcome is
that the bitmaps would not be lossy and therefore no rechecking is
needed at all. (Tuple bitmaps become lossy only if they have to store a
lot of tuples, in which case they forget the idea of storing each tuple,
and instead "compress" the representation to storing only the page
numbers where matching tuples are to be found).

Note however, that even if the bitmaps are not lossy, the visit to the
heap is still required, because the need to check for visibility.

Thanks...
I have 1 more question in the same line...

*Query1*
SELECT person_id FROM person WHERE (column1=1 AND column2='62')
INTERSECT
SELECT person_id FROM person WHERE (column1=1 AND column2='189')

There is an index created as person_idx(column1,column2)

QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

SetOp Intersect (cost=1750719.48..1769378.35 rows=373177 width=4) (actual
time=42913.626..47247.650 rows=6352
loops=1)
-> Sort (cost=1750719.48..1760048.92 rows=3731774 width=4) (actual
time=42913.537..45838.472 rows=3756726
loops=1)
Sort Key: "*SELECT*
1".patient_id

*Sort method: external merge Disk:73432kB *

-> Append (cost=17886.42..1209431.67 rows=3731774 width=4)
(actual time=1474.995..32215.493 rows=3756726
loops=1)
-> Subquery Scan "*SELECT* 1"
(cost=17886.42..496377.90rows=381993 width=4) (actual time=
1474.993..4936.240 rows=327498 loops=1)
-> Bitmap Heap Scan on person (cost=
17886.42..492557.97 rows=381993 width=4) (actual
time=1474.990..4735.972rows=327498 loops=1)
Recheck Cond: ((column1 = 1) AND ((column2)::text
= '62'::text))
-> Bitmap Index Scan on person_idx (cost=
0.00..17790.92 rows=381993 width=0) (actual
time=1469.508..1469.508rows=327498 loops=1)
Index Cond: ((column1 = 1) AND
((column2)::text =
'62'::text))
-> Subquery Scan "*SELECT* 2"
(cost=156754.24..713053.77rows=3349781 width=4) (actual time=
4142.577..25518.305 rows=3429228 loops=1)
-> Bitmap Heap Scan on person (cost=
156754.24..679555.96 rows=3349781 width=4) (actual
time=4142.573..23493.596rows=3429228 loops=1)
Recheck Cond: ((column1 = 1) AND ((column2)::text
= '189'::text))
-> Bitmap Index Scan on person_idx (cost=
0.00..155916.80 rows=3349781 width=0) (actual
time=4136.948..4136.948rows=3429228 loops=1)
Index Cond: ((column1 = 1) AND
((column2)::text =
'189'::text))
Total runtime: 47250.501 ms

**
Question:
In this query Intersection is used. How does postgres handle this? The steps
in the above query are
1.find all tuples that match column1=1 AND column2='62'
2. find all tuples that match column1=1 AND column2='189'
3. Find the intersection of the above 2
Does it go to the heap even to get the intermediate results (1 & 2) ?
or
Does it do the first 2 steps using index and go to the heap for the final
data?

Also what does *Sort method: external merge Disk:73432kB *mean? Should I
have to modify this to make this query run faster? Postgres takes 4 times
slower than Oracle to return this query. Is there a way to make this
faster?

Thanks
jo

Alvaro Herrera

alvherre@2ndquadrant.com

over 18 years ago

In reply to: Josh Harrison (#8)

Re: Recheck condition

Josh Harrison escribiï¿½:

Thanks...
I have 1 more question in the same line...

*Query1*
SELECT person_id FROM person WHERE (column1=1 AND column2='62')
INTERSECT
SELECT person_id FROM person WHERE (column1=1 AND column2='189')

Hmm, I think INTERSECT (and EXCEPT) is pretty stupid in Postgres in
general. Maybe INTERSECT ALL could be a bit faster, because it can
avoid the sort steps. Make sure you eliminate duplicates if they are a
concern.

--
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere." (Lamar Owen)

#10

Josh Harrison

joshques@gmail.com

over 18 years ago

In reply to: Alvaro Herrera (#9)

Re: Recheck condition

On Nov 30, 2007 7:55 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Josh Harrison escribió:

Thanks...
I have 1 more question in the same line...

*Query1*
SELECT person_id FROM person WHERE (column1=1 AND column2='62')
INTERSECT
SELECT person_id FROM person WHERE (column1=1 AND column2='189')

Hmm, I think INTERSECT (and EXCEPT) is pretty stupid in Postgres in
general. Maybe INTERSECT ALL could be a bit faster, because it can
avoid the sort steps. Make sure you eliminate duplicates if they are a
concern.

I get the same plan(see below) with 'sort' for 'intersect all' operation
too. Why is intersect not an effecient way? Is there any other way this
query/index can be written/created so that I can get the intersect results
in an efficient way?
Thanks
jo

QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

SetOp Intersect All (cost=1750719.48..1769378.35 rows=373177 width=4)
(actual time=41065.459..45469.038 rows=128562
loops=1)
-> Sort (cost=1750719.48..1760048.92 rows=3731774 width=4) (actual
time=41065.375..44027.342 rows=3756726
loops=1)
Sort Key: "*SELECT*
1".patient_id

Sort Method: external merge Disk:
73432kB

-> Append (cost=17886.42..1209431.67 rows=3731774 width=4)
(actual time=1445.675..30171.066 rows=3756726
loops=1)
-> Subquery Scan "*SELECT* 1"
(cost=17886.42..496377.90rows=381993 width=4) (actual time=
1445.674..8223.061 rows=327498 loops=1)
-> Bitmap Heap Scan on person (cost=
17886.42..492557.97 rows=381993 width=4) (actual
time=1445.670..8021.006rows=327498 loops=1)
Recheck Cond: ((column1 = 1) AND ((column2)::text
= '62'::text))
-> Bitmap Index Scan on person_idx (cost=
0.00..17790.92 rows=381993 width=0) (actual
time=1440.189..1440.189rows=327498 loops=1)
Index Cond: ((column1 = 1) AND
((column2)::text =
'62'::text))
-> Subquery Scan "*SELECT* 2"
(cost=156754.24..713053.77rows=3349781 width=4) (actual time=
4183.977..20195.276 rows=3429228 loops=1)
-> Bitmap Heap Scan on person (cost=
156754.24..679555.96 rows=3349781 width=4) (actual
time=4183.973..18191.919rows=3429228 loops=1)
Recheck Cond: ((column1 = 1) AND ((column2)::text
= '189'::text))
-> Bitmap Index Scan on person_idx (cost=
0.00..155916.80 rows=3349781 width=0) (actual
time=4178.644..4178.644rows=3429228 loops=1)
Index Cond: ((column1 = 1) AND
((column2)::text =
'189'::text))
Total runtime: 45504.425 ms

#11

Martijn van Oosterhout

kleptog@svana.org

over 18 years ago

In reply to: Josh Harrison (#10)

Re: Recheck condition

On Fri, Nov 30, 2007 at 08:21:18AM -0500, Josh Harrison wrote:

*Query1*
SELECT person_id FROM person WHERE (column1=1 AND column2='62')
INTERSECT
SELECT person_id FROM person WHERE (column1=1 AND column2='189')

I get the same plan(see below) with 'sort' for 'intersect all' operation
too. Why is intersect not an effecient way? Is there any other way this
query/index can be written/created so that I can get the intersect results
in an efficient way?

Set operations are rather inefficient. To find the intersection of two
arbitrary sets you need to sort them and compare. A query like you
write would be better expressed as a join, something like:

SELECT a.person_id
FROM (SELECT person_id FROM person WHERE (column1=1 AND column2='62') a,
(SELECT person_id FROM person WHERE (column1=1 AND column2='189') b
WHERE a.person_id = b.person_id;

or perhaps:

SELECT a.person_id
FROM person a, person b
WHERE a.column1=1 AND a.column2='62'
AND b.column1=1 AND b.column2='189'
AND a.person_id = b.person_id;

Which will probably generate a merge join...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#12

Josh Harrison

joshques@gmail.com

over 18 years ago

In reply to: Martijn van Oosterhout (#11)

Re: Recheck condition

*Query1*
SELECT person_id FROM person WHERE (column1=1 AND column2='62')
INTERSECT
SELECT person_id FROM person WHERE (column1=1 AND column2='189')

I get the same plan(see below) with 'sort' for 'intersect all'

operation

too. Why is intersect not an effecient way? Is there any other way this
query/index can be written/created so that I can get the intersect

results

in an efficient way?

Set operations are rather inefficient. To find the intersection of two
arbitrary sets you need to sort them and compare. A query like you
write would be better expressed as a join, something like:

SELECT a.person_id
FROM (SELECT person_id FROM person WHERE (column1=1 AND column2='62')
a,
(SELECT person_id FROM person WHERE (column1=1 AND column2='189') b
WHERE a.person_id = b.person_id;

or perhaps:

SELECT a.person_id
FROM person a, person b
WHERE a.column1=1 AND a.column2='62'
AND b.column1=1 AND b.column2='189'
AND a.person_id = b.person_id;

Which will probably generate a merge join...

Thanks. But this query seems to be more expensive than using intersect
operator.
This is the explain analyse plan for this query. It took 5 1/2 minutes to
generate this. I also tried to disable the mergejoin and in that case it
uses hash join and still takes more than 3 minutes (intersect took only 40
sec)

QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Merge Join (cost=1610648.92..10280119.99 rows=577856095 width=4) (actual
time=30562.630..264534.677 rows=225145385
loops=1)
Merge Cond: (a.patient_id = b.patient_id)

-> Sort (cost=527974.81..528929.79 rows=381993 width=4) (actual time=
3755.361..3845.134 rows=213435
loops=1)
Sort Key: a.patient_id

Sort Method: quicksort Memory:
15868kB

-> Bitmap Heap Scan on clinical_variable2 a (cost=
17886.42..492557.97 rows=381993 width=4) (actual
time=315.753..3410.366rows=327498 loops=1)
Recheck Cond: ((top_parent_service_sys_id = 1) AND
((top_parent_service_code)::text =
'62'::text))
-> Bitmap Index Scan on clinical_variable_idx_topserv
(cost=0.00..17790.92 rows=381993 width=0) (actual
time=277.185..277.185rows=327498 loops=1)
Index Cond: ((top_parent_service_sys_id = 1) AND
((top_parent_service_code)::text =
'62'::text))
-> Materialize (cost=1082674.11..1124546.38 rows=3349781 width=4)
(actual time=26807.248..99885.620 rows=225148250
loops=1)
-> Sort (cost=1082674.11..1091048.57 rows=3349781 width=4)
(actual time=26807.238..30343.870 rows=3429228
loops=1)
Sort Key: b.patient_id

Sort Method: external merge Disk:
53552kB

-> Bitmap Heap Scan on clinical_variable2 b (cost=
156754.24..679555.96 rows=3349781 width=4) (actual
time=2744.126..20106.160rows=3429228 loops=1)
Recheck Cond: ((top_parent_service_sys_id = 1) AND
((top_parent_service_code)::text =
'189'::text))
-> Bitmap Index Scan on clinical_variable_idx_topserv
(cost=0.00..155916.80 rows=3349781 width=0) (actual
time=2686.456..2686.456rows=3429228 loops=1)
Index Cond: ((top_parent_service_sys_id = 1) AND
((top_parent_service_code)::text =
'189'::text))
Total runtime: 324646.035ms

18 record(s) selected [Fetch MetaData: 0/ms] [Fetch Data: 0/ms] a: 0/ms]

Is there any other way you can think of to solve this problem. May be
creating the indexes in a different way or something?

Thanks
jo

#13

Bruce Momjian

bruce@momjian.us

over 18 years ago

In reply to: Martijn van Oosterhout (#11)

Re: Recheck condition

"Martijn van Oosterhout" <kleptog@svana.org> writes:

On Fri, Nov 30, 2007 at 08:21:18AM -0500, Josh Harrison wrote:

*Query1*
SELECT person_id FROM person WHERE (column1=1 AND column2='62')
INTERSECT
SELECT person_id FROM person WHERE (column1=1 AND column2='189')

I get the same plan(see below) with 'sort' for 'intersect all' operation
too. Why is intersect not an effecient way? Is there any other way this
query/index can be written/created so that I can get the intersect results
in an efficient way?

Set operations are rather inefficient. To find the intersection of two
arbitrary sets you need to sort them and compare.

I think all the set operations are implemented this way. It's actually a
pretty clever plan if you're processing two large lists without indexes but,
it would be nice to support a fuller set of plans like we do for other kinds
of queries. For INTERSECT star-schema joins might actually be best.

A query like you write would be better expressed as a join, something like:

SELECT a.person_id
FROM (SELECT person_id FROM person WHERE (column1=1 AND column2='62') a,
(SELECT person_id FROM person WHERE (column1=1 AND column2='189') b
WHERE a.person_id = b.person_id;

or perhaps:

SELECT a.person_id
FROM person a, person b
WHERE a.column1=1 AND a.column2='62'
AND b.column1=1 AND b.column2='189'
AND a.person_id = b.person_id;

Or using an IN or EXISTS query:

SELECT person_id
FROM person
WHERE column1=1
AND column2='62'
AND person_id IN (
SELECT person_id
FROM person
WHERE column1=1
AND column2='189'
)

SELECT person_id
FROM person AS parent
WHERE column1=1
AND column2='62'
AND EXISTS (
SELECT 1
FROM person
WHERE parent.person_id = person_id
AND column1=1
AND column2='189'
)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

#14

Josh Harrison

joshques@gmail.com

over 18 years ago

In reply to: Bruce Momjian (#13)

Re: Recheck condition

Or using an IN or EXISTS query:

SELECT person_id
FROM person
WHERE column1=1
AND column2='62'
AND person_id IN (
SELECT person_id
FROM person
WHERE column1=1
AND column2='189'
)

or

SELECT person_id
FROM person AS parent
WHERE column1=1
AND column2='62'
AND EXISTS (
SELECT 1
FROM person
WHERE parent.person_id = person_id
AND column1=1
AND column2='189'
)

Thanks for your reply
The query with IN gave this plan and took 1m19sec to give the result which
is slightly more than the intersect query(40 sec). The other query with
exists takes way long time for results. All these queries does a heap scan
for intermediate results...right? Is there a way to get them not to use the
heap for intermediate result and go to heap only for final data? This will
drastically improve the performance but Im not sure if postgres can do that?
Will creating the index in a different way and/or rewriting the query in a
different way achieve this result?

Thanks
jo

QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Hash Join (cost=705823.44..1182434.64 rows=43631 width=4) (actual time=
26443.675..52055.698 rows=140464
loops=1)
Hash Cond: (public.person.patient_id = public.person.patient_id)

-> Bitmap Heap Scan on person (cost=17886.42..492557.97 rows=381993
width=4) (actual time=442.934..25779.601 rows=327498
loops=1)
Recheck Cond: ((column1 = 1) AND ((column2)::text =
'62'::text))

-> Bitmap Index Scan on person_idx
(cost=0.00..17790.92rows=381993 width=0) (actual time=
403.869..403.869 rows=327498 loops=1)
Index Cond: ((column1 = 1) AND ((column2)::text =
'62'::text))
-> Hash (cost=687933.35..687933.35 rows=294 width=4) (actual time=
26000.635..26000.635 rows=6568
loops=1)
-> HashAggregate (cost=687930.41..687933.35 rows=294 width=4)
(actual time=25992.971..25996.471 rows=6568
loops=1)
-> Bitmap Heap Scan on person
(cost=156754.24..679555.96rows=3349781 width=4) (actual time=
3202.251..23974.389 rows=3429228 loops=1)
Recheck Cond: ((column1 = 1) AND ((column2)::text =
'189'::text))
-> Bitmap Index Scan on person_idx (cost=
0.00..155916.80 rows=3349781 width=0) (actual
time=3145.912..3145.912rows=3429228 loops=1)
Index Cond: ((column1 = 1) AND ((column2)::text =
'189'::text))
Total runtime: 52094.598 ms

#15

Martijn van Oosterhout

kleptog@svana.org

over 18 years ago

In reply to: Josh Harrison (#14)

Re: Recheck condition

On Fri, Nov 30, 2007 at 11:27:24AM -0500, Josh Harrison wrote:

Thanks for your reply
Is there a way to get them not to use the
heap for intermediate result and go to heap only for final data? This will
drastically improve the performance but Im not sure if postgres can do that?
Will creating the index in a different way and/or rewriting the query in a
different way achieve this result?

I'm trying to imagine what it would take to avoid the heap access after
the index scan I don't think it's possible. It would require that the
bitmaps generated by the bitmap scan have the person_id attached and
then have the bitmap AND operation only happen if the person IDs match.
No such machinary currently exists.

You do have somewhat of a worst case scenario, intersecting 300k
records with 3million records. I'm not sure if there is a good way to
handle that. The only things I can think of is to add the person_id to
the index also so that you can avoid the sort (not sure if first or
last is more helpful). Or perhaps clustering on that index to reduce
disk access...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#16

Bruce Momjian

bruce@momjian.us

over 18 years ago

In reply to: Martijn van Oosterhout (#15)

Re: Recheck condition

"Martijn van Oosterhout" <kleptog@svana.org> writes:

On Fri, Nov 30, 2007 at 11:27:24AM -0500, Josh Harrison wrote:

Thanks for your reply
Is there a way to get them not to use the
heap for intermediate result and go to heap only for final data? This will
drastically improve the performance but Im not sure if postgres can do that?
Will creating the index in a different way and/or rewriting the query in a
different way achieve this result?

I'm trying to imagine what it would take to avoid the heap access after
the index scan I don't think it's possible. It would require that the
bitmaps generated by the bitmap scan have the person_id attached and
then have the bitmap AND operation only happen if the person IDs match.
No such machinary currently exists.

I think you're describing a star schema join. This is a common checklist item
for data warehousing databases.

The classic data warehouse has a table like "person" which has the info you're
looking for, and dozens of tables with person_id and possibly some associated
data. In some cases those tables don't even have any other data, the mere
existence of the person_id in that table is enough.

So a typical query could look like something like:

select *
from person
where person_id in (select person_id from people_who_used_service_in_the_past)
and person_id in (select person_id from people_with_big_balances)
and person_id in (select person_id from people_...)
and person_id not in (select person_id from people_who_unsubscribed)
and person_id not in (select person_id from people_who_we_mailed_last_week)

The best plan for this is to gather up the person_ids in a kind of bitmap scan
with a bitmap of ids. And once the bitmap is done scan an index on person for
just the matching records. Postgres doesn't support anything like this (yet:).

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!