hash-join forgets tuples

Started by Sebastian Freundtover 20 years ago3 messagesgeneral
Jump to latest
#1Sebastian Freundt
hroptatyr@gna.org

Hello,

using a highly surjective left (or inner) join to a table reveals data
loss if the hash join method is used.
Here, highly surjective means I have a table with about 1.4 million tuples
which map to a table with about 40000 tuples.

Now here's the explanation:

qaos=# explain select anfs.anf_id from anfs left join groups USING (group_id) where
anfs.degree=3 and not groups.abelian_p;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Hash Join (cost=8111.80..103961.04 rows=299036 width=521)
Hash Cond: ("outer".group_id = "inner".group_id)
-> Bitmap Heap Scan on anfs (cost=2048.58..71382.50 rows=325594 width=216)
Recheck Cond: (degree = 3)
-> Bitmap Index Scan on anfs_degree (cost=0.00..2048.58 rows=325594 width=0)
Index Cond: (degree = 3)
-> Hash (cost=4275.79..4275.79 rows=40172 width=313)
-> Bitmap Heap Scan on groups (cost=346.60..4275.79 rows=40172 width=313)
Filter: (NOT abelian_p)
-> Bitmap Index Scan on groups_not_properties (cost=0.00..346.60 rows=40119 width=0)
Index Cond: (abelian_p = false)
(11 rows)

The query result is:

qaos=# select anfs.anf_id from anfs left join groups USING (group_id)
where anfs.degree=3 and not groups.abelian_p;
anf_id
--------
(0 rows)

Now whenever other join-methods (like merge join) are used, there are
definitely tuples in the result:

qaos=# select anfs.anf_id from anfs left join groups USING (group_id)
where anfs.degree=3 and not groups.abelian_p limit 20;
anf_id
--------
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(20 rows)

qaos=# explain select anfs.anf_id from anfs left join groups USING
(group_id) where anfs.degree=3 and not groups.abelian_p limit 20;
QUERY PLAN
------------------------------------------------------------------------------------------------
Limit (cost=0.00..138.09 rows=20 width=8)
-> Nested Loop (cost=0.00..2064738.73 rows=299036 width=8)
-> Index Scan using anfs_degree on anfs (cost=0.00..1078774.74 rows=325594 width=16)
Index Cond: (degree = 3)
-> Index Scan using groups_group_id on groups (cost=0.00..3.02 rows=1 width=8)
Index Cond: ("outer".group_id = groups.group_id)
Filter: (NOT abelian_p)
(7 rows)

The canonical step in this misery, SET enable_hashjoin TO off; makes
queries work again, but the performance really suffers from it.

Greetings,
Sebastian

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sebastian Freundt (#1)
Re: hash-join forgets tuples

Sebastian Freundt <hroptatyr@gna.org> writes:

using a highly surjective left (or inner) join to a table reveals data
loss if the hash join method is used.

In which PG version? Given that you appear to be playing with 8.1devel
code, you might be needing this bug fix:

2005-07-23 22:25 tgl

* src/backend/nodes/tidbitmap.c: Fix logic error in tbm_intersect:
the intersection of a normal page and a lossy page has to be lossy,
because we don't know exactly which tuples on the page should
remain part of the bitmap. Per Jie Zhang.

If not, please send along a self-contained test case.

regards, tom lane

#3Sebastian Freundt
hroptatyr@gna.org
In reply to: Tom Lane (#2)
Re: hash-join forgets tuples

Tom Lane <tgl@sss.pgh.pa.us> writes:

Sebastian Freundt <hroptatyr@gna.org> writes:

using a highly surjective left (or inner) join to a table reveals data
loss if the hash join method is used.

In which PG version? Given that you appear to be playing with 8.1devel
code, you might be needing this bug fix:

2005-07-23 22:25 tgl

* src/backend/nodes/tidbitmap.c: Fix logic error in tbm_intersect:
the intersection of a normal page and a lossy page has to be lossy,
because we don't know exactly which tuples on the page should
remain part of the bitmap. Per Jie Zhang.

Ah, good hint. Indeed it is 8.1-devel. Let me examine your proposal.
Thanks Tom!

Regards,
Sebastian