problem with huge joins

Started by Kolus Maximilianoover 22 years ago4 messagesgeneral
Jump to latest
#1Kolus Maximiliano
Kolus.maximiliano@bcr.com.ar

Hello,

I don't know if this is *entirely* a postgresql question or if it's
my screwup with the database design.

First, a little introduction to the problem: I maintain a database
of open proxies (they're used for tagging and/or rejecting email) and I use
postgres as it's backend. The list itself is contained in three tables:

tHost (458K rows)
Hostid integer (PK + HASH INDEX)
Ip inet
Rdns varchar

tList (403K rows)
Listid integer (PK)
Input integer
Output integer (HASH INDEX)
Date timestamp

tPort (986K rows)
Portid integer (PK)
Listid integer (HASH INDEX)
Protocol integer
Port integer (BTREE INDEX)

Every hour, I must build a couple of files with the list itself, in
order to do this I need to retrieve output ip address (list+host) and for
each ip address port/protocol combinations, ie:

192.168.1.1 1080 3 (for socks 4)

In order to get this, I run this query:

SELECT ip, TO_CHAR(date, 'YYYY-MM-DD'), protocol, port
FROM tProxyPort, tProxyList, tProxyHost
WHERE tProxyPort.listId=tProxyList.listId
AND tProxyList.output=tProxyHost.hostId
ORDER BY ip, port

Whose query plan is:

Sort (cost=311874.07..311874.07 rows=986130 width=44) (actual
time=300086.42..302580.25 rows=986130 loops=1)
-> Hash Join (cost=39735.96..96907.83 rows=986130 width=44) (actual
time=86226.28..223195.50 rows=986130 loops=1)
-> Seq Scan on tport (cost=0.00..18629.30 rows=986130 width=12)
(actual time=0.15..25910.56 rows=986130 loops=1)
-> Hash (cost=35972.38..35972.38 rows=403034 width=32) (actual
time=86194.99..86194.99 rows=0 loops=1)
-> Hash Join (cost=9787.92..35972.38 rows=403034 width=32)
(actual time=12180.64..84316.65 rows=403927 loops=1)
-> Seq Scan on thost (cost=0.00..7850.41 rows=457341
width=16) (actual time=619.09..10032.85 rows=458787 loops=1)
-> Hash (cost=6812.34..6812.34 rows=403034 width=16)
(actual time=6656.36..6656.36 rows=0 loops=1)
-> Seq Scan on tlist (cost=0.00..6812.34
rows=403034 width=16) (actual time=6.90..5030.22 rows=403927 loops=1)
Total runtime: 317046.69 msec

As you can see, it takes eons to complete and I couldn't find a way
to make it faster (the hash indexes are one of the attempts and yes, I did a
vacuum). I don't think that a table with nearly one million entries is "that
much" for postgres and I'm convinced I'm doing something wrong. Any ideas on
how to speed this thing up will be appreciated.

Version:
PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.4
(shared buffers: 128)

CPU:
Pentium III (Katmai) (600Mhz)

/proc/meminfo:

total: used: free: shared: buffers: cached:
Mem: 263475200 218435584 45039616 0 17240064 161374208
Swap: 511959040 7389184 504569856
MemTotal: 257300 kB
MemFree: 43984 kB
MemShared: 0 kB
Buffers: 16836 kB
Cached: 155348 kB
SwapCached: 2244 kB
Active: 131204 kB
Inactive: 68760 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 257300 kB
LowFree: 43984 kB
SwapTotal: 499960 kB
SwapFree: 492744 kB

Thanks in advance.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kolus Maximiliano (#1)
Re: problem with huge joins

Kolus Maximiliano <Kolus.maximiliano@bcr.com.ar> writes:

In order to get this, I run this query:

SELECT ip, TO_CHAR(date, 'YYYY-MM-DD'), protocol, port
FROM tProxyPort, tProxyList, tProxyHost
WHERE tProxyPort.listId=tProxyList.listId
AND tProxyList.output=tProxyHost.hostId
ORDER BY ip, port

Whose query plan is:

Sort (cost=311874.07..311874.07 rows=986130 width=44) (actual
time=300086.42..302580.25 rows=986130 loops=1)
-> Hash Join (cost=39735.96..96907.83 rows=986130 width=44) (actual
time=86226.28..223195.50 rows=986130 loops=1)
-> Seq Scan on tport (cost=0.00..18629.30 rows=986130 width=12)
(actual time=0.15..25910.56 rows=986130 loops=1)
-> Hash (cost=35972.38..35972.38 rows=403034 width=32) (actual
time=86194.99..86194.99 rows=0 loops=1)
-> Hash Join (cost=9787.92..35972.38 rows=403034 width=32)
(actual time=12180.64..84316.65 rows=403927 loops=1)
-> Seq Scan on thost (cost=0.00..7850.41 rows=457341
width=16) (actual time=619.09..10032.85 rows=458787 loops=1)
-> Hash (cost=6812.34..6812.34 rows=403034 width=16)
(actual time=6656.36..6656.36 rows=0 loops=1)
-> Seq Scan on tlist (cost=0.00..6812.34
rows=403034 width=16) (actual time=6.90..5030.22 rows=403927 loops=1)
Total runtime: 317046.69 msec

The joins and sort steps seem to take rather a long time. What do you
have sort_mem set to? You probably want it on the order of 10Mb so that
these joins are done in memory rather than spilling to disk.

The hash indexes are a waste of time for this :-(

regards, tom lane

#3Kolus Maximiliano
Kolus.maximiliano@bcr.com.ar
In reply to: Tom Lane (#2)
Re: problem with huge joins

Total runtime: 317046.69 msec

Total runtime: 216001.94 msec

A lot better! Thanks!

The hash indexes are a waste of time for this :-(

Which kind should I use?

#4Martijn van Oosterhout
kleptog@svana.org
In reply to: Kolus Maximiliano (#3)
Re: problem with huge joins

On Fri, Oct 31, 2003 at 01:17:45PM -0300, Kolus Maximiliano wrote:

Total runtime: 317046.69 msec

Total runtime: 216001.94 msec

A lot better! Thanks!

The hash indexes are a waste of time for this :-(

Which kind should I use?

My guess that for large joins it can be more useful to use indexes to
presort a table and use highly efficient merge joins instead. However, hash
indexes cannot be used to sort a table, only for lookups.

btrees are more highly optimised and can produce sorted output.

Give it a shot.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

"All that is needed for the forces of evil to triumph is for enough good
men to do nothing." - Edmond Burke
"The penalty good people pay for not being interested in politics is to be
governed by people worse than themselves." - Plato