Join with an array
Hi,
I'm trying to speed up a query with a lookup table. This lookup table
gets very big and should still fit into memory. It does not change very
often. Given these facts I decided to use an array, as follows:
CREATE TABLE lookup_table (id INT PRIMARY KEY, items INT[] NOT NULL);
I know this is not considered good database design, but it saves a lot
of overhead for tuple visibility compared to a 1:1 table.
To fetch an item via the lookup_table I tried to use the following
query:
SELECT i.id, i.title FROM item i
JOIN lookup_table lut ON i.id = ANY(lut.items)
WHERE lut.id = $LOOKUP_ID;
Unfortunately that one seems to always use a sequential scan over items.
As the items array in the lookup table often has only 3 - 10 entries
(compared to about 1 mio rows in the item table) this is a very
expensive operation.
I tried to circumvent the problem with generate_series:
SELECT i.id, i.title FROM generate_series(0, $MAX) s
JOIN lookup_table lut ON s = ANY(lut.items)
JOIN item i ON s = i.id
WHERE lut.id = $LOOKUP_ID;
That query uses the index to lookup the item, but as soon as $MAX gets
bigger than 10000 generate_series takes too long and too many
comparisons s = ANY(lut.items) need to be done.
I think it would be possible to write a function generate_series(INT[])
which returns all the elements of the array. The query would then look
something like:
SELECT i.id, i.title
FROM generate_series(SELECT lut.items FROM lookup_table lut WHERE
lut.id = $LOOKUP_ID) s
JOIN item i ON s = i.id;
Do you see any problem in implementing such function? Does something
similar already existt?
Why does the first query use a seqscan instead of the index on items? Do
I miss anything? What problems do I face if I want to teach the planner
to use the index in the first query [1]generally in most cases like "JOIN .. ON x IN ANY($ARRAY)" where $ARRAY is reasonably small.?
Regards
Markus
[1]: generally in most cases like "JOIN .. ON x IN ANY($ARRAY)" where $ARRAY is reasonably small.
$ARRAY is reasonably small.
On Thu, Feb 23, 2006 at 12:36:35PM +0100, Markus Schiltknecht wrote:
Hi,
I'm trying to speed up a query with a lookup table. This lookup table
gets very big and should still fit into memory. It does not change very
often. Given these facts I decided to use an array, as follows:CREATE TABLE lookup_table (id INT PRIMARY KEY, items INT[] NOT NULL);
<snip>
SELECT i.id, i.title FROM item i
JOIN lookup_table lut ON i.id = ANY(lut.items)
WHERE lut.id = $LOOKUP_ID;
At the very least you're going to have to tell us which version you are
running plus the output of EXPLAIN ANALYZE for that query. Anything
less and we're guessing. Have you got the appropriate indexes?
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
Markus,
have you seen contrib/intarray ?
Oleg
On Thu, 23 Feb 2006, Markus Schiltknecht wrote:
Hi,
I'm trying to speed up a query with a lookup table. This lookup table
gets very big and should still fit into memory. It does not change very
often. Given these facts I decided to use an array, as follows:CREATE TABLE lookup_table (id INT PRIMARY KEY, items INT[] NOT NULL);
I know this is not considered good database design, but it saves a lot
of overhead for tuple visibility compared to a 1:1 table.To fetch an item via the lookup_table I tried to use the following
query:SELECT i.id, i.title FROM item i
JOIN lookup_table lut ON i.id = ANY(lut.items)
WHERE lut.id = $LOOKUP_ID;Unfortunately that one seems to always use a sequential scan over items.
As the items array in the lookup table often has only 3 - 10 entries
(compared to about 1 mio rows in the item table) this is a very
expensive operation.I tried to circumvent the problem with generate_series:
SELECT i.id, i.title FROM generate_series(0, $MAX) s
JOIN lookup_table lut ON s = ANY(lut.items)
JOIN item i ON s = i.id
WHERE lut.id = $LOOKUP_ID;That query uses the index to lookup the item, but as soon as $MAX gets
bigger than 10000 generate_series takes too long and too many
comparisons s = ANY(lut.items) need to be done.I think it would be possible to write a function generate_series(INT[])
which returns all the elements of the array. The query would then look
something like:SELECT i.id, i.title
FROM generate_series(SELECT lut.items FROM lookup_table lut WHERE
lut.id = $LOOKUP_ID) s
JOIN item i ON s = i.id;Do you see any problem in implementing such function? Does something
similar already existt?Why does the first query use a seqscan instead of the index on items? Do
I miss anything? What problems do I face if I want to teach the planner
to use the index in the first query [1]?Regards
Markus
[1]: generally in most cases like "JOIN .. ON x IN ANY($ARRAY)" where
$ARRAY is reasonably small.---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Hello Martijn,
On Thu, 2006-02-23 at 12:44 +0100, Martijn van Oosterhout wrote:
SELECT i.id, i.title FROM item i
JOIN lookup_table lut ON i.id = ANY(lut.items)
WHERE lut.id = $LOOKUP_ID;At the very least you're going to have to tell us which version you are
running plus the output of EXPLAIN ANALYZE for that query. Anything
less and we're guessing. Have you got the appropriate indexes?
Sorry, I knew I'd forget something ;-)
I'm on PostgreSQL 8.1.3. The 'PRIMARY KEY' constraint automatically
creates an index on the lookup_table. The items table as well has an
index on item(id).
Because the other, similar queries use the indices I concluded that in
this first query PostgreSQL _never_ uses an index scan. It also should
not always use it, because the array might be large and a seqscan could
be cheaper in such cases. How should the planer know? In my case,
thought, I assume it would always be cheaper to use an index scan.
If this functionality already exists I was very sorry for the noise and
I beg you to tell me what knobs to fiddle with to make the planner use
the index.
Regards
Markus
Hello Oleg,
On Thu, 2006-02-23 at 15:02 +0300, Oleg Bartunov wrote:
have you seen contrib/intarray ?
Yes, but in the documentation I did not find anything like
'generate_series' or thelike. Maybe I'm looking at the wrong place,
please give me a hint.
Regards
Markus
Markus Schiltknecht <markus@bluegap.ch> writes:
I'm trying to speed up a query with a lookup table. This lookup table
gets very big and should still fit into memory. It does not change very
often. Given these facts I decided to use an array, as follows:
CREATE TABLE lookup_table (id INT PRIMARY KEY, items INT[] NOT NULL);
I know this is not considered good database design, but it saves a lot
of overhead for tuple visibility compared to a 1:1 table.
To fetch an item via the lookup_table I tried to use the following
query:
SELECT i.id, i.title FROM item i
JOIN lookup_table lut ON i.id = ANY(lut.items)
WHERE lut.id = $LOOKUP_ID;
Unfortunately that one seems to always use a sequential scan over items.
FWIW, "indexcol = ANY(array)" searches are indexable in CVS tip.
There's no hope in any existing release though :-(
I tried to circumvent the problem with generate_series:
SELECT i.id, i.title FROM generate_series(0, $MAX) s
JOIN lookup_table lut ON s = ANY(lut.items)
JOIN item i ON s = i.id
WHERE lut.id = $LOOKUP_ID;
Seems like the hard way --- why aren't you searching over array subscripts?
SELECT i.id, i.title FROM generate_series(1, $MAX) s
JOIN lookup_table lut ON s <= array_upper(lut.items)
JOIN item i ON i.id = lut.items[s]
WHERE lut.id = $LOOKUP_ID;
$MAX need only be as large as the widest array in lookup_table.
regards, tom lane