where in (select array)

Started by Marcus Engeneover 17 years ago6 messagesgeneral

mengpg2@engene.se

over 17 years ago

Hi List,

I have the might_like table that contains products a user might like if
he likes the present one (item).

CREATE TABLE might_like
(
item INTEGER NOT NULL
,created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child INTEGER NOT NULL
)
WITHOUT OIDS;

CREATE INDEX might_like_x1 ON might_like(item);

Since there are (will be) houndreds of thousands of items, and 20+ might
like items, i thought it would be nice to reduce the set to 1/20th by
using a vector.

CREATE TABLE might_like_vector
(
item INTEGER NOT NULL
,created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child_arr INTEGER[]
)
WITHOUT OIDS;

CREATE INDEX might_like_vector_x1 ON might_like_vector(item);

But then this don't work:

select
...
from
item pic
where
pic.objectid in (
select mlv.child_arr
from might_like_vector mlv
where mlv.item = 125 AND
mlv.created_at > now() - interval '1 week'
)
limit 16

This seems to work but is ugly:

select
...
from
item pic
where
pic.objectid in (
select mlv.child_arr[s.a]
from might_like_vector mlv
,generate_series(1,20) as s(a)
where mlv.item = 125 AND
mlv.created_at > now() - interval '1 week'
)
limit 16

Is there a better way to do it?

Thanks,
Marcus

Grzegorz Jaśkiewicz

gryzman@gmail.com

over 17 years ago

In reply to: Marcus Engene (#1)

Re: where in (select array)

just as a little advice, I would actually use joins - for performance
reasons. 'where in' seems to be rather slow, especially if you use it on
large sets of data.

Richard Huxton

dev@archonet.com

over 17 years ago

In reply to: Marcus Engene (#1)

Re: where in (select array)

Marcus Engene wrote:

Hi List,

I have the might_like table that contains products a user might like if
he likes the present one (item).

CREATE TABLE might_like
(
item INTEGER NOT NULL
,created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child INTEGER NOT NULL
)
WITHOUT OIDS;

CREATE INDEX might_like_x1 ON might_like(item);

Since there are (will be) houndreds of thousands of items, and 20+ might
like items, i thought it would be nice to reduce the set to 1/20th by
using a vector.

CREATE TABLE might_like_vector
(
item INTEGER NOT NULL
,created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child_arr INTEGER[]
)
WITHOUT OIDS;

You haven't reduced the set at all, you've just turned part of it
sideways. You might gain something on your search, but I'm guessing
you've not tested it.

Hmm - the attached script generates 100,000 items and 10 liked ones for
each (well for the first 99,990 it says you like the next 10 items).
They're all given different timestamps at day intervals which means
you'll end up with 6 or seven matches for you sample query.

But then this don't work:

select
...
from
item pic
where
pic.objectid in (
select mlv.child_arr
from might_like_vector mlv
where mlv.item = 125 AND
mlv.created_at > now() - interval '1 week'
)
limit 16

Without messing around with arrays you get this query (which seems
readable enough to me)

SELECT
objectid, objname
FROM
items i
JOIN might_like m ON (i.objectid = m.child)
WHERE
m.created_at > (now() - '1 week'::interval)
AND m.item = 125
ORDER BY
objectid
LIMIT
16
;

I'm getting times less than a millisecond for this - are you sure it's
worth fiddling with arrays?

--
Richard Huxton
Archonet Ltd

Marcus Engene

mengpg2@engene.se

over 17 years ago

In reply to: Richard Huxton (#3)

Re: where in (select array)

Richard Huxton wrote:

Marcus Engene wrote:

Hi List,

I have the might_like table that contains products a user might like if
he likes the present one (item).

CREATE TABLE might_like
(
item INTEGER NOT NULL
,created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child INTEGER NOT NULL
)
WITHOUT OIDS;

CREATE INDEX might_like_x1 ON might_like(item);

Since there are (will be) houndreds of thousands of items, and 20+ might
like items, i thought it would be nice to reduce the set to 1/20th by
using a vector.

CREATE TABLE might_like_vector
(
item INTEGER NOT NULL
,created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child_arr INTEGER[]
)
WITHOUT OIDS;

You haven't reduced the set at all, you've just turned part of it
sideways. You might gain something on your search, but I'm guessing
you've not tested it.

Hmm - the attached script generates 100,000 items and 10 liked ones for
each (well for the first 99,990 it says you like the next 10 items).
They're all given different timestamps at day intervals which means
you'll end up with 6 or seven matches for you sample query.

Sorry, I was a bit unclear. This is run on a web server. The might like
rows are generated if they are not available for an item when the
corresponding page is generated. The one row per might-like-item is
online since yesterday and has when this is written 1/2 M rows in it.
The caching is primarily initiated by a google indexer agent.

This might-like generation is expensive so caching them in the db is a
must and the used CPU of the web-server halfed when this caching had
been put live and cached the most commonly used items.

When the might-like data is read from the database, the generated
presentation html is put in memcached with a timeout of 1h (presently).
Memcached here is probably way overkill, but using it in situations like
this makes the site more scaleable to add webservers and postpones the
problem of clustering pg.

So with memcached I care less about saving a few mS in select latency
and more about postponing other approaching problems like having the
dbdump manageble. Right now it's a 100MB gzipped dump, which is very
manageable, so where it's possible I'd like to keep the data compact. I
imagine it's cheaper disk & dump wise to do the array thing in this and
some other similar usages, and therefore it would be nice to have a
non-ugly usage pattern. Also, we're going to collect usage statistics to
further optimize the behavior of the site, and I'm really worried about
these millions of rows.

If this is a bad approach to the problem I'm very interested to hear
this. Regardless of the best approach, a "proper" solution to the
subquery in would be super appreciated too :)

Thanks for your answer!

Best regards,
Marcus

Richard Huxton

dev@archonet.com

over 17 years ago

In reply to: Marcus Engene (#4)

Re: where in (select array)

Marcus Engene wrote:

So with memcached I care less about saving a few mS in select latency
and more about postponing other approaching problems like having the
dbdump manageble. Right now it's a 100MB gzipped dump, which is very
manageable, so where it's possible I'd like to keep the data compact.

If you remove the duplication you reduce the amount it will compress by.
Oh - and you're not gzipping the dump yourself are you? Use the "-Fc"
format to dump with and it'll already be compressed.

I
imagine it's cheaper disk & dump wise to do the array thing in this and
some other similar usages, and therefore it would be nice to have a
non-ugly usage pattern.

Don't imagine, test. And then factor in the cost of fiddling around with
arrays when you need to access individual values. And the cost of the
time you spent working on all this.

Also, we're going to collect usage statistics to
further optimize the behavior of the site, and I'm really worried about
these millions of rows.

Either:
1. Collect usage stats, *then* optimise. Then collect more stats and
make sure what you did was useful.
2. Stop worrying, and knock off early. Go have a couple of drinks with
works colleagues and unwind - it's the weekend!
3. Flip a coin and decide how to proceed based on that.

Now, #1 is the most effective but also the most effort and cost. Numbers
2,3 are about equal but both beat spending time optimising something you
haven't measured yet.

--
Richard Huxton
Archonet Ltd

Marcus Engene

mengpg2@engene.se

over 17 years ago

In reply to: Richard Huxton (#5)

Re: where in (select array)

Richard Huxton wrote:

I imagine it's cheaper disk & dump wise to do the array thing in this and
some other similar usages, and therefore it would be nice to have a
non-ugly usage pattern.

Don't imagine, test. And then factor in the cost of fiddling around with
arrays when you need to access individual values. And the cost of the
time you spent working on all this.

On my dev 8.2.4 I get
using real values from a db dump with
931873 might like rows
46539 might like vector rows

Might like (row version):
10s to dump the second time, 38MB txt, 4MB gzip

Might like vector:
2s to dump the second time, 7.6MB text, 2MB gzip

Might like (row version)
explain cost, my in () version: ~200
explain cost, join on: ~670
explain cost, virtual table *): ~670

*)
select
...
from
(select ...) as a.b

Might like vector:
explain cost, my in (): 1669

If there would have been a "generate_series" function for vectors, the
choice would have been easy I think.

Best regards,
Marcus

where in (select array)

Attachments: