Efficient sorting the results of a join, without denormalization

Started by Glen M. Witheringtonalmost 11 years ago9 messagesgeneral
Jump to latest

Sorry about the horrendous subject, let me explain by example:

Let's take this schema:

```
CREATE TABLE a (
id bigserial PRIMARY KEY,
created_at timestamp with time zone NOT NULL DEFAULT NOW()
);

CREATE TABLE b(
id bigserial PRIMARY KEY,
a_id bigint NOT NULL REFERENCES a(id),
created_at timestamp with time zone NOT NULL DEFAULT NOW()
);

CREATE TABLE c(
id bigserial PRIMARY KEY,
b_id bigint NOT NULL REFERENCES b(id),
created_at timestamp with time zone NOT NULL DEFAULT NOW()
);
```

And let's fill it up with some dummy data, that roughly matches the
distribution of mine:

```
INSERT INTO a SELECT FROM generate_series(1, 5);
INSERT INTO b(a_id) SELECT (i % 5) + 1 FROM generate_series(1, 100) i;
INSERT INTO c(b_id) SELECT (trunc(random() * 100)+1) FROM
generate_series(1, 1000000);
```

And here's the query I want to do, efficiently:

````
SELECT * FROM c
JOIN b ON b.id = c.b_id
JOIN a ON a.id = b.a_id
WHERE a.id = 3
ORDER BY b.created_at DESC
LIMIT 10
```

There seems to simply be no index I can put on the data, that allows me
to run this query efficiently. Four hours of playing with this, the only
solution I can see is, normalizing table `c` by adding a field "b's
a_id" and then creating an index on (bs_a_id, created_at).

But surely there's a better solution?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Glen M. Witherington (#1)
Re: Efficient sorting the results of a join, without denormalization

On Saturday, May 30, 2015, Glen M. Witherington <glen@fea.st> wrote:

Sorry about the horrendous subject, let me explain by example:

Let's take this schema:

```
CREATE TABLE a (
id bigserial PRIMARY KEY,
created_at timestamp with time zone NOT NULL DEFAULT NOW()
);

CREATE TABLE b(
id bigserial PRIMARY KEY,
a_id bigint NOT NULL REFERENCES a(id),
created_at timestamp with time zone NOT NULL DEFAULT NOW()
);

CREATE TABLE c(
id bigserial PRIMARY KEY,
b_id bigint NOT NULL REFERENCES b(id),
created_at timestamp with time zone NOT NULL DEFAULT NOW()
);
```

And let's fill it up with some dummy data, that roughly matches the
distribution of mine:

```
INSERT INTO a SELECT FROM generate_series(1, 5);
INSERT INTO b(a_id) SELECT (i % 5) + 1 FROM generate_series(1, 100) i;
INSERT INTO c(b_id) SELECT (trunc(random() * 100)+1) FROM
generate_series(1, 1000000);
```

And here's the query I want to do, efficiently:

````
SELECT * FROM c
JOIN b ON b.id = c.b_id
JOIN a ON a.id = b.a_id
WHERE a.id = 3
ORDER BY b.created_at DESC
LIMIT 10
```

There seems to simply be no index I can put on the data, that allows me
to run this query efficiently. Four hours of playing with this, the only
solution I can see is, normalizing table `c` by adding a field "b's
a_id" and then creating an index on (bs_a_id, created_at).

But surely there's a better solution?

This is one problem with using made up surrogate keys...

The PK of A is a component of both the PK of B and the PK of C but you
throw that information away by using serial fields for PKs instead. You
should have unique indexes on B and C that incorporate the ID from A and
then indeed you will end up with a join sequence that can be executed
against efficiently.

All that said you really should put indexes on the foreign keys...

I haven't run the code to see the actual plan....

David J.

In reply to: David G. Johnston (#2)
Re: Efficient sorting the results of a join, without denormalization

On Sat, May 30, 2015, at 11:33 PM, David G. Johnston wrote:

This is one problem with using made up surrogate keys...

The PK of A is a component of both the PK of B and the PK of C but you throw that information away by using serial fields for PKs instead.  You should have unique indexes on B and C that incorporate the ID from A

That is quite a strange schema, though isn't it? If you imagine it as
emails:

C = Emails
B = Folder
A = User

Now you're suggesting that even though an email belongs to to a folder,
which belongs to a user ... each email should also contain contain a
reference to a user? I guess that's fine, but seems unideal from a
redundancy perspective

All that said you really should put indexes on the foreign keys...

Yeah, of course. I purposely left that out, as I was asking which
indexes would need to be created to support that query

Thanks for your help!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Glen M. Witherington (#1)
Re: Efficient sorting the results of a join, without denormalization

"Glen M. Witherington" <glen@fea.st> writes:

And here's the query I want to do, efficiently:

SELECT * FROM c
JOIN b ON b.id = c.b_id
JOIN a ON a.id = b.a_id
WHERE a.id = 3
ORDER BY b.created_at DESC
LIMIT 10

At least for that dummy data, this seems sufficient:

regression=# create index on b (a_id, created_at);
CREATE INDEX
regression=# explain analyze SELECT * FROM c
JOIN b ON b.id = c.b_id
JOIN a ON a.id = b.a_id
WHERE a.id = 3
ORDER BY b.created_at DESC
LIMIT 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.14..21.95 rows=10 width=64) (actual time=0.064..1.176 rows=10 loops=1)
-> Nested Loop (cost=0.14..436079.81 rows=200000 width=64) (actual time=0.063..1.173 rows=10 loops=1)
Join Filter: (b.id = c.b_id)
Rows Removed by Join Filter: 1218
-> Nested Loop (cost=0.14..9.81 rows=20 width=40) (actual time=0.035..0.035 rows=1 loops=1)
-> Index Scan Backward using b_a_id_created_at_idx on b (cost=0.14..8.49 rows=20 width=24) (actual time=0.019..0.019 rows=1 loops=1)
Index Cond: (a_id = 3)
-> Materialize (cost=0.00..1.07 rows=1 width=16) (actual time=0.013..0.013 rows=1 loops=1)
-> Seq Scan on a (cost=0.00..1.06 rows=1 width=16) (actual time=0.009..0.009 rows=1 loops=1)
Filter: (id = 3)
Rows Removed by Filter: 2
-> Materialize (cost=0.00..27230.00 rows=1000000 width=24) (actual time=0.008..0.811 rows=1228 loops=1)
-> Seq Scan on c (cost=0.00..16370.00 rows=1000000 width=24) (actual time=0.007..0.310 rows=1228 loops=1)
Planning time: 0.796 ms
Execution time: 1.390 ms
(15 rows)

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

In reply to: Tom Lane (#4)
Re: Efficient sorting the results of a join, without denormalization

On Sun, May 31, 2015, at 12:53 AM, Tom Lane wrote:

"Glen M. Witherington" <glen@fea.st> writes:

And here's the query I want to do, efficiently:

SELECT * FROM c
JOIN b ON b.id = c.b_id
JOIN a ON a.id = b.a_id
WHERE a.id = 3
ORDER BY b.created_at DESC
LIMIT 10

At least for that dummy data, this seems sufficient:

regression=# create index on b (a_id, created_at);
CREATE INDEX
regression=# explain analyze SELECT * FROM c
JOIN b ON b.id = c.b_id
JOIN a ON a.id = b.a_id
WHERE a.id = 3
ORDER BY b.created_at DESC
LIMIT 10;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.14..21.95 rows=10 width=64) (actual time=0.064..1.176
rows=10 loops=1)
-> Nested Loop (cost=0.14..436079.81 rows=200000 width=64) (actual
time=0.063..1.173 rows=10 loops=1)
Join Filter: (b.id = c.b_id)
Rows Removed by Join Filter: 1218
-> Nested Loop (cost=0.14..9.81 rows=20 width=40) (actual
time=0.035..0.035 rows=1 loops=1)
-> Index Scan Backward using b_a_id_created_at_idx on b
(cost=0.14..8.49 rows=20 width=24) (actual
time=0.019..0.019 rows=1 loops=1)
Index Cond: (a_id = 3)
-> Materialize (cost=0.00..1.07 rows=1 width=16) (actual
time=0.013..0.013 rows=1 loops=1)
-> Seq Scan on a (cost=0.00..1.06 rows=1 width=16)
(actual time=0.009..0.009 rows=1 loops=1)
Filter: (id = 3)
Rows Removed by Filter: 2
-> Materialize (cost=0.00..27230.00 rows=1000000 width=24)
(actual time=0.008..0.811 rows=1228 loops=1)
-> Seq Scan on c (cost=0.00..16370.00 rows=1000000
width=24) (actual time=0.007..0.310 rows=1228 loops=1)
Planning time: 0.796 ms
Execution time: 1.390 ms
(15 rows)

regards, tom lane

Wow, sorry I screwed up the query. It should be:

ORDER BY c.created_at DESC

Not b, or as you noted its trivial to index. Sorry!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Francisco Olarte
folarte@peoplecall.com
In reply to: Glen M. Witherington (#3)
Re: Efficient sorting the results of a join, without denormalization

Hi Glen:

On Sun, May 31, 2015 at 6:43 AM, Glen M. Witherington <glen@fea.st> wrote:

On Sat, May 30, 2015, at 11:33 PM, David G. Johnston wrote:

This is one problem with using made up surrogate keys...
The PK of A is a component of both the PK of B and the PK of C but you throw that information away by using serial fields for PKs instead. You should have unique indexes on B and C that incorporate the ID from A

That is quite a strange schema, though isn't it? If you imagine it as
emails:

C = Emails
B = Folder
A = User

Now you're suggesting that even though an email belongs to to a folder,
which belongs to a user ... each email should also contain contain a
reference to a user? I guess that's fine, but seems unideal from a
redundancy perspective

It may seem, and be, unideal from a redundancy perspective, but keys
are more natural. It means you have user (Glen), folder (Glen, PGlist)
and message (Glen,PGlist,27), different from (Glen,Inbox,27) or (Glen,
PgList,28) or (Francisco,PgList,27) ( Where the 'tuples' I've printed
are the PK values ). This has a lot of advantages, which you pay for
in other ways, like redundancies, but having composite primary keys
sometimes work in your favor as you can express restrictions with the
relationships and build composite indexes for add hoc queries. In this
case ( an email database ), a serial could be used ( instead of the
name ) for the user and folder PK, but still have very fast, simple
queries from a MUA for things like 'select * from messages where
user_id = <Prefetched_id> and not read order by timestamp desc limit
100'. Also it will help catch things like mismatching folder ids, or
using the user id as folder id, which are easily made when all the
keys are synthetic and meaningless numbers.

As an example, I have a currency table, with it's serial key
currency_id, and a seller table, which sells just a currency and whose
pk is (currency_id+seller_id), and a rate table with rates
(currency_id, rate_id), and an allowed rates table ( to see which
rates a seller can use ), with primay key (currency_id, seller_id,
rate_id) and foreign keys (currency_id, seller_id) and (currency_id,
rate_id) ( it is more or less a classical example. The composite keys
guarantee I can only allow a seller to sell rates on her currency.

I can also, if needed, build unique indexes on any single id ( they
are all serials, as I have no other candidate keys ), if I need them,
but given the access patterns I normally have all of them, and things
like populating a drop box to allow new rates for a seller are very
easy.

Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Bill Moran
wmoran@potentialtech.com
In reply to: Glen M. Witherington (#5)
Re: Efficient sorting the results of a join, without denormalization

On Sun, 31 May 2015 04:50:00 -0500
"Glen M. Witherington" <glen@fea.st> wrote:

On Sun, May 31, 2015, at 12:53 AM, Tom Lane wrote:

"Glen M. Witherington" <glen@fea.st> writes:

And here's the query I want to do, efficiently:

SELECT * FROM c
JOIN b ON b.id = c.b_id
JOIN a ON a.id = b.a_id
WHERE a.id = 3
ORDER BY b.created_at DESC
LIMIT 10

At least for that dummy data, this seems sufficient:

regression=# create index on b (a_id, created_at);
CREATE INDEX
regression=# explain analyze SELECT * FROM c
JOIN b ON b.id = c.b_id
JOIN a ON a.id = b.a_id
WHERE a.id = 3
ORDER BY b.created_at DESC
LIMIT 10;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.14..21.95 rows=10 width=64) (actual time=0.064..1.176
rows=10 loops=1)
-> Nested Loop (cost=0.14..436079.81 rows=200000 width=64) (actual
time=0.063..1.173 rows=10 loops=1)
Join Filter: (b.id = c.b_id)
Rows Removed by Join Filter: 1218
-> Nested Loop (cost=0.14..9.81 rows=20 width=40) (actual
time=0.035..0.035 rows=1 loops=1)
-> Index Scan Backward using b_a_id_created_at_idx on b
(cost=0.14..8.49 rows=20 width=24) (actual
time=0.019..0.019 rows=1 loops=1)
Index Cond: (a_id = 3)
-> Materialize (cost=0.00..1.07 rows=1 width=16) (actual
time=0.013..0.013 rows=1 loops=1)
-> Seq Scan on a (cost=0.00..1.06 rows=1 width=16)
(actual time=0.009..0.009 rows=1 loops=1)
Filter: (id = 3)
Rows Removed by Filter: 2
-> Materialize (cost=0.00..27230.00 rows=1000000 width=24)
(actual time=0.008..0.811 rows=1228 loops=1)
-> Seq Scan on c (cost=0.00..16370.00 rows=1000000
width=24) (actual time=0.007..0.310 rows=1228 loops=1)
Planning time: 0.796 ms
Execution time: 1.390 ms
(15 rows)

regards, tom lane

Wow, sorry I screwed up the query. It should be:

ORDER BY c.created_at DESC

Not b, or as you noted its trivial to index. Sorry!

Creating an index on c.created_at sped things up by a factor of over
1000, which caused the case you defined to run in ~0.5ms for me.

--
Bill Moran <wmoran@potentialtech.com>

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

In reply to: Francisco Olarte (#6)
Re: Efficient sorting the results of a join, without denormalization

On Sun, May 31, 2015, at 01:16 PM, Francisco Olarte wrote:

It may seem, and be, unideal from a redundancy perspective, but keys
are more natural. It means you have user (Glen), folder (Glen, PGlist)
and message (Glen,PGlist,27), different from (Glen,Inbox,27) or (Glen,
PgList,28) or (Francisco,PgList,27) ( Where the 'tuples' I've printed
are the PK values ). This has a lot of advantages, which you pay for
in other ways, like redundancies, but having composite primary keys
sometimes work in your favor as you can express restrictions with the
relationships and build composite indexes for add hoc queries. In this
case ( an email database ), a serial could be used ( instead of the
name ) for the user and folder PK, but still have very fast, simple
queries from a MUA for things like 'select * from messages where
user_id = <Prefetched_id> and not read order by timestamp desc limit
100'. Also it will help catch things like mismatching folder ids, or
using the user id as folder id, which are easily made when all the
keys are synthetic and meaningless numbers.

As an example, I have a currency table, with it's serial key
currency_id, and a seller table, which sells just a currency and whose
pk is (currency_id+seller_id), and a rate table with rates
(currency_id, rate_id), and an allowed rates table ( to see which
rates a seller can use ), with primay key (currency_id, seller_id,
rate_id) and foreign keys (currency_id, seller_id) and (currency_id,
rate_id) ( it is more or less a classical example. The composite keys
guarantee I can only allow a seller to sell rates on her currency.

I can also, if needed, build unique indexes on any single id ( they
are all serials, as I have no other candidate keys ), if I need them,
but given the access patterns I normally have all of them, and things
like populating a drop box to allow new rates for a seller are very
easy.

Francisco Olarte.

Thanks Francisco, that makes sense. I've started moving my code to that,
and it eliminates all the performance issues I had.

I guess I was really hoping there would exist some sort of "dereference"
option when indexing, so I could dereference a foreign key, and then
index on a attribute of that row. E.g. So I could have created an index
such as:

deref(deref(mail.folder_id).user_id, created_at)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Francisco Olarte
folarte@peoplecall.com
In reply to: Glen M. Witherington (#8)
Re: Efficient sorting the results of a join, without denormalization

Hi Glen:

On Mon, Jun 1, 2015 at 1:16 AM, Glen M. Witherington <glen@fea.st> wrote:

Thanks Francisco, that makes sense. I've started moving my code to that,
and it eliminates all the performance issues I had.

Happty to hear it. Seems you have a kind of speed-size trade off. If
you can solve it while preserving integrity, great for you.

I guess I was really hoping there would exist some sort of "dereference"
option when indexing, so I could dereference a foreign key, and then
index on a attribute of that row. E.g. So I could have created an index
such as:
deref(deref(mail.folder_id).user_id, created_at)

That is difficult, as it would need a way to force mail.folder.user_id
to be a constant or have a lot of rules/triggers ( manual, automatic
or just plain magic ) to update your indexes. On this cases composite
keys let you use composite indexes to accelerate your queries while
preserving normalization, if you can afford them they are nice. The
big problem comes many times if you try to mix them with ORMs and
similar.

Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general