Confusion about composite indexes

Started by Bill Mitchellalmost 14 years ago6 messagesgeneral
Jump to latest
#1Bill Mitchell
bill@publicrelay.com

I am searching for some logic behind the selection of an index in
postgres -- it seems that if I have a composite index based on both
columns in a join table, it's only referenced if I query on the first
term in the composite index. I've read
http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over
and over and think that this is the same scenario as what I face.

As an example:
OUTLET: has OUTLET_ID as a primary key, consisting of about a million rows
MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
OUTLET_MEDIA: a join table used to correlate, and this has about a
million rows

Each outlet may have 1+ Media (technically, 0 or more, in this schema)

Table "public.outlet_media"
Column | Type | Modifiers
-----------+--------+-----------
outlet_id | bigint | not null
media_id | bigint | not null
Indexes:
"outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
Foreign-key constraints:
"fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
"fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
outlet(outlet_id)

When I test performance, using an OUTLET_ID, the query uses the
outlet_media_pkey index

# explain analyze select * from outlet_media where outlet_id in (select
outlet_id from outlet order by random() limit 50);
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=67625.64..68048.50 rows=50 width=16) (actual
time=841.115..884.669 rows=50 loops=1)
-> HashAggregate (cost=67625.64..67626.14 rows=50 width=8) (actual
time=841.048..841.090 rows=50 loops=1)
-> Limit (cost=67624.89..67625.01 rows=50 width=8) (actual
time=840.980..841.011 rows=50 loops=1)
-> Sort (cost=67624.89..70342.66 rows=1087110 width=8)
(actual time=840.978..840.991 rows=50 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 27kB
-> Seq Scan on outlet (cost=0.00..31511.88
rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
-> Index Scan using outlet_media_pkey on outlet_media
(cost=0.00..8.43 rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
Index Cond: (outlet_id = outlet.outlet_id)
Total runtime: 884.759 ms
(10 rows)

However if I try the reverse, to search using the MEDIA_ID
# explain analyze select * from outlet_media where media_id in (select
media_id from media where media_name='Online News');
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1.19..21647.53 rows=362125 width=16) (actual
time=0.034..0.034 rows=0 loops=1)
Hash Cond: (outlet_media.media_id = media.media_id)
-> Seq Scan on outlet_media (cost=0.00..16736.76 rows=1086376
width=16) (actual time=0.012..0.012 rows=1 loops=1)
-> Hash (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 0kB
-> HashAggregate (cost=1.17..1.18 rows=1 width=8) (actual
time=0.013..0.013 rows=0 loops=1)
-> Seq Scan on media (cost=0.00..1.16 rows=1 width=8)
(actual time=0.012..0.012 rows=0 loops=1)
Filter: ((media_name)::text = 'Online News'::text)
Total runtime: 0.084 ms
(9 rows)

Thanks in advance for whatever light can be shed. If it's safer for me
to just create individual indexes on each of the two columns ("
Multicolumn indexes should be used sparingly. In most situations, an
index on a single column is sufficient and saves space and time")

regards
Bill

#2Chris Curvey
chris@chriscurvey.com
In reply to: Bill Mitchell (#1)
Re: Confusion about composite indexes

On Mon, May 21, 2012 at 3:34 PM, Bill Mitchell <bill@publicrelay.com> wrote:

I am searching for some logic behind the selection of an index in
postgres -- it seems that if I have a composite index based on both columns
in a join table, it's only referenced if I query on the first term in the
composite index. I've read
http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over
and over and think that this is the same scenario as what I face.

As an example:
OUTLET: has OUTLET_ID as a primary key, consisting of about a million rows
MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
OUTLET_MEDIA: a join table used to correlate, and this has about a million
rows

Each outlet may have 1+ Media (technically, 0 or more, in this schema)

Table "public.outlet_media"
Column | Type | Modifiers
-----------+--------+-----------
outlet_id | bigint | not null
media_id | bigint | not null
Indexes:
"outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
Foreign-key constraints:
"fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
"fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
outlet(outlet_id)

When I test performance, using an OUTLET_ID, the query uses the
outlet_media_pkey index

# explain analyze select * from outlet_media where outlet_id in (select
outlet_id from outlet order by random() limit 50);
QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=67625.64..68048.50 rows=50 width=16) (actual
time=841.115..884.669 rows=50 loops=1)
-> HashAggregate (cost=67625.64..67626.14 rows=50 width=8) (actual
time=841.048..841.090 rows=50 loops=1)
-> Limit (cost=67624.89..67625.01 rows=50 width=8) (actual
time=840.980..841.011 rows=50 loops=1)
-> Sort (cost=67624.89..70342.66 rows=1087110 width=8)
(actual time=840.978..840.991 rows=50 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 27kB
-> Seq Scan on outlet (cost=0.00..31511.88
rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
-> Index Scan using outlet_media_pkey on outlet_media
(cost=0.00..8.43 rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
Index Cond: (outlet_id = outlet.outlet_id)
Total runtime: 884.759 ms
(10 rows)

However if I try the reverse, to search using the MEDIA_ID
# explain analyze select * from outlet_media where media_id in (select
media_id from media where media_name='Online News');
QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1.19..21647.53 rows=362125 width=16) (actual
time=0.034..0.034 rows=0 loops=1)
Hash Cond: (outlet_media.media_id = media.media_id)
-> Seq Scan on outlet_media (cost=0.00..16736.76 rows=1086376
width=16) (actual time=0.012..0.012 rows=1 loops=1)
-> Hash (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 0kB
-> HashAggregate (cost=1.17..1.18 rows=1 width=8) (actual
time=0.013..0.013 rows=0 loops=1)
-> Seq Scan on media (cost=0.00..1.16 rows=1 width=8)
(actual time=0.012..0.012 rows=0 loops=1)
Filter: ((media_name)::text = 'Online News'::text)
Total runtime: 0.084 ms
(9 rows)

Thanks in advance for whatever light can be shed. If it's safer for me to
just create individual indexes on each of the two columns (" Multicolumn
indexes should be used sparingly. In most situations, an index on a single
column is sufficient and saves space and time")

regards
Bill

You are absolutely correct. It may be easier to understand if you think
about the structure of the index keys. For clarity, I'm going to pretend
that the "outlet" id is a character instead of an integer.

The index is sorted in the order of the columns in the index. So our
sorted values might look something like this (my apologies if the spacing
does not come through properly)

outlet media
A 1
A 2
A 3
B 1
B 2
B 3
C 1
C 2
C 3

So if I want to search for all the entries where outlet = 'a', then I find
the first 'A' record, search until I find something that is not 'A', and I
know that I'm done.

But to find all the entries where media = 1, I would have to search every
key in the index. And because there is an unknown amount of work involved
in searching the index, the optimizer assumes that using the index will be
more work than just scanning the table.

So what should you do? I would keep the 2-key index, because you want that
for uniqueness. And the index as you have structured it is useful for
searching if you are specifying the outlet. But you may want to add a
non-unique index on the media_id column, so that you will have a useful
index for searching.

The Standard Index Tradeoff Disclaimer (TM) applies: indexes can speed up
searches, but they will slow down inserts/updates/deletes because the index
must be maintained.

--
e-Mail is the equivalent of a postcard written in pencil. This message may
not have been sent by me, or intended for you. It may have been read or
even modified while in transit. e-Mail disclaimers have the same force in
law as a note passed in study hall. If your corporate attorney says that
you need an disclaimer in your signature, you need a new corporate attorney.

#3Merlin Moncure
mmoncure@gmail.com
In reply to: Bill Mitchell (#1)
Re: Confusion about composite indexes

On Mon, May 21, 2012 at 2:34 PM, Bill Mitchell <bill@publicrelay.com> wrote:

I am searching for some logic behind the selection of an index in postgres
-- it seems that if I have a composite index based on both columns in a join
table, it's only referenced if I query on the first term in the composite
index.  I've read
http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over and
over and think that this is the same scenario as what I face.

As an example:
OUTLET:  has OUTLET_ID as a primary key, consisting of about a million rows
MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
OUTLET_MEDIA: a join table used to correlate, and this has about a million
rows

Each outlet may have 1+ Media (technically, 0 or more, in this schema)

  Table "public.outlet_media"
  Column   |  Type  | Modifiers
-----------+--------+-----------
 outlet_id | bigint | not null
 media_id  | bigint | not null
Indexes:
    "outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
Foreign-key constraints:
    "fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
    "fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
outlet(outlet_id)

When I test performance, using an OUTLET_ID, the query uses the
outlet_media_pkey index

# explain analyze select * from outlet_media where outlet_id in (select
outlet_id from outlet order by random() limit 50);
                                                                QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=67625.64..68048.50 rows=50 width=16) (actual
time=841.115..884.669 rows=50 loops=1)
   ->  HashAggregate  (cost=67625.64..67626.14 rows=50 width=8) (actual
time=841.048..841.090 rows=50 loops=1)
         ->  Limit  (cost=67624.89..67625.01 rows=50 width=8) (actual
time=840.980..841.011 rows=50 loops=1)
               ->  Sort  (cost=67624.89..70342.66 rows=1087110 width=8)
(actual time=840.978..840.991 rows=50 loops=1)
                     Sort Key: (random())
                     Sort Method: top-N heapsort  Memory: 27kB
                     ->  Seq Scan on outlet  (cost=0.00..31511.88
rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
   ->  Index Scan using outlet_media_pkey on outlet_media  (cost=0.00..8.43
rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
         Index Cond: (outlet_id = outlet.outlet_id)
 Total runtime: 884.759 ms
(10 rows)

However if I try the reverse, to search using the MEDIA_ID
# explain analyze select * from outlet_media where media_id in (select
media_id from media where media_name='Online News');
                                                      QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1.19..21647.53 rows=362125 width=16) (actual
time=0.034..0.034 rows=0 loops=1)
   Hash Cond: (outlet_media.media_id = media.media_id)
   ->  Seq Scan on outlet_media  (cost=0.00..16736.76 rows=1086376 width=16)
(actual time=0.012..0.012 rows=1 loops=1)
   ->  Hash  (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
rows=0 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 0kB
         ->  HashAggregate  (cost=1.17..1.18 rows=1 width=8) (actual
time=0.013..0.013 rows=0 loops=1)
               ->  Seq Scan on media  (cost=0.00..1.16 rows=1 width=8)
(actual time=0.012..0.012 rows=0 loops=1)
                     Filter: ((media_name)::text = 'Online News'::text)
 Total runtime: 0.084 ms
(9 rows)

Thanks in advance for whatever light can be shed.  If it's safer for me to
just create individual indexes on each of the two columns  (" Multicolumn
indexes should be used sparingly. In most situations, an index on a single
column is sufficient and saves space and time")

There are a couple of things going on here. First of all, 'order by
random() limit..' guarantees a full seq scan and a short of whatever
you're ordering, so it's never going to be very efficient. When you
wen the other way, you supplied a hard search term which can be fed to
the index and optimized through.

Secondly, if you have a mapping table that has to support searches in
both directions, it's pretty typical to do something like this:

create table map(a int references ..., b int references..., primary key(a,b));
create index on map(b);

So you can get fully index lookups on all of a, b, ab, and ba. the
primary key can't optimize ba because indexes only fully match if
candidate fields are supplied from left to right order. They can
still help somewhat, but to a lesser degree.

merlin

#4Dmitriy Igrishin
dmitigr@gmail.com
In reply to: Merlin Moncure (#3)
Re: Confusion about composite indexes

2012/5/22 Merlin Moncure <mmoncure@gmail.com>

On Mon, May 21, 2012 at 2:34 PM, Bill Mitchell <bill@publicrelay.com>
wrote:

I am searching for some logic behind the selection of an index in

postgres

-- it seems that if I have a composite index based on both columns in a

join

table, it's only referenced if I query on the first term in the composite
index. I've read
http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over

and

over and think that this is the same scenario as what I face.

As an example:
OUTLET: has OUTLET_ID as a primary key, consisting of about a million

rows

MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
OUTLET_MEDIA: a join table used to correlate, and this has about a

million

rows

Each outlet may have 1+ Media (technically, 0 or more, in this schema)

Table "public.outlet_media"
Column | Type | Modifiers
-----------+--------+-----------
outlet_id | bigint | not null
media_id | bigint | not null
Indexes:
"outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
Foreign-key constraints:
"fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES

media(media_id)

"fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
outlet(outlet_id)

When I test performance, using an OUTLET_ID, the query uses the
outlet_media_pkey index

# explain analyze select * from outlet_media where outlet_id in (select
outlet_id from outlet order by random() limit 50);
QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=67625.64..68048.50 rows=50 width=16) (actual
time=841.115..884.669 rows=50 loops=1)
-> HashAggregate (cost=67625.64..67626.14 rows=50 width=8) (actual
time=841.048..841.090 rows=50 loops=1)
-> Limit (cost=67624.89..67625.01 rows=50 width=8) (actual
time=840.980..841.011 rows=50 loops=1)
-> Sort (cost=67624.89..70342.66 rows=1087110 width=8)
(actual time=840.978..840.991 rows=50 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 27kB
-> Seq Scan on outlet (cost=0.00..31511.88
rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
-> Index Scan using outlet_media_pkey on outlet_media

(cost=0.00..8.43

rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
Index Cond: (outlet_id = outlet.outlet_id)
Total runtime: 884.759 ms
(10 rows)

However if I try the reverse, to search using the MEDIA_ID
# explain analyze select * from outlet_media where media_id in (select
media_id from media where media_name='Online News');
QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------

Hash Join (cost=1.19..21647.53 rows=362125 width=16) (actual
time=0.034..0.034 rows=0 loops=1)
Hash Cond: (outlet_media.media_id = media.media_id)
-> Seq Scan on outlet_media (cost=0.00..16736.76 rows=1086376

width=16)

(actual time=0.012..0.012 rows=1 loops=1)
-> Hash (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 0kB
-> HashAggregate (cost=1.17..1.18 rows=1 width=8) (actual
time=0.013..0.013 rows=0 loops=1)
-> Seq Scan on media (cost=0.00..1.16 rows=1 width=8)
(actual time=0.012..0.012 rows=0 loops=1)
Filter: ((media_name)::text = 'Online News'::text)
Total runtime: 0.084 ms
(9 rows)

Thanks in advance for whatever light can be shed. If it's safer for me

to

just create individual indexes on each of the two columns (" Multicolumn
indexes should be used sparingly. In most situations, an index on a

single

column is sufficient and saves space and time")

There are a couple of things going on here. First of all, 'order by
random() limit..' guarantees a full seq scan and a short of whatever
you're ordering, so it's never going to be very efficient. When you
wen the other way, you supplied a hard search term which can be fed to
the index and optimized through.

Secondly, if you have a mapping table that has to support searches in
both directions, it's pretty typical to do something like this:

create table map(a int references ..., b int references..., primary
key(a,b));
create index on map(b);

So you can get fully index lookups on all of a, b, ab, and ba. the
primary key can't optimize ba because indexes only fully match if
candidate fields are supplied from left to right order. They can
still help somewhat, but to a lesser degree.

BTW, I would like to know is it worth it to create 3rd index on map(a)
to reduce the size of the index which will be used by the planer
to save some server's RAM (obviously, at the cost of extra disk space) ?

--
// Dmitriy.

#5Merlin Moncure
mmoncure@gmail.com
In reply to: Dmitriy Igrishin (#4)
Re: Confusion about composite indexes

On Mon, May 21, 2012 at 3:36 PM, Dmitriy Igrishin <dmitigr@gmail.com> wrote:

So you can get fully index lookups on all of a, b, ab, and ba.  the
primary key can't optimize ba because indexes only fully match if
candidate fields are supplied from left to right order.  They can
still help somewhat, but to a lesser degree.

BTW, I would like to know is it worth it to create 3rd index on map(a)
to reduce the size of the index which will be used by the planer
to save some server's RAM (obviously, at the cost of extra disk space) ?

What Dmitriy is talking about here is that even though an index on
(a,b) can efficiently (in terms of searching through the tree) match
terms on just 'a', you still pay a price because the entries on the
index have to store the data for b as well, So even though it's
algorithmically efficient you have to browse more data to do it which
pressures RAM. In other words, an index on just 'a' is ideal for
searches on just 'a', although a,b is much better than (b,a) or no
index at all.

I personally think that generally it's better not to do that in most
cases especially if you're indexing integer keys since you're not
making *that* much difference on the overall index size. Also,
primary key indexes are much more likely to have to stay 'hot' in the
cache anyways since they will be serving fkey reference lookups and
stuff like that so in the end you might be consuming *more* ram, not
less.

An exception might be if your key on a,b has a very small 'a' and a
very large 'b'. But that's pretty rare in practice and it's usually a
good idea to avoid indexing large fields if you can help it. It
really depends on the workload.

merlin

#6Bill Mitchell
bill@publicrelay.com
In reply to: Merlin Moncure (#5)
Re: Confusion about composite indexes

Thanks to everybody's input -- as a first-time poster to this listserv,
I wasn't sure how long it would take to get a response. ;)

I was frankly astonished to see that the composite index on (a,b) was
used when I searched for (a), but Chris' response makes total sense.

In this case, I don't want to go with a MAP due to the fact that I'm
actually using Java Hibernate to generate this schema and access it.

My sample query of using RANDOM() to select a random subset of the
overall outlets was actually to try and defeat any prior caching of
results, and give a more reasonable measurement -- but I didn't realize
the implications. I had thought that coupled with a MAX clause at the
end it would simply randomize and then bail out early instead of a full
table scan - so thanks to Merlin for pointing that out.

I'll go with a 2nd index on MEDIA_ID and do some measurements of speed
increase, but it makes a lot more sense now.

thank you Postgres gurus! :D

regards
Bill

Show quoted text

On 5/21/12 5:11 PM, Merlin Moncure wrote:

On Mon, May 21, 2012 at 3:36 PM, Dmitriy Igrishin <dmitigr@gmail.com> wrote:

So you can get fully index lookups on all of a, b, ab, and ba. the
primary key can't optimize ba because indexes only fully match if
candidate fields are supplied from left to right order. They can
still help somewhat, but to a lesser degree.

BTW, I would like to know is it worth it to create 3rd index on map(a)
to reduce the size of the index which will be used by the planer
to save some server's RAM (obviously, at the cost of extra disk space) ?

What Dmitriy is talking about here is that even though an index on
(a,b) can efficiently (in terms of searching through the tree) match
terms on just 'a', you still pay a price because the entries on the
index have to store the data for b as well, So even though it's
algorithmically efficient you have to browse more data to do it which
pressures RAM. In other words, an index on just 'a' is ideal for
searches on just 'a', although a,b is much better than (b,a) or no
index at all.

I personally think that generally it's better not to do that in most
cases especially if you're indexing integer keys since you're not
making *that* much difference on the overall index size. Also,
primary key indexes are much more likely to have to stay 'hot' in the
cache anyways since they will be serving fkey reference lookups and
stuff like that so in the end you might be consuming *more* ram, not
less.

An exception might be if your key on a,b has a very small 'a' and a
very large 'b'. But that's pretty rare in practice and it's usually a
good idea to avoid indexing large fields if you can help it. It
really depends on the workload.

merlin