surprising query optimisation

Started by Chris Withersover 7 years ago19 messagesgeneral

chris@withers.org

over 7 years ago

Hi All,

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries
from users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with
an index on it. Despite being a string, this column is only used to
store one of three values: 'NEW', 'ACK', or 'RSV'.

One of our most common queries clauses is "state!='RSV'" and we've found
that by substituting this clause with "state='ACK' or state='NEW'"
wherever it was used, we've dropped the postgres server's load average
from 20 down to 4 and the CPU usage from 60% in user space down to <5%.

This seems counter-intuitive to me, so thought I'd ask here. Why would
this be likely to make such a difference? We're currently on 9.4, is
this something that's likely to be different (better? worse?) if we got
all the way up to 10 or 11?

cheers,

Chris

Adrian Klaver

adrian.klaver@aklaver.com

over 7 years ago

In reply to: Chris Withers (#1)

Re: surprising query optimisation

On 11/28/18 2:26 PM, Chris Withers wrote:

Hi All,

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries
from users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with
an index on it. Despite being a string, this column is only used to
store one of three values: 'NEW', 'ACK', or 'RSV'.

One of our most common queries clauses is "state!='RSV'" and we've found
that by substituting this clause with "state='ACK' or state='NEW'"
wherever it was used, we've dropped the postgres server's load average
from 20 down to 4 and the CPU usage from 60% in user space down to <5%.

This seems counter-intuitive to me, so thought I'd ask here. Why would

The way I see it is state = "something" is a confined question. state !=
'something' is potentially unbounded.

Does EXPLAIN ANALYZE shed any light?

this be likely to make such a difference? We're currently on 9.4, is
this something that's likely to be different (better? worse?) if we got
all the way up to 10 or 11?

cheers,

Chris

--
Adrian Klaver
adrian.klaver@aklaver.com

Gavin Flower

GavinFlower@archidevsys.co.nz

over 7 years ago

In reply to: Chris Withers (#1)

Re: surprising query optimisation

On 29/11/2018 11:26, Chris Withers wrote:

Hi All,

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries
from users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with
an index on it. Despite being a string, this column is only used to
store one of three values: 'NEW', 'ACK', or 'RSV'.

One of our most common queries clauses is "state!='RSV'" and we've
found that by substituting this clause with "state='ACK' or
state='NEW'" wherever it was used, we've dropped the postgres server's
load average from 20 down to 4 and the CPU usage from 60% in user
space down to <5%.

This seems counter-intuitive to me, so thought I'd ask here. Why would
this be likely to make such a difference? We're currently on 9.4, is
this something that's likely to be different (better? worse?) if we
got all the way up to 10 or 11?

cheers,

Chris

At a guess...

"state!='RSV'" ==> pg only has to check one value

and

"state='ACK' or state='NEW'" ==> pg has to check two values

so I would expect the '!=' to be faster.

Cheers,
Gavin

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Chris Withers (#1)

Re: surprising query optimisation

Greetings,

* Chris Withers (chris@withers.org) wrote:

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries from
users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with an
index on it. Despite being a string, this column is only used to store one
of three values: 'NEW', 'ACK', or 'RSV'.

Sounds like a horrible field to have an index on.

One of our most common queries clauses is "state!='RSV'" and we've found
that by substituting this clause with "state='ACK' or state='NEW'" wherever
it was used, we've dropped the postgres server's load average from 20 down
to 4 and the CPU usage from 60% in user space down to <5%.

You've changed the question you're asking the database. PG doesn't
*know* that there's only those three values, but it probably has a
pretty good guess about how many records are ACK and how many are NEW
thanks to those being in the MCV list.

This seems counter-intuitive to me, so thought I'd ask here. Why would this
be likely to make such a difference? We're currently on 9.4, is this
something that's likely to be different (better? worse?) if we got all the
way up to 10 or 11?

When you change what you're asking, PG is going to change how it gives
you the answer and sometimes that'll be faster and other times it won't
be.

Really though, if you want something more than wild speculation, posting
the 'explain analyze' of each query along with the actual table
definitions and sizes and such would be the best way to get it.

I'd suggest you check out the wiki article written about this kind of
question:

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Thanks!

Stephen

Chris Withers

chris@withers.org

over 7 years ago

In reply to: Stephen Frost (#4)

Re: surprising query optimisation

On 28/11/2018 22:49, Stephen Frost wrote:

* Chris Withers (chris@withers.org) wrote:

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries from
users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with an
index on it. Despite being a string, this column is only used to store one
of three values: 'NEW', 'ACK', or 'RSV'.

Sounds like a horrible field to have an index on.

That's counter-intuitive for me. What leads you to say this and what
would you do/recommend instead?

Really though, if you want something more than wild speculation, posting
the 'explain analyze' of each query along with the actual table
definitions and sizes and such would be the best way to get it.

table definition:

# \d alerts_alert
Table "public.alerts_alert"
Column | Type | Modifiers
-----------------+--------------------------+-----------
tags | jsonb | not null
id | character varying(86) | not null
earliest_seen | timestamp with time zone | not null
latest_seen | timestamp with time zone | not null
created | timestamp with time zone | not null
modified | timestamp with time zone | not null
type | character varying(300) | not null
state | character varying(3) | not null
until | timestamp with time zone |
latest_note | text | not null
created_by_id | integer | not null
modified_by_id | integer | not null
owner_id | integer |
owning_group_id | integer | not null
latest_new | timestamp with time zone | not null
Indexes:
"alerts_alert_pkey" PRIMARY KEY, btree (id)
"alert_tags_index" gin (tags)
"alerts_alert_1efacf1d" btree (latest_seen)
"alerts_alert_3103a7d8" btree (until)
"alerts_alert_599dcce2" btree (type)
"alerts_alert_5e7b1936" btree (owner_id)
"alerts_alert_9ae73c65" btree (modified)
"alerts_alert_9ed39e2e" btree (state)
"alerts_alert_b3da0983" btree (modified_by_id)
"alerts_alert_c5151f5a" btree (earliest_seen)
"alerts_alert_e2fa5388" btree (created)
"alerts_alert_e93cb7eb" btree (created_by_id)
"alerts_alert_efea2d76" btree (owning_group_id)
"alerts_alert_id_13155e16_like" btree (id varchar_pattern_ops)
"alerts_alert_latest_new_e8d1fbde_uniq" btree (latest_new)
"alerts_alert_state_90ab480b_like" btree (state varchar_pattern_ops)
"alerts_alert_type_3021f46f_like" btree (type varchar_pattern_ops)
Foreign-key constraints:
"alerts_alert_created_by_id_520608c0_fk_alerts_user_id" FOREIGN KEY
(created_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
"alerts_alert_modified_by_id_6db4b04b_fk_alerts_user_id" FOREIGN
KEY (modified_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY
DEFERRED
"alerts_alert_owner_id_0c00548a_fk_alerts_user_id" FOREIGN KEY
(owner_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
"alerts_alert_owning_group_id_a4869b66_fk_auth_group_id" FOREIGN
KEY (owning_group_id) REFERENCES auth_group(id) DEFERRABLE INITIALLY
DEFERRED
Referenced by:
TABLE "alerts_alertevent" CONSTRAINT
"alerts_alertevent_alert_id_edd734b8_fk_alerts_alert_id" FOREIGN KEY
(alert_id) REFERENCES alerts_alert(id) DEFERRABLE INITIALLY DEFERRED

Row counts by state:

# select state, count(*) from alerts_alert group by 1 order by 1;
state | count
-------+---------
ACK | 1053
NEW | 1958
RSV | 1528623
(3 rows)

here's an example of the "bad" query plan:
https://explain.depesz.com/s/cDkp

here's an example with all the "state!='RSV'" clauses rewritten as I
described:
https://explain.depesz.com/s/B9Xi

I'd suggest you check out the wiki article written about this kind of
question:

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Thanks!

Chris

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Chris Withers (#5)

Re: surprising query optimisation

Greetings,

On Fri, Nov 30, 2018 at 07:52 Chris Withers <chris@withers.org> wrote:

On 28/11/2018 22:49, Stephen Frost wrote:

* Chris Withers (chris@withers.org) wrote:

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries

from

users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with

an

index on it. Despite being a string, this column is only used to store

one

of three values: 'NEW', 'ACK', or 'RSV'.

Sounds like a horrible field to have an index on.

That's counter-intuitive for me. What leads you to say this and what
would you do/recommend instead?

Really though, if you want something more than wild speculation, posting
the 'explain analyze' of each query along with the actual table
definitions and sizes and such would be the best way to get it.

table definition:

# \d alerts_alert
Table "public.alerts_alert"
Column | Type | Modifiers
-----------------+--------------------------+-----------
tags | jsonb | not null
id | character varying(86) | not null
earliest_seen | timestamp with time zone | not null
latest_seen | timestamp with time zone | not null
created | timestamp with time zone | not null
modified | timestamp with time zone | not null
type | character varying(300) | not null
state | character varying(3) | not null
until | timestamp with time zone |
latest_note | text | not null
created_by_id | integer | not null
modified_by_id | integer | not null
owner_id | integer |
owning_group_id | integer | not null
latest_new | timestamp with time zone | not null
Indexes:
"alerts_alert_pkey" PRIMARY KEY, btree (id)
"alert_tags_index" gin (tags)
"alerts_alert_1efacf1d" btree (latest_seen)
"alerts_alert_3103a7d8" btree (until)
"alerts_alert_599dcce2" btree (type)
"alerts_alert_5e7b1936" btree (owner_id)
"alerts_alert_9ae73c65" btree (modified)
"alerts_alert_9ed39e2e" btree (state)
"alerts_alert_b3da0983" btree (modified_by_id)
"alerts_alert_c5151f5a" btree (earliest_seen)
"alerts_alert_e2fa5388" btree (created)
"alerts_alert_e93cb7eb" btree (created_by_id)
"alerts_alert_efea2d76" btree (owning_group_id)
"alerts_alert_id_13155e16_like" btree (id varchar_pattern_ops)
"alerts_alert_latest_new_e8d1fbde_uniq" btree (latest_new)
"alerts_alert_state_90ab480b_like" btree (state varchar_pattern_ops)
"alerts_alert_type_3021f46f_like" btree (type varchar_pattern_ops)
Foreign-key constraints:
"alerts_alert_created_by_id_520608c0_fk_alerts_user_id" FOREIGN KEY
(created_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
"alerts_alert_modified_by_id_6db4b04b_fk_alerts_user_id" FOREIGN
KEY (modified_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY
DEFERRED
"alerts_alert_owner_id_0c00548a_fk_alerts_user_id" FOREIGN KEY
(owner_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
"alerts_alert_owning_group_id_a4869b66_fk_auth_group_id" FOREIGN
KEY (owning_group_id) REFERENCES auth_group(id) DEFERRABLE INITIALLY
DEFERRED
Referenced by:
TABLE "alerts_alertevent" CONSTRAINT
"alerts_alertevent_alert_id_edd734b8_fk_alerts_alert_id" FOREIGN KEY
(alert_id) REFERENCES alerts_alert(id) DEFERRABLE INITIALLY DEFERRED

Row counts by state:

# select state, count(*) from alerts_alert group by 1 order by 1;
state | count
-------+---------
ACK | 1053
NEW | 1958
RSV | 1528623
(3 rows)

here's an example of the "bad" query plan:
https://explain.depesz.com/s/cDkp

here's an example with all the "state!='RSV'" clauses rewritten as I
described:
https://explain.depesz.com/s/B9Xi

I'd suggest you check out the wiki article written about this kind of
question:

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Have you tried a partial index on state!=‘RSV’?

Thanks,

Stephen

Show quoted text

<https://wiki.postgresql.org/wiki/Slow_Query_Questions>

Chris Withers

chris@withers.org

over 7 years ago

In reply to: Stephen Frost (#6)

Re: surprising query optimisation

On 30/11/2018 12:55, Stephen Frost wrote:

I'd suggest you check out the wiki article written about this kind of
question:

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Have you tried a partial index on state!=‘RSV’?

The solution I originally posted, that we do easily enough at our query
generation layer, is working perfectly, but this is good to know for
next time.

My post here is mainly to try and understand what's going on so I can
improve my general feel for how to use postgres at it's best.

So, why was the query ending up being a big scan rather than some quick
lookups based on the index?

cheers,

Chris

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Chris Withers (#7)

Re: surprising query optimisation

Greetings,

On Fri, Nov 30, 2018 at 08:00 Chris Withers <chris@withers.org> wrote:

On 30/11/2018 12:55, Stephen Frost wrote:

I'd suggest you check out the wiki article written about this

kind of

question:

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Have you tried a partial index on state!=‘RSV’?

The solution I originally posted, that we do easily enough at our query
generation layer, is working perfectly, but this is good to know for
next time.

My post here is mainly to try and understand what's going on so I can
improve my general feel for how to use postgres at it's best.

So, why was the query ending up being a big scan rather than some quick
lookups based on the index?

Thought that was mentioned already but at least part of the issue is that
PG can’t just search for the other values when it’s a != in the index
because it wouldn’t know what values to search for... PG doesn’t know,
with complete certainty, that there’s only 3 values.

The partial index is something you should want anyway- that index won’t
ever be used to search for RSV because it’s cheaper to just scan the table
if we are looking for those, and the index will be much, much smaller
without that common value being included.

Thanks!

Stephen

Show quoted text

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Chris Withers (#5)

Re: surprising query optimisation

Greetings,

* Chris Withers (chris@withers.org) wrote:

On 28/11/2018 22:49, Stephen Frost wrote:

* Chris Withers (chris@withers.org) wrote:

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries from
users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with an
index on it. Despite being a string, this column is only used to store one
of three values: 'NEW', 'ACK', or 'RSV'.

Sounds like a horrible field to have an index on.

That's counter-intuitive for me. What leads you to say this and what would
you do/recommend instead?

For this, specifically, it's because you end up with exactly what you
have: a large index with tons of duplicate values. Indexes are
particularly good when you have high-cardinality fields. Now, if you
have a skewed index, where there's one popular value and a few much less
popular ones, then that's where you really want a partial index (as I
suggest earlier) so that queries against the non-popular value(s) is
able to use the index and the index is much smaller.

Of course, for this to work you need to set up the partial index
correctly and make sure that your queries end up using it.

Thanks!

Stephen

#10

Gavin Flower

GavinFlower@archidevsys.co.nz

over 7 years ago

In reply to: Stephen Frost (#9)

Re: surprising query optimisation

On 01/12/2018 04:33, Stephen Frost wrote:

Greetings,

* Chris Withers (chris@withers.org) wrote:

On 28/11/2018 22:49, Stephen Frost wrote:

* Chris Withers (chris@withers.org) wrote:

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries from
users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with an
index on it. Despite being a string, this column is only used to store one
of three values: 'NEW', 'ACK', or 'RSV'.

Sounds like a horrible field to have an index on.

That's counter-intuitive for me. What leads you to say this and what would
you do/recommend instead?

For this, specifically, it's because you end up with exactly what you
have: a large index with tons of duplicate values. Indexes are
particularly good when you have high-cardinality fields. Now, if you
have a skewed index, where there's one popular value and a few much less
popular ones, then that's where you really want a partial index (as I
suggest earlier) so that queries against the non-popular value(s) is
able to use the index and the index is much smaller.

Of course, for this to work you need to set up the partial index
correctly and make sure that your queries end up using it.

Thanks!

Stephen

An index simply tells pg which block to look at (assuming that the index
itself is not sufficient to satisfy the query), so if using an index
would still require that pg look at most blocks, then it cheaper to just
scan the whole table - rather than load the index and still look at all
blocks that contain the table data.ï¿½ I've oversimplified slightly.

An index is best used when using it results in fewer blocks being read
from disk.

Also the use of RAM is better when the size of the index is kept small.ï¿½
For example having an index on sex for nurses is a waste of time because
most nurses are female.ï¿½ However, a partial index (as suggested by
Stephen, for your query) that includes only males could be useful if you
have queries looking for male nurses (assumes male nurses are a very
small fraction of nurses such that most data blocks don't have rows for
males nurses, and the planner knows this).

I once optimised a very complex set queries that made extensive use of
indexes.ï¿½ However, with the knowledge I have today, I would have most
likely had fewer and smaller indexes.ï¿½ As I now realize, that some of my
indexes were probably counter productive, especially as I'd given no
thought to how much RAM would be required to host the data plus
indexes!ï¿½ Fortunately, while the objective was to run all those queries
within 24 hours, they actually only took a couple of hours.

Chris, I would strongly suggest, you read up on the excellent
documentation pg has about indexes, but don't expect to understand it
all at one sitting...

Cheers,
Gavin

#11

Chris Withers

chris@withers.org

over 7 years ago

In reply to: Stephen Frost (#9)

Re: surprising query optimisation

On 30/11/2018 15:33, Stephen Frost wrote:

Greetings,

* Chris Withers (chris@withers.org) wrote:

On 28/11/2018 22:49, Stephen Frost wrote:

* Chris Withers (chris@withers.org) wrote:

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries from
users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with an
index on it. Despite being a string, this column is only used to store one
of three values: 'NEW', 'ACK', or 'RSV'.

Sounds like a horrible field to have an index on.

That's counter-intuitive for me. What leads you to say this and what would
you do/recommend instead?

For this, specifically, it's because you end up with exactly what you
have: a large index with tons of duplicate values. Indexes are
particularly good when you have high-cardinality fields. Now, if you
have a skewed index, where there's one popular value and a few much less
popular ones, then that's where you really want a partial index (as I
suggest earlier) so that queries against the non-popular value(s) is
able to use the index and the index is much smaller.

Interesting! In my head, for some reason, I'd always assumed a btree
index would break down a char field based on the characters within it.
Does that never happen?

If I changed this to be an enum field, would != still perform poorly or
can the query optimiser spot that it's an enum and just look for the
other options?

cheers,

Chris

#12

Chris Withers

chris@withers.org

over 7 years ago

In reply to: Gavin Flower (#10)

Re: surprising query optimisation

On 30/11/2018 22:10, Gavin Flower wrote:

I once optimised a very complex set queries that made extensive use of
indexes.ï¿½ However, with the knowledge I have today, I would have most
likely had fewer and smaller indexes.ï¿½ As I now realize, that some of my
indexes were probably counter productive, especially as I'd given no
thought to how much RAM would be required to host the data plus
indexes!ï¿½ Fortunately, while the objective was to run all those queries
within 24 hours, they actually only took a couple of hours.

So, interestingly, this box has 250GB memory in it, and even though I've
set effective_cache_size to 200GB, I only see 9G of memory being used.
How can I persuade postgres to keep more in memory?

cheers,

Chris

#13

Thomas Kellerer

spam_eater@gmx.net

over 7 years ago

In reply to: Stephen Frost (#8)

Re: surprising query optimisation

Stephen Frost schrieb am 30.11.2018 um 14:05:

PG doesn’t know, with complete certainty, that there’s only 3
values.

Would the optimizer consult a check constraint ensuring that?

#14

Thomas Kellerer

spam_eater@gmx.net

over 7 years ago

In reply to: Chris Withers (#12)

Re: surprising query optimisation

Chris Withers schrieb am 05.12.2018 um 12:42:

So, interestingly, this box has 250GB memory in it, and even though
I've set effective_cache_size to 200GB, I only see 9G of memory being
used. How can I persuade postgres to keep more in memory?

effective_cache_size is a hint to the optimizer on how much data is to be expected to be cached (by the filesystem).

Only shared_buffers (and work_mem) will actually allocate memory from the operating system.

Thomas

#15

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Thomas Kellerer (#13)

Re: surprising query optimisation

Greetings,

* Thomas Kellerer (spam_eater@gmx.net) wrote:

Stephen Frost schrieb am 30.11.2018 um 14:05:

PG doesn’t know, with complete certainty, that there’s only 3
values.

Would the optimizer consult a check constraint ensuring that?

Not today, I don't believe (haven't looked at the code though, to be
fair), and seems like it'd be an awful lot of work for a rare
use-case that would be better with just a partial index..

What we will do today is if you ask for a specific value and there's a
CHECK constraint which lists out specific values and that value isn't
among the set, then we'll just skip over the table (One-time filter:
false), thanks to constraint exclusion.

Thanks!

Stephen

#16

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Chris Withers (#11)

Re: surprising query optimisation

Greetings,

* Chris Withers (chris@withers.org) wrote:

On 30/11/2018 15:33, Stephen Frost wrote:

* Chris Withers (chris@withers.org) wrote:

On 28/11/2018 22:49, Stephen Frost wrote:

For this, specifically, it's because you end up with exactly what you
have: a large index with tons of duplicate values. Indexes are
particularly good when you have high-cardinality fields. Now, if you
have a skewed index, where there's one popular value and a few much less
popular ones, then that's where you really want a partial index (as I
suggest earlier) so that queries against the non-popular value(s) is
able to use the index and the index is much smaller.

Interesting! In my head, for some reason, I'd always assumed a btree index
would break down a char field based on the characters within it. Does that
never happen?

Not sure what you mean by 'break down a char field'.

If I changed this to be an enum field, would != still perform poorly or can
the query optimiser spot that it's an enum and just look for the other
options?

I don't believe we've got any kind of optimization like that today for
enums.

Thanks!

Stephen

#17

Chris Withers

chris@withers.org

over 7 years ago

In reply to: Stephen Frost (#16)

Re: surprising query optimisation

On 05/12/2018 14:38, Stephen Frost wrote:

Greetings,

* Chris Withers (chris@withers.org) wrote:

On 30/11/2018 15:33, Stephen Frost wrote:

* Chris Withers (chris@withers.org) wrote:

On 28/11/2018 22:49, Stephen Frost wrote:

For this, specifically, it's because you end up with exactly what you
have: a large index with tons of duplicate values. Indexes are
particularly good when you have high-cardinality fields. Now, if you
have a skewed index, where there's one popular value and a few much less
popular ones, then that's where you really want a partial index (as I
suggest earlier) so that queries against the non-popular value(s) is
able to use the index and the index is much smaller.

Interesting! In my head, for some reason, I'd always assumed a btree index
would break down a char field based on the characters within it. Does that
never happen?

Not sure what you mean by 'break down a char field'.

Rather than breaking into three buckets ('NEW', 'ACK', 'RSV'), a more
complicated hierarchy ('N', 'NE', 'A', 'AC', etc).

If I changed this to be an enum field, would != still perform poorly or can
the query optimiser spot that it's an enum and just look for the other
options?

I don't believe we've got any kind of optimization like that today for
enums.

Good to know, I see query optimisers as magic, and postgres often seems
to achieve magic results ;-)

Chris

#18

Ron

ronljohnsonjr@gmail.com

over 7 years ago

In reply to: Chris Withers (#17)

Re: surprising query optimisation

On 12/05/2018 08:42 AM, Chris Withers wrote:

On 05/12/2018 14:38, Stephen Frost wrote:

Greetings,

* Chris Withers (chris@withers.org) wrote:

On 30/11/2018 15:33, Stephen Frost wrote:

* Chris Withers (chris@withers.org) wrote:

On 28/11/2018 22:49, Stephen Frost wrote:

For this, specifically, it's because you end up with exactly what you
have: a large index with tons of duplicate values. Indexes are
particularly good when you have high-cardinality fields. Now, if you
have a skewed index, where there's one popular value and a few much less
popular ones, then that's where you really want a partial index (as I
suggest earlier) so that queries against the non-popular value(s) is
able to use the index and the index is much smaller.

Interesting! In my head, for some reason, I'd always assumed a btree index
would break down a char field based on the characters within it. Does that
never happen?

Not sure what you mean by 'break down a char field'.

Rather than breaking into three buckets ('NEW', 'ACK', 'RSV'), a more
complicated hierarchy ('N', 'NE', 'A', 'AC', etc).

The b-tree indexes on legacy RDBMS which I still occasionally fiddle with
performs key prefix compression in a manner similar to what you refer to,
but otherwise that's not how b-trees work.

--
Angular momentum makes the world go 'round.

#19

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Ron (#18)

Re: surprising query optimisation

Greetings,

* Ron (ronljohnsonjr@gmail.com) wrote:

On 12/05/2018 08:42 AM, Chris Withers wrote:

On 05/12/2018 14:38, Stephen Frost wrote:

* Chris Withers (chris@withers.org) wrote:

Interesting! In my head, for some reason, I'd always assumed a btree index
would break down a char field based on the characters within it. Does that
never happen?

Not sure what you mean by 'break down a char field'.

Rather than breaking into three buckets ('NEW', 'ACK', 'RSV'), a more
complicated hierarchy ('N', 'NE', 'A', 'AC', etc).

The b-tree indexes on legacy RDBMS which I still occasionally fiddle with
performs key prefix compression in a manner similar to what you refer to,
but otherwise that's not how b-trees work.

There's been some discussion of prefix compression in PostgreSQL. Even
with that, though, it hardly seems sensible to have an index which has
tons of duplicates comprising most of the index, and a != would still
have to search the index to make sure there aren't any entries which
need to be returned..

Now, maybe once we get skipping scans where we would be able to skip
over a large chunk of the index because it's just tons of duplicates
without having to visit everything along the way, then maybe having this
inefficient index would "just" take up disk space, but why waste that
space?

Thanks!

Stephen