Faster distinct query?

Started by Israel Brewsterover 4 years ago34 messagesgeneral
Jump to latest
#1Israel Brewster
ijbrewster@alaska.edu

I was wondering if there was any way to improve the performance of this query:

SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;

The explain execution plan can be found here:
https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html&gt;

and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that is a list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels associated with each station). This query takes around 5 minutes to run.

To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).

This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Israel Brewster (#1)
Re: Faster distinct query?

On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:

To work around the issue, I created a materialized view that I can update
periodically, and of course I can query said view in no time flat. However,
I’m concerned that as the dataset grows, the time it takes to refresh the
view will also grow (correct me if I am wrong there).

I'd probably turn that index into a foreign key that just ensures that
every (station,channel) that appears in the data table also appears on the
lookup table. Grouping and array-ifying the lookup table would be
trivial. Either modify the application code or add a trigger to populate
the lookup table as needed.

The parentheses around channel in "array_agg(distinct(channel))" are
unnecessary - you are invoking composite-type syntax, which is ignored in
the single column case unless you write the optional ROW keyword, i.e.,
distinct ROW(channel)
David J.

#3Michael Lewis
mlewis@entrata.com
In reply to: Israel Brewster (#1)
Re: Faster distinct query?

In the future, please share the plan returned by explain analyze, and some
data about how many rows in the involved tables, what type of system you
are running it on, any changes from default configs, etc.

How does the below work? It should do a very simple index scan only, then
aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

If there is correlation between station & channel, then you might look at
creating a multivariate statistics object and analyzing the table so the
planner can make better choices, knowing that channel is dependent on
station perhaps. I wouldn't necessarily think that it would help this
query, but perhaps others. Also, you might try creating only dependencies,
only ndistinct type, or some combination other than all 3 types.

ref- https://www.postgresql.org/docs/current/sql-createstatistics.html

#4David G. Johnston
david.g.johnston@gmail.com
In reply to: Michael Lewis (#3)
Re: Faster distinct query?

On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:

In the future, please share the plan returned by explain analyze, and some
data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is
what the OP did. Reading explain output in email has its own challenges,
and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then
aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be
identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at
creating a multivariate statistics object and analyzing the table so the
planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going
down this path. The Index-Only scan seems like an optimal way to obtain
this data and the existing query already does that. The aggregation path
might vary though it seems like that shouldn't be the case here.

David J.

#5Israel Brewster
ijbrewster@alaska.edu
In reply to: David G. Johnston (#2)
Re: Faster distinct query?

On Sep 22, 2021, at 12:20 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster <ijbrewster@alaska.edu <mailto:ijbrewster@alaska.edu>> wrote:
To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).

I'd probably turn that index into a foreign key that just ensures that every (station,channel) that appears in the data table also appears on the lookup table. Grouping and array-ifying the lookup table would be trivial. Either modify the application code or add a trigger to populate the lookup table as needed.

Makes sense. I was actually considering this approach (albeit without the foreign key - that’s a nice additional safety measure), but was concerned about the overhead that adding said trigger would have on inserts - thus my thought to try the materialized view. As a reference, this database is receiving 1Hz data from around 170 stations, with up to three channels of data per station. So something like 350-500 inserts per second, although the data is “grouped” into 10 minute batches. I’ll give it another look.

The parentheses around channel in "array_agg(distinct(channel))" are unnecessary - you are invoking composite-type syntax, which is ignored in the single column case unless you write the optional ROW keyword, i.e., distinct ROW(channel)

Good information, thanks!

David J.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#4)
Re: Faster distinct query?

"David G. Johnston" <david.g.johnston@gmail.com> writes:

There is no where clause so I'm doubtful there is much to be gained going
down this path. The Index-Only scan seems like an optimal way to obtain
this data and the existing query already does that.

The "index-only" scan is reported to do 86m heap fetches along the
way to returning 812m rows, so the data is apparently pretty dirty.
It's possible that a preliminary VACUUM to get page-all-visible hint
bits set would be a net win.

regards, tom lane

#7Ryan Booz
ryan@timescale.com
In reply to: David G. Johnston (#4)
Re: Faster distinct query?

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted
as trying to "sell" something), but feels like an opportunity to talk about
DISTINCT queries and opportunities. Because you have that index,
Timescale 2.3 added a "Skip Scan" query planner node that works on regular
BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable
data at all). In this case, your distinct query would likely run in a few
milliseconds based on the counts you mention (170 stations, 3 channels per
station), and then the outer aggregation would do the GROUP BY. So, you
**could** add the TimescaleDB extension to your database (or a copy of) and
give it a try. You don't actually need to use any TimescaleDB features
otherwise.

- A writeup of why this kind of DISTINCT query is slow in PostgreSQL
(for now) and what we did to overcome it:
https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally
support but hasn't made forward progress yet:
https://commitfest.postgresql.org/19/1741/

Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:

Show quoted text

On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:

In the future, please share the plan returned by explain analyze, and
some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is
what the OP did. Reading explain output in email has its own challenges,
and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then
aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be
identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at
creating a multivariate statistics object and analyzing the table so the
planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going
down this path. The Index-Only scan seems like an optimal way to obtain
this data and the existing query already does that. The aggregation path
might vary though it seems like that shouldn't be the case here.

David J.

#8Israel Brewster
ijbrewster@alaska.edu
In reply to: Tom Lane (#6)
Re: Faster distinct query?

On Sep 22, 2021, at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"David G. Johnston" <david.g.johnston@gmail.com> writes:

There is no where clause so I'm doubtful there is much to be gained going
down this path. The Index-Only scan seems like an optimal way to obtain
this data and the existing query already does that.

The "index-only" scan is reported to do 86m heap fetches along the
way to returning 812m rows, so the data is apparently pretty dirty.
It's possible that a preliminary VACUUM to get page-all-visible hint
bits set would be a net win.

I do have autovaccum turned on, but perhaps I need to do a manual? The initial population of the database was accomplished via logical replication from a different database cluster (needed to move this database to more dedicated hardware), so perhaps that left the database in a state that autovaccum doesn’t address? Or perhaps my autovaccum settings aren’t kosher - I haven’t adjusted that portion of the config any.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

Show quoted text

regards, tom lane

#9Israel Brewster
ijbrewster@alaska.edu
In reply to: Ryan Booz (#7)
Re: Faster distinct query?

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

Show quoted text

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/ <https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/&gt;
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/ <https://commitfest.postgresql.org/19/1741/&gt;
Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com <mailto:mlewis@entrata.com>> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

#10Ryan Booz
ryan@timescale.com
In reply to: Israel Brewster (#9)
Re: Faster distinct query?

Cool. I'd be interested to see the explain on it if you ever try it again.
On that cardinality, I'd expect it to be really fast, so I'm interested to
see if the (SkipScan) nodes were actually used.

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:

Show quoted text

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted
as trying to "sell" something), but feels like an opportunity to talk about
DISTINCT queries and opportunities. Because you have that index,
Timescale 2.3 added a "Skip Scan" query planner node that works on regular
BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable
data at all). In this case, your distinct query would likely run in a few
milliseconds based on the counts you mention (170 stations, 3 channels per
station), and then the outer aggregation would do the GROUP BY. So, you
**could** add the TimescaleDB extension to your database (or a copy of) and
give it a try. You don't actually need to use any TimescaleDB features
otherwise.

I had actually already done that, as I was considering, in spite of past
negative experiences with timescaledb, experimenting with it on this DB to
see if it worked any better with this data. Out of curiosity, I tried
removing the timescaledb extension, whereupon the query in question took
roughly twice as long. So you are right that installing timescaledb speeds
things up, even when not using any timescaledb specific functions. So that
was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

- A writeup of why this kind of DISTINCT query is slow in PostgreSQL
(for now) and what we did to overcome it:
https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally
support but hasn't made forward progress yet:
https://commitfest.postgresql.org/19/1741/

Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:

On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:

In the future, please share the plan returned by explain analyze, and
some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which
is what the OP did. Reading explain output in email has its own
challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only,
then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be
identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look
at creating a multivariate statistics object and analyzing the table so the
planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going
down this path. The Index-Only scan seems like an optimal way to obtain
this data and the existing query already does that. The aggregation path
might vary though it seems like that shouldn't be the case here.

David J.

#11Israel Brewster
ijbrewster@alaska.edu
In reply to: Ryan Booz (#10)
Re: Faster distinct query?

On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:

Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.

With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html&gt;). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.

Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

Show quoted text

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu <mailto:ijbrewster@alaska.edu>> wrote:

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com <mailto:ryan@timescale.com>> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/ <https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/&gt;
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/ <https://commitfest.postgresql.org/19/1741/&gt;
Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com <mailto:mlewis@entrata.com>> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

#12Ryan Booz
ryan@timescale.com
In reply to: Israel Brewster (#11)
Re: Faster distinct query?

Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as
one of the execution nodes. I also realize I was making a few assumptions
about your data, are channels shared among stations, or are all channels
unique (like an ID) per station? That would impact the index and approach.

Something like:

station | channel
----------|-----------
1 1
1 2
2 3
2 4

or:
station | channel
----------|-----------
1 1
1 2
2 1
2 2

On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:

Show quoted text

On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:

Cool. I'd be interested to see the explain on it if you ever try it again.
On that cardinality, I'd expect it to be really fast, so I'm interested to
see if the (SkipScan) nodes were actually used.

With timescaledb extension installed, the explain is what I posted in the
original message (https://explain.depesz.com/s/mtxB#html). Without
timescaledb installed, the explain looks the same, except it takes twice as
long to run.

Unless I missed something in your message, i.e. some sort of tweak to the
query to get it to use the timescaledb features?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be
interpreted as trying to "sell" something), but feels like an opportunity
to talk about DISTINCT queries and opportunities. Because you have that
index, Timescale 2.3 added a "Skip Scan" query planner node that works on
regular BTree indexes (it doesn't have to be time-series/TimescaleDB
Hypertable data at all). In this case, your distinct query would likely run
in a few milliseconds based on the counts you mention (170 stations, 3
channels per station), and then the outer aggregation would do the GROUP
BY. So, you **could** add the TimescaleDB extension to your database (or a
copy of) and give it a try. You don't actually need to use any TimescaleDB
features otherwise.

I had actually already done that, as I was considering, in spite of past
negative experiences with timescaledb, experimenting with it on this DB to
see if it worked any better with this data. Out of curiosity, I tried
removing the timescaledb extension, whereupon the query in question took
roughly twice as long. So you are right that installing timescaledb speeds
things up, even when not using any timescaledb specific functions. So that
was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

- A writeup of why this kind of DISTINCT query is slow in PostgreSQL
(for now) and what we did to overcome it:
https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally
support but hasn't made forward progress yet:
https://commitfest.postgresql.org/19/1741/

Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:

On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com>
wrote:

In the future, please share the plan returned by explain analyze, and
some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which
is what the OP did. Reading explain output in email has its own
challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only,
then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be
identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look
at creating a multivariate statistics object and analyzing the table so the
planner can make better choices

There is no where clause so I'm doubtful there is much to be gained
going down this path. The Index-Only scan seems like an optimal way to
obtain this data and the existing query already does that. The aggregation
path might vary though it seems like that shouldn't be the case here.

David J.

#13Israel Brewster
ijbrewster@alaska.edu
In reply to: Ryan Booz (#12)
Re: Faster distinct query?

On Sep 22, 2021, at 2:05 PM, Ryan Booz <ryan@timescale.com> wrote:

Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.

Ok, that may be a good point: “channel” is currently a varchar column, containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful of possible channels that I am currently aware of, which are shared among stations - most stations have a ‘BHZ’ channel, for example. That would be fairly simple to normalize out if that would help.

Show quoted text

Something like:

station | channel
----------|-----------
1 1
1 2
2 3
2 4

or:
station | channel
----------|-----------
1 1
1 2
2 1
2 2

On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster@alaska.edu <mailto:ijbrewster@alaska.edu>> wrote:

On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com <mailto:ryan@timescale.com>> wrote:

Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.

With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html&gt;). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.

Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu <mailto:ijbrewster@alaska.edu>> wrote:

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com <mailto:ryan@timescale.com>> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/ <https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/&gt;
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/ <https://commitfest.postgresql.org/19/1741/&gt;
Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com <mailto:mlewis@entrata.com>> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

#14Michael Lewis
mlewis@entrata.com
In reply to: Tom Lane (#6)
Re: Faster distinct query?

On Wed, Sep 22, 2021 at 2:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"David G. Johnston" <david.g.johnston@gmail.com> writes:

There is no where clause so I'm doubtful there is much to be gained going
down this path. The Index-Only scan seems like an optimal way to obtain
this data and the existing query already does that.

The "index-only" scan is reported to do 86m heap fetches along the
way to returning 812m rows, so the data is apparently pretty dirty.

Do you say that because you would expect many more than 10 tuples per page?

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Lewis (#14)
Re: Faster distinct query?

Michael Lewis <mlewis@entrata.com> writes:

On Wed, Sep 22, 2021 at 2:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The "index-only" scan is reported to do 86m heap fetches along the
way to returning 812m rows, so the data is apparently pretty dirty.

Do you say that because you would expect many more than 10 tuples per page?

No, I say that because if the table were entirely all-visible, there
would have been *zero* heap fetches. As it stands, it's reasonable
to suspect that a pretty sizable fraction of the index-only scan's
runtime went into random-access heap fetches made to verify
visibility of individual rows.

(You will, of course, never get to exactly zero heap fetches in an
IOS unless the table data is quite static. But one dirty page
out of every ten seems like there were a lot of recent changes.
A VACUUM to clean that up might be well worthwhile.)

regards, tom lane

#16David Rowley
dgrowleyml@gmail.com
In reply to: Michael Lewis (#3)
Re: Faster distinct query?

On Thu, 23 Sept 2021 at 08:21, Michael Lewis <mlewis@entrata.com> wrote:

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Since the subquery is grouping by station, channel, then there's no
need for the DISTINCT in the aggregate function. Removing that should
remove some tuplestore overhead from the aggregate node.

David

#17David Rowley
dgrowleyml@gmail.com
In reply to: David G. Johnston (#4)
Re: Faster distinct query?

On Thu, 23 Sept 2021 at 08:27, David G. Johnston
<david.g.johnston@gmail.com> wrote:

On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

ndistinct extended statistics would be used to estimate the number of
groups in the GROUP BY clause for the version of the query that
performs GROUP BY station, channel. We've not seen the EXPLAIN
ANALYZE for that query, so don't know if there's any use for extended
statistics there. However, if the planner was to think there were
more groups than there actually are, then it would be less inclined to
do parallelise the GROUP BY. I think writing the query in such a way
that allows it to be parallelised is likely going to result in some
quite good performance improvements. i.e:

select station, array_agg(channel) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

which is just the same as Michael's version but without DISTINCT.
Also, Tom's point about lots of heap fetches is going to count for
quite a bit too, especially so if I/O plays a large part in the total
query time.

David

#18Israel Brewster
ijbrewster@alaska.edu
In reply to: David Rowley (#17)
Re: Faster distinct query?

On Sep 22, 2021, at 5:10 PM, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 23 Sept 2021 at 08:27, David G. Johnston
<david.g.johnston@gmail.com> wrote:

On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

ndistinct extended statistics would be used to estimate the number of
groups in the GROUP BY clause for the version of the query that
performs GROUP BY station, channel. We've not seen the EXPLAIN
ANALYZE for that query, so don't know if there's any use for extended
statistics there. However, if the planner was to think there were
more groups than there actually are, then it would be less inclined to
do parallelise the GROUP BY. I think writing the query in such a way
that allows it to be parallelised is likely going to result in some
quite good performance improvements. i.e:

select station, array_agg(channel) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Ah, yes indeed. That version runs in about 30 seconds rather than 5 minutes! See the explain analyze output here: https://explain.depesz.com/s/L5Bf <https://explain.depesz.com/s/L5Bf&gt; It looks more complicated, but being able to run parallel definitely makes a difference, and there may be some other improvements in there that I’m not aware of as well!

Still not quite fast enough for real-time queries, but certainly fast enough to keep a materialized view updated.

And this is why I love postgresql and this community - when something isn’t working as well as I would like, there is usually a way to improve it drastically :-)

Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

Show quoted text

which is just the same as Michael's version but without DISTINCT.
Also, Tom's point about lots of heap fetches is going to count for
quite a bit too, especially so if I/O plays a large part in the total
query time.

David

#19David G. Johnston
david.g.johnston@gmail.com
In reply to: David Rowley (#17)
Re: Faster distinct query?

On Wednesday, September 22, 2021, David Rowley <dgrowleyml@gmail.com> wrote:

I think writing the query in such a way
that allows it to be parallelised is likely going to result in some
quite good performance improvements. i.e:

Agreed, though if the query author needs to do that here we’ve violated the
spirit of the declarative SQL language. At first blush nothing about the
original query seems like it should be preventing parallelism. Each worker
builds its own distinct array then the final concatenation is made distinct.

David J.

#20David Rowley
dgrowleyml@gmail.com
In reply to: David G. Johnston (#19)
Re: Faster distinct query?

On Thu, 23 Sept 2021 at 13:28, David G. Johnston
<david.g.johnston@gmail.com> wrote:

Agreed, though if the query author needs to do that here we’ve violated the spirit of the declarative SQL language. At first blush nothing about the original query seems like it should be preventing parallelism. Each worker builds its own distinct array then the final concatenation is made distinct.

We don't parallelise DISTINCT / ORDER BY aggregates.

David

#21David Rowley
dgrowleyml@gmail.com
In reply to: Israel Brewster (#18)
#22Ryan Booz
ryan@timescale.com
In reply to: Israel Brewster (#13)
#23Mladen Gogala
gogala.mladen@gmail.com
In reply to: David G. Johnston (#2)
In reply to: Israel Brewster (#1)
#25Ryan Booz
ryan@timescale.com
In reply to: hubert depesz lubaczewski (#24)
#26Israel Brewster
ijbrewster@alaska.edu
In reply to: hubert depesz lubaczewski (#24)
#27Israel Brewster
ijbrewster@alaska.edu
In reply to: Ryan Booz (#25)
#28Rob Sargent
robjsargent@gmail.com
In reply to: Israel Brewster (#27)
#29Israel Brewster
ijbrewster@alaska.edu
In reply to: Rob Sargent (#28)
#30Rob Sargent
robjsargent@gmail.com
In reply to: Israel Brewster (#29)
#31Geoff Winkless
pgsqladmin@geoff.dj
In reply to: Israel Brewster (#1)
#32Israel Brewster
ijbrewster@alaska.edu
In reply to: Geoff Winkless (#31)
#33Michael Lewis
mlewis@entrata.com
In reply to: Israel Brewster (#32)
#34Israel Brewster
ijbrewster@alaska.edu
In reply to: Michael Lewis (#33)