A way to optimize sql about the last temporary-related row

Started by agharta82@gmail.comalmost 2 years ago10 messagesgeneral

almost 2 years ago

Hello everyone,
Sorry to bother you but I have a query that is driving me crazy.

I need to have the last valid record at a temporal level according to a
specific parameter.

First some data:
Linux Rocky 8.10 environment, minimal installation (on VM KVM with
Fedora 40).
Postgresql 16.3, installed by official Postgresql guide.
effective_cache_size = '1000 MB';
shared_buffers = '500 MB';
work_mem = '16MB';
The changes are deliberately minimal to be able to all to simulate the
problem.

Table script:
CREATE TABLE test_table
(
pk_id int NOT NULL,
integer_field_1 int ,
integer_field_2 int,
datetime_field_1 timestamp,
primary key (pk_id)
)

-- insert 4M records
insert into test_table(pk_id) select generate_series(1,4000000,1);

-- now set some random data, distribuited between specific ranges (as in
my production table)
update test_table set
datetime_field_1 = timestamp '2000-01-01 00:00:00' + random() *
(timestamp '2024-05-31 23:59:59' - timestamp '2000-01-01 00:00:00'),
integer_field_1 = floor(random() * (6-1+1) + 1)::int,
integer_field_2 = floor(random() * (200000-1+1) + 1)::int;

-- indexes
CREATE INDEX idx_test_table_integer_field_1 ON test_table(integer_field_1);
CREATE INDEX xtest_table_datetime_field_1 ON test_table(datetime_field_1
desc);
CREATE INDEX idx_test_table_integer_field_2 ON test_table(integer_field_2);

--vacuum

vacuum full test_table;

Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
select xx from (
select LAST_VALUE(pk_id) over (partition by integer_field_2 order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) xx
from test_table
where integer_field_1 = 1
and datetime_field_1 <= CURRENT_TIMESTAMP
) ww group by ww.xx

),
last_row_per_ids as (
select tt.* from last_table_ids lt
inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;

This query, on my PC, takes 46 seconds!!!
I was expecting about 2-3 seconds (according with my other queries in
this table) but it seems that the xtest_table_datetime_field_1 index is
not being used.

Do you think there is a way to optimize the query?

Thanks so much for the support,

Agharta

Ron

ronljohnsonjr@gmail.com

almost 2 years ago

In reply to: agharta82@gmail.com (#1)

Re: A way to optimize sql about the last temporary-related row

On Thu, Jun 27, 2024 at 11:20 AM agharta82@gmail.com <agharta82@gmail.com>
wrote:
[snip]

-- insert 4M records
insert into test_table(pk_id) select generate_series(1,4000000,1);

-- now set some random data, distribuited between specific ranges (as in
my production table)
update test_table set
datetime_field_1 = timestamp '2000-01-01 00:00:00' + random() *
(timestamp '2024-05-31 23:59:59' - timestamp '2000-01-01 00:00:00'),
integer_field_1 = floor(random() * (6-1+1) + 1)::int,
integer_field_2 = floor(random() * (200000-1+1) + 1)::int;

-- indexes
CREATE INDEX idx_test_table_integer_field_1 ON test_table(integer_field_1);
CREATE INDEX xtest_table_datetime_field_1 ON test_table(datetime_field_1
desc);
CREATE INDEX idx_test_table_integer_field_2 ON test_table(integer_field_2);

Off-topic: save some resources by vacuuming before creating indices.

agharta82@gmail.com

almost 2 years ago

In reply to: Ron (#2)

Re: A way to optimize sql about the last temporary-related row

Hi,

You are right. Too quickly copy-paste on my part :-)

I take this opportunity to add a NOT insignificant detail.

Before executing the select query I clear the cache:

systemctl stop postgresql-16 && sync && echo 3 >
/proc/sys/vm/drop_caches && systemctl start postgresql-16

I need to get a performance result even if data is not in cache.

My best regards,

Agharta

Il 27/06/24 5:27 PM, Ron Johnson ha scritto:

Show quoted text

On Thu, Jun 27, 2024 at 11:20 AM agharta82@gmail.com
<agharta82@gmail.com> wrote:
[snip]

-- insert 4M records
insert into test_table(pk_id) select generate_series(1,4000000,1);

-- now set some random data, distribuited between specific ranges
(as in
my production table)
update test_table set
datetime_field_1 = timestamp '2000-01-01 00:00:00' + random() *
(timestamp '2024-05-31 23:59:59' - timestamp '2000-01-01 00:00:00'),
integer_field_1 = floor(random() * (6-1+1) + 1)::int,
integer_field_2 = floor(random() * (200000-1+1) + 1)::int;

-- indexes
CREATE INDEX idx_test_table_integer_field_1 ON
test_table(integer_field_1);
CREATE INDEX xtest_table_datetime_field_1 ON
test_table(datetime_field_1
desc);
CREATE INDEX idx_test_table_integer_field_2 ON
test_table(integer_field_2);

Off-topic: save some resources by vacuuming before creating indices.

David G. Johnston

david.g.johnston@gmail.com

almost 2 years ago

In reply to: agharta82@gmail.com (#1)

Re: A way to optimize sql about the last temporary-related row

On Thursday, June 27, 2024, agharta82@gmail.com <agharta82@gmail.com> wrote:

Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
select xx from (
select LAST_VALUE(pk_id) over (partition by integer_field_2 order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
xx
from test_table
where integer_field_1 = 1
and datetime_field_1 <= CURRENT_TIMESTAMP
) ww group by ww.xx

),
last_row_per_ids as (
select tt.* from last_table_ids lt
inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;

Do you think there is a way to optimize the query?

Write a lateral subquery to pick the first row of a descending ordered
query? Using group to select ranked rows is both semantically wrong and
potentially optimization blocking.

I’m going by the general query form and the “last row” aspect of the
question. I haven’t gone and confirmed your specific query can benefit
from this approach. The window expression does give me pause.

David J.

Ron

ronljohnsonjr@gmail.com

almost 2 years ago

In reply to: agharta82@gmail.com (#3)

Re: A way to optimize sql about the last temporary-related row

On Thu, Jun 27, 2024 at 11:33 AM agharta82@gmail.com <agharta82@gmail.com>
wrote:

Hi,

You are right. Too quickly copy-paste on my part :-)

I take this opportunity to add a NOT insignificant detail.

Before executing the select query I clear the cache:

systemctl stop postgresql-16 && sync && echo 3 > /proc/sys/vm/drop_caches
&& systemctl start postgresql-16

I need to get a performance result even if data is not in cache.

In addition to David's suggestion, consider the pg_prewarm extension:
https://www.postgresql.org/docs/current/pgprewarm.html

agharta82@gmail.com

almost 2 years ago

In reply to: David G. Johnston (#4)

Re: A way to optimize sql about the last temporary-related row

Hi,

Thanks for you reply.

About syntax you're right, but I couldn't think of anything better :(((
I'm here for that too, to solve the problem in a fancy way, with your
great support.

In practice, I need to get back a dataset with the last association (the
most datatime recent record) for all the distinct entries of
integer_field_2 based on filter: integer_field_1 = 1

As said in another reply, the query needs to be performant even if data
is not in cache (systemctl stop postgresql-16 && sync && echo 3 >
/proc/sys/vm/drop_caches && systemctl start postgresql-16).

Many thanks for your support.

Agharta

Il 27/06/24 5:33 PM, David G. Johnston ha scritto:

Show quoted text

On Thursday, June 27, 2024, agharta82@gmail.com <agharta82@gmail.com>
wrote:

Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
select xx from (
select LAST_VALUE(pk_id) over (partition by integer_field_2
order by datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING) xx
from test_table
where integer_field_1 = 1
and datetime_field_1 <= CURRENT_TIMESTAMP
) ww group by ww.xx

),
last_row_per_ids as (
select tt.* from last_table_ids lt
inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;

Do you think there is a way to optimize the query?

Write a lateral subquery to pick the first row of a descending ordered
query? Using group to select ranked rows is both semantically wrong
and potentially optimization blocking.

I’m going by the general query form and the “last row” aspect of the
question. I haven’t gone and confirmed your specific query can
benefit from this approach. The window expression does give me pause.

David J.

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: agharta82@gmail.com (#1)

Re: A way to optimize sql about the last temporary-related row

On Fri, 28 Jun 2024, 3:20 am agharta82@gmail.com, <agharta82@gmail.com>
wrote:

Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
select xx from (
select LAST_VALUE(pk_id) over (partition by integer_field_2 order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) xx
from test_table
where integer_field_1 = 1
and datetime_field_1 <= CURRENT_TIMESTAMP
) ww group by ww.xx

),
last_row_per_ids as (
select tt.* from last_table_ids lt
inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;

This query, on my PC, takes 46 seconds!!!

(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from test_table where
integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by
integer_field_2,datetime_field_1 desc;

Might run a bit faster. However if it's slow due to I/O then maybe not
much faster. Your version took about 5 seconds on my phone and my version
ran in 1.5 seconds.

It's difficult for me to check the results match with each query from my
phone. A quick scan of the first 10 or so records looked good.

If the updated query is still too slow on cold cache then faster disks
might be needed.

David

Show quoted text

agharta82@gmail.com

almost 2 years ago

In reply to: David Rowley (#7)

Re: A way to optimize sql about the last temporary-related row

HOO-HA! This is HUGE!

Only 2.2 seconds on my data!!!! Amazing!

distinct on (field) *followed by "*" *is a hidden gem!

Thank you so much and thanks to everyone who helped me! Thank you very
much!!

Cheers,

Agharta

Il 27/06/24 6:16 PM, David Rowley ha scritto:

Show quoted text

On Fri, 28 Jun 2024, 3:20 am agharta82@gmail.com,
<agharta82@gmail.com> wrote:

Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
   select xx from (
   select LAST_VALUE(pk_id) over (partition by integer_field_2
order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) xx
   from test_table
   where integer_field_1 = 1
   and datetime_field_1 <= CURRENT_TIMESTAMP
   ) ww group by ww.xx

),
last_row_per_ids as (
   select tt.* from last_table_ids lt
   inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;

This query, on my PC, takes 46 seconds!!!

(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from test_table where
integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by
integer_field_2,datetime_field_1 desc;

Might run a bit faster. However if it's slow due to I/O then maybe
not much faster. Your version took about 5 seconds on my phone and my
version ran in 1.5 seconds.

It's difficult for me to check the results match with each query from
my phone. A quick scan of the first 10 or so records looked good.

If the updated query is still too slow on cold cache then faster disks
might be needed.

David

Richard Welty

rwelty@salesium.com

almost 2 years ago

In reply to: agharta82@gmail.com (#8)

Re: A way to optimize sql about the last temporary-related row

not really in direct response to this conversation, but is there any reason

on the face of the planet why read receipts need to be sent to every single

recipient of the mailing list?

just saying,

richard

---- On Fri, 28 Jun 2024 03:20:26 -0400 <agharta82@gmail.com> wrote ---

HOO-HA! This is HUGE!

Only 2.2 seconds on my data!!!! Amazing!

distinct on (field) followed by "*" is a hidden gem!

Thank you so much and thanks to everyone who helped me! Thank
you very much!!

Cheers,

Agharta

Il 27/06/24 6:16 PM, David Rowley ha
scritto:

On Fri, 28 Jun 2024, 3:20
am mailto:agharta82@gmail.com,
<mailto:agharta82@gmail.com>
wrote:

Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
   select xx from (
   select LAST_VALUE(pk_id) over (partition by
integer_field_2 order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED
FOLLOWING) xx
   from test_table
   where integer_field_1 = 1
   and datetime_field_1 <= CURRENT_TIMESTAMP
   ) ww group by ww.xx

),
last_row_per_ids as (
select tt.* from last_table_ids lt
inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;

This query, on my PC, takes 46 seconds!!!

(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from
test_table where integer_field_1 = 1 and datetime_field_1
<= CURRENT_TIMESTAMP order by
integer_field_2,datetime_field_1 desc;

Might run a bit faster. However if it's slow
due to I/O then maybe not much faster. Your version took
about 5 seconds on my phone and my version ran in 1.5 seconds.

It's difficult for me to check the results match
with each query from my phone. A quick scan of the first 10 or
so records looked good.

If the updated query is still too slow on cold
cache then faster disks might be needed.

David

#10

agharta82@gmail.com

almost 2 years ago

In reply to: Richard Welty (#9)

Re: A way to optimize sql about the last temporary-related row

Sorry, my wrong settings on pc mail client.
Sorry again.
Agharta

Il ven 28 giu 2024, 19:51 Richard Welty <rwelty@salesium.com> ha scritto:

Show quoted text

not really in direct response to this conversation, but is there any reason
on the face of the planet why read receipts need to be sent to every single
recipient of the mailing list?

just saying,
richard

---- On Fri, 28 Jun 2024 03:20:26 -0400 * <agharta82@gmail.com
<agharta82@gmail.com>>* wrote ---

HOO-HA! This is HUGE!

Only 2.2 seconds on my data!!!! Amazing!

distinct on (field) *followed by "*" *is a hidden gem!

Thank you so much and thanks to everyone who helped me! Thank you very
much!!

Cheers,

Agharta

Il 27/06/24 6:16 PM, David Rowley ha scritto:

On Fri, 28 Jun 2024, 3:20 am agharta82@gmail.com, <agharta82@gmail.com>
wrote:

Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
select xx from (
select LAST_VALUE(pk_id) over (partition by integer_field_2 order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) xx
from test_table
where integer_field_1 = 1
and datetime_field_1 <= CURRENT_TIMESTAMP
) ww group by ww.xx

),
last_row_per_ids as (
select tt.* from last_table_ids lt
inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;

This query, on my PC, takes 46 seconds!!!

(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from test_table where
integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by
integer_field_2,datetime_field_1 desc;

Might run a bit faster. However if it's slow due to I/O then maybe not
much faster. Your version took about 5 seconds on my phone and my version
ran in 1.5 seconds.

It's difficult for me to check the results match with each query from my
phone. A quick scan of the first 10 or so records looked good.

If the updated query is still too slow on cold cache then faster disks
might be needed.

David