Very newbie question

Started by Олег Самойловover 2 years ago12 messagesgeneral

splarv@ya.ru

over 2 years ago

Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:

SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?

Toomas Kristin

toomas.kristin@gmail.com

over 2 years ago

In reply to: Олег Самойлов (#1)

Re: Very newbie question

There is no reason to use index. The query has neither WHERE nor ORDER BY clause.

Toomas

Show quoted text

On 23. Oct 2023, at 18:13, Олег Самойлов <splarv@ya.ru> wrote:

Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:

SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?

Francisco Olarte

folarte@peoplecall.com

over 2 years ago

In reply to: Олег Самойлов (#1)

Re: Very newbie question

On Mon, 23 Oct 2023 at 17:14, Олег Самойлов <splarv@ya.ru> wrote:

Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:

SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?

You should send an explain of your query, and your table and index definition.

Unless you are tied to do this in one query, and assuming you have an
index by "created_at", I normally do these kind of things by:
1.- Get list of partitions, sort oldest first.
2.- do "select created_at from $partition order by created at desc
limit 1", which normally is just an index lookup, and compare
client-side.
You can do the date math in the database too. Also, rhs of the
comparison seems to be date, if created_at is timestamp you may be
blocking the optimizer for some things.

Francisco Olarte.

Олег Самойлов

splarv@ya.ru

over 2 years ago

In reply to: Francisco Olarte (#3)

Re: Very newbie question

Well, get list of partitions and later scan one by one all 100 partitions is too simple. :) I am interesting is here more elegant way? Any rewriting the query, any creating an index are permitted.

Show quoted text

23 окт. 2023 г., в 18:25, Francisco Olarte <folarte@peoplecall.com> написал(а):

On Mon, 23 Oct 2023 at 17:14, Олег Самойлов <splarv@ya.ru> wrote:

Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:

SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?

You should send an explain of your query, and your table and index definition.

Unless you are tied to do this in one query, and assuming you have an
index by "created_at", I normally do these kind of things by:
1.- Get list of partitions, sort oldest first.
2.- do "select created_at from $partition order by created at desc
limit 1", which normally is just an index lookup, and compare
client-side.
You can do the date math in the database too. Also, rhs of the
comparison seems to be date, if created_at is timestamp you may be
blocking the optimizer for some things.

Francisco Olarte.

Олег Самойлов

splarv@ya.ru

over 2 years ago

In reply to: Toomas Kristin (#2)

Re: Very newbie question

This is not correct. An index can accelerate, for instance, max(). Here is also not WHERE or ORDER BY, but index is useful:
select max(created_at) from delivery;

Show quoted text

23 окт. 2023 г., в 18:23, Toomas <toomas.kristin@gmail.com> написал(а):

There is no reason to use index. The query has neither WHERE nor ORDER BY clause.

Toomas

On 23. Oct 2023, at 18:13, Олег Самойлов <splarv@ya.ru> wrote:

Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:

SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?

Ron

ronljohnsonjr@gmail.com

over 2 years ago

In reply to: Олег Самойлов (#1)

Re: Very newbie question

On 10/23/23 10:13, Олег Самойлов wrote:

Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:

SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?

Maybe:

SELECT DISTINCT id/10000000 as partition
FROM delivery
WHERE max(created_at) < CURRENT_DATE - '3 month'::interval;

I haven't tried it, though.

--
Born in Arizona, moved to Babylonia.

Francisco Olarte

folarte@peoplecall.com

over 2 years ago

In reply to: Олег Самойлов (#4)

Re: Very newbie question

On Mon, 23 Oct 2023 at 17:42, Олег Самойлов <splarv@ya.ru> wrote:

Well, get list of partitions and later scan one by one all 100 partitions is too simple. :) I am interesting is here more elegant way? Any rewriting the query, any creating an index are permitted.

1.- You do not scan all partitions. Had you not top-posted it coudl
easily be noted:

1.- Get list of partitions, sort oldest first.

This means you get list of partitions, which is just a query, sorting
them, even if you have to do it client side should be trivial unless
you use really weird schemes, and you could sort them by your ranges
in the query.

Also, getting them oldest first means you evaluate the age-query
before archiving, once for each archivable partition plus one extra,
an overhead which should be dwarfed by any non-trivial archival, even
a rename or drop index would probably be longer.

And last. We have different concepts for elegance. IMO by saying a
solution is "too simple" not having stated "I want a complex tricky
solution" disqualifies you a bit. For real problems, no solution is
too simple ( it may be a language problem, or you may have hidden
constraints, but that needs to be specified ).

feel free to exec &> /dev/null.

Francisco Olarte.

Олег Самойлов

splarv@ya.ru

over 2 years ago

In reply to: Олег Самойлов (#1)

Re: Very newbie question

Okey, I see no one was be able to solve this problem. But I could. May be for someone this will be useful too. There is solution.

Original query was:

23 окт. 2023 г., в 18:13, Олег Самойлов <splarv@ya.ru> написал(а):

SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

And I was not able to accelerate it by any index, works 5 minutes. Now query is:

SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000 <=id and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;

Return the same (number of partition need to archive), accelerated by two btree index: on id and created_at. Works very quick, less then second.

Olivier Gautherot

ogautherot@gautherot.net

over 2 years ago

In reply to: Олег Самойлов (#8)

Re: Very newbie question

Hi,

El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:

Okey, I see no one was be able to solve this problem. But I could. May be
for someone this will be useful too. There is solution.

Original query was:

23 окт. 2023 г., в 18:13, Олег Самойлов <splarv@ya.ru> написал(а):

SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;

And I was not able to accelerate it by any index, works 5 minutes. Now
query is:

SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000 <=id
and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;

Return the same (number of partition need to archive), accelerated by two
btree index: on id and created_at. Works very quick, less then second.

If you happen to rework your design, consider partitioning on (created_at),
as it may simplify your maintenance.

The reason why you couldn't improve the performance with an index is due to
the calls of min() and max() that force to evaluate every single row. You
may consider using a computed index in this case.

Your fast solution will work as long as you don't have missing sequences
(like deleted rows).

Regards
Olivier

Show quoted text

#10

Peter J. Holzer

hjp-pgsql@hjp.at

over 2 years ago

In reply to: Olivier Gautherot (#9)

Re: Very newbie question

On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote:

El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
Okey, I see no one was be able to solve this problem. But I could. May be
for someone this will be useful too. There is solution.

[...]

Now query is:

SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000 <=id
and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;

Return the same (number of partition need to archive), accelerated by two
btree index: on id and created_at. Works very quick, less then second.

[...]

Your fast solution will work as long as you don't have missing sequences (like
deleted rows).

Why do you think this would break with missing sequence numbers?

#11

Olivier Gautherot

ogautherot@gautherot.net

over 2 years ago

In reply to: Peter J. Holzer (#10)

Re: Very newbie question

Hi,

El jue, 26 oct 2023 11:15, Peter J. Holzer <hjp-pgsql@hjp.at> escribió:

On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote:

El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
Okey, I see no one was be able to solve this problem. But I could.

May be

for someone this will be useful too. There is solution.

[...]

Now query is:

SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where

n*10000000 <=id

and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;

Return the same (number of partition need to archive), accelerated

by two

btree index: on id and created_at. Works very quick, less then

second.
[...]

Your fast solution will work as long as you don't have missing sequences

(like

deleted rows).

Why do you think this would break with missing sequence numbers?

hp

In the suggested query, the return value contains a list of sequential
numbers from a min to a max - they seem to be markers of the partitions.
Let's assume that a complete partition is deleted in the middle: its index
will still be returned by the query, although it doesn't exist any more in
the table. It can be an issue if the list of indexes is actually used and
partitions are not deleted sequentially.

My cent worth to ensure data integrity.

Regards
Olivier Gautherot

Show quoted text

#12

Peter J. Holzer

hjp-pgsql@hjp.at

over 2 years ago

In reply to: Olivier Gautherot (#11)

Re: Very newbie question

On 2023-10-26 11:56:56 +0200, Olivier Gautherot wrote:

El jue, 26 oct 2023 11:15, Peter J. Holzer <hjp-pgsql@hjp.at> escribió:
On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote:

El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
Okey, I see no one was be able to solve this problem. But I could.
May be
for someone this will be useful too. There is solution.

[...]

Now query is:

SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000

<=id

and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;

I just realized that this query is mangled. I'm going to assume that it
should have been something like

with part_numbers as (
SELECT generate_series(min(id)/100, max(id)/100) as n
from delivery
)
select * from part_numbers
WHERE (SELECT max(created_at) from delivery where n*100 <= id and id < (n+1)*100) < CURRENT_DATE-'3 month'::interval;

[...]

Your fast solution will work as long as you don't have missing sequences

(like

deleted rows).

Why do you think this would break with missing sequence numbers?

In the suggested query, the return value contains a list of sequential numbers
from a min to a max - they seem to be markers of the partitions. Let's assume
that a complete partition is deleted in the middle: its index will still be
returned by the query, although it doesn't exist any more in the table.

I don't think it will. While the generate_series() will produce the
partition number, the where clause will not find any matching rows, so
the query will not return it.

E.g. (this table isn't partitioned, but that shouldn't affect the
result, also I'll reduce the "partition size" to 100 to make it more
readable):

create table delivery (id int, created_at date);
insert into delivery(200, '2000-01-01');
insert into delivery values(200, '2000-01-01');
insert into delivery values(299, '2000-12-01');
insert into delivery values(412, '2002-02-01');
insert into delivery values(439, '2002-03-01');
insert into delivery values(501, '2023-01-01');
insert into delivery values(555, now());

Note that there are no records in "partition" 3, and "partition" 5
contains current data, so we should get only "partition numbers" 2 and
4:

╔═══╗
║ n ║
╟───╢
║ 2 ║
║ 4 ║
╚═══╝
(2 rows)

Looks ok to me.