Very newbie question
Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:
SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;
The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
There is no reason to use index. The query has neither WHERE nor ORDER BY clause.
Toomas
Show quoted text
On 23. Oct 2023, at 18:13, Олег Самойлов <splarv@ya.ru> wrote:
Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:
SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
On Mon, 23 Oct 2023 at 17:14, Олег Самойлов <splarv@ya.ru> wrote:
Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:
SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
You should send an explain of your query, and your table and index definition.
Unless you are tied to do this in one query, and assuming you have an
index by "created_at", I normally do these kind of things by:
1.- Get list of partitions, sort oldest first.
2.- do "select created_at from $partition order by created at desc
limit 1", which normally is just an index lookup, and compare
client-side.
You can do the date math in the database too. Also, rhs of the
comparison seems to be date, if created_at is timestamp you may be
blocking the optimizer for some things.
Francisco Olarte.
Well, get list of partitions and later scan one by one all 100 partitions is too simple. :) I am interesting is here more elegant way? Any rewriting the query, any creating an index are permitted.
Show quoted text
23 окт. 2023 г., в 18:25, Francisco Olarte <folarte@peoplecall.com> написал(а):
On Mon, 23 Oct 2023 at 17:14, Олег Самойлов <splarv@ya.ru> wrote:
Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:
SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?You should send an explain of your query, and your table and index definition.
Unless you are tied to do this in one query, and assuming you have an
index by "created_at", I normally do these kind of things by:
1.- Get list of partitions, sort oldest first.
2.- do "select created_at from $partition order by created at desc
limit 1", which normally is just an index lookup, and compare
client-side.
You can do the date math in the database too. Also, rhs of the
comparison seems to be date, if created_at is timestamp you may be
blocking the optimizer for some things.Francisco Olarte.
This is not correct. An index can accelerate, for instance, max(). Here is also not WHERE or ORDER BY, but index is useful:
select max(created_at) from delivery;
Show quoted text
23 окт. 2023 г., в 18:23, Toomas <toomas.kristin@gmail.com> написал(а):
There is no reason to use index. The query has neither WHERE nor ORDER BY clause.
Toomas
On 23. Oct 2023, at 18:13, Олег Самойлов <splarv@ya.ru> wrote:
Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:
SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
On 10/23/23 10:13, Олег Самойлов wrote:
Back pardon, but I have a very newbie question. I have a partitioned table, partitioned by primary bigint key, size of partition 10000000. I need to get the number of partition which need to archive, which has all rows are olden then 3 month. Here is query:
SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;The 'id/10000000 as partition' is a number of the partition, it later will be used inside the partition name.
The query runs long by sequence scan. Has anyone any ideas how to rewrite query so it will use any index?
Maybe:
SELECT DISTINCT id/10000000 as partition
FROM delivery
WHERE max(created_at) < CURRENT_DATE - '3 month'::interval;
I haven't tried it, though.
--
Born in Arizona, moved to Babylonia.
On Mon, 23 Oct 2023 at 17:42, Олег Самойлов <splarv@ya.ru> wrote:
Well, get list of partitions and later scan one by one all 100 partitions is too simple. :) I am interesting is here more elegant way? Any rewriting the query, any creating an index are permitted.
1.- You do not scan all partitions. Had you not top-posted it coudl
easily be noted:
1.- Get list of partitions, sort oldest first.
This means you get list of partitions, which is just a query, sorting
them, even if you have to do it client side should be trivial unless
you use really weird schemes, and you could sort them by your ranges
in the query.
Also, getting them oldest first means you evaluate the age-query
before archiving, once for each archivable partition plus one extra,
an overhead which should be dwarfed by any non-trivial archival, even
a rename or drop index would probably be longer.
And last. We have different concepts for elegance. IMO by saying a
solution is "too simple" not having stated "I want a complex tricky
solution" disqualifies you a bit. For real problems, no solution is
too simple ( it may be a language problem, or you may have hidden
constraints, but that needs to be specified ).
feel free to exec &> /dev/null.
Francisco Olarte.
Okey, I see no one was be able to solve this problem. But I could. May be for someone this will be useful too. There is solution.
Original query was:
23 окт. 2023 г., в 18:13, Олег Самойлов <splarv@ya.ru> написал(а):
SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;
And I was not able to accelerate it by any index, works 5 minutes. Now query is:
SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000 <=id and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;
Return the same (number of partition need to archive), accelerated by two btree index: on id and created_at. Works very quick, less then second.
Hi,
El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
Okey, I see no one was be able to solve this problem. But I could. May be
for someone this will be useful too. There is solution.Original query was:
23 окт. 2023 г., в 18:13, Олег Самойлов <splarv@ya.ru> написал(а):
SELECT id/10000000 as partition
FROM delivery
GROUP BY partition
HAVING max(created_at) < CURRENT_DATE - '3 month'::interval;And I was not able to accelerate it by any index, works 5 minutes. Now
query is:SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000 <=id
and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;Return the same (number of partition need to archive), accelerated by two
btree index: on id and created_at. Works very quick, less then second.
If you happen to rework your design, consider partitioning on (created_at),
as it may simplify your maintenance.
The reason why you couldn't improve the performance with an index is due to
the calls of min() and max() that force to evaluate every single row. You
may consider using a computed index in this case.
Your fast solution will work as long as you don't have missing sequences
(like deleted rows).
Regards
Olivier
Show quoted text
On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote:
El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
Okey, I see no one was be able to solve this problem. But I could. May be
for someone this will be useful too. There is solution.
[...]
Now query is:
SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000 <=id
and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;Return the same (number of partition need to archive), accelerated by two
btree index: on id and created_at. Works very quick, less then second.
[...]
Your fast solution will work as long as you don't have missing sequences (like
deleted rows).
Why do you think this would break with missing sequence numbers?
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Hi,
El jue, 26 oct 2023 11:15, Peter J. Holzer <hjp-pgsql@hjp.at> escribió:
On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote:
El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
Okey, I see no one was be able to solve this problem. But I could.May be
for someone this will be useful too. There is solution.
[...]
Now query is:
SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery wheren*10000000 <=id
and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;Return the same (number of partition need to archive), accelerated
by two
btree index: on id and created_at. Works very quick, less then
second.
[...]Your fast solution will work as long as you don't have missing sequences
(like
deleted rows).
Why do you think this would break with missing sequence numbers?
hp
In the suggested query, the return value contains a list of sequential
numbers from a min to a max - they seem to be markers of the partitions.
Let's assume that a complete partition is deleted in the middle: its index
will still be returned by the query, although it doesn't exist any more in
the table. It can be an issue if the list of indexes is actually used and
partitions are not deleted sequentially.
My cent worth to ensure data integrity.
Regards
Olivier Gautherot
Show quoted text
On 2023-10-26 11:56:56 +0200, Olivier Gautherot wrote:
El jue, 26 oct 2023 11:15, Peter J. Holzer <hjp-pgsql@hjp.at> escribió:
On 2023-10-25 17:48:46 +0200, Olivier Gautherot wrote:El mié, 25 oct 2023 16:58, Олег Самойлов <splarv@ya.ru> escribió:
Okey, I see no one was be able to solve this problem. But I could.
May be
for someone this will be useful too. There is solution.[...]
Now query is:
SELECT generate_series(min(id)/10000000, max(id)/10000000) AS n FROM
delivery) as part_numbers
WHERE (SELECT max(created_at) from delivery where n*10000000<=id
and id < (n+1)*10000000)
< CURRENT_DATE-'3 month'::interval;
I just realized that this query is mangled. I'm going to assume that it
should have been something like
with part_numbers as (
SELECT generate_series(min(id)/100, max(id)/100) as n
from delivery
)
select * from part_numbers
WHERE (SELECT max(created_at) from delivery where n*100 <= id and id < (n+1)*100) < CURRENT_DATE-'3 month'::interval;
[...]
Your fast solution will work as long as you don't have missing sequences
(like
deleted rows).
Why do you think this would break with missing sequence numbers?
In the suggested query, the return value contains a list of sequential numbers
from a min to a max - they seem to be markers of the partitions. Let's assume
that a complete partition is deleted in the middle: its index will still be
returned by the query, although it doesn't exist any more in the table.
I don't think it will. While the generate_series() will produce the
partition number, the where clause will not find any matching rows, so
the query will not return it.
E.g. (this table isn't partitioned, but that shouldn't affect the
result, also I'll reduce the "partition size" to 100 to make it more
readable):
create table delivery (id int, created_at date);
insert into delivery(200, '2000-01-01');
insert into delivery values(200, '2000-01-01');
insert into delivery values(299, '2000-12-01');
insert into delivery values(412, '2002-02-01');
insert into delivery values(439, '2002-03-01');
insert into delivery values(501, '2023-01-01');
insert into delivery values(555, now());
Note that there are no records in "partition" 3, and "partition" 5
contains current data, so we should get only "partition numbers" 2 and
4:
with part_numbers as (
SELECT generate_series(min(id)/100, max(id)/100) as n
from delivery
)
select * from part_numbers
WHERE (SELECT max(created_at) from delivery where n*100 <= id and id < (n+1)*100) < CURRENT_DATE-'3 month'::interval;
╔═══╗
║ n ║
╟───╢
║ 2 ║
║ 4 ║
╚═══╝
(2 rows)
Looks ok to me.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"