ANALYZE ONLY
Hello Postgres Hackers
An application that I am developing uses Postgresql, and includes a fairly large
number of partitioned tables which are used to store time series data.
The tables are partitioned by time, and typically there is only one partition
at a time - the current partition - that is actually being updated.
Older partitions
are available for query and eventually dropped.
As per the documentation, partitioned tables are not analyzed by the autovacuum
workers, although their partitions are. Statistics are needed on the partitioned
table level for at least some query planning activities.
The problem is that giving an ANALYZE command targeting a partitioned table
causes it to update statistics for the partitioned table AND all the individual
partitions. There is currently no option to prevent it from including the
partitions.
This is wasteful for our application: for one thing the autovacuum
has already analyzed the individual partitions; for another most of
the partitions
will have had no changes, so they don't need to be analyzed repeatedly.
I took some measurements when running ANALYZE on one of our tables. It
took approx
4 minutes to analyze the partitioned table, then 29 minutes to analyze the
partitions. We have hundreds of these tables, so the cost is very significant.
For my use case at least it would be fantastic if we could add an ONLY option
to ANALYZE, which would cause it to analyze the named table only and not
descend into the partitions.
I took a look at the source and it looks doable, but before launching into it
I thought I would ask a few questions here.
1. Would such a feature be welcomed? Are there any traps I might not
have thought of?
2. The existing ANALYZE command has the following structure:
ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
It would be easiest to add ONLY as another option, but that
doesn't look quite
right to me - surely the ONLY should be attached to the table name?
An alternative would be:
ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ]
Any feedback or advice would be great.
Regards
Mike.
On Tue, 20 Aug 2024 at 07:52, Michael Harris <harmic@gmail.com> wrote:
1. Would such a feature be welcomed? Are there any traps I might not
have thought of?
I think this sounds like a reasonable feature.
2. The existing ANALYZE command has the following structure:
ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
It would be easiest to add ONLY as another option, but that
doesn't look quite
right to me - surely the ONLY should be attached to the table name?
An alternative would be:ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ]
Any feedback or advice would be great.
Personally I'd prefer a new option to be added. But I agree ONLY isn't
a good name then. Maybe something like SKIP_PARTITIONS.
On 20.8.24 10:42, Jelte Fennema-Nio wrote:
On Tue, 20 Aug 2024 at 07:52, Michael Harris<harmic@gmail.com> wrote:
1. Would such a feature be welcomed? Are there any traps I might not
have thought of?I think this sounds like a reasonable feature.
2. The existing ANALYZE command has the following structure:
ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
It would be easiest to add ONLY as another option, but that
doesn't look quite
right to me - surely the ONLY should be attached to the table name?
An alternative would be:ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ]
Any feedback or advice would be great.
Personally I'd prefer a new option to be added. But I agree ONLY isn't
a good name then. Maybe something like SKIP_PARTITIONS.
Hi everyone,
Your proposal is indeed interesting, but I have a question: can't your
issue be resolved by properly configuring |autovacuum| instead of
developing a new feature for |ANALYZE|?
From my perspective, |ANALYZE| is intended to forcibly gather
statistics from all partitioned tables. If the goal is to ensure that
statistics are updated at the right moment, adjusting the
|autovacuum_analyze_threshold| and |autovacuum_analyze_scale_factor|
parameters might be the solution.
--
Regards,
Ilia Evdokimov,
Tantor Labs LCC.
On Tue, 20 Aug 2024 at 23:25, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:
Your proposal is indeed interesting, but I have a question: can't your issue be resolved by properly configuring autovacuum instead of developing a new feature for ANALYZE?
Basically, no. There's a "tip" in [1]https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-STATISTICS which provides information on
the limitation, namely:
"The autovacuum daemon does not issue ANALYZE commands for partitioned
tables. Inheritance parents will only be analyzed if the parent itself
is changed - changes to child tables do not trigger autoanalyze on the
parent table. If your queries require statistics on parent tables for
proper planning, it is necessary to periodically run a manual ANALYZE
on those tables to keep the statistics up to date."
There is also some discussion about removing the limitation in [2]/messages/by-id/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA@mail.gmail.com.
While I agree that it would be nice to have autovacuum handle this,
it's not clear how exactly it would work. Additionally, if we had
that, it would still be useful if the ANALYZE command could be
instructed to just gather statistics for the partitioned table only.
David
[1]: https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-STATISTICS
[2]: /messages/by-id/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA@mail.gmail.com
Hi Michael,
Thanks for starting this thread. I've also spent a bit time on this after
reading your first thread on this issue [1]
Michael Harris <harmic@gmail.com>, 20 Ağu 2024 Sal, 08:52 tarihinde şunu
yazdı:
The problem is that giving an ANALYZE command targeting a partitioned table
causes it to update statistics for the partitioned table AND all the
individual
partitions. There is currently no option to prevent it from including the
partitions.This is wasteful for our application: for one thing the autovacuum
has already analyzed the individual partitions; for another most of
the partitions
will have had no changes, so they don't need to be analyzed repeatedly.
I agree that it's a waste to analyze partitions when they're already
analyzed by autovacuum. It would be nice to have a way to run analyze only
on a partitioned table without its partitions.
I took some measurements when running ANALYZE on one of our tables. It
took approx
4 minutes to analyze the partitioned table, then 29 minutes to analyze the
partitions. We have hundreds of these tables, so the cost is very
significant.
I quickly tweaked the code a bit to exclude partitions when a partitioned
table is being analyzed. I can confirm that there is a significant gain
even on a simple case like a partitioned table with 10 partitions and 1M
rows in each partition.
1. Would such a feature be welcomed? Are there any traps I might not
have thought of?
2. The existing ANALYZE command has the following structure:
ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
It would be easiest to add ONLY as another option, but that
doesn't look quite
right to me - surely the ONLY should be attached to the table name
An alternative would be:ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ]
I feel closer to adding this as an option instead of a new keyword in
ANALYZE grammar. To me, it would be easier to have this option and then
give the names of partitioned tables as opposed to typing ONLY before each
partition table.
But we should think of another name as ONLY is used differently (attached
to the table name as you mentioned) in other contexts.
I've been also thinking about how this new option should affect inheritance
tables. Should it have just no impact on them or only analyze the parent
table without taking child tables into account? There are two records for
an inheritance parent table in pg_statistic, one row for only the parent
table and a second row including children. We might only analyze the parent
table if this new "ONLY" option is specified. I'm not sure if that would be
something users would need or not, but I think this option should behave
similarly for both partitioned tables and inheritance tables.
If we decide to go with only partition tables and not care about
inheritance, then naming this option to SKIP_PARTITIONS as Jelte suggested
sounds fine. But that name wouldn't work if this option will affect
inheritance tables.
Thanks,
--
Melih Mutlu
Microsoft
Melih Mutlu <m.melihmutlu@gmail.com>, 20 Ağu 2024 Sal, 19:26 tarihinde şunu
yazdı:
Hi Michael,
Thanks for starting this thread. I've also spent a bit time on this after
reading your first thread on this issue [1]
Forgot to add the reference [1]/messages/by-id/CADofcAXVbD0yGp_EaC9chmzsOoSai3jcfBCnyva3j0RRdRvMVA@mail.gmail.com
[1]: /messages/by-id/CADofcAXVbD0yGp_EaC9chmzsOoSai3jcfBCnyva3j0RRdRvMVA@mail.gmail.com
/messages/by-id/CADofcAXVbD0yGp_EaC9chmzsOoSai3jcfBCnyva3j0RRdRvMVA@mail.gmail.com
On Tue, Aug 20, 2024 at 1:52 AM Michael Harris <harmic@gmail.com> wrote:
2. The existing ANALYZE command has the following structure:
ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
It would be easiest to add ONLY as another option, but that
doesn't look quite
right to me - surely the ONLY should be attached to the table name?
An alternative would be:ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ]
Any feedback or advice would be great.
I like trying to use ONLY somehow.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote:
I like trying to use ONLY somehow.
Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table;
or as a table modifier like gram.y's extended_relation_expr?
Making it a command option means that the option would apply to all
tables listed, whereas if it was more like an extended_relation_expr,
the option would be applied per table listed in the command.
1. ANALYZE ONLY ptab, ptab2; -- gather stats on ptab but not on its
partitions but get stats on ptab2 and stats on its partitions too.
2. ANALYZE ONLY ptab, ONLY ptab2; -- gather stats on ptab and ptab2
without doing that on any of their partitions.
Whereas: "ANALYZE (ONLY) ptab, ptab2;" would always give you the
behaviour of #2.
If we did it as a per-table option, then we'd need to consider what
should happen if someone did: "VACUUM ONLY parttab;". Probably
silently doing nothing wouldn't be good. Maybe a warning, akin to
what's done in:
postgres=# analyze (skip_locked) a;
WARNING: skipping analyze of "a" --- lock not available
David
David Rowley <dgrowleyml@gmail.com> writes:
On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote:
I like trying to use ONLY somehow.
Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table;
or as a table modifier like gram.y's extended_relation_expr?
Making it a command option means that the option would apply to all
tables listed, whereas if it was more like an extended_relation_expr,
the option would be applied per table listed in the command.
1. ANALYZE ONLY ptab, ptab2; -- gather stats on ptab but not on its
partitions but get stats on ptab2 and stats on its partitions too.
2. ANALYZE ONLY ptab, ONLY ptab2; -- gather stats on ptab and ptab2
without doing that on any of their partitions.
FWIW, I think that's the right approach, for consistency with the
way that ONLY works in DML.
regards, tom lane
On Tue, Aug 20, 2024 at 6:53 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote:
I like trying to use ONLY somehow.
Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table;
or as a table modifier like gram.y's extended_relation_expr?
The table modifier idea seems nice to me.
If we did it as a per-table option, then we'd need to consider what
should happen if someone did: "VACUUM ONLY parttab;". Probably
silently doing nothing wouldn't be good. Maybe a warning, akin to
what's done in:postgres=# analyze (skip_locked) a;
WARNING: skipping analyze of "a" --- lock not available
Perhaps. I'm not sure about this part.
--
Robert Haas
EDB: http://www.enterprisedb.com
David Rowley <dgrowleyml@gmail.com>, 21 Ağu 2024 Çar, 01:53 tarihinde şunu
yazdı:
On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote:
I like trying to use ONLY somehow.
Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table;
or as a table modifier like gram.y's extended_relation_expr?Making it a command option means that the option would apply to all
tables listed, whereas if it was more like an extended_relation_expr,
the option would be applied per table listed in the command.1. ANALYZE ONLY ptab, ptab2; -- gather stats on ptab but not on its
partitions but get stats on ptab2 and stats on its partitions too.
2. ANALYZE ONLY ptab, ONLY ptab2; -- gather stats on ptab and ptab2
without doing that on any of their partitions.
I believe we should go this route if we want this to be called "ONLY" so
that it would be consistent with other places too.
Whereas: "ANALYZE (ONLY) ptab, ptab2;" would always give you the
behaviour of #2.
Havin it as an option would be easier to use when we have several partition
tables. But I agree that if we call it "ONLY ", it may be confusing and the
other approach would be appropriate.
If we did it as a per-table option, then we'd need to consider what
should happen if someone did: "VACUUM ONLY parttab;". Probably
silently doing nothing wouldn't be good. Maybe a warning, akin to
what's done in:postgres=# analyze (skip_locked) a;
WARNING: skipping analyze of "a" --- lock not available
+1 to raising a warning message in that case instead of staying silent. We
might also not allow ONLY if ANALYZE is not present in VACUUM query and
raise an error. But that would require changes in grams.y and could
complicate things. So it may not be necessary and we may be fine with just
a warning.
Regards,
--
Melih Mutlu
Microsoft
Thank you all for the replies & discussion.
It sounds like more are in favour of using the ONLY syntax attached to
the tables is the best way to go, with the main advantages being:
- consistency with other commands
- flexibility in allowing to specify whether to include partitions
for individual tables when supplying a list of tables
I will start working on an implementation along those lines. It looks
like we can simply replace qualified_name with relation_expr in the
production for vacuum_relation within gram.y.
One other thing I noticed when reading the code. The function
expand_vacuum_rel in vacuum.c seems to be responsible for adding the
partitions. If I am reading it correctly, it only adds child tables in
the case of a partitioned table, not in the case of an inheritance
parent:
include_parts = (classForm->relkind == RELKIND_PARTITIONED_TABLE);
..
if (include_parts)
{
.. add partitions ..
This is a little different to some other contexts where the ONLY
keyword is used, in that ONLY would be the default and only available
mode of operation for an inheritance parent.
Regards,
Mike
Show quoted text
On Wed, 21 Aug 2024 at 20:04, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
David Rowley <dgrowleyml@gmail.com>, 21 Ağu 2024 Çar, 01:53 tarihinde şunu yazdı:
On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote:
I like trying to use ONLY somehow.
Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table;
or as a table modifier like gram.y's extended_relation_expr?Making it a command option means that the option would apply to all
tables listed, whereas if it was more like an extended_relation_expr,
the option would be applied per table listed in the command.1. ANALYZE ONLY ptab, ptab2; -- gather stats on ptab but not on its
partitions but get stats on ptab2 and stats on its partitions too.
2. ANALYZE ONLY ptab, ONLY ptab2; -- gather stats on ptab and ptab2
without doing that on any of their partitions.I believe we should go this route if we want this to be called "ONLY" so that it would be consistent with other places too.
Whereas: "ANALYZE (ONLY) ptab, ptab2;" would always give you the
behaviour of #2.Havin it as an option would be easier to use when we have several partition tables. But I agree that if we call it "ONLY ", it may be confusing and the other approach would be appropriate.
If we did it as a per-table option, then we'd need to consider what
should happen if someone did: "VACUUM ONLY parttab;". Probably
silently doing nothing wouldn't be good. Maybe a warning, akin to
what's done in:postgres=# analyze (skip_locked) a;
WARNING: skipping analyze of "a" --- lock not available+1 to raising a warning message in that case instead of staying silent. We might also not allow ONLY if ANALYZE is not present in VACUUM query and raise an error. But that would require changes in grams.y and could complicate things. So it may not be necessary and we may be fine with just a warning.
Regards,
--
Melih Mutlu
Microsoft
On Thu, 22 Aug 2024 at 11:32, Michael Harris <harmic@gmail.com> wrote:
One other thing I noticed when reading the code. The function
expand_vacuum_rel in vacuum.c seems to be responsible for adding the
partitions. If I am reading it correctly, it only adds child tables in
the case of a partitioned table, not in the case of an inheritance
parent:include_parts = (classForm->relkind == RELKIND_PARTITIONED_TABLE);
..
if (include_parts)
{
.. add partitions ..This is a little different to some other contexts where the ONLY
keyword is used, in that ONLY would be the default and only available
mode of operation for an inheritance parent.
That's inconvenient and quite long-established behaviour. I had a look
as far back as 9.2 and we only analyze parents there too. I'm keen on
the ONLY syntax, but it would be strange if ONLY did the same thing as
not using ONLY for inheritance parents.
I feel like we might need to either bite the bullet and make ONLY work
consistently with both, or think of another way to have ANALYZE not
recursively gather stats for each partition on partitioned tables.
Could we possibly get away with changing inheritance parent behaviour?
David
David Rowley <dgrowleyml@gmail.com> writes:
I feel like we might need to either bite the bullet and make ONLY work
consistently with both, or think of another way to have ANALYZE not
recursively gather stats for each partition on partitioned tables.
Could we possibly get away with changing inheritance parent behaviour?
Yeah, my thought was to change the behavior so it's consistent in
that case too. This doesn't seem to me like a change that'd
really cause anybody serious problems: ANALYZE without ONLY would
do more than before, but it's work you probably want done anyway.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, my thought was to change the behavior so it's consistent in
that case too. This doesn't seem to me like a change that'd
really cause anybody serious problems: ANALYZE without ONLY would
do more than before, but it's work you probably want done anyway.
Would we want to apply that change to VACUUM too? That might be a
bit drastic, especially if you had a multi-level inheritance structure featuring
large tables.
It feels a bit like VACUUM and ANALYZE have opposite natural defaults here.
For VACUUM it does not make much sense to vacuum only at the partitioned
table level and not include the partitions, since it would do nothing
- that might
be why the existing code always adds the partitions.
On Thu, 22 Aug 2024 at 13:38, Michael Harris <harmic@gmail.com> wrote:
Would we want to apply that change to VACUUM too? That might be a
bit drastic, especially if you had a multi-level inheritance structure featuring
large tables.
I think they'd need to work the same way as for "VACUUM (ANALYZE)", it
would be strange to analyze some tables that you didn't vacuum. It's
just a much bigger pill to swallow in terms of the additional effort.
It feels a bit like VACUUM and ANALYZE have opposite natural defaults here.
For VACUUM it does not make much sense to vacuum only at the partitioned
table level and not include the partitions, since it would do nothing
- that might
be why the existing code always adds the partitions.
Yeah, I suspect that's exactly why it was coded that way.
David
Hi All,
Here is a first draft of a patch to implement the ONLY option for
VACUUM and ANALYZE.
I'm a little nervous about the implications of changing the behaviour of VACUUM
for inheritance structures; I can imagine people having regularly
executed scripts
that currently vacuum all the tables in their inheritance structure;
after this change
they might get more vacuuming than they bargained for.
It's my first attempt to submit a patch here so apologies if I've
missed any part
of the process.
Cheers
Mike
Show quoted text
On Thu, 22 Aug 2024 at 12:09, David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 22 Aug 2024 at 13:38, Michael Harris <harmic@gmail.com> wrote:
Would we want to apply that change to VACUUM too? That might be a
bit drastic, especially if you had a multi-level inheritance structure featuring
large tables.I think they'd need to work the same way as for "VACUUM (ANALYZE)", it
would be strange to analyze some tables that you didn't vacuum. It's
just a much bigger pill to swallow in terms of the additional effort.It feels a bit like VACUUM and ANALYZE have opposite natural defaults here.
For VACUUM it does not make much sense to vacuum only at the partitioned
table level and not include the partitions, since it would do nothing
- that might
be why the existing code always adds the partitions.Yeah, I suspect that's exactly why it was coded that way.
David
Attachments:
0001-Initial-implementation-of-the-ONLY-feature.patchapplication/octet-stream; name=0001-Initial-implementation-of-the-ONLY-feature.patchDownload+140-17
Hi Michael,
Thanks for the patch.
I quickly tried running some ANALYZE ONLY queries, it seems like it works
fine.
-ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] )
] [ <replaceable class="parameter">table_and_columns</replaceable> [, ...] ]
+ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] )
] [ [ ONLY ] <replaceable class="parameter">table_and_columns</replaceable>
[, ...] ]
It seems like extended_relation_expr allows "tablename *" syntax too. That
should be added in docs as well. (Same for VACUUM doc)
<para>
For partitioned tables, <command>ANALYZE</command> gathers statistics
by
sampling rows from all partitions; in addition, it will recurse into
each
partition and update its statistics. Each leaf partition is analyzed
only
once, even with multi-level partitioning. No statistics are collected
for
only the parent table (without data from its partitions), because with
partitioning it's guaranteed to be empty.
</para>
We may also want to update the above note in ANALYZE doc.
+-- ANALYZE ONLY / VACUUM ONLY on partitioned table
+CREATE TABLE only_parted (a int, b char) PARTITION BY LIST (a); +CREATE TABLE only_parted1 PARTITION OF vacparted FOR VALUES IN (1);
+INSERT INTO only_parted1 VALUES (1, 'a');
Tests don't seem right to me.
I believe the above should be " PARTITION OF only_parted " instead of
vacparted.
It may better to insert into partitioned table (only_parted) instead of the
partition (only_parted1);
Also it may be a good idea to test VACUUM ONLY for inheritance tables the
same way you test ANALYZE ONLY.
Lastly, the patch includes an unrelated file (compile_flags.txt) and has
whitespace errors when I apply it.
Regards,
--
Melih Mutlu
Microsoft
Thanks for the feedback Melih,
On Thu, 22 Aug 2024 at 21:26, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
It seems like extended_relation_expr allows "tablename *" syntax too. That should be added in docs as well. (Same for VACUUM doc)
I had included it in the parameter description but had missed it from
the synopsis. I've fixed that now.
We may also want to update the above note in ANALYZE doc.
Well spotted. I have updated that now.
Tests don't seem right to me.
I believe the above should be " PARTITION OF only_parted " instead of vacparted.
It may better to insert into partitioned table (only_parted) instead of the partition (only_parted1);Also it may be a good idea to test VACUUM ONLY for inheritance tables the same way you test ANALYZE ONLY.
Well spotted again. I have fixed the incorrect table names and added
tests as suggested.
Lastly, the patch includes an unrelated file (compile_flags.txt) and has whitespace errors when I apply it.
Oops! Fixed,
V2 of the patch is attached.
Cheers
Mike
Attachments:
v2-0001-Initial-implementation-of-the-ONLY-feature.patchapplication/octet-stream; name=v2-0001-Initial-implementation-of-the-ONLY-feature.patchDownload+176-24
Hi Michael,
Michael Harris <harmic@gmail.com>, 23 Ağu 2024 Cum, 13:01 tarihinde şunu
yazdı:
V2 of the patch is attached.
Thanks for updating the patch. I have a few more minor feedbacks.
-ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] )
] [ <replaceable class="parameter">table_and_columns</replaceable> [, ...] ]
+ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] )
] [ [ ONLY ] <replaceable class="parameter">table_and_columns</replaceable>
[, ...] ]
I believe moving "[ ONLY ]" to the definitions of table_and_columns below,
as you did with "[ * ]", might be better to be consistent with other places
(see [1]https://www.postgresql.org/docs/16/sql-createpublication.html)
+ if ((options & VACOPT_VACUUM) && is_partitioned_table && ! i
nclude_children)
There are also some issues with coding conventions in some places (e.g. the
space between "!" and "include_children" abode). I think running pgindent
would resolve such issue in most places.
[1]: https://www.postgresql.org/docs/16/sql-createpublication.html
Regards,
--
Melih Mutlu
Microsoft