test_autovacuum/001_parallel_autovacuum is broken

Started by Sami Imseih8 days ago20 messageshackers
Jump to latest
#1Sami Imseih
samimseih@gmail.com

Hi,

I noticed that the test introduced in parallel autovacuum in 1ff3180ca01 was
very slow, but eventually succeeded. I tracked it down to the point in
the test that is waiting for "parallel autovacuum worker updated cost params".

This portion of the test that is waiting for the cost params to propagate to the
workers is getting stuck on wait_for_autovacuum_complete(). At the time
it's stuck the injection point from the previous test
autovacuum-start-parallel-vacuum
is still active on template1 tables.

datname | query
| wait_event
-----------+-----------------------------------------------------------+----------------------------------
postgres | select datname, query, wait_event from pg_stat_activity ; |
template1 | autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute
| autovacuum-start-parallel-vacuum
|
| AutovacuumMain
|
| LogicalLauncherMain
|
| IoWorkerMain
|
| IoWorkerMain
|
| IoWorkerMain
|
| CheckpointerMain
|
| BgwriterMain
|
| WalWriterMain
(10 rows)

The poll_query_until eventually just times out, but this does not
cause the test to fail.

# test succeeded
----------------------------------- stderr -----------------------------------
# poll_query_until timed out executing this query:
#
# SELECT autovacuum_count > 1 FROM pg_stat_user_tables WHERE
relname = 'test_autovac'
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
================

This issue only occurs when I run all tests and not when I run
test_autovacuum in isolation. It makes sense the issue only occurs
for all tests only since autovacuum running runs for template1 and other
tables unrelated to the test.

I run all the tests ( equivelant of check-wold) with:
```
meson test -q --print-errorlogs
```

I think we can remove the second wait_for_autovacuum_complete()
call in the test, as all we really need is to wait_for_log to guarantee
the cost parameters were updated. No need to wait for the autovacuum
to complete.

--- a/src/test/modules/test_autovacuum/t/001_parallel_autovacuum.pl
+++ b/src/test/modules/test_autovacuum/t/001_parallel_autovacuum.pl
@@ -177,8 +177,6 @@ $node->wait_for_log(
        qr/parallel autovacuum worker updated cost params:
cost_limit=500, cost_delay=5, cost_page_miss=10, cost_page_dirty=10,
cost_page_hit=10/,
        $log_offset);

-wait_for_autovacuum_complete($node, $av_count);
-
# Cleanup
$node->safe_psql(
'postgres', qq{

Regards,

--
Sami Imseih
Amazon Web Services (AWS)

#2Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Sami Imseih (#1)
Re: test_autovacuum/001_parallel_autovacuum is broken

On Mon, Apr 6, 2026 at 7:24 PM Sami Imseih <samimseih@gmail.com> wrote:

Hi,

I noticed that the test introduced in parallel autovacuum in 1ff3180ca01 was
very slow, but eventually succeeded. I tracked it down to the point in
the test that is waiting for "parallel autovacuum worker updated cost params".

Thank you for the report.

This portion of the test that is waiting for the cost params to propagate to the
workers is getting stuck on wait_for_autovacuum_complete(). At the time
it's stuck the injection point from the previous test
autovacuum-start-parallel-vacuum
is still active on template1 tables.

(snip)

I think we can remove the second wait_for_autovacuum_complete()
call in the test, as all we really need is to wait_for_log to guarantee
the cost parameters were updated. No need to wait for the autovacuum
to complete.

It sound like a good idea. In the test 2, we make garbage tuples on
test_autovac table but it doesn't necessarily mean that autovacuum
would work only on that table. Also given that the purpose of this
test is to check the propagation works fine, we can verify it whatever
tables eligible for parallel vacuum.

I've attached the patch, and am going to push it barring any objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

0001-Remove-an-unstable-wait-from-parallel-autovacuum-reg.patchtext/x-patch; charset=UTF-8; name=0001-Remove-an-unstable-wait-from-parallel-autovacuum-reg.patchDownload+0-3
#3Daniil Davydov
3danissimo@gmail.com
In reply to: Masahiko Sawada (#2)
Re: test_autovacuum/001_parallel_autovacuum is broken

Hi,

On Mon, Apr 6, 2026 at 7:24 PM Sami Imseih <samimseih@gmail.com> wrote:

I noticed that the test introduced in parallel autovacuum in 1ff3180ca01 was
very slow, but eventually succeeded. I tracked it down to the point in
the test that is waiting for "parallel autovacuum worker updated cost params".

Good catch!

On Tue, Apr 7, 2026 at 2:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Apr 6, 2026 at 7:24 PM Sami Imseih <samimseih@gmail.com> wrote:

I think we can remove the second wait_for_autovacuum_complete()
call in the test, as all we really need is to wait_for_log to guarantee
the cost parameters were updated. No need to wait for the autovacuum
to complete.

It sound like a good idea. In the test 2, we make garbage tuples on
test_autovac table but it doesn't necessarily mean that autovacuum
would work only on that table. Also given that the purpose of this
test is to check the propagation works fine, we can verify it whatever
tables eligible for parallel vacuum.

Yeah, we only need to ensure that there will be free parallel workers in
bgworkers pool. Since only autovacuum can launch parallel workers in this
test, I think everything is OK.

The proposed patch fixes the problem, but I am thinking about possible new
tests for parallel a/v. What if some of them will require both injection points
and wait_for_autovacuum_complete call? If the reloption could override the GUC
parameter, then we could disable parallel a/v globally and turn it on for the
particular table. In this case we can avoid such a problem.

--
Best regards,
Daniil Davydov

#4Sami Imseih
samimseih@gmail.com
In reply to: Daniil Davydov (#3)
Re: test_autovacuum/001_parallel_autovacuum is broken

I think we can remove the second wait_for_autovacuum_complete()
call in the test, as all we really need is to wait_for_log to guarantee
the cost parameters were updated. No need to wait for the autovacuum
to complete.

It sound like a good idea. In the test 2, we make garbage tuples on
test_autovac table but it doesn't necessarily mean that autovacuum
would work only on that table. Also given that the purpose of this
test is to check the propagation works fine, we can verify it whatever
tables eligible for parallel vacuum.

Yeah, we only need to ensure that there will be free parallel workers in
bgworkers pool. Since only autovacuum can launch parallel workers in this
test, I think everything is OK.

The proposed patch fixes the problem, but I am thinking about possible new
tests for parallel a/v. What if some of them will require both injection points
and wait_for_autovacuum_complete call?

Yeah, we could use dynamically named injection points, which I don't
see a precedent for. For example, we could construct a name like
"autovacuum-test-<relname>" and have the autovacuum code path
generate the injection point name dynamically based on the relation
being vacuumed. That way, the test only blocks on the specific table
it cares about.

But I don't think we need to do this now, since the test we are dealing
with does not need to wait for autovacuum to complete.

If the reloption could override the GUC
parameter, then we could disable parallel a/v globally and turn it on for the
particular table. In this case we can avoid such a problem.

That is an interesting idea which may be possible since we do reserve
a/v workers at startup with autovacuum_worker_slots. Although I am
not sure if we should be putting a feature that makes disabling
autovacuum globally supported behavior.

--
Sami

#5Daniil Davydov
3danissimo@gmail.com
In reply to: Sami Imseih (#4)
Re: test_autovacuum/001_parallel_autovacuum is broken

Hi,

On Tue, Apr 7, 2026 at 11:33 PM Sami Imseih <samimseih@gmail.com> wrote:

The proposed patch fixes the problem, but I am thinking about possible new
tests for parallel a/v. What if some of them will require both injection points
and wait_for_autovacuum_complete call?

Yeah, we could use dynamically named injection points, which I don't
see a precedent for. For example, we could construct a name like
"autovacuum-test-<relname>" and have the autovacuum code path
generate the injection point name dynamically based on the relation
being vacuumed. That way, the test only blocks on the specific table
it cares about.

I am afraid that this would be too rough a workaround for this problem..

But I don't think we need to do this now, since the test we are dealing
with does not need to wait for autovacuum to complete.

Sure! But this test needs to be slightly reworked in the future.

If the reloption could override the GUC
parameter, then we could disable parallel a/v globally and turn it on for the
particular table. In this case we can avoid such a problem.

That is an interesting idea which may be possible since we do reserve
a/v workers at startup with autovacuum_worker_slots. Although I am
not sure if we should be putting a feature that makes disabling
autovacuum globally supported behavior.

Hm, I think I didn't reveal my thoughts enough.
Let me describe current logic in short : each a/v worker now can take from
bgworkers pool as many parallel workers as the GUC parameter
(autovacuum_max_parallel_workers) allows.

We also have an "autovacuum_parallel_workers" reloption that can additionally
limit the number of parallel workers for the table. Default value of the
reloption is "-1" which means "use the GUC parameter's value". I.e. when we are
setting the GUC parameter to N, then every table automatically allows N
parallel a/v workers. If autovacuum_max_parallel_workers = 0 then no one can
launch parallel workers for autovacuum, even if reloption is > 0. Thus,
autovacuum_max_parallel_workers is the main limiter during the number of
parallel workers calculation.

But I suggest an alternative idea - allow reloption to override GUC parameter.
So even if autovacuum_max_parallel_workers is 0 we still can enable parallel
a/v for a particular table via reloption.

This approach allows us to rework the test as follows :
1) Keep the default value of GUC parameter which means that no table allows
parallel a/v.
2) Set reloption of a particular table to N (allow parallel a/v for this and
only this table).

This approach may also be very useful in large productions. You can read
discussion about it from here [1]/messages/by-id/CAJDiXgj3A=wNC-S0z3TixmnVUkifs=07yLLHJ7_+dDsakft1tA@mail.gmail.com up to the end of the thread. Since the
question is still open, all feedback is welcome!

[1]: /messages/by-id/CAJDiXgj3A=wNC-S0z3TixmnVUkifs=07yLLHJ7_+dDsakft1tA@mail.gmail.com

--
Best regards,
Daniil Davydov

#6Sami Imseih
samimseih@gmail.com
In reply to: Daniil Davydov (#5)
Re: test_autovacuum/001_parallel_autovacuum is broken

On Tue, Apr 7, 2026 at 11:33 PM Sami Imseih <samimseih@gmail.com> wrote:

The proposed patch fixes the problem, but I am thinking about possible new
tests for parallel a/v. What if some of them will require both injection points
and wait_for_autovacuum_complete call?

Yeah, we could use dynamically named injection points, which I don't
see a precedent for. For example, we could construct a name like
"autovacuum-test-<relname>" and have the autovacuum code path
generate the injection point name dynamically based on the relation
being vacuumed. That way, the test only blocks on the specific table
it cares about.

I am afraid that this would be too rough a workaround for this problem..

Perhaps, but I don't see it being unreasonable for injection points.

I guess we can also think about expanding InjectionPointCondition to
handle other types of conditions, maybe OID??, to filter when running
the point.

/*
* Conditions related to injection points. This tracks in shared memory the
* runtime conditions under which an injection point is allowed to run,
* stored as private_data when an injection point is attached, and passed as
* argument to the callback.
*
* If more types of runtime conditions need to be tracked, this structure
* should be expanded.
*/
typedef enum InjectionPointConditionType
{
INJ_CONDITION_ALWAYS = 0, /* always run */
INJ_CONDITION_PID, /* PID restriction */
} InjectionPointConditionType;

typedef struct InjectionPointCondition
{
/* Type of the condition */
InjectionPointConditionType type;

/* ID of the process where the injection point is allowed to run */
int pid;
} InjectionPointCondition;

We also have an "autovacuum_parallel_workers" reloption that can additionally
limit the number of parallel workers for the table. Default value of the
reloption is "-1" which means "use the GUC parameter's value". I.e. when we are
setting the GUC parameter to N, then every table automatically allows N
parallel a/v workers. If autovacuum_max_parallel_workers = 0 then no one can
launch parallel workers for autovacuum, even if reloption is > 0. Thus,
autovacuum_max_parallel_workers is the main limiter during the number of
parallel workers calculation.

autovacuum_max_parallel_workers being the limiter is a desirable
attribute, otherwise
it will allow users to disable the GUC and set whatever they want on a
per table level,
only guarded by max_parallel_workers. That to me sounds pretty easy to
misconfigure
and manage.

But I suggest an alternative idea - allow reloption to override GUC parameter.
So even if autovacuum_max_parallel_workers is 0 we still can enable parallel
a/v for a particular table via reloption.

This approach allows us to rework the test as follows :
1) Keep the default value of GUC parameter which means that no table allows
parallel a/v.
2) Set reloption of a particular table to N (allow parallel a/v for this and
only this table).

This approach may also be very useful in large productions. You can read
discussion about it from here [1] up to the end of the thread. Since the
question is still open, all feedback is welcome!

[1] /messages/by-id/CAJDiXgj3A=wNC-S0z3TixmnVUkifs=07yLLHJ7_+dDsakft1tA@mail.gmail.com

Thanks!

--
Sami

#7Daniil Davydov
3danissimo@gmail.com
In reply to: Sami Imseih (#6)
Re: test_autovacuum/001_parallel_autovacuum is broken

Hi,

On Wed, Apr 8, 2026 at 1:03 AM Sami Imseih <samimseih@gmail.com> wrote:

I am afraid that this would be too rough a workaround for this problem..

Perhaps, but I don't see it being unreasonable for injection points.

I guess we can also think about expanding InjectionPointCondition to
handle other types of conditions, maybe OID??, to filter when running
the point.

Hm, sounds reasonable.

I am thinking about how to make it less invasive. For example, we can write an
extension for this test (like many other tests do). The extension could allow
us to specify the oid of the table we are interested in. See attached patch
that demonstrates my idea. What do you think?

We also have an "autovacuum_parallel_workers" reloption that can additionally
limit the number of parallel workers for the table. Default value of the
reloption is "-1" which means "use the GUC parameter's value". I.e. when we are
setting the GUC parameter to N, then every table automatically allows N
parallel a/v workers. If autovacuum_max_parallel_workers = 0 then no one can
launch parallel workers for autovacuum, even if reloption is > 0. Thus,
autovacuum_max_parallel_workers is the main limiter during the number of
parallel workers calculation.

autovacuum_max_parallel_workers being the limiter is a desirable
attribute, otherwise
it will allow users to disable the GUC and set whatever they want on a
per table level,
only guarded by max_parallel_workers. That to me sounds pretty easy to
misconfigure
and manage.

Yes, this is the main argument against this idea. However, in the thread that
I mentioned I tried to give arguments why this might be extremely convenient
for users with large databases.

--
Best regards,
Daniil Davydov

Attachments:

0001-Improvements-for-parallel-autovacuum-testing.patchtext/x-patch; charset=US-ASCII; name=0001-Improvements-for-parallel-autovacuum-testing.patchDownload+222-9
#8Michael Paquier
michael@paquier.xyz
In reply to: Sami Imseih (#6)
Re: test_autovacuum/001_parallel_autovacuum is broken

On Tue, Apr 07, 2026 at 01:02:49PM -0500, Sami Imseih wrote:

Perhaps, but I don't see it being unreasonable for injection points.

I guess we can also think about expanding InjectionPointCondition to
handle other types of conditions, maybe OID??, to filter when running
the point.

Yeah, InjectionPointConditionType was designed with these types of
extensions in mind in terms of conditions you want to assign to a
point name when attaching it, with the check happening in the callback
attached when it is run. It should not be complicated to extend
injection_points_attach(), just pass down a string that it then
translated to an OID, or you could use a different grammar as well.
One thing that I'd be careful about is to handle that with one
argument in the SQL attach function, with the condition data given as
an input string. One grammar that Alexander K. designed at some point
for some of the facilities he had proposed was a JSON input, but
that's an implementation artifact.

Doing something as a separate module/library would be also fine. We
do that for the AIO tests.
--
Michael

#9Sami Imseih
samimseih@gmail.com
In reply to: Michael Paquier (#8)
Re: test_autovacuum/001_parallel_autovacuum is broken

On Tue, Apr 07, 2026 at 01:02:49PM -0500, Sami Imseih wrote:

Perhaps, but I don't see it being unreasonable for injection points.

I guess we can also think about expanding InjectionPointCondition to
handle other types of conditions, maybe OID??, to filter when running
the point.

Yeah, InjectionPointConditionType was designed with these types of
extensions in mind in terms of conditions you want to assign to a
point name when attaching it, with the check happening in the callback
attached when it is run. It should not be complicated to extend
injection_points_attach(), just pass down a string that it then
translated to an OID, or you could use a different grammar as well.
One thing that I'd be careful about is to handle that with one
argument in the SQL attach function, with the condition data given as
an input string. One grammar that Alexander K. designed at some point
for some of the facilities he had proposed was a JSON input, but
that's an implementation artifact.

Doing something as a separate module/library would be also fine. We
do that for the AIO tests.

I think we can enhance these tests using a separate module as Daniil is
suggesting, but we should probably get 0001 committed first and then
have a quick follow-up. What do you think?

--
Sami

#10Alexander Lakhin
exclusion@gmail.com
In reply to: Sami Imseih (#1)
Re: test_autovacuum/001_parallel_autovacuum is broken

Hello,

07.04.2026 05:23, Sami Imseih wrote:

I noticed that the test introduced in parallel autovacuum in 1ff3180ca01 was
very slow, but eventually succeeded. I tracked it down to the point in
the test that is waiting for "parallel autovacuum worker updated cost params".

I've found another issue with the test manifested on buildfarm, at least
at [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&amp;dt=2026-04-07%2004%3A41%3A09:
[06:54:07.738](4.121s) not ok 1
[06:54:07.769](0.031s) #   Failed test at
/home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/modules/test_autovacuum/t/001_parallel_autovacuum.pl line 133.
### Stopping node "main" using mode fast

The corresponding test code:
# Wait until the parallel autovacuum on table is completed. At the same time,
# we check that the required number of parallel workers has been started.
wait_for_autovacuum_complete($node, $av_count);
ok( $node->log_contains(
        qr/parallel workers: index vacuum: 2 planned, 2 launched in total/,
        $log_offset));

but regress_log_001_parallel_autovacuum contains this string:
2026-04-07 06:54:07.736 CEST [1825954][autovacuum worker][102/5:0] LOG:  automatic vacuum of table
"postgres.public.test_autovac": index scans: 1
...
    parallel workers: index vacuum: 2 planned, 2 launched in total

though the timestamp difference is only 2 ms. I tried the following
modification:
@@ -1222,6 +1222,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
                              (double) dead_items_max_bytes / (1024 * 1024));
             appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));

+pg_usleep(300000);
             ereport(verbose ? INFO : LOG,
                     (errmsg_internal("%s", buf.data)));
             pfree(buf.data);

and it makes the test fail for me on each run.
Could you please look if this can be fixed too?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&amp;dt=2026-04-07%2004%3A41%3A09

Best regards,
Alexander

#11Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Sami Imseih (#9)
Re: test_autovacuum/001_parallel_autovacuum is broken

On Wed, Apr 8, 2026 at 7:52 AM Sami Imseih <samimseih@gmail.com> wrote:

On Tue, Apr 07, 2026 at 01:02:49PM -0500, Sami Imseih wrote:

Perhaps, but I don't see it being unreasonable for injection points.

I guess we can also think about expanding InjectionPointCondition to
handle other types of conditions, maybe OID??, to filter when running
the point.

Yeah, InjectionPointConditionType was designed with these types of
extensions in mind in terms of conditions you want to assign to a
point name when attaching it, with the check happening in the callback
attached when it is run. It should not be complicated to extend
injection_points_attach(), just pass down a string that it then
translated to an OID, or you could use a different grammar as well.
One thing that I'd be careful about is to handle that with one
argument in the SQL attach function, with the condition data given as
an input string. One grammar that Alexander K. designed at some point
for some of the facilities he had proposed was a JSON input, but
that's an implementation artifact.

Doing something as a separate module/library would be also fine. We
do that for the AIO tests.

I think we can enhance these tests using a separate module as Daniil is
suggesting, but we should probably get 0001 committed first and then
have a quick follow-up. What do you think?

+1. I've pushed the fix.

It might also be useful if an injection point is activated on the
particular database rather than cluster-wide.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#12Sami Imseih
samimseih@gmail.com
In reply to: Masahiko Sawada (#11)
Re: test_autovacuum/001_parallel_autovacuum is broken

I think we can enhance these tests using a separate module as Daniil is
suggesting, but we should probably get 0001 committed first and then
have a quick follow-up. What do you think?

+1. I've pushed the fix.

Thanks!

It might also be useful if an injection point is activated on the
particular database rather than cluster-wide.

Here is something that could be discussed for v20 [1]https://commitfest.postgresql.org/patch/6663/

[1]: https://commitfest.postgresql.org/patch/6663/

--
Sami

#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Alexander Lakhin (#10)
Re: test_autovacuum/001_parallel_autovacuum is broken

On Thu, Apr 9, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Hello,

07.04.2026 05:23, Sami Imseih wrote:

I noticed that the test introduced in parallel autovacuum in 1ff3180ca01 was
very slow, but eventually succeeded. I tracked it down to the point in
the test that is waiting for "parallel autovacuum worker updated cost params".

I've found another issue with the test manifested on buildfarm, at least
at [1]:
[06:54:07.738](4.121s) not ok 1
[06:54:07.769](0.031s) # Failed test at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/modules/test_autovacuum/t/001_parallel_autovacuum.pl line 133.
### Stopping node "main" using mode fast

The corresponding test code:
# Wait until the parallel autovacuum on table is completed. At the same time,
# we check that the required number of parallel workers has been started.
wait_for_autovacuum_complete($node, $av_count);
ok( $node->log_contains(
qr/parallel workers: index vacuum: 2 planned, 2 launched in total/,
$log_offset));

but regress_log_001_parallel_autovacuum contains this string:
2026-04-07 06:54:07.736 CEST [1825954][autovacuum worker][102/5:0] LOG: automatic vacuum of table "postgres.public.test_autovac": index scans: 1
...
parallel workers: index vacuum: 2 planned, 2 launched in total

though the timestamp difference is only 2 ms. I tried the following
modification:
@@ -1222,6 +1222,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
(double) dead_items_max_bytes / (1024 * 1024));
appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));

+pg_usleep(300000);
ereport(verbose ? INFO : LOG,
(errmsg_internal("%s", buf.data)));
pfree(buf.data);

and it makes the test fail for me on each run.
Could you please look if this can be fixed too?

Thank you for the report.

The root cause seems to me that it's not guaranteed that we can see
the autovacuum logs after checking the statistics (i.e.,
pg_stat_user_tables) as we update the statistics and then write the
log.

One way to fix the test is to replace log_contains() with
wait_for_log(). We can also remove wait_for_autovacuum_complete()
logic altogether.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14Sami Imseih
samimseih@gmail.com
In reply to: Masahiko Sawada (#13)
Re: test_autovacuum/001_parallel_autovacuum is broken

The root cause seems to me that it's not guaranteed that we can see
the autovacuum logs after checking the statistics (i.e.,
pg_stat_user_tables) as we update the statistics and then write the
log.

One way to fix the test is to replace log_contains() with
wait_for_log(). We can also remove wait_for_autovacuum_complete()
logic altogether.

+1. I was going to reply with exactly this. Attached is the fix.

--
Sami

Attachments:

v1-0001-Fix-unstable-log_contains-in-parallel-autovacuum-.patchapplication/octet-stream; name=v1-0001-Fix-unstable-log_contains-in-parallel-autovacuum-.patchDownload+1-14
#15Daniil Davydov
3danissimo@gmail.com
In reply to: Sami Imseih (#14)
Re: test_autovacuum/001_parallel_autovacuum is broken

Hi,

On Fri, Apr 10, 2026 at 1:32 AM Sami Imseih <samimseih@gmail.com> wrote:

The root cause seems to me that it's not guaranteed that we can see
the autovacuum logs after checking the statistics (i.e.,
pg_stat_user_tables) as we update the statistics and then write the
log.

One way to fix the test is to replace log_contains() with
wait_for_log(). We can also remove wait_for_autovacuum_complete()
logic altogether.

+1. I was going to reply with exactly this. Attached is the fix.

Thank you for the patch! I agree with this approach.

Optionally, we can change the comment above to something like this:
# Wait until the parallel autovacuum on the table completes and reports the
# number of launched workers, which must correspond to the value specified in
# the reloption.

IMHO it better reflects what is going on.

--
Best regards,
Daniil Davydov

#16Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Sami Imseih (#14)
Re: test_autovacuum/001_parallel_autovacuum is broken

On Thu, Apr 9, 2026 at 11:32 AM Sami Imseih <samimseih@gmail.com> wrote:

The root cause seems to me that it's not guaranteed that we can see
the autovacuum logs after checking the statistics (i.e.,
pg_stat_user_tables) as we update the statistics and then write the
log.

One way to fix the test is to replace log_contains() with
wait_for_log(). We can also remove wait_for_autovacuum_complete()
logic altogether.

+1. I was going to reply with exactly this. Attached is the fix.

Thank you for the patch! I agree with the overall idea. Since we
enable autovacuum log only the test_autovac table, just checking
autovacuum log works as expected.

I think we can simplify the test further by removing the logic around
the av_count variable.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#17Sami Imseih
samimseih@gmail.com
In reply to: Masahiko Sawada (#16)
Re: test_autovacuum/001_parallel_autovacuum is broken

Optionally, we can change the comment above to something like this:
# Wait until the parallel autovacuum on the table completes and reports the
# number of launched workers, which must correspond to the value specified in
# the reloption.

Made the comment less verbose but in the same spirit as the above.

One way to fix the test is to replace log_contains() with
wait_for_log(). We can also remove wait_for_autovacuum_complete()
logic altogether.

+1. I was going to reply with exactly this. Attached is the fix.

Thank you for the patch! I agree with the overall idea. Since we
enable autovacuum log only the test_autovac table, just checking
autovacuum log works as expected.

I think we can simplify the test further by removing the logic around
the av_count variable.

removed av_count and pg_stat_user_tables query, but hardened the
regexp a bit to ensure that the parallel logging is for the test_autovac
table. It gives the same assurance as counting pg_stat_user_tables
and will be better if we add another parallel test table in the future.

--
Sami

Attachments:

v2-0001-Fix-unstable-log_contains-in-parallel-autovacuum-.patchapplication/octet-stream; name=v2-0001-Fix-unstable-log_contains-in-parallel-autovacuum-.patchDownload+6-28
#18Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Sami Imseih (#17)
Re: test_autovacuum/001_parallel_autovacuum is broken

On Thu, Apr 9, 2026 at 1:14 PM Sami Imseih <samimseih@gmail.com> wrote:

Optionally, we can change the comment above to something like this:
# Wait until the parallel autovacuum on the table completes and reports the
# number of launched workers, which must correspond to the value specified in
# the reloption.

Made the comment less verbose but in the same spirit as the above.

One way to fix the test is to replace log_contains() with
wait_for_log(). We can also remove wait_for_autovacuum_complete()
logic altogether.

+1. I was going to reply with exactly this. Attached is the fix.

Thank you for the patch! I agree with the overall idea. Since we
enable autovacuum log only the test_autovac table, just checking
autovacuum log works as expected.

I think we can simplify the test further by removing the logic around
the av_count variable.

removed av_count and pg_stat_user_tables query, but hardened the
regexp a bit to ensure that the parallel logging is for the test_autovac
table. It gives the same assurance as counting pg_stat_user_tables
and will be better if we add another parallel test table in the future.

I believe that we don't need to worry about the regexp for this test.
Parallel vacuum would be used on all tables having more than one
index, but we enable the autovacuum logs only on the test_autovac
table.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#19Sami Imseih
samimseih@gmail.com
In reply to: Masahiko Sawada (#18)
Re: test_autovacuum/001_parallel_autovacuum is broken

I believe that we don't need to worry about the regexp for this test.
Parallel vacuum would be used on all tables having more than one
index, but we enable the autovacuum logs only on the test_autovac
table.

ah, correct.

```
) WITH (autovacuum_parallel_workers = $autovacuum_parallel_workers,
log_autovacuum_min_duration = 0);
```

see v3 with the reverted regexp string.

Thanks!

--
Sami

Attachments:

v3-0001-Fix-unstable-log_contains-in-parallel-autovacuum-.patchapplication/octet-stream; name=v3-0001-Fix-unstable-log_contains-in-parallel-autovacuum-.patchDownload+5-27
#20Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#11)
Re: test_autovacuum/001_parallel_autovacuum is broken

On Thu, Apr 09, 2026 at 11:18:45AM -0700, Masahiko Sawada wrote:

It might also be useful if an injection point is activated on the
particular database rather than cluster-wide.

Indeed. One can design a lot of conditions like that. The important
part for me is to find use-cases for them, rather than pile them in
the code for potential "future" uses.
--
Michael