Intermittent buildfarm failures on wrasse

Started by Tom Lanealmost 4 years ago88 messages
#1Tom Lane
tgl@sss.pgh.pa.us

For the past five days or so, wrasse has been intermittently
failing due to unexpectedly not using an Index Only Scan plan
in the create_index test [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-04-08%2003%3A48%3A30, eg

@@ -1910,11 +1910,15 @@
 SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
-                      QUERY PLAN                       
--------------------------------------------------------
- Index Only Scan using tenk1_unique1 on tenk1
-   Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
-(2 rows)
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Sort
+   Sort Key: unique1
+   ->  Bitmap Heap Scan on tenk1
+         Recheck Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+         ->  Bitmap Index Scan on tenk1_unique1
+               Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(6 rows)

SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)

The most probable explanation for this seems to be that tenk1's
pg_class.relallvisible value hasn't been set high enough to make an IOS
look cheaper than the alternatives. Where that ought to be getting set
is the "VACUUM ANALYZE tenk1" step in test_setup.sql. It's plausible
I guess that a background autovacuum is preventing that command from
setting relallvisible as high as it ought to be --- but if so, why
are we only seeing two plans changing, on only one animal?

But what I'm really confused about is that this test arrangement has
been stable since early February. Why has wrasse suddenly started
showing a 25% failure rate when it never failed this way before that?
Somebody has to have recently committed a change that affects this.
Checking the commit log up to the onset of the failures on 8 April,
I only see two plausible candidates:

* shared-memory pgstats
* Peter's recent VACUUM changes

Any connection to pgstats is, um, pretty obscure. I'd finger the VACUUM
changes as a more likely trigger except that the last interesting-looking
one was f3c15cbe5 on 3 April, and wrasse got through "make check" 38 times
after that before its first failure of this kind. That doesn't square with
the 25% failure rate since then, so I'm kind of forced to the conclusion
that the pgstats work changed some behavior that it should not have.

Any ideas?

I'm tempted to add something like

SELECT relallvisible = relpages FROM pg_class WHERE relname = 'tenk1';

so that we can confirm or refute the theory that relallvisible is
the driving factor.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-04-08%2003%3A48%3A30

In reply to: Tom Lane (#1)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 13, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm tempted to add something like

SELECT relallvisible = relpages FROM pg_class WHERE relname = 'tenk1';

so that we can confirm or refute the theory that relallvisible is
the driving factor.

It would be fairly straightforward to commit a temporary debugging
patch that has the autovacuum logging stuff report directly on how
VACUUM set new_rel_allvisible in pg_class. We should probably be doing
that already, just because it's useful information that is already
close at hand.

Might be a bit trickier to make sure that wrasse reliably reported on
all relevant VACUUMs, since that would have to include manual VACUUMs
(which would really have to use VACUUM VERBOSE), as well as
autovacuums.

--
Peter Geoghegan

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#2)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Apr 13, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm tempted to add something like
SELECT relallvisible = relpages FROM pg_class WHERE relname = 'tenk1';
so that we can confirm or refute the theory that relallvisible is
the driving factor.

It would be fairly straightforward to commit a temporary debugging
patch that has the autovacuum logging stuff report directly on how
VACUUM set new_rel_allvisible in pg_class. We should probably be doing
that already, just because it's useful information that is already
close at hand.

Doesn't look like wrasse has autovacuum logging enabled, though.

After a bit more navel-contemplation I see a way that the pgstats
work could have changed timing in this area. We used to have a
rate limit on how often stats reports would be sent to the
collector, which'd ensure half a second or so delay before a
transaction's change counts became visible to the autovac daemon.
I've not looked at the new code, but I'm betting that that's gone
and the autovac launcher might start a worker nearly immediately
after some foreground process finishes inserting some rows.
So that could result in autovac activity occurring concurrently
with test_setup where it didn't before.

As to what to do about it ... maybe apply the FREEZE and
DISABLE_PAGE_SKIPPING options in test_setup's vacuums?
It seems like DISABLE_PAGE_SKIPPING is necessary but perhaps
not sufficient.

regards, tom lane

#4David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#3)
Re: Intermittent buildfarm failures on wrasse

On Thu, 14 Apr 2022 at 10:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After a bit more navel-contemplation I see a way that the pgstats
work could have changed timing in this area. We used to have a
rate limit on how often stats reports would be sent to the
collector, which'd ensure half a second or so delay before a
transaction's change counts became visible to the autovac daemon.
I've not looked at the new code, but I'm betting that that's gone
and the autovac launcher might start a worker nearly immediately
after some foreground process finishes inserting some rows.
So that could result in autovac activity occurring concurrently
with test_setup where it didn't before.

It's not quite clear to me why the manual vacuum wouldn't just cancel
the autovacuum and complete the job. I can't quite see how there's
room for competing page locks here. Also, see [1]/messages/by-id/20220224153339.pqn64kseb5gpgl74@alap3.anarazel.de. One of the
reported failing tests there is the same as one of the failing tests
on wrasse. My investigation for the AIO branch found that
relallvisible was not equal to relpages. I don't recall the reason why
that was happening now.

As to what to do about it ... maybe apply the FREEZE and
DISABLE_PAGE_SKIPPING options in test_setup's vacuums?
It seems like DISABLE_PAGE_SKIPPING is necessary but perhaps
not sufficient.

We should likely try and confirm it's due to relallvisible first.

David

[1]: /messages/by-id/20220224153339.pqn64kseb5gpgl74@alap3.anarazel.de

In reply to: Tom Lane (#3)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 13, 2022 at 3:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

After a bit more navel-contemplation I see a way that the pgstats
work could have changed timing in this area. We used to have a
rate limit on how often stats reports would be sent to the
collector, which'd ensure half a second or so delay before a
transaction's change counts became visible to the autovac daemon.
I've not looked at the new code, but I'm betting that that's gone
and the autovac launcher might start a worker nearly immediately
after some foreground process finishes inserting some rows.
So that could result in autovac activity occurring concurrently
with test_setup where it didn't before.

But why should it matter? The test_setup.sql VACUUM of tenk1 should
leave relallvisible and relpages in the same state, either way (or
very close to it).

The only way that it seems like it could matter is if OldestXmin was
held back during test_setup.sql's execution of the VACUUM command.

As to what to do about it ... maybe apply the FREEZE and
DISABLE_PAGE_SKIPPING options in test_setup's vacuums?
It seems like DISABLE_PAGE_SKIPPING is necessary but perhaps
not sufficient.

BTW, the work on VACUUM for Postgres 15 probably makes VACUUM test
flappiness issues less of a problem -- unless they're issues involving
something holding back OldestXmin when it shouldn't (in which case it
won't have any effect on test stability). I would expect that to be
the case, at least, since VACUUM now does almost all of the same work
for any individual page that it cannot get a cleanup lock on. There is
surprisingly little difference between a page that gets processed by
lazy_scan_prune and a page that gets processed by lazy_scan_noprune.

--
Peter Geoghegan

#6Andres Freund
andres@anarazel.de
In reply to: David Rowley (#4)
Re: Intermittent buildfarm failures on wrasse

Hi,

On April 13, 2022 7:06:33 PM EDT, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 14 Apr 2022 at 10:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After a bit more navel-contemplation I see a way that the pgstats
work could have changed timing in this area. We used to have a
rate limit on how often stats reports would be sent to the
collector, which'd ensure half a second or so delay before a
transaction's change counts became visible to the autovac daemon.
I've not looked at the new code, but I'm betting that that's gone
and the autovac launcher might start a worker nearly immediately
after some foreground process finishes inserting some rows.
So that could result in autovac activity occurring concurrently
with test_setup where it didn't before.

It's not quite clear to me why the manual vacuum wouldn't just cancel
the autovacuum and complete the job. I can't quite see how there's
room for competing page locks here. Also, see [1]. One of the
reported failing tests there is the same as one of the failing tests
on wrasse. My investigation for the AIO branch found that
relallvisible was not equal to relpages. I don't recall the reason why
that was happening now.

As to what to do about it ... maybe apply the FREEZE and
DISABLE_PAGE_SKIPPING options in test_setup's vacuums?
It seems like DISABLE_PAGE_SKIPPING is necessary but perhaps
not sufficient.

We should likely try and confirm it's due to relallvisible first.

We had this issue before, and not just on the aio branch. On my phone right now, so won't look up references.

IIRC the problem in matter isn't skipped pages, but that the horizon simply isn't new enough to mark pages as all visible. An independent autovac worker starting is enough for that, for example. Previously the data load and vacuum were further apart, preventing this kind of issue.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

In reply to: Andres Freund (#6)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 13, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:

IIRC the problem in matter isn't skipped pages, but that the horizon simply isn't new enough to mark pages as all visible.

Sometimes OldestXmin can go backwards in VACUUM operations that are
run in close succession against the same table, due to activity from
other databases in the same cluster (perhaps other factors are
involved at times).

That's why the following assertion that I recently added to
vacuumlazy.c will fail pretty quickly without the
"vacrel->NewRelfrozenXid == OldestXmin" part of its test:

Assert(vacrel->NewRelfrozenXid == OldestXmin ||
TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
vacrel->relfrozenxid,
vacrel->NewRelfrozenXid));

If you remove "vacrel->NewRelfrozenXid == OldestXmin", and run the
regression tests, the remaining assertion will fail quite easily.
Though perhaps not with a serial "make check".

--
Peter Geoghegan

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#7)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Apr 13, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:

IIRC the problem in matter isn't skipped pages, but that the horizon simply isn't new enough to mark pages as all visible.

Sometimes OldestXmin can go backwards in VACUUM operations that are
run in close succession against the same table, due to activity from
other databases in the same cluster (perhaps other factors are
involved at times).

I've been doing some testing locally by inserting commands to
manually set tenk1's relallvisible to zero. I first did that
in test_setup.sql ... and it had no effect whatsoever. Further
experimentation showed that the "CREATE INDEX ON tenk1" steps
in create_index.sql itself generally suffice to fix relallvisible;
although if you force it back to zero after the last such command,
you get the same plan diffs wrasse is showing. And you don't
get any others, which I thought curious until I realized that
sanity_check.sql's database-wide VACUUM offers yet another
opportunity to heal the incorrect value. If you force it back
to zero again after that, a bunch of later tests start to show
plan differences, which is what I'd been expecting.

So what seems to be happening on wrasse is that a background
autovacuum (or really autoanalyze?) is preventing pages from
being marked all-visible not only during test_setup.sql but
also create_index.sql; but it's gone by the time sanity_check.sql
runs. Which is odd in itself because not that much time elapses
between create_index and sanity_check, certainly less than the
time from test_setup to create_index.

It seems like a reliable fix might require test_setup to wait
for any background autovac to exit before it does its own
vacuums. Ick.

And we still lack an explanation of why this only now broke.
I remain suspicious that pgstats is behaving unexpectedly.

regards, tom lane

In reply to: Tom Lane (#8)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 13, 2022 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So what seems to be happening on wrasse is that a background
autovacuum (or really autoanalyze?) is preventing pages from
being marked all-visible not only during test_setup.sql but
also create_index.sql; but it's gone by the time sanity_check.sql
runs.

I agree that it would need to be an autoanalyze (due to the
PROC_IN_VACUUM optimization).

It seems like a reliable fix might require test_setup to wait
for any background autovac to exit before it does its own
vacuums. Ick.

This is hardly a new problem, really. I wonder if it's worth inventing
a comprehensive solution. Some kind of infrastructure that makes
VACUUM establish a next XID up-front (by calling
ReadNextTransactionId()), and then find a way to run with an
OldestXmin that's >= the earleir "next" XID value. If necessary by
waiting.

--
Peter Geoghegan

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#9)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Apr 13, 2022 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

It seems like a reliable fix might require test_setup to wait
for any background autovac to exit before it does its own
vacuums. Ick.

This is hardly a new problem, really. I wonder if it's worth inventing
a comprehensive solution.

Yeah, we have band-aided around this type of problem repeatedly.
Making a fix that's readily accessible from any test script
seems like a good idea.

regards, tom lane

In reply to: Tom Lane (#10)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 13, 2022 at 4:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, we have band-aided around this type of problem repeatedly.
Making a fix that's readily accessible from any test script
seems like a good idea.

We might even be able to consistently rely on this new option, given
*any* problem involving test stability and VACUUM. Having a
one-size-fits-all solution to these kinds of stability problems would
be nice -- no more DISABLE_PAGE_SKIPPING bandaids.

That general approach will be possible provided an inability to
acquire a cleanup lock during VACUUM (which can more or less occur at
random in most environments) doesn't ever lead to unexpected test
results. There is good reason to think that it might work out that
way. Simulating problems with acquiring cleanup locks during VACUUM
left me with the impression that that could really work:

/messages/by-id/CAH2-WzkiB-qcsBmWrpzP0nxvrQExoUts1d7TYShg_DrkOHeg4Q@mail.gmail.com

--
Peter Geoghegan

#12Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#9)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-13 16:45:44 -0700, Peter Geoghegan wrote:

On Wed, Apr 13, 2022 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So what seems to be happening on wrasse is that a background
autovacuum (or really autoanalyze?) is preventing pages from
being marked all-visible not only during test_setup.sql but
also create_index.sql; but it's gone by the time sanity_check.sql
runs.

I agree that it would need to be an autoanalyze (due to the
PROC_IN_VACUUM optimization).

That's not a realiable protection - the snapshot is established normally
at first, only after a while we set PROC_IN_VACUUM...

Greetings,

Andres Freund

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#11)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Apr 13, 2022 at 4:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, we have band-aided around this type of problem repeatedly.
Making a fix that's readily accessible from any test script
seems like a good idea.

We might even be able to consistently rely on this new option, given
*any* problem involving test stability and VACUUM. Having a
one-size-fits-all solution to these kinds of stability problems would
be nice -- no more DISABLE_PAGE_SKIPPING bandaids.

My guess is that you'd need both this new wait-for-horizon behavior
*and* DISABLE_PAGE_SKIPPING. But the two together ought to make
for pretty reproducible behavior. I noticed while scanning the
commit log that some patches have tried adding a FREEZE option,
which seems more like waving a dead chicken than anything that'd
improve stability.

We'd not necessarily have to embed wait-for-horizon into VACUUM
itself. It seems like a SQL-accessible function could be written
and then called before any problematic VACUUM. I like this better
for something we're thinking of jamming in post-feature-freeze;
we'd not be committing to the feature quite as much as if we
added a VACUUM option.

regards, tom lane

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#3)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-13 18:54:06 -0400, Tom Lane wrote:

We used to have a rate limit on how often stats reports would be sent
to the collector, which'd ensure half a second or so delay before a
transaction's change counts became visible to the autovac daemon.

Just for posterity: That's not actually gone. But what is gone is the
rate limiting in autovacuum about requesting recent stats for a table /
autovac seeing slightly older stats.

Greetings,

Andres Freund

In reply to: Tom Lane (#13)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 13, 2022 at 5:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

My guess is that you'd need both this new wait-for-horizon behavior
*and* DISABLE_PAGE_SKIPPING. But the two together ought to make
for pretty reproducible behavior. I noticed while scanning the
commit log that some patches have tried adding a FREEZE option,
which seems more like waving a dead chicken than anything that'd
improve stability.

I think that it's more likely that FREEZE will correct problems, out of the two:

* FREEZE forces an aggressive VACUUM whose FreezeLimit is as recent a
cutoff value as possible (FreezeLimit will be equal to OldestXmin).

* DISABLE_PAGE_SKIPPING also forces an aggressive VACUUM. But unlike
FREEZE it makes VACUUM not use the visibility map, even in the case of
all-frozen pages. And it changes nothing about FreezeLimit.

It's also a certainty that VACUUM(FREEZE, DISABLE_PAGE_SKIPPING) is
not a sensible remedy for any problem with test stability, but there
are still some examples of that combination in the regression tests.
The only way it could make sense is if the visibility map might be
corrupt, but surely we're not expecting that anyway (and if we were
we'd be testing it more directly).

I recently argued that DISABLE_PAGE_SKIPPING should have nothing to do
with aggressive vacuuming -- that should all be left up to VACUUM
FREEZE. It seems more logical to make DISABLE_PAGE_SKIPPING mean
"don't use the visibility map to skip anything", without bringing
aggressiveness into it at all. That would be less confusing.

We'd not necessarily have to embed wait-for-horizon into VACUUM
itself. It seems like a SQL-accessible function could be written
and then called before any problematic VACUUM. I like this better
for something we're thinking of jamming in post-feature-freeze;
we'd not be committing to the feature quite as much as if we
added a VACUUM option.

Hmm. I would say that the feature has zero appeal to users anyway.
Maybe it can and should be done through an SQL function for other
reasons, though. Users already think that there are several different
flavors of VACUUM, which isn't really true.

--
Peter Geoghegan

#16Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-13 20:35:50 -0400, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Apr 13, 2022 at 4:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, we have band-aided around this type of problem repeatedly.
Making a fix that's readily accessible from any test script
seems like a good idea.

We might even be able to consistently rely on this new option, given
*any* problem involving test stability and VACUUM. Having a
one-size-fits-all solution to these kinds of stability problems would
be nice -- no more DISABLE_PAGE_SKIPPING bandaids.

My guess is that you'd need both this new wait-for-horizon behavior
*and* DISABLE_PAGE_SKIPPING. But the two together ought to make
for pretty reproducible behavior. I noticed while scanning the
commit log that some patches have tried adding a FREEZE option,
which seems more like waving a dead chicken than anything that'd
improve stability.

I think most of those we've ended up replacing by using temp tables in
those tests instead, since they're not affected by the global horizon
anymore.

We'd not necessarily have to embed wait-for-horizon into VACUUM
itself.

I'm not sure it'd be quite reliable outside of vacuum though, due to the
horizon potentially going backwards (in otherwise harmless ways)?

It seems like a SQL-accessible function could be written
and then called before any problematic VACUUM. I like this better
for something we're thinking of jamming in post-feature-freeze;
we'd not be committing to the feature quite as much as if we
added a VACUUM option.

We could otherwise just disable IOS for that query, for now.

Greetings,

Andres Freund

In reply to: Peter Geoghegan (#15)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 13, 2022 at 6:03 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that it's more likely that FREEZE will correct problems, out of the two:

* FREEZE forces an aggressive VACUUM whose FreezeLimit is as recent a
cutoff value as possible (FreezeLimit will be equal to OldestXmin).

The reason why that might have helped (at least in the past) is that
it's enough to force us to wait for a cleanup lock to prune and
freeze, if necessary. Which was never something that
DISABLE_PAGE_SKIPPING could do.

--
Peter Geoghegan

In reply to: Andres Freund (#16)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 13, 2022 at 6:05 PM Andres Freund <andres@anarazel.de> wrote:

I think most of those we've ended up replacing by using temp tables in
those tests instead, since they're not affected by the global horizon
anymore.

Maybe, but it's a pain to have to work that way. You can't do it in
cases like this, because a temp table is not workable. So that's not
an ideal long term solution.

We'd not necessarily have to embed wait-for-horizon into VACUUM
itself.

I'm not sure it'd be quite reliable outside of vacuum though, due to the
horizon potentially going backwards (in otherwise harmless ways)?

I agree, since vacuumlazy.c would need to either be given its own
OldestXmin, or knowledge of a wait-up-to XID. Either way we have to
make non-trivial changes to vacuumlazy.c.

--
Peter Geoghegan

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#16)
Re: Intermittent buildfarm failures on wrasse

Andres Freund <andres@anarazel.de> writes:

On 2022-04-13 20:35:50 -0400, Tom Lane wrote:

It seems like a SQL-accessible function could be written
and then called before any problematic VACUUM. I like this better
for something we're thinking of jamming in post-feature-freeze;
we'd not be committing to the feature quite as much as if we
added a VACUUM option.

We could otherwise just disable IOS for that query, for now.

The entire point of that test case is to verify the shape of the
IOS plan, so no that's not an acceptable answer. But if we're
looking for quick hacks, we could do

update pg_class set relallvisible = relpages where relname = 'tenk1';

just before that test.

I'm still suspicious of the pgstat changes, though. I checked into
things here by doing

initdb
edit postgresql.conf to set log_autovacuum_min_duration = 0
pg_ctl start && make installcheck-parallel

and what I see is that the first reported autovacuum activity begins
exactly one minute after the postmaster starts, which is what I'd
expect given the autovacuum naptime rules. On my machine, of course,
the installcheck-parallel run is long gone by then. But even on the
much slower wrasse, we should be well past create_index by the time any
autovac worker launches --- you can see from wrasse's reported test
runtimes that only about 10 seconds have elapsed when it get to the end
of create_index.

This suggests to me that what is holding the (presumed) conflicting
snapshot must be the autovac launcher, because what else could it be?
So I'm suspicious that we broke something in that area, though I am
baffled why only wrasse would be telling us so.

regards, tom lane

#20Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#19)
Re: Intermittent buildfarm failures on wrasse

Hi,

Noah, any chance you could enable log_autovacuum_min_duration=0 on
wrasse?

On 2022-04-13 21:23:12 -0400, Tom Lane wrote:

I'm still suspicious of the pgstat changes, though. I checked into
things here by doing

initdb
edit postgresql.conf to set log_autovacuum_min_duration = 0
pg_ctl start && make installcheck-parallel

and what I see is that the first reported autovacuum activity begins
exactly one minute after the postmaster starts, which is what I'd
expect given the autovacuum naptime rules.

It'd not necessarily have to be autovacuum. A CREATE INDEX or VACUUM
using parallelism, could also cause this, I think. It'd be a narrow
window, of course...

Does sparc have wider alignment rules for some types? Perhaps that'd be
enough to put some tables to be sufficiently larger to trigger parallel
vacuum?

Greetings,

Andres Freund

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#20)
Re: Intermittent buildfarm failures on wrasse

Andres Freund <andres@anarazel.de> writes:

Noah, any chance you could enable log_autovacuum_min_duration=0 on
wrasse?

+1

Does sparc have wider alignment rules for some types? Perhaps that'd be
enough to put some tables to be sufficiently larger to trigger parallel
vacuum?

No, the configure results on wrasse look pretty ordinary:

checking size of void *... 8
checking size of size_t... 8
checking size of long... 8
checking alignment of short... 2
checking alignment of int... 4
checking alignment of long... 8
checking alignment of double... 8

I wondered for a moment about force_parallel_mode, but wrasse doesn't
appear to be setting that, and in any case I'm pretty sure it only
affects plannable statements.

regards, tom lane

#22Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#20)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 13, 2022 at 06:51:12PM -0700, Andres Freund wrote:

Noah, any chance you could enable log_autovacuum_min_duration=0 on
wrasse?

Done. Also forced hourly builds.

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#22)
Re: Intermittent buildfarm failures on wrasse

Noah Misch <noah@leadboat.com> writes:

On Wed, Apr 13, 2022 at 06:51:12PM -0700, Andres Freund wrote:

Noah, any chance you could enable log_autovacuum_min_duration=0 on
wrasse?

Done. Also forced hourly builds.

Thanks! We now have two failing runs with the additional info [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-04-14%2013%3A28%3A14[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-04-14%2004%3A48%3A13,
and in both, it's clear that the first autovac worker doesn't launch
until 1 minute after postmaster start, by which time we're long done
with the test scripts of interest. So whatever is breaking this is
not an autovac worker.

I think I'm going to temporarily add a couple of queries to check
what tenk1's relallvisible actually is, just so we can confirm
positively that that's what's causing the plan change. (I'm also
curious about whether the CREATE INDEX steps manage to change it
at all.)

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-04-14%2013%3A28%3A14
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-04-14%2004%3A48%3A13

#24Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#23)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-14 12:01:23 -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Wed, Apr 13, 2022 at 06:51:12PM -0700, Andres Freund wrote:

Noah, any chance you could enable log_autovacuum_min_duration=0 on
wrasse?

Done. Also forced hourly builds.

Thanks! Can you repro the problem manually on wrasse, perhaps even
outside the buildfarm script? That might be simpler than debugging via
the BF...

Thanks! We now have two failing runs with the additional info [1][2],
and in both, it's clear that the first autovac worker doesn't launch
until 1 minute after postmaster start, by which time we're long done
with the test scripts of interest. So whatever is breaking this is
not an autovac worker.

I did some experiments around that too, and didn't find any related
problems.

For a second I was wondering if it's caused by the time of initdb (which
ends up with a working pgstat snapshot now, but didn't before), but
that's just a few more seconds. While the BF scripts don't show
timestamps for initdb, the previous step's log output confirms that it's
just a few seconds...

I think I'm going to temporarily add a couple of queries to check
what tenk1's relallvisible actually is, just so we can confirm
positively that that's what's causing the plan change. (I'm also
curious about whether the CREATE INDEX steps manage to change it
at all.)

I wonder if we should make VACUUM log the VERBOSE output at DEBUG1
unconditionally. This is like the third bug where we needed that
information, and it's practically impossible to include in regression
output. Then we'd know what the xid horizon is, whether pages were
skipped, etc.

It also just generally seems like a good thing.

Greetings,

Andres Freund

In reply to: Andres Freund (#24)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 9:18 AM Andres Freund <andres@anarazel.de> wrote:

I wonder if we should make VACUUM log the VERBOSE output at DEBUG1
unconditionally. This is like the third bug where we needed that
information, and it's practically impossible to include in regression
output. Then we'd know what the xid horizon is, whether pages were
skipped, etc.

I like the idea of making VACUUM log the VERBOSE output as a
configurable user-visible feature. We'll then be able to log all
VACUUM statements (not just autovacuum worker VACUUMs).

--
Peter Geoghegan

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#24)
Re: Intermittent buildfarm failures on wrasse

Andres Freund <andres@anarazel.de> writes:

Thanks! Can you repro the problem manually on wrasse, perhaps even
outside the buildfarm script?

I'm working on that right now, actually...

I wonder if we should make VACUUM log the VERBOSE output at DEBUG1
unconditionally. This is like the third bug where we needed that
information, and it's practically impossible to include in regression
output. Then we'd know what the xid horizon is, whether pages were
skipped, etc.

Right at the moment it seems like we also need visibility into what
CREATE INDEX is doing.

I'm not sure I'd buy into permanent changes here (at least not ones made
in haste), but temporarily adding more logging seems perfectly reasonable.

regards, tom lane

#27Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#26)
1 attachment(s)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-14 12:26:20 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Thanks! Can you repro the problem manually on wrasse, perhaps even
outside the buildfarm script?

Ah, cool.

I'm working on that right now, actually...

I wonder if we should make VACUUM log the VERBOSE output at DEBUG1
unconditionally. This is like the third bug where we needed that
information, and it's practically impossible to include in regression
output. Then we'd know what the xid horizon is, whether pages were
skipped, etc.

Right at the moment it seems like we also need visibility into what
CREATE INDEX is doing.

I'm not sure I'd buy into permanent changes here (at least not ones made
in haste), but temporarily adding more logging seems perfectly reasonable.

I think it might be worth leaving in, but let's debate that separately?
I'm thinking of something like the attached.

Greetings,

Andres Freund

Attachments:

vacuum-log-verbose-debug1.difftext/x-diff; charset=us-asciiDownload
diff --git i/src/backend/access/heap/vacuumlazy.c w/src/backend/access/heap/vacuumlazy.c
index 788db569b2d..eaee473754d 100644
--- i/src/backend/access/heap/vacuumlazy.c
+++ w/src/backend/access/heap/vacuumlazy.c
@@ -186,6 +186,7 @@ typedef struct LVRelState
 	OffsetNumber offnum;		/* used only for heap operations */
 	VacErrPhase phase;
 	bool		verbose;		/* VACUUM VERBOSE? */
+	int			elevel;
 
 	/*
 	 * dead_items stores TIDs whose index tuples are deleted by index
@@ -330,8 +331,24 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	WalUsage	walusage_start = pgWalUsage;
 	ErrorContextCallback errcallback;
 	char	  **indnames = NULL;
+	int			elevel;
+
+	if (params->options & VACOPT_VERBOSE)
+	{
+		verbose = true;
+		elevel = INFO;
+	}
+	else if (message_level_is_interesting(DEBUG1))
+	{
+		verbose = true;
+		elevel = DEBUG1;
+	}
+	else
+	{
+		verbose = false;
+		elevel = DEBUG5;
+	}
 
-	verbose = (params->options & VACOPT_VERBOSE) != 0;
 	instrument = (verbose || (IsAutoVacuumWorkerProcess() &&
 							  params->log_min_duration >= 0));
 	if (instrument)
@@ -392,20 +409,20 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indname = NULL;
 	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
 	vacrel->verbose = verbose;
+	vacrel->elevel = elevel;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
+	if (verbose && !IsAutoVacuumWorkerProcess())
 	{
-		Assert(!IsAutoVacuumWorkerProcess());
 		if (aggressive)
-			ereport(INFO,
+			ereport(elevel,
 					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
 							get_database_name(MyDatabaseId),
 							vacrel->relnamespace, vacrel->relname)));
 		else
-			ereport(INFO,
+			ereport(elevel,
 					(errmsg("vacuuming \"%s.%s.%s\"",
 							get_database_name(MyDatabaseId),
 							vacrel->relnamespace, vacrel->relname)));
@@ -788,7 +805,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (unsigned long long) walusage.wal_bytes);
 			appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
 
-			ereport(verbose ? INFO : LOG,
+			ereport(elevel,
 					(errmsg_internal("%s", buf.data)));
 			pfree(buf.data);
 		}
@@ -2944,7 +2961,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
-		ereport(vacrel->verbose ? INFO : DEBUG2,
+		ereport(vacrel->elevel,
 				(errmsg("table \"%s\": truncated %u to %u pages",
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
@@ -3006,7 +3023,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 			{
 				if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
 				{
-					ereport(vacrel->verbose ? INFO : DEBUG2,
+					ereport(vacrel->elevel,
 							(errmsg("table \"%s\": suspending truncate due to conflicting lock request",
 									vacrel->relname)));
 
In reply to: Andres Freund (#27)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 9:48 AM Andres Freund <andres@anarazel.de> wrote:

I think it might be worth leaving in, but let's debate that separately?
I'm thinking of something like the attached.

The current convention for the "extra" ereport()s that VACUUM VERBOSE
outputs at INFO elevel is to use DEBUG2 elevel in all other cases
(these extra messages are considered part of VACUUM VERBOSE output,
but are *not* considered part of the autovacuum log output).

It looks like you're changing the elevel convention for these "extra"
messages with this patch. That might be fine, but don't forget about
similar ereports() in vacuumparallel.c. I think that the elevel should
probably remain uniform across all of these messages. Though I don't
particular care if it's DEBUG2 or DEBUG5.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#28)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 10:07 AM Peter Geoghegan <pg@bowt.ie> wrote:

It looks like you're changing the elevel convention for these "extra"
messages with this patch. That might be fine, but don't forget about
similar ereports() in vacuumparallel.c. I think that the elevel should
probably remain uniform across all of these messages. Though I don't
particular care if it's DEBUG2 or DEBUG5.

Also, don't forget to do something here, with the assertion and with
the message:

if (verbose)
{
/*
* Aggressiveness already reported earlier, in dedicated
* VACUUM VERBOSE ereport
*/
Assert(!params->is_wraparound);
msgfmt = _("finished vacuuming \"%s.%s.%s\": index
scans: %d\n");
}
else if (params->is_wraparound)
{

Presumably we will need to report on antiwraparound-ness in the
verbose-debug-elevel-for-autovacuum case (and not allow this assertion
to fail).

--
Peter Geoghegan

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#26)
Re: Intermittent buildfarm failures on wrasse

I wroteL

Andres Freund <andres@anarazel.de> writes:

Thanks! Can you repro the problem manually on wrasse, perhaps even
outside the buildfarm script?

I'm working on that right now, actually...

So far, reproducing it manually has been a miserable failure: I've
run about 180 cycles of the core regression tests with no error.
Not sure what's different between my test scenario and wrasse's.

Meanwhile, wrasse did fail with my relallvisible check in place [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-04-14%2019%3A28%3A12,
and what that shows is that relallvisible is *zero* to start with
and remains so throughout the CREATE INDEX sequence. That pretty
definitively proves that it's not a page-skipping problem but
an xmin-horizon-too-old problem. We're no closer to understanding
where that horizon value is coming from, though.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-04-14%2019%3A28%3A12

In reply to: Tom Lane (#30)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 2:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Meanwhile, wrasse did fail with my relallvisible check in place [1],
and what that shows is that relallvisible is *zero* to start with
and remains so throughout the CREATE INDEX sequence. That pretty
definitively proves that it's not a page-skipping problem but
an xmin-horizon-too-old problem. We're no closer to understanding
where that horizon value is coming from, though.

Have you looked at the autovacuum log output in more detail? It might
be possible to debug further, but looks like there are no XIDs to work
off of in the log_line_prefix that's in use on wrasse.

The CITester log_line_prefix is pretty useful -- I wonder if we can
standardize on that within the buildfarm, too.
--
Peter Geoghegan

#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#31)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

Have you looked at the autovacuum log output in more detail?

I don't think there's anything to be learned there. The first autovacuum
in wrasse's log happens long after things went south:

2022-04-14 22:49:15.177 CEST [9427:1] LOG: automatic vacuum of table "regression.pg_catalog.pg_type": index scans: 1
pages: 0 removed, 49 remain, 49 scanned (100.00% of total)
tuples: 539 removed, 1112 remain, 0 are dead but not yet removable
removable cutoff: 8915, older by 1 xids when operation ended
index scan needed: 34 pages from table (69.39% of total) had 1107 dead item identifiers removed
index "pg_type_oid_index": pages: 14 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_type_typname_nsp_index": pages: 13 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 193 hits, 0 misses, 0 dirtied
WAL usage: 116 records, 0 full page images, 14113 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

If we captured equivalent output from the manual VACUUM in test_setup,
maybe something could be learned. However, it seems virtually certain
to me that the problematic xmin is in some background process
(eg autovac launcher) and thus wouldn't show up in the postmaster log,
log_line_prefix or no.

regards, tom lane

In reply to: Tom Lane (#32)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 3:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If we captured equivalent output from the manual VACUUM in test_setup,
maybe something could be learned. However, it seems virtually certain
to me that the problematic xmin is in some background process
(eg autovac launcher) and thus wouldn't show up in the postmaster log,
log_line_prefix or no.

A bunch of autovacuums that ran between "2022-04-14 22:49:16.274" and
"2022-04-14 22:49:19.088" all have the same "removable cutoff".

The logs from this time show a period of around three seconds
(likely more) where something held back OldestXmin generally.
That does seem a bit fishy to me, even though it happened about a
minute after the failure itself took place.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#33)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 3:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

A bunch of autovacuums that ran between "2022-04-14 22:49:16.274" and
"2022-04-14 22:49:19.088" all have the same "removable cutoff".

Are you aware of Andres' commit 02fea8fd? That work prevented exactly
the same set of symptoms (the same index-only scan create_index
regressions), which was apparently necessary following the
rearrangements to the regression tests to remove cross-script
dependencies (Tom's commit cc50080a82).

This was the thread that led to Andres' commit, which was just over a month ago:

/messages/by-id/CAJ7c6TPJNof1Q+vJsy3QebgbPgXdu2ErPvYkBdhD6_Ckv5EZRg@mail.gmail.com

--
Peter Geoghegan

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#34)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

Are you aware of Andres' commit 02fea8fd? That work prevented exactly
the same set of symptoms (the same index-only scan create_index
regressions),

Hm. I'm starting to get the feeling that the real problem here is
we've "optimized" the system to the point where repeatable results
from VACUUM are impossible :-(

regards, tom lane

In reply to: Tom Lane (#35)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 6:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hm. I'm starting to get the feeling that the real problem here is
we've "optimized" the system to the point where repeatable results
from VACUUM are impossible :-(

I don't think that there is any fundamental reason why VACUUM cannot
have repeatable results.

Anyway, I suppose it's possible that problems reappeared here due to
some other patch. Something else could have broken Andres' earlier
band aid solution (which was to set synchronous_commit=on in
test_setup).

Is there any patch that could plausibly have had that effect, whose
commit fits with our timeline for the problems seen on wrasse?

--
Peter Geoghegan

#37Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#35)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-14 21:32:27 -0400, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

Are you aware of Andres' commit 02fea8fd? That work prevented exactly
the same set of symptoms (the same index-only scan create_index
regressions),

Hm. I'm starting to get the feeling that the real problem here is
we've "optimized" the system to the point where repeatable results
from VACUUM are impossible :-(

The synchronous_commit issue is an old one. It might actually be worth
addressing it by flushing out pending async commits out instead. It just
started to be noticeable when tenk1 load and vacuum were moved closer.

What do you think about applying a polished version of what I posted in
/messages/by-id/20220414164830.63rk5zqsvtqqk7qz@alap3.anarazel.de
? That'd tell us a bit more about the horizon etc.

It doesn't have to be the autovacuum launcher. I think it shouldn't even
be taken into account - it's not database connected, and tenk1 isn't a
shared relation. All very odd.

It's also interesting that it only happens in the installcheck cases,
afaics, not the check ones. Although that might just be because there's
more of them...

Greetings,

Andres Freund

#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#36)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

Anyway, I suppose it's possible that problems reappeared here due to
some other patch. Something else could have broken Andres' earlier
band aid solution (which was to set synchronous_commit=on in
test_setup).

That band-aid only addressed the situation of someone having turned
off synchronous_commit in the first place; which is not the case
on wrasse or most/all other buildfarm animals. Whatever we're
dealing with here is something independent of that.

Is there any patch that could plausibly have had that effect, whose
commit fits with our timeline for the problems seen on wrasse?

I already enumerated my suspects, back at the top of this thread.

regards, tom lane

In reply to: Tom Lane (#38)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 6:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That band-aid only addressed the situation of someone having turned
off synchronous_commit in the first place; which is not the case
on wrasse or most/all other buildfarm animals. Whatever we're
dealing with here is something independent of that.

That was the intent, but that in itself doesn't mean that it isn't
something to do with setting hint bits (not the OldestXmin horizon
being held back). I'd really like to know what the removable cutoff
looks like for these VACUUM operations, which is something like
Andres' VACUUM VERBOSE debug patch should tell us.

Is there any patch that could plausibly have had that effect, whose
commit fits with our timeline for the problems seen on wrasse?

I already enumerated my suspects, back at the top of this thread.

Right, but I thought that the syncronous_commit thing was new
information that made that worth revisiting.

--
Peter Geoghegan

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#39)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

That was the intent, but that in itself doesn't mean that it isn't
something to do with setting hint bits (not the OldestXmin horizon
being held back).

Oh! You mean that maybe the OldestXmin horizon was fine, but something
decided not to update hint bits (and therefore also not the all-visible
bit) anyway? Worth investigating I guess.

I'd really like to know what the removable cutoff
looks like for these VACUUM operations, which is something like
Andres' VACUUM VERBOSE debug patch should tell us.

Yeah. I'd hoped to investigate this manually and not have to clutter
the main repo with debugging commits. However, since I've still
utterly failed to reproduce the problem on wrasse's host, I think
we're going to be forced to do that.

regards, tom lane

In reply to: Tom Lane (#40)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 7:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh! You mean that maybe the OldestXmin horizon was fine, but something
decided not to update hint bits (and therefore also not the all-visible
bit) anyway? Worth investigating I guess.

Yes. That is starting to seem like a plausible alternative explanation.

I'd really like to know what the removable cutoff
looks like for these VACUUM operations, which is something like
Andres' VACUUM VERBOSE debug patch should tell us.

Yeah. I'd hoped to investigate this manually and not have to clutter
the main repo with debugging commits.

Suppose that the bug was actually in 06f5295af6, "Add single-item
cache when looking at topmost XID of a subtrans XID". Doesn't that fit
your timeline just as well?

I haven't really started to investigate that theory (just putting
dinner on here). Just a wild guess at this point.

--
Peter Geoghegan

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#41)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

Suppose that the bug was actually in 06f5295af6, "Add single-item
cache when looking at topmost XID of a subtrans XID". Doesn't that fit
your timeline just as well?

I'd dismissed that on the grounds that there are no subtrans XIDs
involved in tenk1's contents. However, if that patch was faulty
enough, maybe it affected other cases besides the advertised one?
I've not read it.

regards, tom lane

#43Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#37)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 06:52:49PM -0700, Andres Freund wrote:

On 2022-04-14 21:32:27 -0400, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

Are you aware of Andres' commit 02fea8fd? That work prevented exactly
the same set of symptoms (the same index-only scan create_index
regressions),

Hm. I'm starting to get the feeling that the real problem here is
we've "optimized" the system to the point where repeatable results
from VACUUM are impossible :-(

The synchronous_commit issue is an old one. It might actually be worth
addressing it by flushing out pending async commits out instead. It just
started to be noticeable when tenk1 load and vacuum were moved closer.

What do you think about applying a polished version of what I posted in
/messages/by-id/20220414164830.63rk5zqsvtqqk7qz@alap3.anarazel.de
? That'd tell us a bit more about the horizon etc.

No objection.

It's also interesting that it only happens in the installcheck cases,
afaics, not the check ones. Although that might just be because there's
more of them...

I suspect the failure is somehow impossible in "check". Yesterday, I cranked
up the number of locales, so there are now a lot more installcheck. Before
that, each farm run had one "check" and two "installcheck". Those days saw
ten installcheck failures, zero check failures.

Like Tom, I'm failing to reproduce this outside the buildfarm client. I wrote
a shell script to closely resemble the buildfarm installcheck sequence, but
it's lasted a dozen runs without failing.

#44Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#43)
1 attachment(s)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 07:45:15PM -0700, Noah Misch wrote:

On Thu, Apr 14, 2022 at 06:52:49PM -0700, Andres Freund wrote:

On 2022-04-14 21:32:27 -0400, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

Are you aware of Andres' commit 02fea8fd? That work prevented exactly
the same set of symptoms (the same index-only scan create_index
regressions),

Hm. I'm starting to get the feeling that the real problem here is
we've "optimized" the system to the point where repeatable results
from VACUUM are impossible :-(

The synchronous_commit issue is an old one. It might actually be worth
addressing it by flushing out pending async commits out instead. It just
started to be noticeable when tenk1 load and vacuum were moved closer.

What do you think about applying a polished version of what I posted in
/messages/by-id/20220414164830.63rk5zqsvtqqk7qz@alap3.anarazel.de
? That'd tell us a bit more about the horizon etc.

No objection.

It's also interesting that it only happens in the installcheck cases,
afaics, not the check ones. Although that might just be because there's
more of them...

I suspect the failure is somehow impossible in "check". Yesterday, I cranked
up the number of locales, so there are now a lot more installcheck. Before
that, each farm run had one "check" and two "installcheck". Those days saw
ten installcheck failures, zero check failures.

Like Tom, I'm failing to reproduce this outside the buildfarm client. I wrote
a shell script to closely resemble the buildfarm installcheck sequence, but
it's lasted a dozen runs without failing.

But 24s after that email, it did reproduce the problem. Same symptoms as the
last buildfarm runs, including visfrac=0. I'm attaching my script. (It has
various references to my home directory, so it's not self-contained.)

Attachments:

repro-wrasse-vis.shapplication/x-shDownload
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#43)
Re: Intermittent buildfarm failures on wrasse

Noah Misch <noah@leadboat.com> writes:

Like Tom, I'm failing to reproduce this outside the buildfarm client.

This is far from the first time that I've failed to reproduce a buildfarm
result manually, even on the very machine hosting the animal. I would
like to identify the cause(s) of that. One obvious theory is that the
environment under a cron job is different --- but the only thing I know
of that should be different is possibly nice'ing the job priorities.
I did try a fair number of test cycles under "nice" in this case.
Anybody have other ideas?

regards, tom lane

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#44)
Re: Intermittent buildfarm failures on wrasse

Noah Misch <noah@leadboat.com> writes:

But 24s after that email, it did reproduce the problem.

Ain't that always the way.

Same symptoms as the
last buildfarm runs, including visfrac=0. I'm attaching my script. (It has
various references to my home directory, so it's not self-contained.)

That's interesting, because I see you used installcheck-parallel,
which I'd not been using on the grounds that wrasse isn't parallelizing
these runs. That puts a big hole in my working assumption that the
problem is one of timing. I'm now suspecting that the key issue is
something about how wrasse is building the executables that I did
not exactly reproduce. I did not try to copy the build details
involving stuff under your home directory (like the private openldap
version), mainly because it didn't seem like openldap or uuid or
perl could be involved at all. But maybe somehow?

regards, tom lane

In reply to: Tom Lane (#45)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 7:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This is far from the first time that I've failed to reproduce a buildfarm
result manually, even on the very machine hosting the animal. I would
like to identify the cause(s) of that. One obvious theory is that the
environment under a cron job is different --- but the only thing I know
of that should be different is possibly nice'ing the job priorities.
I did try a fair number of test cycles under "nice" in this case.
Anybody have other ideas?

Well, Noah is running wrasse with 'fsync = off'. And did so in the
script as well.

That seems like it definitely could matter.

--
Peter Geoghegan

#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#47)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

Well, Noah is running wrasse with 'fsync = off'. And did so in the
script as well.

As am I. I duplicated wrasse's config to the extent of

cat >>$PGDATA/postgresql.conf <<EOF
log_line_prefix = '%m [%p:%l] %q%a '
log_connections = 'true'
log_disconnections = 'true'
log_statement = 'all'
fsync = off
log_autovacuum_min_duration = 0
EOF

One thing I'm eyeing now is that it looks like Noah is re-initdb'ing
each time, whereas I'd just stopped and started the postmaster of
an existing installation. That does not seem like it could matter
but ...

regards, tom lane

#49Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#46)
1 attachment(s)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 11:06:04PM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

But 24s after that email, it did reproduce the problem.

Ain't that always the way.

Quite so.

Same symptoms as the
last buildfarm runs, including visfrac=0. I'm attaching my script. (It has
various references to my home directory, so it's not self-contained.)

That's interesting, because I see you used installcheck-parallel,
which I'd not been using on the grounds that wrasse isn't parallelizing
these runs. That puts a big hole in my working assumption that the
problem is one of timing.

With "make installcheck-tests TESTS='test_setup create_index'" it remains
reproducible. The attached script reproduced it in 103s and then in 703s.

I'm now suspecting that the key issue is
something about how wrasse is building the executables that I did
not exactly reproduce. I did not try to copy the build details
involving stuff under your home directory (like the private openldap
version), mainly because it didn't seem like openldap or uuid or
perl could be involved at all. But maybe somehow?

Can't rule it out entirely. I think I've now put world-read on all the
directories referenced in the script, in the event you'd like to use them.

Attachments:

repro-wrasse-vis.shapplication/x-shDownload
#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#48)
Re: Intermittent buildfarm failures on wrasse

I wrote:

One thing I'm eyeing now is that it looks like Noah is re-initdb'ing
each time, whereas I'd just stopped and started the postmaster of
an existing installation. That does not seem like it could matter
but ...

Well, damn. I changed my script that way and it failed on the tenth
iteration (versus a couple hundred successful iterations the other
way). So somehow this is related to time-since-initdb, not
time-since-postmaster-start. Any ideas?

Anyway, I'm too tired to do more tonight, but now that I can reproduce it
I will stick some debugging logic in tomorrow. I no longer think we
should clutter the git repo with any more short-term hacks.

regards, tom lane

PS a bit later: I've not yet reproduced it a second time, so the
failure rate is unfortunately a lot less than one-in-ten. Still,
this eliminates the idea that there's some secret sauce in Noah's
build details.

#51Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#42)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-14 22:40:51 -0400, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

Suppose that the bug was actually in 06f5295af6, "Add single-item
cache when looking at topmost XID of a subtrans XID". Doesn't that fit
your timeline just as well?

I'd dismissed that on the grounds that there are no subtrans XIDs
involved in tenk1's contents. However, if that patch was faulty
enough, maybe it affected other cases besides the advertised one?
I've not read it.

I was planning to complain about that commit, fwiw. Without so much as
an assertion verifying the cache is correct it seems quite dangerous to
me.

And looking at it, it has obvious wraparound issues... But that can't
matter here, obviously.

We also reach SubTransGetTopmostTransaction() from XactLockTableWait()
but I don't quite see how we reach that here either...

Greetings,

Andres Freund

#52Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#50)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 11:56:15PM -0400, Tom Lane wrote:

Anyway, I'm too tired to do more tonight, but now that I can reproduce it
I will stick some debugging logic in tomorrow. I no longer think we
should clutter the git repo with any more short-term hacks.

Sounds good. I've turned off the wrasse buildfarm client for the moment.
It's less useful than your local setup, and they'd compete for resources.

#53Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#50)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-14 23:56:15 -0400, Tom Lane wrote:

I wrote:

One thing I'm eyeing now is that it looks like Noah is re-initdb'ing
each time, whereas I'd just stopped and started the postmaster of
an existing installation. That does not seem like it could matter
but ...

Well, damn. I changed my script that way and it failed on the tenth
iteration (versus a couple hundred successful iterations the other
way).

Just to make sure: This is also on wrasse?

What DSM backend do we end up with on solaris? With shared memory stats
we're using DSM a lot earlier and more commonly than before.

Another thing that might be worth trying is to enable checksums. I've
caught weird bugs with that in the past. And it's possible that bgwriter
writes out a page that we then read back in quickly after, or something
like that.

So somehow this is related to time-since-initdb, not
time-since-postmaster-start. Any ideas?

Perhaps it makes a difference that we start with a "young" database xid
age wise? We've had bugs around subtracting xids and ending up on some
special one in the past.

Greetings,

Andres Freund

#54Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#43)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-14 19:45:15 -0700, Noah Misch wrote:

I suspect the failure is somehow impossible in "check". Yesterday, I cranked
up the number of locales, so there are now a lot more installcheck. Before
that, each farm run had one "check" and two "installcheck". Those days saw
ten installcheck failures, zero check failures.

I notice that the buildfarm appears to run initdb with syncing enabled
("syncing data to disk ... ok" in the initdb steps). Whereas pg_regress
uses --no-sync.

I wonder if that's what makes the difference? Now that you reproduced
it, does it still reproduce with --no-sync added?

Also worth noting that pg_regress doesn't go through pg_ctl...

Greetings,

Andres Freund

#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#53)
Re: Intermittent buildfarm failures on wrasse

Andres Freund <andres@anarazel.de> writes:

On 2022-04-14 23:56:15 -0400, Tom Lane wrote:

Well, damn. I changed my script that way and it failed on the tenth
iteration (versus a couple hundred successful iterations the other
way).

Just to make sure: This is also on wrasse?

Right, gcc211 with a moderately close approximation to wrasse's
build details. Why that shows the problem when we've not seen
it elsewhere remains to be seen.

What DSM backend do we end up with on solaris? With shared memory stats
we're using DSM a lot earlier and more commonly than before.

That ... is an interesting point. It seems to be just "posix" though.

So somehow this is related to time-since-initdb, not
time-since-postmaster-start. Any ideas?

Perhaps it makes a difference that we start with a "young" database xid
age wise? We've had bugs around subtracting xids and ending up on some
special one in the past.

It does seem like it's got to be related to small XID and/or small
LSN values. No clue right now, but more news tomorrow, I hope.

regards, tom lane

#56Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#54)
Re: Intermittent buildfarm failures on wrasse

On Thu, Apr 14, 2022 at 10:12:05PM -0700, Andres Freund wrote:

On 2022-04-14 19:45:15 -0700, Noah Misch wrote:

I suspect the failure is somehow impossible in "check". Yesterday, I cranked
up the number of locales, so there are now a lot more installcheck. Before
that, each farm run had one "check" and two "installcheck". Those days saw
ten installcheck failures, zero check failures.

I notice that the buildfarm appears to run initdb with syncing enabled
("syncing data to disk ... ok" in the initdb steps). Whereas pg_regress
uses --no-sync.

Yep.

I wonder if that's what makes the difference? Now that you reproduced
it, does it still reproduce with --no-sync added?

It does; the last version of my script used "initdb -N ...".

Also worth noting that pg_regress doesn't go through pg_ctl...

Hmmm.

#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#56)
Re: Intermittent buildfarm failures on wrasse

The morning's first result is that during a failing run,
the vacuum in test_setup sees

2022-04-15 16:01:43.064 CEST [4436:75] pg_regress/test_setup LOG: statement: VACUUM ANALYZE tenk1;
2022-04-15 16:01:43.064 CEST [4436:76] pg_regress/test_setup LOG: vacuuming "regression.public.tenk1"
2022-04-15 16:01:43.064 CEST [4436:77] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1;
2022-04-15 16:01:43.071 CEST [4436:78] pg_regress/test_setup LOG: finished vacuuming "regression.public.tenk1": index scans: 0
pages: 0 removed, 345 remain, 345 scanned (100.00% of total)
tuples: 0 removed, 10000 remain, 0 are dead but not yet removable
removable cutoff: 724, older by 26 xids when operation ended
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 2.189 MB/s, avg write rate: 2.189 MB/s
buffer usage: 695 hits, 2 misses, 2 dirtied
WAL usage: 1 records, 0 full page images, 188 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-04-15 16:01:43.071 CEST [4436:79] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1;

OldestXmin = 724 is too old to consider tenk1's contents as all-visible:

regression=# select distinct xmin from tenk1;
xmin
------
749
(1 row)

In fact, right after initdb pg_controldata shows
Latest checkpoint's NextXID: 0:724
Latest checkpoint's oldestXID: 716

So there's no longer any doubt that something is holding back OldestXmin.
I will go put some instrumentation into the code that's computing that.

regards, tom lane

#58Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#57)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-15 10:15:32 -0400, Tom Lane wrote:

The morning's first result is that during a failing run,
the vacuum in test_setup sees

2022-04-15 16:01:43.064 CEST [4436:75] pg_regress/test_setup LOG: statement: VACUUM ANALYZE tenk1;
2022-04-15 16:01:43.064 CEST [4436:76] pg_regress/test_setup LOG: vacuuming "regression.public.tenk1"
2022-04-15 16:01:43.064 CEST [4436:77] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1;
2022-04-15 16:01:43.071 CEST [4436:78] pg_regress/test_setup LOG: finished vacuuming "regression.public.tenk1": index scans: 0
pages: 0 removed, 345 remain, 345 scanned (100.00% of total)
tuples: 0 removed, 10000 remain, 0 are dead but not yet removable
removable cutoff: 724, older by 26 xids when operation ended
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 2.189 MB/s, avg write rate: 2.189 MB/s
buffer usage: 695 hits, 2 misses, 2 dirtied
WAL usage: 1 records, 0 full page images, 188 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-04-15 16:01:43.071 CEST [4436:79] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1;

The horizon advancing by 26 xids during tenk1's vacuum seems like quite
a bit, given there's no normal concurrent activity during test_setup.

In fact, right after initdb pg_controldata shows
Latest checkpoint's NextXID: 0:724
Latest checkpoint's oldestXID: 716

So that's the xmin that e.g. the autovac launcher ends up with during
start...

If I make get_database_list() sleep for 5s within the scan, I can
reproduce on x86-64.

Greetings,

Andres Freund

#59Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#57)
Re: Intermittent buildfarm failures on wrasse

I wrote:

the vacuum in test_setup sees
...
removable cutoff: 724, older by 26 xids when operation ended
...

BTW, before I forget: the wording of this log message is just awful.
On first sight, I thought that it meant that we'd computed OldestXmin
a second time and discovered that it advanced by 26 xids while the VACUUM
was running. Looking at the code, I see that's not so:

            diff = (int32) (ReadNextTransactionId() - OldestXmin);
            appendStringInfo(&buf,
                             _("removable cutoff: %u, older by %d xids when operation ended\n"),
                             OldestXmin, diff);

but good luck understanding what it actually means from the message
text alone. I think more appropriate wording would be something like

"removable cutoff: %u, which was %d xids old when operation ended\n"

Also, is it really our practice to spell XID in lower-case in
user-facing messages?

Thoughts, better ideas?

regards, tom lane

#60Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#58)
Re: Intermittent buildfarm failures on wrasse

Hi,

On April 15, 2022 11:12:10 AM EDT, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-04-15 10:15:32 -0400, Tom Lane wrote:

The morning's first result is that during a failing run,
the vacuum in test_setup sees

2022-04-15 16:01:43.064 CEST [4436:75] pg_regress/test_setup LOG: statement: VACUUM ANALYZE tenk1;
2022-04-15 16:01:43.064 CEST [4436:76] pg_regress/test_setup LOG: vacuuming "regression.public.tenk1"
2022-04-15 16:01:43.064 CEST [4436:77] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1;
2022-04-15 16:01:43.071 CEST [4436:78] pg_regress/test_setup LOG: finished vacuuming "regression.public.tenk1": index scans: 0
pages: 0 removed, 345 remain, 345 scanned (100.00% of total)
tuples: 0 removed, 10000 remain, 0 are dead but not yet removable
removable cutoff: 724, older by 26 xids when operation ended
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 2.189 MB/s, avg write rate: 2.189 MB/s
buffer usage: 695 hits, 2 misses, 2 dirtied
WAL usage: 1 records, 0 full page images, 188 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-04-15 16:01:43.071 CEST [4436:79] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1;

The horizon advancing by 26 xids during tenk1's vacuum seems like quite
a bit, given there's no normal concurrent activity during test_setup.

In fact, right after initdb pg_controldata shows
Latest checkpoint's NextXID: 0:724
Latest checkpoint's oldestXID: 716

So that's the xmin that e.g. the autovac launcher ends up with during
start...

If I make get_database_list() sleep for 5s within the scan, I can
reproduce on x86-64.

Off for a bit, but I realized that we likely don't exclude the launcher because it's not database associated...

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#57)
Re: Intermittent buildfarm failures on wrasse

I wrote:

So there's no longer any doubt that something is holding back OldestXmin.
I will go put some instrumentation into the code that's computing that.

The something is the logical replication launcher. In the failing runs,
it is advertising xmin = 724 (the post-initdb NextXID) and continues to
do so well past the point where tenk1 gets vacuumed.

Discuss.

regards, tom lane

#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#60)
Re: Intermittent buildfarm failures on wrasse

Andres Freund <andres@anarazel.de> writes:

Off for a bit, but I realized that we likely don't exclude the launcher because it's not database associated...

Yeah. I think this bit in ComputeXidHorizons needs rethinking:

/*
* Normally queries in other databases are ignored for anything but
* the shared horizon. ...
*/
if (in_recovery ||
MyDatabaseId == InvalidOid || proc->databaseId == MyDatabaseId ||
proc->databaseId == 0) /* always include WalSender */
{

The "proc->databaseId == 0" business apparently means to include only
walsender processes, and it's broken because that condition doesn't
include only walsender processes.

At this point we have the following conclusions:

1. A slow transaction in the launcher's initial get_database_list()
call fully explains these failures. (I had been thinking that the
launcher's xact would have to persist as far as the create_index
script, but that's not so: it only has to last until test_setup
begins vacuuming tenk1. The CREATE INDEX steps are not doing any
visibility map changes of their own, but what they are doing is
updating relallvisible from the results of visibilitymap_count().
That's why they undid the effects of manually poking relallvisible,
without actually inserting any data better than what the initial
VACUUM computed.)

2. We can probably explain why only wrasse sees this as some quirk
of the Solaris scheduler. I'm satisfied to blame it-happens-in-
installcheck-but-not-check on that too.

3. It remains unclear why we suddenly started seeing this last week.
I suppose it has to be a side-effect of the pgstats changes, but
the mechanism is obscure. Probably not worth the effort to pin
down exactly why.

As for fixing it, what I think would be the preferable answer is to
fix the above-quoted logic so that it indeed includes only walsenders
and not random other background workers. (Why does it need to include
walsenders, anyway? The commentary sucks.) Alternatively, or perhaps
also, we could do what was discussed previously and make a hack to
allow delaying vacuum until the system is quiescent.

regards, tom lane

#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#58)
Re: Intermittent buildfarm failures on wrasse

Andres Freund <andres@anarazel.de> writes:

On 2022-04-15 10:15:32 -0400, Tom Lane wrote:

removable cutoff: 724, older by 26 xids when operation ended

The horizon advancing by 26 xids during tenk1's vacuum seems like quite
a bit, given there's no normal concurrent activity during test_setup.

Hah, so you were taken in by this wording too. See my complaint
about it downthread.

regards, tom lane

#64Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#61)
Re: Intermittent buildfarm failures on wrasse

Hi,

On April 15, 2022 11:23:40 AM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

So there's no longer any doubt that something is holding back OldestXmin.
I will go put some instrumentation into the code that's computing that.

The something is the logical replication launcher. In the failing runs,
it is advertising xmin = 724 (the post-initdb NextXID) and continues to
do so well past the point where tenk1 gets vacuumed.

Discuss.

That explains it. Before shmstat autovac needed to wait for the stats collector to write out stats. Now it's near instantaneous. So the issue probably existed before, just unlikely to ever be reached.

We can't just ignore database less xmins for non-shared rels, because walsender propagates hot_standby_feedback that way. But we can probably add a flag somewhere indicating whether a database less PGPROC has to be accounted in the horizon for non-shared rels.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

In reply to: Tom Lane (#59)
Re: Intermittent buildfarm failures on wrasse

On Fri, Apr 15, 2022 at 8:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, before I forget: the wording of this log message is just awful.
On first sight, I thought that it meant that we'd computed OldestXmin
a second time and discovered that it advanced by 26 xids while the VACUUM
was running.

"removable cutoff: %u, which was %d xids old when operation ended\n"

How the output appears when placed right before the output describing
how VACUUM advanced relfrozenxid is an important consideration. I want
the format and wording that we use to imply a relationship between
these two things. Right now, that other line looks like this:

"new relfrozenxid: %u, which is %d xids ahead of previous value\n"

Do you think that this juxtaposition works well?

Also, is it really our practice to spell XID in lower-case in
user-facing messages?

There are examples of both. This could easily be changed to "XIDs".

--
Peter Geoghegan

#66Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#64)
Re: Intermittent buildfarm failures on wrasse

Andres Freund <andres@anarazel.de> writes:

On April 15, 2022 11:23:40 AM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The something is the logical replication launcher. In the failing runs,
it is advertising xmin = 724 (the post-initdb NextXID) and continues to
do so well past the point where tenk1 gets vacuumed.

That explains it. Before shmstat autovac needed to wait for the stats collector to write out stats. Now it's near instantaneous. So the issue probably existed before, just unlikely to ever be reached.

Um, this is the logical replication launcher, not the autovac launcher.
Your observation that a sleep in get_database_list() reproduces it
confirms that, and I don't entirely see why the timing of the LR launcher
would have changed.

(On thinking about it, I suppose the AV launcher might trigger this
too, but that is not the PID I saw in testing.)

We can't just ignore database less xmins for non-shared rels, because walsender propagates hot_standby_feedback that way. But we can probably add a flag somewhere indicating whether a database less PGPROC has to be accounted in the horizon for non-shared rels.

Yeah, I was also thinking about a flag in PGPROC being a more reliable
way to do this. Is there anything besides walsenders that should set
that flag?

regards, tom lane

#67Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#65)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

On Fri, Apr 15, 2022 at 8:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, before I forget: the wording of this log message is just awful.
[ so how about ]
"removable cutoff: %u, which was %d xids old when operation ended\n"

How the output appears when placed right before the output describing
how VACUUM advanced relfrozenxid is an important consideration. I want
the format and wording that we use to imply a relationship between
these two things. Right now, that other line looks like this:

"new relfrozenxid: %u, which is %d xids ahead of previous value\n"

Do you think that this juxtaposition works well?

Seems all right to me; do you have a better suggestion?

regards, tom lane

In reply to: Tom Lane (#67)
Re: Intermittent buildfarm failures on wrasse

On Fri, Apr 15, 2022 at 9:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Do you think that this juxtaposition works well?

Seems all right to me; do you have a better suggestion?

No. At first I thought that mixing "which is" and "which was" wasn't
quite right. I changed my mind, though. Your new wording is fine.

I'll update the log output some time today.

--
Peter Geoghegan

#69Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#66)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-15 12:36:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On April 15, 2022 11:23:40 AM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The something is the logical replication launcher. In the failing runs,
it is advertising xmin = 724 (the post-initdb NextXID) and continues to
do so well past the point where tenk1 gets vacuumed.

That explains it. Before shmstat autovac needed to wait for the stats collector to write out stats. Now it's near instantaneous. So the issue probably existed before, just unlikely to ever be reached.

Um, this is the logical replication launcher, not the autovac
launcher.

Short term confusion...

Your observation that a sleep in get_database_list() reproduces it
confirms that

I don't understand what you mean here? get_database_list() is autovac
launcher code? So being able to reproduce the problem by putting in a
sleep there doesn't seem like a confirm anything about the logical rep
launcher?

, and I don't entirely see why the timing of the LR launcher
would have changed.

Could still be related to the autovac launcher not requesting / pgstats
not writing / launcher not reading the stats file(s). That obviously is
going to have some scheduler impact.

We can't just ignore database less xmins for non-shared rels, because walsender propagates hot_standby_feedback that way. But we can probably add a flag somewhere indicating whether a database less PGPROC has to be accounted in the horizon for non-shared rels.

Yeah, I was also thinking about a flag in PGPROC being a more reliable
way to do this. Is there anything besides walsenders that should set
that flag?

Not that I can think of. It's only because of hs_feedback that we need
to. I guess it's possible that somebody build some extension that needs
something similar, but then they'd need to set that flag...

Greetings,

Andres Freund

#70Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#66)
Re: Intermittent buildfarm failures on wrasse

Hi,

(Sent again, somehow my editor started to sometimes screw up mail
headers, and ate the From:, sorry for the duplicate)

On 2022-04-15 12:36:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On April 15, 2022 11:23:40 AM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The something is the logical replication launcher. In the failing runs,
it is advertising xmin = 724 (the post-initdb NextXID) and continues to
do so well past the point where tenk1 gets vacuumed.

That explains it. Before shmstat autovac needed to wait for the stats collector to write out stats. Now it's near instantaneous. So the issue probably existed before, just unlikely to ever be reached.

Um, this is the logical replication launcher, not the autovac
launcher.

Short term confusion...

Your observation that a sleep in get_database_list() reproduces it
confirms that

I don't understand what you mean here? get_database_list() is autovac
launcher code? So being able to reproduce the problem by putting in a
sleep there doesn't seem like a confirm anything about the logical rep
launcher?

, and I don't entirely see why the timing of the LR launcher
would have changed.

Could still be related to the autovac launcher not requesting / pgstats
not writing / launcher not reading the stats file(s). That obviously is
going to have some scheduler impact.

We can't just ignore database less xmins for non-shared rels, because walsender propagates hot_standby_feedback that way. But we can probably add a flag somewhere indicating whether a database less PGPROC has to be accounted in the horizon for non-shared rels.

Yeah, I was also thinking about a flag in PGPROC being a more reliable
way to do this. Is there anything besides walsenders that should set
that flag?

Not that I can think of. It's only because of hs_feedback that we need
to. I guess it's possible that somebody build some extension that needs
something similar, but then they'd need to set that flag...

Greetings,

Andres Freund

#71Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#66)
Re: Intermittent buildfarm failures on wrasse

I wrote:

Um, this is the logical replication launcher, not the autovac launcher.
Your observation that a sleep in get_database_list() reproduces it
confirms that, and I don't entirely see why the timing of the LR launcher
would have changed.

Oh, to clarify: I misread "get_database_list()" as
"get_subscription_list()", which is the part of the LR launcher startup
that causes the problem for me. So what we actually have confirmed is
that BOTH of those launchers are problematic for this. And AFAICS
neither of them needs to be causing horizon adjustments for non-shared
tables.

(It's possible that the AV launcher is responsible in some of wrasse's
reports, but it's been the LR launcher in five out of five
sufficiently-instrumented failures for me.)

regards, tom lane

#72Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#65)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-15 09:29:20 -0700, Peter Geoghegan wrote:

On Fri, Apr 15, 2022 at 8:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, before I forget: the wording of this log message is just awful.
On first sight, I thought that it meant that we'd computed OldestXmin
a second time and discovered that it advanced by 26 xids while the VACUUM
was running.

"removable cutoff: %u, which was %d xids old when operation ended\n"

How the output appears when placed right before the output describing
how VACUUM advanced relfrozenxid is an important consideration. I want
the format and wording that we use to imply a relationship between
these two things. Right now, that other line looks like this:

"new relfrozenxid: %u, which is %d xids ahead of previous value\n"

Do you think that this juxtaposition works well?

I don't think they're actually that comparable. One shows how much
relfrozenxid advanced, to a large degree influenced by the time between
aggressive (or "unintentionally aggressive") vacuums. The other shows
the age of OldestXmin at the end of the vacuum. Which is influenced by
what's currently running.

Greetings,

Andres Freund

In reply to: Andres Freund (#72)
Re: Intermittent buildfarm failures on wrasse

On Fri, Apr 15, 2022 at 10:05 AM Andres Freund <andres@anarazel.de> wrote:

I don't think they're actually that comparable. One shows how much
relfrozenxid advanced, to a large degree influenced by the time between
aggressive (or "unintentionally aggressive") vacuums.

It matters more in the extreme cases. The most recent possible value
for our new relfrozenxid is OldestXmin/removable cutoff. So when
something holds back OldestXmin, it also holds back new relfrozenxid
values.

The other shows
the age of OldestXmin at the end of the vacuum. Which is influenced by
what's currently running.

As well as the age of OldestXmin at the start of VACUUM.

--
Peter Geoghegan

#74Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#73)
Re: Intermittent buildfarm failures on wrasse

Peter Geoghegan <pg@bowt.ie> writes:

On Fri, Apr 15, 2022 at 10:05 AM Andres Freund <andres@anarazel.de> wrote:

The other shows
the age of OldestXmin at the end of the vacuum. Which is influenced by
what's currently running.

As well as the age of OldestXmin at the start of VACUUM.

Is it worth capturing and logging both of those numbers? Why is
the age at the end more interesting than the age at the start?

regards, tom lane

In reply to: Tom Lane (#74)
Re: Intermittent buildfarm failures on wrasse

On Fri, Apr 15, 2022 at 10:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As well as the age of OldestXmin at the start of VACUUM.

Is it worth capturing and logging both of those numbers? Why is
the age at the end more interesting than the age at the start?

As Andres said, that's often more interesting because most of the time
OldestXmin is not held back by much (not enough to matter).

Users will often look at the output of successive related VACUUM
operations. Often the way things change over time is much more
interesting than the details at any particular point in time.
Especially in the kinds of extreme cases I'm thinking about.

--
Peter Geoghegan

#76Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#75)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-15 10:23:56 -0700, Peter Geoghegan wrote:

On Fri, Apr 15, 2022 at 10:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As well as the age of OldestXmin at the start of VACUUM.

Is it worth capturing and logging both of those numbers? Why is
the age at the end more interesting than the age at the start?

As Andres said, that's often more interesting because most of the time
OldestXmin is not held back by much (not enough to matter).

I think it'd be interesting - particularly for large relations or when
looking to adjust autovac cost limits. It's not rare for autovac to take
long enough that another autovac is necessary immediately again. Also
helps to interpret the "dead but not yet removable" counts.

Something like:
removable cutoff: %u, age at start: %u, age at end: %u...

Greetings,

Andres Freund

In reply to: Andres Freund (#76)
Re: Intermittent buildfarm failures on wrasse

On Fri, Apr 15, 2022 at 10:43 AM Andres Freund <andres@anarazel.de> wrote:

I think it'd be interesting - particularly for large relations or when
looking to adjust autovac cost limits.

Something like:
removable cutoff: %u, age at start: %u, age at end: %u...

Part of the problem here is that we determine VACUUM's FreezeLimit by
calculating `OldestXmin - vacuum_freeze_min_age` (more or less [1]We do something a bit like this when OldestXmin is already very old -- then FreezeLimit is the same value as OldestXmin (see WARNING from vacuum_set_xid_limits() function). That's better than nothing, but doesn't change the fact that our general approach to calculating FreezeLimit makes little sense.).
Why should we do less freezing due to the presence of an old snapshot?
Sure, that has to happen with those XIDs that are fundamentally
ineligible for freezing due to the presence of the old snapshot -- but
what about those XIDs that *are* eligible, and still don't get frozen
at first?

We should determine FreezeLimit by calculating `NextXID -
vacuum_freeze_min_age ` instead (and then clamp, to make sure that
it's always <= OldestXmin). That approach would make our final
FreezeLimit "strictly age-based".

[1]: We do something a bit like this when OldestXmin is already very old -- then FreezeLimit is the same value as OldestXmin (see WARNING from vacuum_set_xid_limits() function). That's better than nothing, but doesn't change the fact that our general approach to calculating FreezeLimit makes little sense.
old -- then FreezeLimit is the same value as OldestXmin (see WARNING
from vacuum_set_xid_limits() function). That's better than nothing,
but doesn't change the fact that our general approach to calculating
FreezeLimit makes little sense.

--
Peter Geoghegan

#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#70)
1 attachment(s)
Re: Intermittent buildfarm failures on wrasse

Andres Freund <andres@anarazel.de> writes:

On 2022-04-15 12:36:52 -0400, Tom Lane wrote:

Yeah, I was also thinking about a flag in PGPROC being a more reliable
way to do this. Is there anything besides walsenders that should set
that flag?

Not that I can think of. It's only because of hs_feedback that we need
to. I guess it's possible that somebody build some extension that needs
something similar, but then they'd need to set that flag...

Here's a WIP patch for that. The only exciting thing in it is that
because of some undocumented cowboy programming in walsender.c, the
Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
in ProcArrayInstallRestoredXmin fires unless we skip that.

I could use some help filling in the XXX comments, because it's far
from clear to me *why* walsenders need this to happen.

regards, tom lane

Attachments:

cleaner-walsender-handling-wip.patchtext/x-diff; charset=us-ascii; name=cleaner-walsender-handling-wip.patchDownload
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index be40261393..a04a4a8e5d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -285,6 +285,15 @@ InitWalSender(void)
 	MarkPostmasterChildWalSender();
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
 
+	/*
+	 * Also, mark ourselves in PGPROC as affecting vacuum horizons in all
+	 * databases.  XXX explain why
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	MyProc->statusFlags |= PROC_AFFECTS_ALL_HORIZONS;
+	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
+	LWLockRelease(ProcArrayLock);
+
 	/* Initialize empty timestamp buffer for lag tracking. */
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
 }
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index f6e98aae29..8b867bd56c 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1821,10 +1821,12 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * to prune still needed data away). If the current backend never
 		 * connects to a database that is harmless, because
 		 * data_oldest_nonremovable will never be utilized.
+		 *
+		 * XXX explain PROC_AFFECTS_ALL_HORIZONS, too
 		 */
 		if (in_recovery ||
 			MyDatabaseId == InvalidOid || proc->databaseId == MyDatabaseId ||
-			proc->databaseId == 0)	/* always include WalSender */
+			(statusFlags & PROC_AFFECTS_ALL_HORIZONS))
 		{
 			/*
 			 * We can ignore this backend if it's running CREATE INDEX
@@ -2681,10 +2683,14 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 		/* Install xmin */
 		MyProc->xmin = TransactionXmin = xmin;
 
-		/* Flags being copied must be valid copy-able flags. */
-		Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
-		MyProc->statusFlags = proc->statusFlags;
-		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
+		/* walsender cheats by passing proc == MyProc, don't check flags */
+		if (proc != MyProc)
+		{
+			/* Flags being copied must be valid copy-able flags. */
+			Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
+			MyProc->statusFlags = proc->statusFlags;
+			ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
+		}
 
 		result = true;
 	}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 809b74db5e..e132665938 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -60,6 +60,9 @@ struct XidCache
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
+#define		PROC_AFFECTS_ALL_HORIZONS	0x20	/* proc's xmin must be
+												 * included in vacuum horizons
+												 * in all databases */
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
#79Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#78)
Re: Intermittent buildfarm failures on wrasse

Hi,

On April 15, 2022 2:14:47 PM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

On 2022-04-15 12:36:52 -0400, Tom Lane wrote:

Yeah, I was also thinking about a flag in PGPROC being a more reliable
way to do this. Is there anything besides walsenders that should set
that flag?

Not that I can think of. It's only because of hs_feedback that we need
to. I guess it's possible that somebody build some extension that needs
something similar, but then they'd need to set that flag...

Here's a WIP patch for that. The only exciting thing in it is that
because of some undocumented cowboy programming in walsender.c, the
Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
in ProcArrayInstallRestoredXmin fires unless we skip that.

I could use some help filling in the XXX comments, because it's far
from clear to me *why* walsenders need this to happen.

I'm out for the rest of the day due to family events (visiting my girlfriend's parents till Wednesday), I can take a stab at formulating something after.

If you want to commit before: The reason is that walsenders use their xmin to represent the xmin of standbys when using hot_standby_feedback. Since we're only transmitting global horizons up from standbys, it has to influence globally (and it would be hard to represent per db horizons anyway).

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#80Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#79)
Re: Intermittent buildfarm failures on wrasse

Andres Freund <andres@anarazel.de> writes:

On April 15, 2022 2:14:47 PM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I could use some help filling in the XXX comments, because it's far
from clear to me *why* walsenders need this to happen.

If you want to commit before: The reason is that walsenders use their xmin to represent the xmin of standbys when using hot_standby_feedback. Since we're only transmitting global horizons up from standbys, it has to influence globally (and it would be hard to represent per db horizons anyway).

Got it. I rewrote the comments and pushed. Noah, it should be safe
to turn wrasse back on.

regards, tom lane

#81Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#77)
Re: Intermittent buildfarm failures on wrasse

Hi,

On 2022-04-15 11:12:34 -0700, Peter Geoghegan wrote:

On Fri, Apr 15, 2022 at 10:43 AM Andres Freund <andres@anarazel.de> wrote:

I think it'd be interesting - particularly for large relations or when
looking to adjust autovac cost limits.

Something like:
removable cutoff: %u, age at start: %u, age at end: %u...

Part of the problem here is that we determine VACUUM's FreezeLimit by
calculating `OldestXmin - vacuum_freeze_min_age` (more or less [1]).

What the message outputs is OldestXmin and not FreezeLimit though. And
FreezeLimit doesn't affect "dead but not yet removable".

IOW, the following might be right, but that seems independent of
improving the output of
diff = (int32) (ReadNextTransactionId() - OldestXmin);
appendStringInfo(&buf,
_("removable cutoff: %u, which was %d XIDs old when operation ended\n"),
OldestXmin, diff);

Why should we do less freezing due to the presence of an old snapshot?
Sure, that has to happen with those XIDs that are fundamentally
ineligible for freezing due to the presence of the old snapshot -- but
what about those XIDs that *are* eligible, and still don't get frozen
at first?

We should determine FreezeLimit by calculating `NextXID -
vacuum_freeze_min_age ` instead (and then clamp, to make sure that
it's always <= OldestXmin). That approach would make our final
FreezeLimit "strictly age-based".

[1] We do something a bit like this when OldestXmin is already very
old -- then FreezeLimit is the same value as OldestXmin (see WARNING
from vacuum_set_xid_limits() function). That's better than nothing,
but doesn't change the fact that our general approach to calculating
FreezeLimit makes little sense.

Greetings,

Andres Freund

In reply to: Andres Freund (#81)
Re: Intermittent buildfarm failures on wrasse

On Sun, Apr 17, 2022 at 7:36 AM Andres Freund <andres@anarazel.de> wrote:

Part of the problem here is that we determine VACUUM's FreezeLimit by
calculating `OldestXmin - vacuum_freeze_min_age` (more or less [1]).

What the message outputs is OldestXmin and not FreezeLimit though.

My higher level point is that there is a general tendency to assume
that OldestXmin is the same thing as NextXID, which it isn't. It's an
easy enough mistake to make, though, in part because they're usually
quite close together. The "Routine Vacuuming" docs seem to suggest
that they're the same thing, or at least that's what I take away from
the following sentence:

"This implies that if a table is not otherwise vacuumed, autovacuum
will be invoked on it approximately once every
autovacuum_freeze_max_age minus vacuum_freeze_min_age transactions".

And FreezeLimit doesn't affect "dead but not yet removable".

But OldestXmin affects FreezeLimit.

Anyway, I'm not opposed to showing the age at the start as well. But
from the point of view of issues like this tenk1 issue, it would be
more useful to just report on new_rel_allvisible. It would also be
more useful to users.

--
Peter Geoghegan

#83Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tom Lane (#78)
1 attachment(s)
Re: Intermittent buildfarm failures on wrasse

On 2022-Apr-15, Tom Lane wrote:

Here's a WIP patch for that. The only exciting thing in it is that
because of some undocumented cowboy programming in walsender.c, the
Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
in ProcArrayInstallRestoredXmin fires unless we skip that.

Hmm, maybe a better use of that define is to use to select which flags
to copy, rather than to ensure we they are the only ones set. What
about this?

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"¿Qué importan los años? Lo que realmente importa es comprobar que
a fin de cuentas la mejor edad de la vida es estar vivo" (Mafalda)

Attachments:

copyflags.patchtext/x-diff; charset=utf-8Download
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 25c310f675..4347941568 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2685,17 +2685,14 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 		TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
-		/* Install xmin */
+		/*
+		 * Install xmin.  In addition, propagate statusFlags that affect how
+		 * the value is interpreted by vacuum.
+		 */
 		MyProc->xmin = TransactionXmin = xmin;
 
-		/* walsender cheats by passing proc == MyProc, don't check its flags */
-		if (proc != MyProc)
-		{
-			/* Flags being copied must be valid copy-able flags. */
-			Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
-			MyProc->statusFlags = proc->statusFlags;
-			ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-		}
+		MyProc->statusFlags |= (proc->statusFlags & PROC_COPYABLE_FLAGS);
+		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
 
 		result = true;
 	}
#84Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#83)
Re: Intermittent buildfarm failures on wrasse

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On 2022-Apr-15, Tom Lane wrote:

Here's a WIP patch for that. The only exciting thing in it is that
because of some undocumented cowboy programming in walsender.c, the
Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
in ProcArrayInstallRestoredXmin fires unless we skip that.

Hmm, maybe a better use of that define is to use to select which flags
to copy, rather than to ensure we they are the only ones set. What
about this?

Yeah, I thought about that too, but figured that the author probably
had a reason for writing the assertion the way it was. If we want
to redefine PROC_COPYABLE_FLAGS as "flags associated with xmin",
that's fine by me. But I'd suggest that both the name of the macro
and the comment for it in proc.h should be revised to match that
definition.

Another point is that as you've coded it, the code doesn't so much
copy those flags as union them with whatever the recipient had,
which seems wrong. I could go with either

Assert(!(MyProc->statusFlags & PROC_XMIN_FLAGS));
MyProc->statusFlags |= (proc->statusFlags & PROC_XMIN_FLAGS);

or

MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) |
(proc->statusFlags & PROC_XMIN_FLAGS);

Perhaps the latter is more future-proof.

regards, tom lane

#85Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#84)
Re: Intermittent buildfarm failures on wrasse

On Wed, Apr 20, 2022 at 3:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On 2022-Apr-15, Tom Lane wrote:

Here's a WIP patch for that. The only exciting thing in it is that
because of some undocumented cowboy programming in walsender.c, the
Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
in ProcArrayInstallRestoredXmin fires unless we skip that.

Hmm, maybe a better use of that define is to use to select which flags
to copy, rather than to ensure we they are the only ones set. What
about this?

Yeah, I thought about that too, but figured that the author probably
had a reason for writing the assertion the way it was.

The motivation behind the assertion was that when we copy whole
statusFlags from the leader process to the worker process we want to
make sure that the flags we're copying is a known subset of the flags
that are valid to copy from the leader.

If we want
to redefine PROC_COPYABLE_FLAGS as "flags associated with xmin",
that's fine by me. But I'd suggest that both the name of the macro
and the comment for it in proc.h should be revised to match that
definition.

Another point is that as you've coded it, the code doesn't so much
copy those flags as union them with whatever the recipient had,
which seems wrong. I could go with either

Assert(!(MyProc->statusFlags & PROC_XMIN_FLAGS));
MyProc->statusFlags |= (proc->statusFlags & PROC_XMIN_FLAGS);

or

MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) |
(proc->statusFlags & PROC_XMIN_FLAGS);

Perhaps the latter is more future-proof.

Copying only xmin-related flags in this way also makes sense to me and
there is no problem at least for now. A note would be that when we
introduce a new flag that needs to be copied in the future, we need to
make sure to add it to PROC_XMIN_FLAGS so it is copied. Otherwise a
similar issue we fixed by 0f0cfb494004befb0f6e could happen again.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#86Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Masahiko Sawada (#85)
1 attachment(s)
Re: Intermittent buildfarm failures on wrasse

On 2022-Apr-20, Masahiko Sawada wrote:

MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) |
(proc->statusFlags & PROC_XMIN_FLAGS);

Perhaps the latter is more future-proof.

Copying only xmin-related flags in this way also makes sense to me and
there is no problem at least for now. A note would be that when we
introduce a new flag that needs to be copied in the future, we need to
make sure to add it to PROC_XMIN_FLAGS so it is copied. Otherwise a
similar issue we fixed by 0f0cfb494004befb0f6e could happen again.

OK, done this way -- patch attached.

Reading the comment I wrote about it, I wonder if flags
PROC_AFFECTS_ALL_HORIZONS and PROC_IN_LOGICAL_DECODING should also be
included. I think the only reason we don't care at this point is that
walsenders (logical or otherwise) do not engage in snapshot copying.
But if we were to implement usage of parallel workers sharing a common
snapshot to do table sync in parallel, then it ISTM it would be
important to copy at least the latter. Not sure there are any cases
were we might care about the former.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801

Attachments:

v2-0001-Repurpose-PROC_COPYABLE_FLAGS-as-PROC_XMIN_FLAGS.patchtext/x-diff; charset=utf-8Download
From 95bd3bf62992987e1d6e078520ff76133248579e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Sat, 14 May 2022 16:51:23 +0200
Subject: [PATCH v2] Repurpose PROC_COPYABLE_FLAGS as PROC_XMIN_FLAGS

This is a slight, convenient semantics change from what commit
0f0cfb494004 introduced that lets us simplify the coding in the one
place where it is used.
---
 src/backend/storage/ipc/procarray.c | 17 +++++++----------
 src/include/storage/proc.h          |  7 +++----
 2 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ca22336e35..cd58c5faf0 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2685,17 +2685,14 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 		TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
-		/* Install xmin */
+		/*
+		 * Install xmin and propagate the statusFlags that affect how the
+		 * value is interpreted by vacuum.
+		 */
 		MyProc->xmin = TransactionXmin = xmin;
-
-		/* walsender cheats by passing proc == MyProc, don't check its flags */
-		if (proc != MyProc)
-		{
-			/* Flags being copied must be valid copy-able flags. */
-			Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
-			MyProc->statusFlags = proc->statusFlags;
-			ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-		}
+		MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) |
+			(proc->statusFlags & PROC_XMIN_FLAGS);
+		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
 
 		result = true;
 	}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 15be21c00a..2579e619eb 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -69,11 +69,10 @@ struct XidCache
 	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
- * Flags that are valid to copy from another proc, the parallel leader
- * process in practice.  Currently, flags that are set during parallel
- * vacuum and parallel index creation are allowed.
+ * Xmin-related flags. Make sure any flags that affect how the process' Xmin
+ * value is interpreted by VACUUM are included here.
  */
-#define		PROC_COPYABLE_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
 
 /*
  * We allow a small number of "weak" relation locks (AccessShareLock,
-- 
2.30.2

#87Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Alvaro Herrera (#86)
Re: Intermittent buildfarm failures on wrasse

On Sun, May 15, 2022 at 12:29 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Apr-20, Masahiko Sawada wrote:

MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) |
(proc->statusFlags & PROC_XMIN_FLAGS);

Perhaps the latter is more future-proof.

Copying only xmin-related flags in this way also makes sense to me and
there is no problem at least for now. A note would be that when we
introduce a new flag that needs to be copied in the future, we need to
make sure to add it to PROC_XMIN_FLAGS so it is copied. Otherwise a
similar issue we fixed by 0f0cfb494004befb0f6e could happen again.

OK, done this way -- patch attached.

Thank you for updating the patch.

Reading the comment I wrote about it, I wonder if flags
PROC_AFFECTS_ALL_HORIZONS and PROC_IN_LOGICAL_DECODING should also be
included. I think the only reason we don't care at this point is that
walsenders (logical or otherwise) do not engage in snapshot copying.
But if we were to implement usage of parallel workers sharing a common
snapshot to do table sync in parallel, then it ISTM it would be
important to copy at least the latter. Not sure there are any cases
were we might care about the former.

Yeah, it seems to be inconsistent between the comment (and the new
name) and the flags actually included. I think we can include all
xmin-related flags to PROC_XMIN_FLAGS as the comment says. That way,
it would be useful also for other use cases, and I don't see any
downside for now.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#88Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alvaro Herrera (#83)
2 attachment(s)
Re: Intermittent buildfarm failures on wrasse

I was looking at backpatching this to pg13. That made me realize that
commit dc7420c2c927 changed things in 14; and before that commit, the
bitmask that is checked is PROCARRAY_FLAGS_VACUUM, which has a
definition independently from whatever proc.h says. As far as I can
tell, there's no problem with the patches I post here (the backpatched
version for pg13 and p14). But's it's something to be aware of; and if
we do want to add the additional bits to the bitmask, we should do that
in a separate master-only commit.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

Attachments:

v13-0001-Repurpose-PROC_COPYABLE_FLAGS-as-PROC_XMIN_FLAGS.patchtext/x-diff; charset=utf-8Download
From 6595af9af208de33ab431b502e2f178f7d0f8007 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Sat, 14 May 2022 16:51:23 +0200
Subject: [PATCH] Repurpose PROC_COPYABLE_FLAGS as PROC_XMIN_FLAGS

This is a slight, convenient semantics change from what commit
0f0cfb494004 introduced that lets us simplify the coding in the one
place where it is used.
---
 src/backend/storage/ipc/procarray.c | 11 ++++++-----
 src/include/storage/proc.h          |  7 +++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 6ff8d8699b..447b6e3de7 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1916,12 +1916,13 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 		TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
-		/* Install xmin */
+		/*
+		 * Install xmin and propagate the vacuumFlags that affect how the
+		 * value is interpreted by vacuum.
+		 */
 		MyPgXact->xmin = TransactionXmin = xmin;
-
-		/* Flags being copied must be valid copy-able flags. */
-		Assert((pgxact->vacuumFlags & (~PROC_COPYABLE_FLAGS)) == 0);
-		MyPgXact->vacuumFlags = pgxact->vacuumFlags;
+		MyPgXact->vacuumFlags = (MyPgXact->vacuumFlags & ~PROC_XMIN_FLAGS) |
+			(pgxact->vacuumFlags & PROC_XMIN_FLAGS);
 
 		result = true;
 	}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9cf9684b41..7c85b5645b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -63,11 +63,10 @@ struct XidCache
 	(PROC_IN_VACUUM | PROC_IN_ANALYZE | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
- * Flags that are valid to copy from another proc, the parallel leader
- * process in practice.  Currently, a flag that is set during parallel
- * vacuum is allowed.
+ * Xmin-related flags. Make sure any flags that affect how the process' Xmin
+ * value is interpreted by VACUUM are included here.
  */
-#define		PROC_COPYABLE_FLAGS (PROC_IN_VACUUM)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a small number of "weak" relation locks (AccessShareLock,
-- 
2.30.2

v14-0001-Repurpose-PROC_COPYABLE_FLAGS-as-PROC_XMIN_FLAGS.patchtext/x-diff; charset=utf-8Download
From 35197d187fe3e7c9e4bd563396c9c7459e39a7ff Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Sat, 14 May 2022 16:51:23 +0200
Subject: [PATCH] Repurpose PROC_COPYABLE_FLAGS as PROC_XMIN_FLAGS

This is a slight, convenient semantics change from what commit
0f0cfb494004 introduced that lets us simplify the coding in the one
place where it is used.
---
 src/backend/storage/ipc/procarray.c | 11 ++++++-----
 src/include/storage/proc.h          |  7 +++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 127be9c017..08053a7e8f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2686,12 +2686,13 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 		TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
-		/* Install xmin */
+		/*
+		 * Install xmin and propagate the statusFlags that affect how the
+		 * value is interpreted by vacuum.
+		 */
 		MyProc->xmin = TransactionXmin = xmin;
-
-		/* Flags being copied must be valid copy-able flags. */
-		Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0);
-		MyProc->statusFlags = proc->statusFlags;
+		MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) |
+			(proc->statusFlags & PROC_XMIN_FLAGS);
 		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
 
 		result = true;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 11b514c9ae..1464fad9b9 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -66,11 +66,10 @@ struct XidCache
 	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
- * Flags that are valid to copy from another proc, the parallel leader
- * process in practice.  Currently, flags that are set during parallel
- * vacuum and parallel index creation are allowed.
+ * Xmin-related flags. Make sure any flags that affect how the process' Xmin
+ * value is interpreted by VACUUM are included here.
  */
-#define		PROC_COPYABLE_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
 
 /*
  * We allow a small number of "weak" relation locks (AccessShareLock,
-- 
2.30.2