BF member drongo doesn't like 035_standby_logical_decoding.pl

Started by Tom Lane12 months ago12 messages

tgl@sss.pgh.pa.us

12 months ago

I realized just now that drongo has been intermittently failing like this:

147/256 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 2116.16s (exit status 255 or signal 127 SIGinvalid)
------------------------------------- 8< -------------------------------------
stderr:
# Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
# at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
# poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Failed test 'confl_active_logicalslot updated'
# at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 24.

This has been happening for some time, in all three branches where
that test script exists. The oldest failure that looks like that in
the v16 branch is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35

However, there are older failures showing a timeout of
035_standby_logical_decoding.pl that don't provide any detail, but
might well be the same thing. The oldest one of those is from
2024-05-01, which is still considerably later than the test script
itself (added on 2023-04-08). So it would seem that this is something
we broke during 2024, rather than an aboriginal problem in this test.

A search of the buildfarm logs did not turn up similar failures
on any other animals.

I have no idea how to proceed on narrowing down the cause...

regards, tom lane

PS: FTR, the hits I got on this in the past 90 days were

Alexander Lakhin

exclusion@gmail.com

12 months ago

In reply to: Tom Lane (#1)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Hello Tom,

24.01.2025 18:42, Tom Lane wrote:

I realized just now that drongo has been intermittently failing like this:

147/256 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 2116.16s (exit status 255 or signal 127 SIGinvalid)
------------------------------------- 8< -------------------------------------
stderr:
# Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
# at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
# poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Failed test 'confl_active_logicalslot updated'
# at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 24.

This has been happening for some time, in all three branches where
that test script exists. The oldest failure that looks like that in
the v16 branch is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35

However, there are older failures showing a timeout of
035_standby_logical_decoding.pl that don't provide any detail, but
might well be the same thing. The oldest one of those is from
2024-05-01, which is still considerably later than the test script
itself (added on 2023-04-08). So it would seem that this is something
we broke during 2024, rather than an aboriginal problem in this test.

A search of the buildfarm logs did not turn up similar failures
on any other animals.

I have no idea how to proceed on narrowing down the cause...

Please take a look at the list of such failures since 2024-06-01 I
collected here:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#035_standby_logical_decoding_standby.pl_fails_due_to_missing_activeslot_invalidation

There is also a reference to a discussion of the failure there:
/messages/by-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
(In short, I observed that that test suffers from bgwriter's activity.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Andres Freund

andres@anarazel.de

12 months ago

In reply to: Tom Lane (#1)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Hi,

Thanks for noticing the pattern.

On 2025-01-24 11:42:15 -0500, Tom Lane wrote:

I realized just now that drongo has been intermittently failing like this:

147/256 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 2116.16s (exit status 255 or signal 127 SIGinvalid)
------------------------------------- 8< -------------------------------------
stderr:
# Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
# at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
# poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Failed test 'confl_active_logicalslot updated'
# at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 24.

There's a failure before this one:
[05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class

This has been happening for some time, in all three branches where
that test script exists. The oldest failure that looks like that in
the v16 branch is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35

Hm.

2024-09-06 05:23:29.031 UTC [1612:14] 035_standby_logical_decoding.pl LOG: logical decoding found consistent point at 0/4000180
2024-09-06 05:23:29.031 UTC [1612:15] 035_standby_logical_decoding.pl DETAIL: There are no running transactions.
2024-09-06 05:23:29.031 UTC [1612:16] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT "row_removal_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
...
2024-09-06 05:23:49.601 UTC [3984:7] LOG: invalidating obsolete replication slot "row_removal_inactiveslot"
2024-09-06 05:23:49.601 UTC [3984:8] DETAIL: The slot conflicted with xid horizon 747.
2024-09-06 05:23:49.601 UTC [3984:9] CONTEXT: WAL redo at 0/4020210 for Heap2/PRUNE: snapshotConflictHorizon: 747, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0
...
2024-09-06 05:28:21.043 UTC [5648:6] LOG: restartpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.191 s, sync=0.001 s, total=3.192 s; sync files=0, longest=0.000 s, average=0.000 s; distance=249 kB, estimate=29516 kB; lsn=0/403E6B8, redo lsn=0/403E680
2024-09-06 05:28:21.043 UTC [5648:7] LOG: recovery restart point at 0/403E680
...
2024-09-06 05:54:38.414 UTC [3984:10] LOG: invalid record length at 0/403E768: expected at least 24, got 0

There is no further message from 1612 and no intervening message by 3984
(startup) between invalidating row_removal_inactiveslot and reaching the end
of the WAL.

The regress_log does have:

Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary
done

I.e. this looks to not be a failure of the test to find the right message or
the stats about the terminatio nnot being flushed, but that
row_removal_activeslot isn't getting terminated. And due to the replay_lsn
check on the primary, we know that replication must have progressed past the
point of the conflict. So the problem doesn't appear to be that we're waiting
for termination, but that somehow we determined that termination is not
necessary.

Unfortunately with the logs available I can't really pinpoint why. I think we
need to
a) increase the postgres log level for the standby
b) add some explicit debug logging to report the contents of
pg_replication_slots

I hope the issue is that somehow we didn't need to terminate, rather than
missing the need to terminate and just continuing on with replay. But I can't
think of a scenario in which there wouldn't be a need to terminate...

However, there are older failures showing a timeout of
035_standby_logical_decoding.pl that don't provide any detail, but
might well be the same thing. The oldest one of those is from
2024-05-01, which is still considerably later than the test script
itself (added on 2023-04-08). So it would seem that this is something
we broke during 2024, rather than an aboriginal problem in this test.

Looking at the failures I see on drongo, I also started to wonder if there's
something broken around recovery conflict handling. Perhaps windows specific?

There e.g. are the following:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-12-26%2005%3A49%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-10-14%2016%3A39%3A49

Which fail like this:
------------------------------------- 8< -------------------------------------
stderr:
# poll_query_until timed out executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
# Failed test 'startup deadlock: lock acquisition is waiting'
# at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 261.

Greetings,

Andres Freund

andres@anarazel.de

12 months ago

In reply to: Andres Freund (#3)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Hi,

On 2025-01-24 14:23:26 -0500, Andres Freund wrote:

On 2025-01-24 11:42:15 -0500, Tom Lane wrote:
There's a failure before this one:
[05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class

This has been happening for some time, in all three branches where
that test script exists. The oldest failure that looks like that in
the v16 branch is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35

Hm.

2024-09-06 05:23:29.031 UTC [1612:14] 035_standby_logical_decoding.pl LOG: logical decoding found consistent point at 0/4000180
2024-09-06 05:23:29.031 UTC [1612:15] 035_standby_logical_decoding.pl DETAIL: There are no running transactions.
2024-09-06 05:23:29.031 UTC [1612:16] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT "row_removal_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
...
2024-09-06 05:23:49.601 UTC [3984:7] LOG: invalidating obsolete replication slot "row_removal_inactiveslot"
2024-09-06 05:23:49.601 UTC [3984:8] DETAIL: The slot conflicted with xid horizon 747.
2024-09-06 05:23:49.601 UTC [3984:9] CONTEXT: WAL redo at 0/4020210 for Heap2/PRUNE: snapshotConflictHorizon: 747, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0
...
2024-09-06 05:28:21.043 UTC [5648:6] LOG: restartpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.191 s, sync=0.001 s, total=3.192 s; sync files=0, longest=0.000 s, average=0.000 s; distance=249 kB, estimate=29516 kB; lsn=0/403E6B8, redo lsn=0/403E680
2024-09-06 05:28:21.043 UTC [5648:7] LOG: recovery restart point at 0/403E680
...
2024-09-06 05:54:38.414 UTC [3984:10] LOG: invalid record length at 0/403E768: expected at least 24, got 0

There is no further message from 1612 and no intervening message by 3984
(startup) between invalidating row_removal_inactiveslot and reaching the end
of the WAL.

The regress_log does have:

Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary
done

I.e. this looks to not be a failure of the test to find the right message or
the stats about the terminatio nnot being flushed, but that
row_removal_activeslot isn't getting terminated. And due to the replay_lsn
check on the primary, we know that replication must have progressed past the
point of the conflict. So the problem doesn't appear to be that we're waiting
for termination, but that somehow we determined that termination is not
necessary.

Unfortunately with the logs available I can't really pinpoint why. I think we
need to
a) increase the postgres log level for the standby
b) add some explicit debug logging to report the contents of
pg_replication_slots

I hope the issue is that somehow we didn't need to terminate, rather than
missing the need to terminate and just continuing on with replay. But I can't
think of a scenario in which there wouldn't be a need to terminate...

Hm, maybe I'm missing something, but isn't it possible for the active slot to
actually progress decoding past the conflict point? It's an active slot, with
the consumer running in the background, so all that needs to happen for that
is that logical decoding progresses past the conflict point. That requires
there be some reference to a newer xid to be in the WAL, but there's nothing
preventing that afaict?

In fact, I now saw this comment:

# Note that pg_current_snapshot() is used to get the horizon. It does
# not generate a Transaction/COMMIT WAL record, decreasing the risk of
# seeing a xl_running_xacts that would advance an active replication slot's
# catalog_xmin. Advancing the active replication slot's catalog_xmin
# would break some tests that expect the active slot to conflict with
# the catalog xmin horizon.

Which seems precisely what's happening here?

If that's the issue, I think we need to find a way to block logical decoding
from making forward progress during the test.

The easiest way would be to stop pg_recvlogical and emit a bunch of changes,
so that the backend is stalled sending out data. But that'd require a hard to
predict amount of data to be emitted, which isn't great.

But perhaps we could do something smarter, by starting a session on the
primary that acquires an access exclusive lock on a relation that logical
decoding will need to access? The tricky bit likely would be that it'd
somehow need to *not* prevent VACUUM on the primary.

If we could trigger VACUUM in a transaction on the primary this would be
easy, but we can't.

Greetings,

Andres

Andres Freund

andres@anarazel.de

12 months ago

In reply to: Alexander Lakhin (#2)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Hi,

On 2025-01-24 21:00:00 +0200, Alexander Lakhin wrote:

24.01.2025 18:42, Tom Lane wrote:

I realized just now that drongo has been intermittently failing like this:

147/256 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 2116.16s (exit status 255 or signal 127 SIGinvalid)
------------------------------------- 8< -------------------------------------
stderr:
# Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
# at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
# poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Failed test 'confl_active_logicalslot updated'
# at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 24.

This has been happening for some time, in all three branches where
that test script exists. The oldest failure that looks like that in
the v16 branch is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35

However, there are older failures showing a timeout of
035_standby_logical_decoding.pl that don't provide any detail, but
might well be the same thing. The oldest one of those is from
2024-05-01, which is still considerably later than the test script
itself (added on 2023-04-08). So it would seem that this is something
we broke during 2024, rather than an aboriginal problem in this test.

A search of the buildfarm logs did not turn up similar failures
on any other animals.

I have no idea how to proceed on narrowing down the cause...

Please take a look at the list of such failures since 2024-06-01 I
collected here:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#035_standby_logical_decoding_standby.pl_fails_due_to_missing_activeslot_invalidation

There is also a reference to a discussion of the failure there:
/messages/by-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
(In short, I observed that that test suffers from bgwriter's activity.)

I don't think it's quite right to blame this on bgwriter. E.g. a checkpoint
will also emit XLOG_RUNNING_XACTS. The problem is that the test just is racy
and that needs to be fixed.

Greetings,

Andres Freund

Tom Lane

tgl@sss.pgh.pa.us

12 months ago

In reply to: Alexander Lakhin (#2)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Alexander Lakhin <exclusion@gmail.com> writes:

There is also a reference to a discussion of the failure there:
/messages/by-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
(In short, I observed that that test suffers from bgwriter's activity.)

Hmm, that's an interesting theory, but it fails to explain why
only drongo is showing this failure. It's not particularly slow
compared to some other BF animals.

I suspect the correct explanation involves "it's Windows" somehow,
but no idea exactly how.

BTW, looking more closely at the failures, I notice that some of them
are 'with vacuum on pg_authid' but others are 'with vacuum on pg_class'.
So those are two different steps in the test script, and there are
other steps that are very similar but have not failed. Curiouser
and curiouser.

regards, tom lane

Andres Freund

andres@anarazel.de

12 months ago

In reply to: Tom Lane (#6)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Hi,

On 2025-01-24 14:56:45 -0500, Tom Lane wrote:

Alexander Lakhin <exclusion@gmail.com> writes:

There is also a reference to a discussion of the failure there:
/messages/by-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
(In short, I observed that that test suffers from bgwriter's activity.)

Hmm, that's an interesting theory, but it fails to explain why
only drongo is showing this failure. It's not particularly slow
compared to some other BF animals.

One interesting bit is that both on sucessful and failing runs the preceding
test is quite slow:

[14:18:13.386](3.192s) ok 25 - row_removal_activeslot on standby created
### Reloading node "standby"
# Running: pg_ctl -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata reload
server signaled

#### Begin standard error
psql:<stdin>:1: INFO: vacuuming "testdb.pg_catalog.pg_class"
psql:<stdin>:1: INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1
pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
tuples: 4 removed, 419 remain, 0 are dead but not yet removable
removable cutoff: 760, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 11 pages set all-visible, 10 pages set all-frozen (0 were all-visible)
index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed
index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 11.684 MB/s, avg write rate: 23.369 MB/s
buffer usage: 90 hits, 8 reads, 16 dirtied
WAL usage: 28 records, 15 full page images, 123107 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
#### End standard error
Waiting for replication conn standby's replay_lsn to pass 0/403FA80 on primary
done
[14:18:48.597](35.210s) ok 26 - inactiveslot slot invalidation is logged with vacuum on pg_class

That's really quite slow. Looking at a few other instances it always just
seems to take a few seconds. Obviously running for longer does make it more
likely for checkpoint/bgwriter to log running xacts.

Unfortunately there's no timing visible between the vacuum and waiting for
replication... And the passing runs don't seem to have the server logs.

Because of that I'm looking at a failure:

[05:23:26.630](3.831s) ok 21 - row_removal_activeslot on standby created
### Reloading node "standby"
# Running: pg_ctl -D C:\\prog\\bf\\root\\REL_16_STABLE\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata reload
server signaled

#### Begin standard error
psql:<stdin>:1: INFO: vacuuming "testdb.pg_catalog.pg_class"
psql:<stdin>:1: INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1
pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
tuples: 4 removed, 417 remain, 0 are dead but not yet removable
removable cutoff: 748, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed
index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 0.264 MB/s, avg write rate: 0.495 MB/s
buffer usage: 77 hits, 8 misses, 15 dirtied
WAL usage: 28 records, 15 full page images, 122841 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.23 s
#### End standard error
Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary
done
[05:23:54.630](28.000s) ok 22 - inactiveslot slot invalidation is logged with vacuum on pg_class
[05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class

The vacuum only starts at: 05:23:49.294, i.e. after ~23s.

2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG: connection received: host=127.0.0.1 port=51838
2024-09-06 05:23:49.301 UTC [552:2] [unknown] LOG: connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" method=sspi (C:/prog/bf/root/REL_16_STABLE/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_primary_data/pgdata/pg_hba.conf:2)
2024-09-06 05:23:49.301 UTC [552:3] [unknown] LOG: connection authorized: user=pgrunner database=testdb application_name=035_standby_logical_decoding.pl
2024-09-06 05:23:49.355 UTC [552:4] 035_standby_logical_decoding.pl LOG: statement: VACUUM verbose pg_class;
2024-09-06 05:23:49.607 UTC [552:5] 035_standby_logical_decoding.pl LOG: statement: INSERT INTO flush_wal DEFAULT VALUES;
2024-09-06 05:23:49.609 UTC [552:6] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.322 user=pgrunner database=testdb host=127.0.0.1 port=51838

Before that we see:

2024-09-06 05:23:27.683 UTC [168:4] 035_standby_logical_decoding.pl LOG: statement: SELECT xmin IS NULL AND catalog_xmin IS NULL
FROM pg_catalog.pg_replication_slots
WHERE slot_name = 'primary_physical';

2024-09-06 05:23:31.447 UTC [5836:4] 035_standby_logical_decoding.pl LOG: statement: select pg_snapshot_xmin(pg_current_snapshot());
2024-09-06 05:23:36.963 UTC [4908:4] 035_standby_logical_decoding.pl LOG: statement: CREATE TABLE conflict_test(x integer, y text);
2024-09-06 05:23:36.972 UTC [4908:5] 035_standby_logical_decoding.pl LOG: statement: DROP TABLE conflict_test;
2024-09-06 05:23:45.192 UTC [4240:4] 035_standby_logical_decoding.pl LOG: statement: SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - 746) > 0

Between 05:23:30.797 and 05:23:49.601 nothing is logged on the standby.

There's a lot of time between disconnects and connects in the test around that
time:

2024-09-06 05:23:31.591 UTC [5836:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.240 user=pgrunner database=testdb host=127.0.0.1 port=51817
2024-09-06 05:23:36.897 UTC [4908:1] [unknown] LOG: connection received: host=127.0.0.1 port=51829

2024-09-06 05:23:37.032 UTC [4908:6] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.142 user=pgrunner database=testdb host=127.0.0.1 port=51829
2024-09-06 05:23:45.135 UTC [4240:1] [unknown] LOG: connection received: host=127.0.0.1 port=51836

2024-09-06 05:23:45.272 UTC [4240:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.152 user=pgrunner database=testdb host=127.0.0.1 port=51836
2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG: connection received: host=127.0.0.1 port=51838

Even though afaict the perl script doesn't do anything else between those
connections.

BTW, looking more closely at the failures, I notice that some of them
are 'with vacuum on pg_authid' but others are 'with vacuum on pg_class'.
So those are two different steps in the test script, and there are
other steps that are very similar but have not failed. Curiouser
and curiouser.

I think that makes some sense:

"Scenario 1":

Doesn't have the race to the same degree, because a VACUUM FULL on pg_class is
used, which prevents logical decoding from progressing after it started (due
to the logged AEL at the start of VACFULL).

"Scenario 2" and "Scenario 3":

Nearly the same test, except once with a database-local relation and once with
a shared rel.

"Scenario 4":

Doesn't expect to trigger a conflict, so the race doesn't matter.

"Scenario 5":

Theoretically racy, but the window seems much smaller, it looks like the
conflict has multiple chances to be triggered due to each of the updates
happening in a separate session (and thus transaction).

Greetings,

Andres Freund

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

12 months ago

In reply to: Andres Freund (#4)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Hi,

On Fri, Jan 24, 2025 at 02:44:21PM -0500, Andres Freund wrote:

Hm, maybe I'm missing something, but isn't it possible for the active slot to
actually progress decoding past the conflict point? It's an active slot, with
the consumer running in the background, so all that needs to happen for that
is that logical decoding progresses past the conflict point. That requires
there be some reference to a newer xid to be in the WAL, but there's nothing
preventing that afaict?

In fact, I now saw this comment:

# Note that pg_current_snapshot() is used to get the horizon. It does
# not generate a Transaction/COMMIT WAL record, decreasing the risk of
# seeing a xl_running_xacts that would advance an active replication slot's
# catalog_xmin. Advancing the active replication slot's catalog_xmin
# would break some tests that expect the active slot to conflict with
# the catalog xmin horizon.

Yeah, that comes from 46d8587b504 (where we tried to reduce as much as possible
the risk of seeing an unwanted xl_running_xacts being generated).

Which seems precisely what's happening here?

Much probably yes.

If that's the issue, I think we need to find a way to block logical decoding
from making forward progress during the test.

The easiest way would be to stop pg_recvlogical and emit a bunch of changes,
so that the backend is stalled sending out data. But that'd require a hard to
predict amount of data to be emitted, which isn't great.

What about using an injection point instead to block pg_recvlogical until
we want it to resume?

But perhaps we could do something smarter, by starting a session on the
primary that acquires an access exclusive lock on a relation that logical
decoding will need to access? The tricky bit likely would be that it'd
somehow need to *not* prevent VACUUM on the primary.

Hm, I'm not sure how we could do that.

If we could trigger VACUUM in a transaction on the primary this would be
easy, but we can't.

Another idea that I had ([1]/messages/by-id/ZmadPZlEecJNbhvI@ip-10-97-1-34.eu-west-3.compute.internal) was to make use of injection points
around places where RUNNING_XACTS is emitted. IIRC I tried to work on this but
that was not simple as it sounds as we need the startup process not to be blocked
.

[1]: /messages/by-id/ZmadPZlEecJNbhvI@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

12 months ago

In reply to: Tom Lane (#1)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Hi,

On Fri, Jan 24, 2025 at 11:42:15AM -0500, Tom Lane wrote:

I realized just now that drongo has been intermittently failing like this:

Thanks for the report!

PS: FTR, the hits I got on this in the past 90 days were

sysname | branch | snapshot | stage | l
---------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------
drongo | HEAD | 2024-10-29 10:52:07 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'\r\r
drongo | REL_16_STABLE | 2024-10-31 08:07:11 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | REL_17_STABLE | 2024-11-05 11:11:28 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | REL_16_STABLE | 2024-11-22 08:58:52 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'\r\r
drongo | HEAD | 2024-11-29 22:23:47 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | HEAD | 2024-12-04 20:33:42 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | REL_17_STABLE | 2024-12-20 17:00:28 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'\r\r
drongo | HEAD | 2024-12-24 21:07:27 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'\r\r
drongo | REL_16_STABLE | 2025-01-10 15:48:16 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | REL_17_STABLE | 2025-01-10 18:05:18 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | HEAD | 2025-01-11 20:40:09 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | REL_16_STABLE | 2025-01-12 14:06:11 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'\r\r
drongo | REL_16_STABLE | 2025-01-13 07:19:22 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | REL_16_STABLE | 2025-01-14 10:48:16 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | HEAD | 2025-01-20 20:11:00 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | REL_16_STABLE | 2025-01-24 01:54:55 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
drongo | REL_17_STABLE | 2025-01-24 04:21:44 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r

Out of curiosity, how did you generate this output? (that looks wery useful)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#10

Tom Lane

tgl@sss.pgh.pa.us

12 months ago

In reply to: Bertrand Drouvot (#9)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Bertrand Drouvot <bertranddrouvot.pg@gmail.com> writes:

On Fri, Jan 24, 2025 at 11:42:15AM -0500, Tom Lane wrote:

PS: FTR, the hits I got on this in the past 90 days were

sysname | branch | snapshot | stage | l
---------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------
drongo | HEAD | 2024-10-29 10:52:07 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'\r\r
drongo | REL_16_STABLE | 2024-10-31 08:07:11 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
...

Out of curiosity, how did you generate this output? (that looks wery useful)

That is from a SQL query on the database that stores the last few
months' worth of buildfarm logs. I believe current policy is that
only committers will be granted access to that machine, partly
because of security concerns and partly because of resource
limitations. But in the interests of transparency, what I did was

pgbfprod=> select * from
(select sysname, s.branch, snapshot, stage,
unnest(string_to_array(log_text, E'\n')) as l
from build_status s join build_status_log l using (sysname, snapshot)
where stage != 'OK' and snapshot > localtimestamp - '3 mons'::interval) ss
where l like '%Failed test%activeslot slot invalidation%'
\g outfile

(Other than the specific string-to-search-for, this is a canned
query pattern that I use a lot. It scans all failed tests' logs
in the past 3 months.) I did some manual duplicate-removal
and hand filtering afterwards, IIRC.

regards, tom lane

#11

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

12 months ago

In reply to: Tom Lane (#10)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Hi,

On Mon, Jan 27, 2025 at 09:28:57AM -0500, Tom Lane wrote:

Bertrand Drouvot <bertranddrouvot.pg@gmail.com> writes:

On Fri, Jan 24, 2025 at 11:42:15AM -0500, Tom Lane wrote:

PS: FTR, the hits I got on this in the past 90 days were

sysname | branch | snapshot | stage | l
---------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------
drongo | HEAD | 2024-10-29 10:52:07 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'\r\r
drongo | REL_16_STABLE | 2024-10-31 08:07:11 | recoveryCheck | # Failed test 'activeslot slot invalidation is logged with vacuum on pg_authid'\r\r
...

Out of curiosity, how did you generate this output? (that looks wery useful)

That is from a SQL query on the database that stores the last few
months' worth of buildfarm logs. I believe current policy is that
only committers will be granted access to that machine, partly
because of security concerns and partly because of resource
limitations.

Makes sense.

But in the interests of transparency, what I did was

pgbfprod=> select * from
(select sysname, s.branch, snapshot, stage,
unnest(string_to_array(log_text, E'\n')) as l
from build_status s join build_status_log l using (sysname, snapshot)
where stage != 'OK' and snapshot > localtimestamp - '3 mons'::interval) ss
where l like '%Failed test%activeslot slot invalidation%'
\g outfile

(Other than the specific string-to-search-for, this is a canned
query pattern that I use a lot. It scans all failed tests' logs
in the past 3 months.) I did some manual duplicate-removal
and hand filtering afterwards, IIRC.

Thanks for the explanation! Having access to this database looks like a big
time saver in some situations.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#12

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

11 months ago

In reply to: Bertrand Drouvot (#8)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

Hi,

On Mon, Jan 27, 2025 at 07:13:01AM +0000, Bertrand Drouvot wrote:

On Fri, Jan 24, 2025 at 02:44:21PM -0500, Andres Freund wrote:

If we could trigger VACUUM in a transaction on the primary this would be
easy, but we can't.

Another idea that I had ([1]) was to make use of injection points
around places where RUNNING_XACTS is emitted. IIRC I tried to work on this but
that was not simple as it sounds as we need the startup process not to be blocked
.

I just proposed a patch to make use of an injection point to prevent the catalog_xmin
of a logical slot to advance past the conflict point ([1]/messages/by-id/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal). That does not fix the
issue on v16 though.

[1]: /messages/by-id/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com