Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Started by Bertrand Drouvotalmost 3 years ago29 messages
Jump to latest
#1Bertrand Drouvot
bertranddrouvot.pg@gmail.com

Hi hackers,

Please find attached a patch proposal to $SUBJECT.

Indeed, we have seen occurrences in [1]/messages/by-id/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9@OSZPR01MB6310.jpnprd01.prod.outlook.com that some slots were
not invalidated (while we expected vacuum to remove dead rows
leading to slots invalidation on the standby).

Though we don't have strong evidences that this
was due to transactions holding back global xmin (as vacuum did
not run in verbose mode), suspicion is high enough (as Tom pointed
out that the test is broken on its face (see [1]/messages/by-id/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9@OSZPR01MB6310.jpnprd01.prod.outlook.com)).

The proposed patch:

- set autovacuum = off on the primary (as autovacuum is the usual suspect
for holding global xmin).
- Ensure that vacuum is able to remove dead rows before launching
the slots invalidation tests.
- If after 10 attempts the vacuum is still not able to remove the dead
rows then the slots invalidation tests are skipped: that should be pretty
rare, as currently the majority of the tests are green (with only one attempt).

While at it, the patch also addresses the nitpicks mentioned by Robert in [2]/messages/by-id/CA+TgmobHGpU2ZkChgKifGDLaf_+mFA7njEpeTjfyNf_msCZYew@mail.gmail.com.

[1]: /messages/by-id/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9@OSZPR01MB6310.jpnprd01.prod.outlook.com
[2]: /messages/by-id/CA+TgmobHGpU2ZkChgKifGDLaf_+mFA7njEpeTjfyNf_msCZYew@mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Ensure-vacuum-did-remove-dead-rows-in-035_standby.patchtext/plain; charset=UTF-8; name=v1-0001-Ensure-vacuum-did-remove-dead-rows-in-035_standby.patchDownload+150-114
#2Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Bertrand Drouvot (#1)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On 5/30/23 12:34 PM, Drouvot, Bertrand wrote:

Hi hackers,

Please find attached a patch proposal to $SUBJECT.

Indeed, we have seen occurrences in [1] that some slots were
not invalidated (while we expected vacuum to remove dead rows
leading to slots invalidation on the standby).

Though we don't have strong evidences that this
was due to transactions holding back global xmin (as vacuum did
not run in verbose mode), suspicion is high enough (as Tom pointed
out that the test is broken on its face (see [1])).

The proposed patch:

- set autovacuum = off on the primary (as autovacuum is the usual suspect
for holding global xmin).
- Ensure that vacuum is able to remove dead rows before launching
the slots invalidation tests.
- If after 10 attempts the vacuum is still not able to remove the dead
rows then the slots invalidation tests are skipped: that should be pretty
rare, as currently the majority of the tests are green (with only one attempt).

While at it, the patch also addresses the nitpicks mentioned by Robert in [2].

[1]: /messages/by-id/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9@OSZPR01MB6310.jpnprd01.prod.outlook.com
[2]: /messages/by-id/CA+TgmobHGpU2ZkChgKifGDLaf_+mFA7njEpeTjfyNf_msCZYew@mail.gmail.com

Please find attached V2 that, instead of the above proposal, waits for a new snapshot
that has a newer horizon before doing the vacuum (as proposed by Andres in [1]/messages/by-id/20230530152426.ensapay7pozh7ozn@alap3.anarazel.de).

So, V2:

- set autovacuum = off on the primary (as autovacuum is the usual suspect
for holding global xmin).
- waits for a new snapshot that has a newer horizon before doing the vacuum(s).
- addresses the nitpicks mentioned by Robert in [2]/messages/by-id/CA+TgmobHGpU2ZkChgKifGDLaf_+mFA7njEpeTjfyNf_msCZYew@mail.gmail.com.

V2 also keeps the verbose mode for the vacuum(s) (as done in V1), as it may help
for further analysis if needed.

[1]: /messages/by-id/20230530152426.ensapay7pozh7ozn@alap3.anarazel.de
[2]: /messages/by-id/CA+TgmobHGpU2ZkChgKifGDLaf_+mFA7njEpeTjfyNf_msCZYew@mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Ensure-vacuum-is-able-to-remove-dead-rows-in-035_.patchtext/plain; charset=UTF-8; name=v2-0001-Ensure-vacuum-is-able-to-remove-dead-rows-in-035_.patchDownload+37-40
#3Alexander Lakhin
exclusion@gmail.com
In reply to: Bertrand Drouvot (#2)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hello Michael and Bertrand,

I'd also like to note that even with FREEZE added [1]/messages/by-id/4fd52508-54d7-0202-5bd3-546c2295967f@gmail.com, I happened to see
the test failure:
5       #   Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class'
5       #   at t/035_standby_logical_decoding.pl line 222.
5
5       #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
5       #   at t/035_standby_logical_decoding.pl line 227.

where 035_standby_logical_decoding_primary.log contains:
...
2024-01-09 07:44:26.480 UTC [820142] 035_standby_logical_decoding.pl LOG:  statement: DROP TABLE conflict_test;
2024-01-09 07:44:26.687 UTC [820142] 035_standby_logical_decoding.pl LOG:  statement: VACUUM (VERBOSE, FREEZE) pg_class;
2024-01-09 07:44:26.687 UTC [820142] 035_standby_logical_decoding.pl INFO:  aggressively vacuuming
"testdb.pg_catalog.pg_class"
2024-01-09 07:44:27.099 UTC [820143] DEBUG:  autovacuum: processing database "testdb"
2024-01-09 07:44:27.102 UTC [820142] 035_standby_logical_decoding.pl INFO:  finished vacuuming
"testdb.pg_catalog.pg_class": index scans: 1
        pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
        tuples: 0 removed, 423 remain, 4 are dead but not yet removable
        removable cutoff: 762, which was 2 XIDs old when operation ended
        new relfrozenxid: 762, which is 2 XIDs ahead of previous value
        frozen: 1 pages from table (9.09% of total) had 1 tuples frozen
....

Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
0174c2d21 should be superseded by a patch like discussed (or just have
autovacuum = off added)...

09.01.2024 07:59, Michael Paquier wrote:

Alexander, does the test gain in stability once you begin using the
patch posted on [2], mentioned by Bertrand?

(Also, perhaps we'd better move the discussion to the other thread
where the patch has been sent.)

[2]: /messages/by-id/d40d015f-03a4-1cf2-6c1f-2b9aca860762@gmail.com

09.01.2024 08:29, Bertrand Drouvot wrote:

Alexander, pleae find attached v3 which is more or less a rebased version of it.

Bertrand, thank you for updating the patch!

Michael, it definitely increases stability of the test (tens of iterations
with 20 tests in parallel performed successfully), although I've managed to
see another interesting failure (twice):
13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
13      #   at t/035_standby_logical_decoding.pl line 227.

psql:<stdin>:1: INFO:  vacuuming "testdb.pg_catalog.pg_class"
psql:<stdin>:1: INFO:  finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1
pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
tuples: 4 removed, 419 remain, 0 are dead but not yet removable
removable cutoff: 754, which was 0 XIDs old when operation ended
...
Waiting for replication conn standby's replay_lsn to pass 0/403E6F8 on primary

But I see no VACUUM records in WAL:
rmgr: Transaction len (rec/tot):    222/   222, tx:          0, lsn: 0/0403E468, prev 0/0403E370, desc: INVALIDATION ;
inval msgs: catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 relcache
2662 relcache 2663 relcache 3455 relcache 1259
rmgr: Standby     len (rec/tot):    234/   234, tx:          0, lsn: 0/0403E548, prev 0/0403E468, desc: INVALIDATIONS ;
relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 55 catcache 54 catcache 55 catcache 54 catcache 55
catcache 54 catcache 55 catcache 54 relcache 2662 relcache 2663 relcache 3455 relcache 1259
rmgr: Heap        len (rec/tot):     60/   140, tx:        754, lsn: 0/0403E638, prev 0/0403E548, desc: INSERT off: 2,
flags: 0x08, blkref #0: rel 1663/16384/16385 blk 0 FPW
rmgr: Transaction len (rec/tot):     46/    46, tx:        754, lsn: 0/0403E6C8, prev 0/0403E638, desc: COMMIT
2024-01-09 13:40:59.873385 UTC
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0403E6F8, prev 0/0403E6C8, desc: RUNNING_XACTS
nextXid 755 latestCompletedXid 754 oldestRunningXid 755
rmgr: XLOG        len (rec/tot):     30/    30, tx:          0, lsn: 0/0403E730, prev 0/0403E6F8, desc: CHECKPOINT_REDO
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0403E750, prev 0/0403E730, desc: RUNNING_XACTS
nextXid 755 latestCompletedXid 754 oldestRunningXid 755
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/0403E788, prev 0/0403E750, desc:
CHECKPOINT_ONLINE redo 0/403E730; tli 1; prev tli 1; fpw true; xid 0:755; oid 24576; multi 1; offset 0; oldest xid 728
in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 755; online
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0403E800, prev 0/0403E788, desc: RUNNING_XACTS
nextXid 755 latestCompletedXid 754 oldestRunningXid 755

(Full logs are attached.)

[1]: /messages/by-id/4fd52508-54d7-0202-5bd3-546c2295967f@gmail.com

Best regards,
Alexander

Attachments:

035-failures-after-vacuum-on-pg_class.tar.gzapplication/gzip; name=035-failures-after-vacuum-on-pg_class.tar.gzDownload
#4Michael Paquier
michael@paquier.xyz
In reply to: Alexander Lakhin (#3)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:

Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
0174c2d21 should be superseded by a patch like discussed (or just have
autovacuum = off added)...

Adding an extra FREEZE offers an extra insurance, so I don't see why
it would be a problem to add it to stabilize the horizon conflicts on
the standbys.

09.01.2024 07:59, Michael Paquier wrote:
Bertrand, thank you for updating the patch!

Michael, it definitely increases stability of the test (tens of iterations
with 20 tests in parallel performed successfully), although I've managed to
see another interesting failure (twice):
13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
13      #   at t/035_standby_logical_decoding.pl line 227.

Something I'd like to confirm here: you still see this failure with
the patch, but without an extra FREEZE, right? If we do both, the
test would get more stable, wouldn't it?
--
Michael

#5Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Alexander Lakhin (#3)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:

Michael, it definitely increases stability of the test (tens of iterations
with 20 tests in parallel performed successfully),

Thanks for testing!

although I've managed to
see another interesting failure (twice):
13����� #�� Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
13����� #�� at t/035_standby_logical_decoding.pl line 227.

Looking at the attached log files and particularly 1/regress_log_035_standby_logical_decoding:

"
[11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with vacuum on pg_class
[11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with vacuum on pg_class
"

That seems weird, the inactive slot has been invalidated while the active one is not.
While it takes a bit longer to invalidate an active slot, I don't think the test can
move on until both are invalidated (then leading to the tests 24 and 25)). I can
see the tests are very slow to run (13.993s for 24) but still don't get how 24 could
succeed while 25 does not.

Looking at 2/regress_log_035_standby_logical_decoding:

"
[13:41:02.076](20.279s) ok 23 - inactiveslot slot invalidation is logged with vacuum on pg_class
[13:41:02.076](0.000s) not ok 24 - activeslot slot invalidation is logged with vacuum on pg_class
"

Same "weird" behavior but this time the tests numbering are not the same (23 and 24).
That is even more weird as those tests should be the 24 and 25 ones.

Would it be possible to also send the standby logs?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#6Alexander Lakhin
exclusion@gmail.com
In reply to: Michael Paquier (#4)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

10.01.2024 07:26, Michael Paquier wrote:

On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:

Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
0174c2d21 should be superseded by a patch like discussed (or just have
autovacuum = off added)...

Adding an extra FREEZE offers an extra insurance, so I don't see why
it would be a problem to add it to stabilize the horizon conflicts on
the standbys.

As 0174c2d21 added FREEZE already, I meant to add "autovacuum = off" or
apply a fix similar to what we're are discussing here.

Michael, it definitely increases stability of the test (tens of iterations
with 20 tests in parallel performed successfully), although I've managed to
see another interesting failure (twice):
13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
13      #   at t/035_standby_logical_decoding.pl line 227.

Something I'd like to confirm here: you still see this failure with
the patch, but without an extra FREEZE, right? If we do both, the
test would get more stable, wouldn't it?

Yes, I tested the patch as-is, without FREEZE, but it looks like it doesn't
matter in that case. And sorry for misleading information about missing
VACUUM records in my previous message, please ignore it.

10.01.2024 12:46, Bertrand Drouvot wrote:

although I've managed to
see another interesting failure (twice):
13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
13      #   at t/035_standby_logical_decoding.pl line 227.

Looking at the attached log files and particularly 1/regress_log_035_standby_logical_decoding:

"
[11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with vacuum on pg_class
[11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with vacuum on pg_class
"

That seems weird, the inactive slot has been invalidated while the active one is not.
While it takes a bit longer to invalidate an active slot, I don't think the test can
move on until both are invalidated (then leading to the tests 24 and 25)). I can
see the tests are very slow to run (13.993s for 24) but still don't get how 24 could
succeed while 25 does not.

...

Would it be possible to also send the standby logs?

Yes, please look at the attached logs. This time I've build postgres with
-DWAL_DEBUG and ran tests with TEMP_CONFIG as below:
wal_keep_size=1GB
wal_debug = on
log_autovacuum_min_duration = 0
log_statement = 'all'
log_min_messages = INFO

The archive attached contains logs from four runs:
recovery-1-ok -- an example of successful run for reference
recovery-7-pruning and recovery-19-pruning -- failures with a failed
 subtest 'activeslot slot invalidation is logged with on-access pruning'
recovery-15-vacuum_pg_class -- a failure with a failed
 subtest 'activeslot slot invalidation is logged with vacuum on pg_class'

The distinction that I see in the failed run logs, for example in
recovery-15-vacuum_pg_class, 035_standby_logical_decoding_standby.log:
2024-01-10 11:00:18.700 UTC [789618] LOG:  REDO @ 0/4020220; LSN 0/4020250: prev 0/401FDE0; xid 753; len 20 -
Transaction/COMMIT: 2024-01-10 11:00:18.471694+00
2024-01-10 11:00:18.797 UTC [789618] LOG:  REDO @ 0/4020250; LSN 0/4020288: prev 0/4020220; xid 0; len 24 -
Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid 753 oldestRunningXid 754
2024-01-10 11:00:19.013 UTC [789618] LOG:  REDO @ 0/4020288; LSN 0/40202C8: prev 0/4020250; xid 0; len 9; blkref #0: rel
1663/16384/2610, blk 0 - Heap2/PRUNE: snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, isCatalogRel: T, nunused:
0, redirected: [], dead: [48], unused: []
2024-01-10 11:00:19.111 UTC [789618] LOG:  invalidating obsolete replication slot "row_removal_inactiveslot"
2024-01-10 11:00:19.111 UTC [789618] DETAIL:  The slot conflicted with xid horizon 752.
2024-01-10 11:00:19.111 UTC [789618] CONTEXT:  WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752,
nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel
1663/16384/2610, blk 0
2024-01-10 11:00:29.109 UTC [789618] LOG:  terminating process 790377 to release replication slot "row_removal_activeslot"
2024-01-10 11:00:29.109 UTC [789618] DETAIL:  The slot conflicted with xid horizon 752.
2024-01-10 11:00:29.109 UTC [789618] CONTEXT:  WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752,
nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel
1663/16384/2610, blk 0
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl ERROR:  canceling statement due to conflict with
recovery
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl DETAIL:  User was using a logical replication slot
that must be invalidated.
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT
"row_removal_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl LOG:  released logical replication slot
"row_removal_activeslot"

is an absent message 'obsolete replication slot "row_removal_activeslot"'
and an additional record 'Standby/RUNNING_XACTS', which can be found in
035_standby_logical_decoding_primary.log:
2024-01-10 11:00:18.515 UTC [783410] LOG:  xlog bg flush request write 0/4020250; flush: 0/4020250, current is write
0/4020220; flush 0/4020220
2024-01-10 11:00:18.646 UTC [783387] LOG:  INSERT @ 0/4020288:  - Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid
753 oldestRunningXid 754
2024-01-10 11:00:18.702 UTC [790526] 035_standby_logical_decoding.pl LOG:  statement: SELECT (select
txid_snapshot_xmin(txid_current_snapshot()) - 753) > 0
2024-01-10 11:00:18.724 UTC [783410] LOG:  xlog bg flush request write 0/4020288; flush: 0/4020288, current is write
0/4020250; flush 0/4020250

So perhaps it can affect an active slot invalidation?

Best regards,
Alexander

Attachments:

035-failures.tar.gzapplication/gzip; name=035-failures.tar.gzDownload+16-6
#7Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Alexander Lakhin (#6)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On Wed, Jan 10, 2024 at 05:00:01PM +0300, Alexander Lakhin wrote:

10.01.2024 12:46, Bertrand Drouvot wrote:

Would it be possible to also send the standby logs?

Yes, please look at the attached logs. This time I've build postgres with
-DWAL_DEBUG and ran tests with TEMP_CONFIG as below:
wal_keep_size=1GB
wal_debug = on
log_autovacuum_min_duration = 0
log_statement = 'all'
log_min_messages = INFO

The archive attached contains logs from four runs:
recovery-1-ok -- an example of successful run for reference
recovery-7-pruning and recovery-19-pruning -- failures with a failed
�subtest 'activeslot slot invalidation is logged with on-access pruning'
recovery-15-vacuum_pg_class -- a failure with a failed
�subtest 'activeslot slot invalidation is logged with vacuum on pg_class'

Thanks a lot for the testing!

is an absent message 'obsolete replication slot "row_removal_activeslot"'

Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:

Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().

In case of an active slot we first call ReportSlotInvalidation with the second
parameter set to true (to emit the "terminating" message), then SIGTERM the active
process and then (later) we should call the other ReportSlotInvalidation()
call with the second parameter set to false (to emit the message that we don't
see here).

So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation()
call.

The thing it that it looks like we exited the loop in InvalidatePossiblyObsoleteSlot()
because there is more messages from the startup process (789618) after the:

"
2024-01-10 11:00:29.109 UTC [789618] LOG: terminating process 790377 to release replication slot "row_removal_activeslot"
"

one.

Do you think you could try to add more debug messages in InvalidatePossiblyObsoleteSlot()
to understand why the second call to ReportSlotInvalidation() is not done and IIUC
why/how we exited the loop?

FWIW, I did try to reproduce by launching pg_recvlogical and then kill -SIGSTOP
it. Then producing a conflict, I'm able to get the first message and not the second
one (which is expected). But the startup process does not exit the loop, which is
expected here.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#8Alexander Lakhin
exclusion@gmail.com
In reply to: Bertrand Drouvot (#7)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi Bertrand,

10.01.2024 19:32, Bertrand Drouvot wrote:

is an absent message 'obsolete replication slot "row_removal_activeslot"'

Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:

Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().

In case of an active slot we first call ReportSlotInvalidation with the second
parameter set to true (to emit the "terminating" message), then SIGTERM the active
process and then (later) we should call the other ReportSlotInvalidation()
call with the second parameter set to false (to emit the message that we don't
see here).

So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation()
call.

I've found a way to reproduce the issue without slowing down a machine or
running multiple tests in parallel. It's enough for this to add a delay to
allow BackgroundWriterMain() to execute LogStandbySnapshot():
@@ -692,2 +690,3 @@
  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]);
  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);

With this delay, I get the failure immediately:
$ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C src/test/recovery
# +++ tap check in src/test/recovery +++
t/035_standby_logical_decoding.pl .. 47/?
#   Failed test 'activeslot slot invalidation is logged with on-access pruning'
#   at t/035_standby_logical_decoding.pl line 227.

_primary.log contains:
2024-01-11 09:37:01.731 UTC [67656] 035_standby_logical_decoding.pl STATEMENT:  UPDATE prun SET s = 'D';
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl LOG:  statement: SELECT pg_sleep(15);
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl LOG:  xlog flush request 0/404D8F0; write 0/0; flush
0/0 at character 8
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl CONTEXT:  writing block 14 of relation base/16384/1247
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl STATEMENT:  SELECT pg_sleep(15);
2024-01-11 09:37:01.905 UTC [67204] LOG:  xlog flush request 0/404DA58; write 0/404BB00; flush 0/404BB00
2024-01-11 09:37:01.905 UTC [67204] CONTEXT:  writing block 4 of relation base/16384/2673
2024-01-11 09:37:12.514 UTC [67204] LOG:  INSERT @ 0/4057768:  - Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid
768 oldestRunningXid 769
2024-01-11 09:37:12.514 UTC [67206] LOG:  xlog bg flush request write 0/4057768; flush: 0/4057768, current is write
0/4057730; flush 0/4057730
2024-01-11 09:37:16.760 UTC [67712] 035_standby_logical_decoding.pl LOG:  statement: UPDATE prun SET s = 'E';
2024-01-11 09:37:16.760 UTC [67712] 035_standby_logical_decoding.pl LOG:  INSERT @ 0/40577A8:  - Heap2/PRUNE:
snapshotConflictHorizon: 768,...

Note RUNNING_XACTS here...

_standby.log contains:
2024-01-11 09:37:16.842 UTC [67606] LOG:  invalidating obsolete replication slot "pruning_inactiveslot"
2024-01-11 09:37:16.842 UTC [67606] DETAIL:  The slot conflicted with xid horizon 768.
2024-01-11 09:37:16.842 UTC [67606] CONTEXT:  WAL redo at 0/4057768 for Heap2/PRUNE: snapshotConflictHorizon: 768, ...
and no 'invalidating obsolete replication slot "pruning_activeslot"' below.

Debug logging added (see attached) gives more information:
2024-01-11 09:37:16.842 UTC [67606] LOG:  invalidating obsolete replication slot "pruning_inactiveslot"
2024-01-11 09:37:16.842 UTC [67606] DETAIL:  The slot conflicted with xid horizon 768.
...
2024-01-11 09:37:16.842 UTC [67606] LOG: !!!InvalidatePossiblyObsoleteSlot| RS_INVAL_HORIZON, s: 0x7f7985475c10,
s->effective_xmin: 0, s->effective_catalog_xmin: 769, snapshotConflictHorizon: 768
...
2024-01-11 09:37:16.842 UTC [67606] LOG: !!!InvalidatePossiblyObsoleteSlot| conflict: 0

so the condition TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
    snapshotConflictHorizon) is not satisfied, hence conflict = 0 and it breaks
the loop in InvalidatePossiblyObsoleteSlot().
Several lines above in the log we can see:
2024-01-11 09:37:12.514 UTC [67606] LOG:  REDO @ 0/4057730; LSN 0/4057768: prev 0/4057700; xid 0; len 24 -
Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid 768 oldestRunningXid 769
2024-01-11 09:37:12.540 UTC [67643] 035_standby_logical_decoding.pl LOG:  !!!LogicalConfirmReceivedLocation|
MyReplicationSlot: 0x7f7985475c10, MyReplicationSlot->effective_catalog_xmin: 769

and that's the first occurrence of xid 769 in the log.

The decoded stack trace for the LogicalConfirmReceivedLocation call is:
ogicalConfirmReceivedLocation at logical.c:1886:1
ProcessStandbyReplyMessage at walsender.c:2327:1
ProcessStandbyMessage at walsender.c:2188:1
ProcessRepliesIfAny at walsender.c:2121:5
WalSndWaitForWal at walsender.c:1735:7
logical_read_xlog_page at walsender.c:1068:13
ReadPageInternal at xlogreader.c:1062:12
XLogDecodeNextRecord at xlogreader.c:601:5
XLogReadAhead at xlogreader.c:976:5
XLogReadRecord at xlogreader.c:406:3
XLogSendLogical at walsender.c:3229:5
WalSndLoop at walsender.c:2658:7
StartLogicalReplication at walsender.c:1477:2
exec_replication_command at walsender.c:1985:6
PostgresMain at postgres.c:4649:10

Best regards,
Alexander

Attachments:

adhoc-035-failure-debugging.patchtext/x-patch; charset=UTF-8; name=adhoc-035-failure-debugging.patchDownload+12-0
035-pruning-failure-logs.tar.gzapplication/gzip; name=035-pruning-failure-logs.tar.gzDownload+9-5
#9Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Alexander Lakhin (#8)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On Thu, Jan 11, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote:

Hi Bertrand,

10.01.2024 19:32, Bertrand Drouvot wrote:

is an absent message 'obsolete replication slot "row_removal_activeslot"'

Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:

Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().

In case of an active slot we first call ReportSlotInvalidation with the second
parameter set to true (to emit the "terminating" message), then SIGTERM the active
process and then (later) we should call the other ReportSlotInvalidation()
call with the second parameter set to false (to emit the message that we don't
see here).

So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation()
call.

I've found a way to reproduce the issue without slowing down a machine or
running multiple tests in parallel. It's enough for this to add a delay to
allow BackgroundWriterMain() to execute LogStandbySnapshot():
@@ -692,2 +690,3 @@
�$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]);
�$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);

With this delay, I get the failure immediately:
$ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C src/test/recovery
# +++ tap check in src/test/recovery +++
t/035_standby_logical_decoding.pl .. 47/?
#�� Failed test 'activeslot slot invalidation is logged with on-access pruning'
#�� at t/035_standby_logical_decoding.pl line 227.

Thanks a lot for the testing!

So I think we have 2 issues here:

1) The one you're mentioning above related to the on-access pruning test:

I think the engine behavior is right here and that the test is racy. I'm
proposing to bypass the active slot invalidation check for this particular test (
as I don't see any "easy" way to take care of this race condition). The active slot
invalidation is already well covered in the other tests anyway.

I'm proposing the attached v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch
for it.

2) The fact that sometime we're getting a termination message which is not followed
by an obsolete one (like as discussed in [1]/messages/by-id/ZZ7GpII4bAYN+jT5@ip-10-97-1-34.eu-west-3.compute.internal).

For this one, I think that InvalidatePossiblyObsoleteSlot() is racy:

In case of an active slot we proceed in 2 steps:
- terminate the backend holding the slot
- report the slot as obsolete

This is racy because between the two we release the mutex on the slot, which
means the effective_xmin and effective_catalog_xmin could advance during that time.

I'm proposing the attached v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
for it.

Would it be possible to re-launch your repro (the slow one, not the pg_sleep() one)
with bot patch applied and see how it goes? (Please note that v4 replaces v3 that
you're already using in your tests).

If it helps, I'll propose v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
into a dedicated hackers thread.

[1]: /messages/by-id/ZZ7GpII4bAYN+jT5@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patchtext/x-diff; charset=us-asciiDownload+59-56
v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patchtext/x-diff; charset=us-asciiDownload+13-7
#10Alexander Lakhin
exclusion@gmail.com
In reply to: Bertrand Drouvot (#9)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

11.01.2024 17:58, Bertrand Drouvot wrote:

So I think we have 2 issues here:

1) The one you're mentioning above related to the on-access pruning test:

I think the engine behavior is right here and that the test is racy. I'm
proposing to bypass the active slot invalidation check for this particular test (
as I don't see any "easy" way to take care of this race condition). The active slot
invalidation is already well covered in the other tests anyway.

I'm proposing the attached v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch
for it.

2) The fact that sometime we're getting a termination message which is not followed
by an obsolete one (like as discussed in [1]).

For this one, I think that InvalidatePossiblyObsoleteSlot() is racy:

In case of an active slot we proceed in 2 steps:
- terminate the backend holding the slot
- report the slot as obsolete

This is racy because between the two we release the mutex on the slot, which
means the effective_xmin and effective_catalog_xmin could advance during that time.

I'm proposing the attached v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
for it.

Would it be possible to re-launch your repro (the slow one, not the pg_sleep() one)
with bot patch applied and see how it goes? (Please note that v4 replaces v3 that
you're already using in your tests).

If it helps, I'll propose v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
into a dedicated hackers thread.

[1]: /messages/by-id/ZZ7GpII4bAYN+jT5@ip-10-97-1-34.eu-west-3.compute.internal

Bertrand, I've relaunched tests in the same slowed down VM with both
patches applied (but with no other modifications) and got a failure
with pg_class, similar to what we had seen before:
9       #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
9       #   at t/035_standby_logical_decoding.pl line 230.

Please look at the logs attached (I see there Standby/RUNNING_XACTS near
'invalidating obsolete replication slot "row_removal_inactiveslot"').

Best regards,
Alexander

Attachments:

035-failure-vacuum-pg_class.tar.gzapplication/gzip; name=035-failure-vacuum-pg_class.tar.gzDownload
#11Michael Paquier
michael@paquier.xyz
In reply to: Alexander Lakhin (#10)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote:

Bertrand, I've relaunched tests in the same slowed down VM with both
patches applied (but with no other modifications) and got a failure
with pg_class, similar to what we had seen before:
9       #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
9       #   at t/035_standby_logical_decoding.pl line 230.

Please look at the logs attached (I see there Standby/RUNNING_XACTS near
'invalidating obsolete replication slot "row_removal_inactiveslot"').

Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
= minimal, because these lead to unpredictible records inserted,
impacting the reliability of the tests. We cannot do that here,
obviously. That may be a long shot, but could it be possible to tweak
the test with a retry logic, retrying things if such a standby
snapshot is found because we know that the invalidation is not going
to work anyway?
--
Michael

#12Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#11)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On Fri, Jan 12, 2024 at 07:01:55AM +0900, Michael Paquier wrote:

On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote:

Bertrand, I've relaunched tests in the same slowed down VM with both
patches applied (but with no other modifications) and got a failure
with pg_class, similar to what we had seen before:
9       #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
9       #   at t/035_standby_logical_decoding.pl line 230.

Please look at the logs attached (I see there Standby/RUNNING_XACTS near
'invalidating obsolete replication slot "row_removal_inactiveslot"').

Thanks!

For this one, the "good" news is that it looks like that we don’t see the
"terminating" message not followed by an "obsolete" message (so the engine
behaves correctly) anymore.

There is simply nothing related to the row_removal_activeslot at all (the catalog_xmin
advanced and there is no conflict).

And I agree that this is due to the Standby/RUNNING_XACTS that is "advancing" the
catalog_xmin of the active slot.

Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
= minimal, because these lead to unpredictible records inserted,
impacting the reliability of the tests. We cannot do that here,
obviously. That may be a long shot, but could it be possible to tweak
the test with a retry logic, retrying things if such a standby
snapshot is found because we know that the invalidation is not going
to work anyway?

I think it all depends what the xl_running_xacts does contain (means does it
"advance" or not the catalog_xmin in our case).

In our case it does advance it (should it occurs) due to the "select txid_current()"
that is done in wait_until_vacuum_can_remove() in 035_standby_logical_decoding.pl.

I suggest to make use of txid_current_snapshot() instead (that does not produce
a Transaction/COMMIT wal record, as opposed to txid_current()).

I think that it could be "enough" for our case here, and it's what v5 attached is
now doing.

Let's give v5 a try? (please apply v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
too).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patchtext/x-diff; charset=us-asciiDownload+59-56
#13Alexander Lakhin
exclusion@gmail.com
In reply to: Bertrand Drouvot (#12)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

12.01.2024 10:15, Bertrand Drouvot wrote:

For this one, the "good" news is that it looks like that we don’t see the
"terminating" message not followed by an "obsolete" message (so the engine
behaves correctly) anymore.

There is simply nothing related to the row_removal_activeslot at all (the catalog_xmin
advanced and there is no conflict).

Yes, judging from all the failures that we see now, it looks like the
0001-Fix-race-condition...patch works as expected.

And I agree that this is due to the Standby/RUNNING_XACTS that is "advancing" the
catalog_xmin of the active slot.

Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
= minimal, because these lead to unpredictible records inserted,
impacting the reliability of the tests. We cannot do that here,
obviously. That may be a long shot, but could it be possible to tweak
the test with a retry logic, retrying things if such a standby
snapshot is found because we know that the invalidation is not going
to work anyway?

I think it all depends what the xl_running_xacts does contain (means does it
"advance" or not the catalog_xmin in our case).

In our case it does advance it (should it occurs) due to the "select txid_current()"
that is done in wait_until_vacuum_can_remove() in 035_standby_logical_decoding.pl.

I suggest to make use of txid_current_snapshot() instead (that does not produce
a Transaction/COMMIT wal record, as opposed to txid_current()).

I think that it could be "enough" for our case here, and it's what v5 attached is
now doing.

Let's give v5 a try? (please apply v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
too).

Unfortunately, I've got the failure again (please see logs attached).
(_primary.log can confirm that I have used exactly v5 — I see no
txid_current() calls there...)

Best regards,
Alexander

Attachments:

035-failures-vacuum-pg_authid.tar.gzapplication/gzip; name=035-failures-vacuum-pg_authid.tar.gzDownload+5-2
#14Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Alexander Lakhin (#13)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On Fri, Jan 12, 2024 at 02:00:01PM +0300, Alexander Lakhin wrote:

Hi,

12.01.2024 10:15, Bertrand Drouvot wrote:

For this one, the "good" news is that it looks like that we don’t see the
"terminating" message not followed by an "obsolete" message (so the engine
behaves correctly) anymore.

There is simply nothing related to the row_removal_activeslot at all (the catalog_xmin
advanced and there is no conflict).

Yes, judging from all the failures that we see now, it looks like the
0001-Fix-race-condition...patch works as expected.

Yeah, thanks for confirming, I'll create a dedicated hackers thread for that one.

Let's give v5 a try? (please apply v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
too).

Unfortunately, I've got the failure again (please see logs attached).
(_primary.log can confirm that I have used exactly v5 — I see no
txid_current() calls there...)

Okay ;-( Thanks for the testing. Then I can think of:

1) Michael's proposal up-thread (means tweak the test with a retry logic, retrying
things if such a standby snapshot is found).

2) Don't report a test error for active slots in case its catalog_xmin advanced.

I'd vote for 2) as:

- this is a corner case and the vast majority of the animals don't report any
issues (means the active slot conflict detection is already well covered).

- even on the same animal it should be pretty rare to not have an active slot
conflict detection not covered at all (and the "failing" one would be probably
moving over time).

- It may be possible that 1) ends up failing (as we'd need to put a limit on the
retry logic anyhow).

What do you think?

And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's
fully correct, so I'll give it another look.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#14)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

On Fri, Jan 12, 2024 at 01:46:08PM +0000, Bertrand Drouvot wrote:

1) Michael's proposal up-thread (means tweak the test with a retry logic, retrying
things if such a standby snapshot is found).

2) Don't report a test error for active slots in case its catalog_xmin advanced.

I'd vote for 2) as:

- this is a corner case and the vast majority of the animals don't report any
issues (means the active slot conflict detection is already well covered).

- even on the same animal it should be pretty rare to not have an active slot
conflict detection not covered at all (and the "failing" one would be probably
moving over time).

- It may be possible that 1) ends up failing (as we'd need to put a limit on the
retry logic anyhow).

What do you think?

And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's
fully correct, so I'll give it another look.

The WAL records related to standby snapshots are playing a lot with
the randomness of the failures we are seeing. Alexander has mentioned
offlist something else: using SIGSTOP on the bgwriter to avoid these
records and make the test more stable. That would not be workable for
Windows, but I could live with that knowing that logical decoding for
standbys has no platform-speficic tweak for the code paths we're
testing here, and that would put as limitation to skip the test for
$windows_os.

While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test. This requires a small
implementation, but nothing really huge, while being portable
everywhere. And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.

Another possibility would be to move these records elsewhere, outside
of the bgwriter, but we need such records at a good frequency for the
availability of read-only standbys. And surely we'd want an on/off
switch anyway to get a full control for test sequences.
--
Michael

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#15)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Michael Paquier <michael@paquier.xyz> writes:

While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test. This requires a small
implementation, but nothing really huge, while being portable
everywhere. And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.

Hmm ... I see what you are after, but to what extent would this mean
that what we are testing is not our real-world behavior?

regards, tom lane

#17Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#16)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

On Sun, Jan 14, 2024 at 11:08:39PM -0500, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test. This requires a small
implementation, but nothing really huge, while being portable
everywhere. And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.

Hmm ... I see what you are after, but to what extent would this mean
that what we are testing is not our real-world behavior?

Don't think so. We don't care much about these records when it comes
to checking slot invalidation scenarios with a predictible XID
horizon, AFAIK.
--
Michael

#18Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#17)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On Mon, Jan 15, 2024 at 01:11:26PM +0900, Michael Paquier wrote:

On Sun, Jan 14, 2024 at 11:08:39PM -0500, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test. This requires a small
implementation, but nothing really huge, while being portable
everywhere. And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.

Hmm ... I see what you are after, but to what extent would this mean
that what we are testing is not our real-world behavior?

Don't think so. We don't care much about these records when it comes
to checking slot invalidation scenarios with a predictible XID
horizon, AFAIK.

Yeah, we want to test slot invalidation behavior so we need to ensure that such
an invalidation occur (which is not the case if we get a xl_running_xacts in the
middle) at the first place.

OTOH I also see Tom's point: for example I think we'd not have discovered [1]/messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal
(outside from the field) with such a developer GUC in place.

We did a few things in this thread, so to sum up what we've discovered:

- a race condition in InvalidatePossiblyObsoleteSlot() (see [1]/messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal)
- we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
( proposal in in v6 attached)
- we need a way to control how frequent xl_running_xacts are emmitted (to ensure
they are not triggered in a middle of an active slot invalidation test).

I'm not sure it's possible to address Tom's concern and keep the test "predictable".

So, I think I'd vote for Michael's proposal to implement a superuser-settable
developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would
still not address Tom's concern anyway).

Another option would be to "sacrifice" the full predictablity of the test (in
favor of real-world behavior testing)?

[1]: /messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patchtext/x-diff; charset=us-asciiDownload+30-30
#19Alexander Lakhin
exclusion@gmail.com
In reply to: Michael Paquier (#15)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hello Michael and Bertrand,

15.01.2024 06:59, Michael Paquier wrote:

The WAL records related to standby snapshots are playing a lot with
the randomness of the failures we are seeing. Alexander has mentioned
offlist something else: using SIGSTOP on the bgwriter to avoid these
records and make the test more stable. That would not be workable for
Windows, but I could live with that knowing that logical decoding for
standbys has no platform-speficic tweak for the code paths we're
testing here, and that would put as limitation to skip the test for
$windows_os.

I've found a way to implement pause/resume for Windows processed and it
looks acceptable to me if we can afford "use Win32::API;" on Windows
(maybe the test could be skipped only if this perl module is absent).
Please look at the PoC patch for the test 035_standby_logical_decoding.
(The patched test passes for me.)

If this approach looks promising to you, maybe we could add a submodule to
perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
019_replslot_limit) as well.

Personally I think that having such a functionality for using in tests
might be useful not only to avoid some "problematic" behaviour but also to
test the opposite cases.

While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test. This requires a small
implementation, but nothing really huge, while being portable
everywhere. And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.

I see that the test in question exists in REL_16_STABLE, it means that a
new GUC would not help there?

Best regards,
Alexander

Attachments:

035-suspend-on-Windows.patchtext/x-patch; charset=UTF-8; name=035-suspend-on-Windows.patchDownload+30-0
#20Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Lakhin (#19)
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hello,

15.01.2024 12:00, Alexander Lakhin wrote:

If this approach looks promising to you, maybe we could add a submodule to
perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
019_replslot_limit) as well.

Personally I think that having such a functionality for using in tests
might be useful not only to avoid some "problematic" behaviour but also to
test the opposite cases.

After spending a few days on it, I've discovered two more issues:
/messages/by-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com
/messages/by-id/b0102688-6d6c-c86a-db79-e0e91d245b1a@gmail.com

(The latter is not related to bgwriter directly, but it was discovered
thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)

So it becomes clear that the 035 test is not the only one, which might
suffer from bgwriter's activity, and inventing a way to stop bgwriter/
control it is a different subject, getting out of scope of the current
issue.

15.01.2024 11:49, Bertrand Drouvot wrote:

We did a few things in this thread, so to sum up what we've discovered:

- a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
- we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
( proposal in in v6 attached)
- we need a way to control how frequent xl_running_xacts are emmitted (to ensure
they are not triggered in a middle of an active slot invalidation test).

I'm not sure it's possible to address Tom's concern and keep the test "predictable".

So, I think I'd vote for Michael's proposal to implement a superuser-settable
developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would
still not address Tom's concern anyway).

Another option would be to "sacrifice" the full predictablity of the test (in
favor of real-world behavior testing)?

[1]: /messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal

So, now we have the test 035 failing due to nondeterministic vacuum
activity in the first place, and due to bgwriter's activity in the second.
Maybe it would be a right move to commit the fix, and then think about
more rare failures.

Though I have a couple of question regarding the fix left, if you don't
mind:
1) The test has minor defects in the comments, that I noted before [1]/messages/by-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b010@gmail.com;
would you like to fix them in passing?

BTW, it looks like the comment:
# One way to produce recovery conflict is to create/drop a relation and
# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
in scenario 3 is a copy-paste error.
Also, there are two "Scenario 4" in this test.

2) Shall we align the 035_standby_logical_decoding with
031_recovery_conflict in regard to improving stability of vacuum?
I see the following options for this:
a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
b) use FREEZE and autovacuum = off in both tests;
c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
 autovacuum = off in both.

I've re-tested the v6 patch today and confirmed that it makes the test
more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in
parallel and got failures ('inactiveslot slot invalidation is logged with
vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
(With unlimited CPU, when average test duration is around 70 seconds.)

But with v6 applied, 60 iterations succeeded.

[1]: /messages/by-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b010@gmail.com

Best regards,
Alexander

#21Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Alexander Lakhin (#20)
#22Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#21)
#23Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#22)
#24Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#23)
#25Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#24)
#26Alexander Lakhin
exclusion@gmail.com
In reply to: Bertrand Drouvot (#25)
#27Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Alexander Lakhin (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#27)
#29Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#28)