Re: pgsql: Prevent invalidation of newly synced replication slots.

Started by Robert Haas3 months ago27 messageshackers

robertmhaas@gmail.com

3 months ago

On Tue, Jan 27, 2026 at 12:56 AM Amit Kapila <akapila@postgresql.org> wrote:

Prevent invalidation of newly synced replication slots.

This commit has broken CI for me. On the "Windows - Server 2022, VS
2019 - Meson & ninja" build, the following shows up in
046_checkpoint_logical_slot_standby.log:

2026-01-27 13:44:44.421 GMT startup[5172] FATAL: could not rename
file "backup_label" to "backup_label.old": Permission denied

I imagine this is going to break CI for everybody else too, as well as cfbot.

--
Robert Haas
EDB: http://www.enterprisedb.com

Import Notes

Reply to msg id not found: E1vkc3p-002uNr-0e@gemulon.postgresql.orgReference msg id not found: E1vkc3p-002uNr-0e@gemulon.postgresql.org

Tom Lane

tgl@sss.pgh.pa.us

3 months ago

In reply to: Robert Haas (#1)

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jan 27, 2026 at 12:56 AM Amit Kapila <akapila@postgresql.org> wrote:

Prevent invalidation of newly synced replication slots.

This commit has broken CI for me.

Hmm, I wonder why the buildfarm seems fine with it ... I'm prepared
to believe a Windows-only problem, but at least hamerkop has run
since 851f664.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

3 months ago

In reply to: Tom Lane (#2)

I wrote:

Robert Haas <robertmhaas@gmail.com> writes:

This commit has broken CI for me.

Hmm, I wonder why the buildfarm seems fine with it ... I'm prepared
to believe a Windows-only problem, but at least hamerkop has run
since 851f664.

D'oh: hamerkop doesn't run any TAP tests, let alone ones that require
--enable-injection-points. So that success proves nothing.

Our other Windows animals (drongo, fairywren, unicorn) seem to be
configured with -Dtap_tests=enabled, but nothing about injection
points, so they will also skip 046_checkpoint_logical_slot.
Seems like a bit of a blind spot in the buildfarm.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

3 months ago

In reply to: Tom Lane (#2)

On Tue, Jan 27, 2026 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jan 27, 2026 at 12:56 AM Amit Kapila <akapila@postgresql.org> wrote:

Prevent invalidation of newly synced replication slots.

This commit has broken CI for me.

Hmm, I wonder why the buildfarm seems fine with it ... I'm prepared
to believe a Windows-only problem, but at least hamerkop has run
since 851f664.

I don't understand it, either. There's a bunch of error codes that we
map to EACCES in _dosmaperr, but I don't know why any of those
problems would have occurred here:

ERROR_ACCESS_DENIED, EACCES
ERROR_CURRENT_DIRECTORY, EACCES
ERROR_LOCK_VIOLATION, EACCES
ERROR_SHARING_VIOLATION, EACCES
ERROR_NETWORK_ACCESS_DENIED, EACCES
ERROR_CANNOT_MAKE, EACCES
ERROR_FAIL_I24, EACCES
ERROR_DRIVE_LOCKED, EACCES
ERROR_SEEK_ON_DEVICE, EACCES
ERROR_NOT_LOCKED, EACCES
ERROR_LOCK_FAILED, EACCES

(Side note: Wouldn't it make a lot of sense to go back and kill
_dosmaperr in favor of display the actual Windows error code string?)

What's also puzzling is that what this test is doing seems to be
totally standard. 040_standby_failover_slots_sync.pl does this:

my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
$standby1->init_from_backup(
$primary, $backup_name,
has_streaming => 1,
has_restoring => 1);

And 046_checkpont_logical_slot.pl does this:

my $standby = PostgreSQL::Test::Cluster->new('standby');
$standby->init_from_backup(
$primary, $backup_name,
has_streaming => 1,
has_restoring => 1);

So why is 046 failing and 040 is fine? I have no idea.

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

3 months ago

In reply to: Robert Haas (#4)

Robert Haas <robertmhaas@gmail.com> writes:

What's also puzzling is that what this test is doing seems to be
totally standard.

Yeah. I do notice something interesting when running it here:
046_checkpoint_logical_slot_mike.log shows that we are triggering
quite a few checkpoints (via pg_switch_wal()) in quick succession
on the primary. I wonder if that is somehow tickling a Windows
filesystem restriction.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

3 months ago

In reply to: Robert Haas (#1)

On Wed, Jan 28, 2026 at 3:59 AM Robert Haas <robertmhaas@gmail.com> wrote:

I imagine this is going to break CI for everybody else too, as well as cfbot.

Just by the way, on that last point, we trained cfbot to watch out for
CI pass/fail in this account:

https://github.com/postgres/postgres/commits/master/

and then use the most recent pass as the base commit when applying
patches to make test branches. So if master is broken for a while, it
no longer takes all the cfbot runs with it. Mentioning just in case
anyone is confused by that...

As for what's happening... hmm, there are a few holes in the "shared
locking" stuff you get with the flags we use. For example you can't
unlink a directory that contains a file that has been unlinked but
someone still holds open. Doesn't seem to be the case here. But I
wonder if you can't rename("old", "new") where "new" is a file that
has already been unlinked (or renamed over) that someone still holds
open, or something like that...

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Robert Haas (#4)

Hi,

On 2026-01-27 10:51:58 -0500, Robert Haas wrote:

I don't understand it, either. There's a bunch of error codes that we
map to EACCES in _dosmaperr, but I don't know why any of those
problems would have occurred here:

ERROR_ACCESS_DENIED, EACCES
ERROR_CURRENT_DIRECTORY, EACCES
ERROR_LOCK_VIOLATION, EACCES
ERROR_SHARING_VIOLATION, EACCES
ERROR_NETWORK_ACCESS_DENIED, EACCES
ERROR_CANNOT_MAKE, EACCES
ERROR_FAIL_I24, EACCES
ERROR_DRIVE_LOCKED, EACCES
ERROR_SEEK_ON_DEVICE, EACCES
ERROR_NOT_LOCKED, EACCES
ERROR_LOCK_FAILED, EACCES

(Side note: Wouldn't it make a lot of sense to go back and kill
_dosmaperr in favor of display the actual Windows error code string?)

It'd be great to somehow preserve the mapping to preserve the original error
message, but I don't really see how we could just give up on our mapping. We
rely on e.g. knowing that a read failed due to ENOENT, not
ERROR_FILE_NOT_FOUND or whatnot.

What's also puzzling is that what this test is doing seems to be
totally standard. 040_standby_failover_slots_sync.pl does this:

my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
$standby1->init_from_backup(
$primary, $backup_name,
has_streaming => 1,
has_restoring => 1);

And 046_checkpont_logical_slot.pl does this:

my $standby = PostgreSQL::Test::Cluster->new('standby');
$standby->init_from_backup(
$primary, $backup_name,
has_streaming => 1,
has_restoring => 1);

So why is 046 failing and 040 is fine? I have no idea.

046 does a fair bit of stuff before the base backup is being taken, I guess?
But what that concretely could be, I have no idea.

It'd be one thing if it failed while creating a base backup, but the fact that
it allows the base backup being created, but then fails during startup is just
plain odd. The typical sharing violation issue seems like it'd require that
we somehow are not waiting for pg_basebackup to actually have terminated?

Greetings,

Andres Freund

Robert Haas

robertmhaas@gmail.com

3 months ago

In reply to: Tom Lane (#5)

On Tue, Jan 27, 2026 at 11:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

What's also puzzling is that what this test is doing seems to be
totally standard.

Yeah. I do notice something interesting when running it here:
046_checkpoint_logical_slot_mike.log shows that we are triggering
quite a few checkpoints (via pg_switch_wal()) in quick succession
on the primary. I wonder if that is somehow tickling a Windows
filesystem restriction.

Maybe, but it seems unlikely to me that this would mess up the
standby, since it's a totally different node. What I kind of wonder is
if somehow there's still a process that has backup_label open, or has
closed it but not recently enough for Windows to unlock it. However, I
don't see why that would affect this test case and not others.

--
Robert Haas
EDB: http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Thomas Munro (#6)

Hi,

On 2026-01-28 05:16:13 +1300, Thomas Munro wrote:

On Wed, Jan 28, 2026 at 3:59 AM Robert Haas <robertmhaas@gmail.com> wrote:

I imagine this is going to break CI for everybody else too, as well as cfbot.

Just by the way, on that last point, we trained cfbot to watch out for
CI pass/fail in this account:

https://github.com/postgres/postgres/commits/master/

and then use the most recent pass as the base commit when applying
patches to make test branches. So if master is broken for a while, it
no longer takes all the cfbot runs with it. Mentioning just in case
anyone is confused by that...

Ah. I was indeed confused by that for a bit.

But I wonder if you can't rename("old", "new") where "new" is a file that
has already been unlinked (or renamed over) that someone still holds open,
or something like that...

I don't see a source of that that would be specific to this test though :(. We
do wait for pg_basebackup to have shut down, which wrote backup.label (which
was "manifactured" during streaming by basebackup.c).

Perhaps we should crank up log level in the test? No idea if it'll help, but
right now I don't even know where to start looking.

Greetings,

Andres Freund

#10

Greg Burd

greg@burd.me

3 months ago

In reply to: Tom Lane (#3)

On Jan 27, 2026, at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Robert Haas <robertmhaas@gmail.com> writes:

This commit has broken CI for me.

Hmm, I wonder why the buildfarm seems fine with it ... I'm prepared
to believe a Windows-only problem, but at least hamerkop has run
since 851f664.

D'oh: hamerkop doesn't run any TAP tests, let alone ones that require
--enable-injection-points. So that success proves nothing.

Our other Windows animals (drongo, fairywren, unicorn) seem to be
configured with -Dtap_tests=enabled, but nothing about injection
points, so they will also skip 046_checkpoint_logical_slot.
Seems like a bit of a blind spot in the buildfarm.

regards, tom lane

I'll see if I can update unicorn today to enable injection points to add some coverage on Win11/ARM64/MSVC. No promises that will be diagnostic at all, but it seems like a good idea.

-Dinjection_points=true

-greg

#11

Robert Haas

robertmhaas@gmail.com

3 months ago

In reply to: Greg Burd (#10)

On Tue, Jan 27, 2026 at 11:53 AM Greg Burd <greg@burd.me> wrote:

I'll see if I can update unicorn today to enable injection points to add some coverage on Win11/ARM64/MSVC. No promises that will be diagnostic at all, but it seems like a good idea.
-Dinjection_points=true

Sounds good!

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

#12

Robert Haas

robertmhaas@gmail.com

3 months ago

In reply to: Andres Freund (#9)

On Tue, Jan 27, 2026 at 11:37 AM Andres Freund <andres@anarazel.de> wrote:

But I wonder if you can't rename("old", "new") where "new" is a file that
has already been unlinked (or renamed over) that someone still holds open,
or something like that...

I don't see a source of that that would be specific to this test though :(. We
do wait for pg_basebackup to have shut down, which wrote backup.label (which
was "manifactured" during streaming by basebackup.c).

Perhaps we should crank up log level in the test? No idea if it'll help, but
right now I don't even know where to start looking.

I tried sticking a pg_sleep(30) in just before starting the standby
node, and that didn't help, so it doesn't seem like it's a race
condition.

Here's what the standby log file looks like with log_min_messages=DEBUG2:

2026-01-27 17:19:25.262 GMT postmaster[4932] DEBUG: registering
background worker "logical replication launcher"
2026-01-27 17:19:25.264 GMT postmaster[4932] DEBUG: dynamic shared
memory system will support 229 segments
2026-01-27 17:19:25.264 GMT postmaster[4932] DEBUG: created dynamic
shared memory control segment 3769552926 (9176 bytes)
2026-01-27 17:19:25.266 GMT postmaster[4932] DEBUG: max_safe_fds =
990, usable_fds = 1000, already_open = 3
2026-01-27 17:19:25.268 GMT postmaster[4932] LOG: starting PostgreSQL
19devel on x86_64-windows, compiled by msvc-19.29.30159, 64-bit
2026-01-27 17:19:25.271 GMT postmaster[4932] LOG: listening on Unix
socket "C:/Windows/TEMP/3xesO1s4ba/.s.PGSQL.17575"
2026-01-27 17:19:25.273 GMT postmaster[4932] DEBUG: updating PMState
from PM_INIT to PM_STARTUP
2026-01-27 17:19:25.273 GMT postmaster[4932] DEBUG: assigned pm child
slot 57 for io worker
2026-01-27 17:19:25.275 GMT postmaster[4932] DEBUG: assigned pm child
slot 58 for io worker
2026-01-27 17:19:25.277 GMT postmaster[4932] DEBUG: assigned pm child
slot 59 for io worker
2026-01-27 17:19:25.278 GMT postmaster[4932] DEBUG: assigned pm child
slot 56 for checkpointer
2026-01-27 17:19:25.280 GMT postmaster[4932] DEBUG: assigned pm child
slot 55 for background writer
2026-01-27 17:19:25.281 GMT postmaster[4932] DEBUG: assigned pm child
slot 89 for startup
2026-01-27 17:19:25.308 GMT checkpointer[6560] DEBUG: checkpointer
updated shared memory configuration values
2026-01-27 17:19:25.314 GMT startup[2488] LOG: database system was
interrupted; last known up at 2026-01-27 17:19:21 GMT
2026-01-27 17:19:25.317 GMT startup[2488] DEBUG: removing all
temporary WAL segments
The system cannot find the file specified.
2026-01-27 17:19:25.336 GMT startup[2488] DEBUG: could not restore
file "00000002.history" from archive: child process exited with exit
code 1
2026-01-27 17:19:25.337 GMT startup[2488] DEBUG: backup time
2026-01-27 17:19:21 GMT in file "backup_label"
2026-01-27 17:19:25.337 GMT startup[2488] DEBUG: backup label
pg_basebackup base backup in file "backup_label"
2026-01-27 17:19:25.337 GMT startup[2488] DEBUG: backup timeline 1 in
file "backup_label"
2026-01-27 17:19:25.337 GMT startup[2488] LOG: starting backup
recovery with redo LSN 0/2A000028, checkpoint LSN 0/2A000080, on
timeline ID 1
The system cannot find the file specified.
2026-01-27 17:19:25.352 GMT startup[2488] DEBUG: could not restore
file "00000001000000000000002A" from archive: child process exited
with exit code 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: checkpoint record is
at 0/2A000080
2026-01-27 17:19:25.353 GMT startup[2488] LOG: entering standby mode
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: redo record is at
0/2A000028; shutdown false
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: next transaction ID:
769; next OID: 24576
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: next MultiXactId: 1;
next MultiXactOffset: 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: oldest unfrozen
transaction ID: 760, in database 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: oldest MultiXactId:
1, in database 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: commit timestamp Xid
oldest/newest: 0/0
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: transaction ID wrap
limit is 2147484407, limited by database with OID 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: MultiXactId wrap
limit is 2147483648, limited by database with OID 1
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: starting up replication slots
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: xmin required by
slots: data 0, catalog 0
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: starting up
replication origin progress state
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: didn't need to
unlink permanent stats file "pg_stat/pgstat.stat" - didn't exist
2026-01-27 17:19:38.938 GMT startup[2488] FATAL: could not rename
file "backup_label" to "backup_label.old": Permission denied
2026-01-27 17:19:38.983 GMT postmaster[4932] DEBUG: releasing pm child slot 89
2026-01-27 17:19:38.983 GMT postmaster[4932] LOG: startup process
(PID 2488) exited with exit code 1
2026-01-27 17:19:38.983 GMT postmaster[4932] LOG: aborting startup
due to startup process failure
2026-01-27 17:19:38.983 GMT postmaster[4932] DEBUG: cleaning up
dynamic shared memory control segment with ID 3769552926
2026-01-27 17:19:38.985 GMT postmaster[4932] LOG: database system is shut down

Unfortunately, I don't see any clues there. The "The system cannot
find the file specified." messages look like they might be a clue, but
I think they are not, because they also occur in
040_standby_failover_slots_sync_standby1.log, and that test passes. At
the point where this log file shows the FATAL error, that log file
continues thus:

2026-01-27 17:18:36.905 GMT startup[1420] DEBUG: resetting unlogged
relations: cleanup 1 init 0
2026-01-27 17:18:36.906 GMT startup[1420] DEBUG: initializing for hot standby
2026-01-27 17:18:36.906 GMT startup[1420] LOG: redo starts at 0/02000028
2026-01-27 17:18:36.906 GMT startup[1420] DEBUG: recovery snapshots
are now enabled
2026-01-27 17:18:36.906 GMT startup[1420] CONTEXT: WAL redo at
0/02000048 for Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid
768 oldestRunningXid 769
2026-01-27 17:18:36.907 GMT startup[1420] DEBUG: end of backup record reached
2026-01-27 17:18:36.907 GMT startup[1420] CONTEXT: WAL redo at
0/02000100 for XLOG/BACKUP_END: 0/02000028
2026-01-27 17:18:36.907 GMT startup[1420] DEBUG: end of backup reached

Which again seems totally normal.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13

Robert Haas

robertmhaas@gmail.com

3 months ago

In reply to: Robert Haas (#12)

On Tue, Jan 27, 2026 at 12:42 PM Robert Haas <robertmhaas@gmail.com> wrote:

2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: didn't need to
unlink permanent stats file "pg_stat/pgstat.stat" - didn't exist
2026-01-27 17:19:38.938 GMT startup[2488] FATAL: could not rename
file "backup_label" to "backup_label.old": Permission denied

Andrey Borodin pointed out to me off-list that there's a retry loop in
pgrename(). The 13 second delay between the above two log messages
almost certainly means that retry loop is iterating until it hits its
10 second timeout. This almost certainly means that the underlying
Windows error is ERROR_ACCESS_DENIED, ERROR_SHARING_VIOLATION, or
ERROR_LOCK_VIOLATION, and that somebody else has the file open. But
nothing other than Perl touches that directory before we try to start
the standby:

my $standby = PostgreSQL::Test::Cluster->new('standby');
$standby->init_from_backup(
$primary, $backup_name,
has_streaming => 1,
has_restoring => 1);
$standby->append_conf(
'postgresql.conf', qq(
hot_standby_feedback = on
primary_slot_name = 'phys_slot'
primary_conninfo = '$connstr_1 dbname=postgres'
log_min_messages = 'debug2'
));
$standby->start;

As far as I can see, only init_from_backup() touches the backup_label
file, and that just copies the directory using RecursiveCopy.pm, which
as far as I can tell is quite careful about closing file handles. So I
still have no idea what's happening here.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Robert Haas (#12)

Hi,

On 2026-01-27 12:42:51 -0500, Robert Haas wrote:

I tried sticking a pg_sleep(30) in just before starting the standby
node, and that didn't help, so it doesn't seem like it's a race
condition.

Interesting.

It could be worth trying to run the test in isolation, without all the other
concurrent tests.

Greg, have you tried to repro it interactively?

Bryan, you seem to have become the resident windows expert...

2026-01-27 17:19:25.337 GMT startup[2488] LOG: starting backup
recovery with redo LSN 0/2A000028, checkpoint LSN 0/2A000080, on
timeline ID 1
The system cannot find the file specified.
2026-01-27 17:19:25.352 GMT startup[2488] DEBUG: could not restore
file "00000001000000000000002A" from archive: child process exited
with exit code 1

I think that must be a message from "copy" (which we seem to be using for
restore_command on windows).

I don't know why the standby is created with has_restoring => 1. But it
shouldn't be related to the issue, I think?

Greetings,

Andres Freund

#15

Amit Kapila

amit.kapila16@gmail.com

3 months ago

In reply to: Robert Haas (#1)

On Tue, Jan 27, 2026 at 8:29 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 27, 2026 at 12:56 AM Amit Kapila <akapila@postgresql.org> wrote:

Prevent invalidation of newly synced replication slots.

This commit has broken CI for me. On the "Windows - Server 2022, VS
2019 - Meson & ninja" build, the following shows up in
046_checkpoint_logical_slot_standby.log:

2026-01-27 13:44:44.421 GMT startup[5172] FATAL: could not rename
file "backup_label" to "backup_label.old": Permission denied

I imagine this is going to break CI for everybody else too, as well as cfbot.

I'll try to reproduce and look into it.

--
With Regards,
Amit Kapila.

#16

Amit Kapila

amit.kapila16@gmail.com

3 months ago

In reply to: Robert Haas (#13)

On Tue, Jan 27, 2026 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 27, 2026 at 12:42 PM Robert Haas <robertmhaas@gmail.com> wrote:

2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: didn't need to
unlink permanent stats file "pg_stat/pgstat.stat" - didn't exist
2026-01-27 17:19:38.938 GMT startup[2488] FATAL: could not rename
file "backup_label" to "backup_label.old": Permission denied

Andrey Borodin pointed out to me off-list that there's a retry loop in
pgrename(). The 13 second delay between the above two log messages
almost certainly means that retry loop is iterating until it hits its
10 second timeout.

Yes, this is correct. I am able to reproduce it. In pgrename(), we use
MoveFileEx() windows API which fails with errorcode 32 which further
maps to doserrr 13 via _dosmaperr. It is following mapping
ERROR_SHARING_VIOLATION, EACCES in doserrors struct.

This almost certainly means that the underlying

Windows error is ERROR_ACCESS_DENIED, ERROR_SHARING_VIOLATION, or
ERROR_LOCK_VIOLATION, and that somebody else has the file open.

It is ERROR_SHARING_VIOLATION.

But

nothing other than Perl touches that directory before we try to start
the standby:

my $standby = PostgreSQL::Test::Cluster->new('standby');
$standby->init_from_backup(
$primary, $backup_name,
has_streaming => 1,
has_restoring => 1);
$standby->append_conf(
'postgresql.conf', qq(
hot_standby_feedback = on
primary_slot_name = 'phys_slot'
primary_conninfo = '$connstr_1 dbname=postgres'
log_min_messages = 'debug2'
));
$standby->start;

As far as I can see, only init_from_backup() touches the backup_label
file, and that just copies the directory using RecursiveCopy.pm, which
as far as I can tell is quite careful about closing file handles. So I
still have no idea what's happening here.

It is not clear to me either why the similar test like
040_standby_failover_slots_sync is successful and
046_checkpoint_logical_slot is failing. I am still thinking about it
but thought of sharing the information I could gather by debugging.

Do let me know if you could think of gathering any other information
which can be of help here.

--
With Regards,
Amit Kapila.

#17

Thomas Munro

thomas.munro@gmail.com

3 months ago

In reply to: Andres Freund (#9)

On Tue, Jan 27, 2026 at 5:37 PM Andres Freund <andres@anarazel.de> wrote:

On 2026-01-28 05:16:13 +1300, Thomas Munro wrote:

But I wonder if you can't rename("old", "new") where "new" is a file that
has already been unlinked (or renamed over) that someone still holds open,
or something like that...

I don't see a source of that that would be specific to this test though :(. We
do wait for pg_basebackup to have shut down, which wrote backup.label (which
was "manifactured" during streaming by basebackup.c).

I have no specific ideas, but just in case it's helpful for this
discussion, I looked at my old test suite[1]/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com where I tried to
catalogue all the edge conditions around this sort of stuff
empirically, and saw that rename() always fails like that if the file
is open (that is, it doesn't require a more complicated sequence with
an earlier unlink/rename of the new name):

+ /*
+ * Windows can't rename over an open non-unlinked file, even with
+ * have_posix_unlink_semantics.
+ */
+ pgwin32_dirmod_loops = 2; /* minimize looping to fail fast in testing */
+ PG_EXPECT_SYS(rename(path, path2) == -1,
+  "Windows: can't rename name1.txt -> name2.txt while name2.txt is open");
+ PG_EXPECT_EQ(errno, EACCES);

[1]: /messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com

#18

Andrey Borodin

amborodin@acm.org

3 months ago

In reply to: Amit Kapila (#16)

On 28 Jan 2026, at 10:47, Amit Kapila <amit.kapila16@gmail.com> wrote:

Do let me know if you could think of gathering any other information
which can be of help here.

Interestingly, increasing timeout in pgrename() to 500 seconds fixes "Windows - Server 2022, VS 2019 - Meson & ninja ", but does not fix "Windows - Server 2022, VS 2019 - Meson & ninja".

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 467b50d6f09..da38e37aa45 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -88,7 +88,7 @@ pgrename(const char *from, const char *to)
                        return -1;
 #endif
-               if (++loops > 100)              /* time out after 10 sec */
+               if (++loops > 5000)             /* time out after 10 sec */
                        return -1;
                pg_usleep(100000);              /* us */
        }

Best regards, Andrey Borodin.

#19

Amit Kapila

amit.kapila16@gmail.com

3 months ago

In reply to: Amit Kapila (#16)

On Wed, Jan 28, 2026 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

It is not clear to me either why the similar test like
040_standby_failover_slots_sync is successful and
046_checkpoint_logical_slot is failing. I am still thinking about it
but thought of sharing the information I could gather by debugging.

It seems there is some interaction with previous test in same file
which is causing this failure as we are using the primary node from
previous test. When I tried to comment out get_changes and its
corresponding injection_point in the previous test as attached, the
entire test passed. I think if we use a freshly created primary node,
this test will pass but I wanted to spend some more time to see
how/why previous test is causing this issue?

--
With Regards,
Amit Kapila.

#20

Amit Kapila

amit.kapila16@gmail.com

3 months ago

In reply to: Andres Freund (#14)

On Tue, Jan 27, 2026 at 11:47 PM Andres Freund <andres@anarazel.de> wrote: