Replace O_EXCL with O_TRUNC for creation of state.tmp in SaveSlotToPath
Hi,
We have encountered a few instances where logical replication errors out
during SaveSlotToPath() after creating the state.tmp file, but before it
was renamed (due to ENOSPC, for example). In these cases, since state.tmp
is not cleaned up and is created with the O_EXCL flag, further invocations
of SaveSlotToPath() for this slot will error out on OpenTransientFile()
with EEXIST, completely blocking slot metadata persistence. The only
explicit cleanup for state.tmp occurs during server startup as part of
RestoreSlotFromDisk().
It doesn't seem that this function relies on data written to state.tmp
previously, so O_EXCL is unnecessary. Attaching a patch that swaps O_EXCL
for O_TRUNC, ensuring a fresh state.tmp is available for writing.
Thanks,
Kevin
Attachments:
replslot_state_tmp_otrunc.patchapplication/octet-stream; name=replslot_state_tmp_otrunc.patchDownload
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index fd0fdb96d42..edd9f844f23 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -2332,7 +2332,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
sprintf(tmppath, "%s/state.tmp", dir);
sprintf(path, "%s/state", dir);
- fd = OpenTransientFile(tmppath, O_CREAT | O_EXCL | O_WRONLY | PG_BINARY);
+ fd = OpenTransientFile(tmppath, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
if (fd < 0)
{
/*
On Tue, Sep 30, 2025 at 05:21:05PM +0530, Kevin K Biju wrote:
We have encountered a few instances where logical replication errors out
during SaveSlotToPath() after creating the state.tmp file, but before it
was renamed (due to ENOSPC, for example). In these cases, since state.tmp
is not cleaned up and is created with the O_EXCL flag, further invocations
of SaveSlotToPath() for this slot will error out on OpenTransientFile()
with EEXIST, completely blocking slot metadata persistence. The only
explicit cleanup for state.tmp occurs during server startup as part of
RestoreSlotFromDisk().
Ah, you are referring to the window between a CloseTransientFile()
completing and the rename().
It's not the first time this report pops up. I have found two
references, for the same error as yours, with one referring to a
discussion about O_EXCL vs O_TRUNC:
/messages/by-id/08bbfab1-a61d-3750-fc18-4ab2c1aa7f09@postgrespro.ru
/messages/by-id/3559061693910326@qy4q4a6esb2lebnz.sas.yp-c.yandex.net
It doesn't seem that this function relies on data written to state.tmp
previously, so O_EXCL is unnecessary. Attaching a patch that swaps O_EXCL
for O_TRUNC, ensuring a fresh state.tmp is available for writing.
Using O_TRUNC has been discussed and discarded because O_EXCL is more
protective in this specific code path, see the argument here:
/messages/by-id/20191202161222.sazl2omhhk5pl3nl@alap3.anarazel.de
An alternative fix that we can do here instead is to unlink() the
temporary file when reaching on these error code paths, allowing
future accesses to work correctly. This was suggested as a second
solution, other than the O_TRUNC objected to. One thing is to make
sure that the unlinks are done while holding the lwlock for the IO in
progress. So, something like the attached should also solve your
problem.
Any thoughts or comments from the others? I'd like to backpatch that
all the way down, 6 years too late. But later is better than never,
right?
--
Michael
Attachments:
save-slot-failure.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index fd0fdb96d424..ac188bb2f771 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -2376,6 +2376,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
pgstat_report_wait_end();
CloseTransientFile(fd);
+ unlink(tmppath);
LWLockRelease(&slot->io_in_progress_lock);
/* if write didn't set errno, assume problem is no disk space */
@@ -2396,7 +2397,9 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
pgstat_report_wait_end();
CloseTransientFile(fd);
+ unlink(tmppath);
LWLockRelease(&slot->io_in_progress_lock);
+
errno = save_errno;
ereport(elevel,
(errcode_for_file_access(),
@@ -2410,7 +2413,9 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
{
int save_errno = errno;
+ unlink(tmppath);
LWLockRelease(&slot->io_in_progress_lock);
+
errno = save_errno;
ereport(elevel,
(errcode_for_file_access(),
@@ -2424,7 +2429,9 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
{
int save_errno = errno;
+ unlink(tmppath);
LWLockRelease(&slot->io_in_progress_lock);
+
errno = save_errno;
ereport(elevel,
(errcode_for_file_access(),
On Thu, Oct 09, 2025 at 05:02:12PM +0900, Michael Paquier wrote:
An alternative fix that we can do here instead is to unlink() the
temporary file when reaching on these error code paths, allowing
future accesses to work correctly. This was suggested as a second
solution, other than the O_TRUNC objected to. One thing is to make
sure that the unlinks are done while holding the lwlock for the IO in
progress. So, something like the attached should also solve your
problem.
I have been playing a bit more with that, and applied 912af1c7e9c9 to
do the unlink() calls down to v13. While on it, I have also played
with hard crashes timed while we are in the middle of SaveSlotToPath()
with state.tmp still around at restart (injection point wait just
before the rename(), for example), and double-checked that recovery is
able to do a correct cleanup job. I didn't fully recall this last
part, as it has been a couple of years since the last report.
There may be an argument for having an automated test, like:
- Physical slot creation.
- Use a bit the slot.
- Injection point wait before the rename() of SaveSlotToPath().
- Checkpoint, that would not be able to finish.
- Hard crash.
- Restart, check that the slot state is correct after recovery.
However, I am not sure that this would be worth the cycles spent on.
There are a lot more interesting scenarios for write failures in the
tree than this one.
--
Michael
Hi Michael,
Thanks for the context. I missed the earlier discussions about the same
issue. Using unlink in the error paths makes sense to me. There is an edge
case in my mind, in case unlink fails as well, and we end up in the same
condition; however, the chance of that occurring is sufficiently low.
It looks like this change was already committed in
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0adf424b490b701ae8eeeca3d9630773f18cd937.
Thanks for the quick response!
Kevin
On Fri, Oct 10, 2025 at 6:22 AM Michael Paquier <michael@paquier.xyz> wrote:
Show quoted text
On Thu, Oct 09, 2025 at 05:02:12PM +0900, Michael Paquier wrote:
An alternative fix that we can do here instead is to unlink() the
temporary file when reaching on these error code paths, allowing
future accesses to work correctly. This was suggested as a second
solution, other than the O_TRUNC objected to. One thing is to make
sure that the unlinks are done while holding the lwlock for the IO in
progress. So, something like the attached should also solve your
problem.I have been playing a bit more with that, and applied 912af1c7e9c9 to
do the unlink() calls down to v13. While on it, I have also played
with hard crashes timed while we are in the middle of SaveSlotToPath()
with state.tmp still around at restart (injection point wait just
before the rename(), for example), and double-checked that recovery is
able to do a correct cleanup job. I didn't fully recall this last
part, as it has been a couple of years since the last report.There may be an argument for having an automated test, like:
- Physical slot creation.
- Use a bit the slot.
- Injection point wait before the rename() of SaveSlotToPath().
- Checkpoint, that would not be able to finish.
- Hard crash.
- Restart, check that the slot state is correct after recovery.However, I am not sure that this would be worth the cycles spent on.
There are a lot more interesting scenarios for write failures in the
tree than this one.
--
Michael
On Wed, Oct 15, 2025 at 07:30:39PM +0530, Kevin K Biju wrote:
Thanks for the context. I missed the earlier discussions about the same
issue. Using unlink in the error paths makes sense to me. There is an edge
case in my mind, in case unlink fails as well, and we end up in the same
condition; however, the chance of that occurring is sufficiently low.
My suspicion is that an unlink() failure would link to much more
problems, including what could be crashes in critical sections. That
would mean a crash, where the state.tmp file would get cleaned up by
recovery. What we have now in the tree should be (I hope!) a good
balance.
That was 6 years ago, and it is very easy to miss, so no worries.
Even in my case, I've recalled the previous discussion after one night
of sleep, only because I've participated in it and because the
automated truncation you have suggested was itching me. Somewhat.
Thanks for the quick response!
No problem.
--
Michael