BUGFIX: standby disconnect can corrupt serialized reorder buffers

Started by Craig Ringerabout 8 years ago15 messages

craig@2ndquadrant.com

about 8 years ago

1 attachment(s)

Hi all

TL;DR: we should delete pg_replslot/$SLOTNAME/* at the start of each
decoding session or we can accidentally re-use stale reorder buffer
contents from prior runs and $BADNESS happens. StartupReorderBuffer() is
not sufficient.

Details:

Petr and I have found a bug in logical decoding where changes get appended
multiple times to serialized reorder buffers. This causes duplicate changes
sent to downstream (conflicts, ERRORs, etc).

Symptoms are:

* Oversized serialized reorder buffers in pg_replslot. In the case we found
this problem in, there was a 147GB reorder buffer that should only have
been 12GB.

* Downstream/receiver working hard but achieving nothing (pglogical/bdr
with conflict resolution enabled), or failing with unique violations and
other errors (built-in logical replication)

Debugging showed that logical decoding was calling the output plugin many
times with the same set of ReorderBufferChange records. It'd advance
normally, then go back to the start of a page address or similar and go
through them all all over again.

Log analysis showed that the downstream had been repeatedly connecting to
the upstream, then ERRORing out, so the upstream logs were full of

LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on standby connection

When the downstream error was fixed, the repeated changes were seen.

The cause appears to be that walsender.c's ProcessRepliesIfAny writes a LOG
for unexpected EOF then calls proc_exit(0). But serialized txn cleanup is
done by
ReorderBufferRestoreCleanup, as called by ReorderBufferCleanupTXN, which is
invoked from the PG_CATCH() in ReorderBufferCommit() and on various normal
exits. It won't get called if we proc_exit() without an ERROR, so we leave
stale data lying around.

It's not a problem on crash restart because StartupReorderBuffer already
does the required delete.

ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear to
have any guard to ensure that the segment files don't already exist when it
goes to serialize a snapshot. Adding one there would probably be expensive;
we don't know the last lsn of the txn yet, so to be really safe we'd have
to do a directory listing and scan for any txn-$OURXID-* entries.

So to fix, I suggest that we should do a
slot-specific StartupReorderBuffer-style deletion when we start a new
decoding session on a slot, per attached patch.

It might be nice to also add a hook on proc exit, so we don't have stale
buffers lying around until next decoding session, but I didn't want to add
new global state to reorderbuffer.c so I've left that for now.

BTW, while this bug looks similar to
/messages/by-id/54e4e488-186b-a056-6628-50628e4e4ebc@lab.ntt.co.jp
"Failed to delete old ReorderBuffer spilled files" it's really quite a
separate issue.

Both this bugfix and the above need backpatching to 9.4.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v1-0001-Clean-up-reorder-buffer-files-when-starting-logic.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Clean-up-reorder-buffer-files-when-starting-logic.patchDownload

From 0a9b98a857e8de02394dc8aefe365a72e50a222e Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 5 Dec 2017 13:32:45 +0800
Subject: [PATCH v1] Clean up reorder buffer files when starting logical
 decoding

We could fail to clean up reorder buffer files if the walsender exited due to a
client disconnect, because we'd skip both the normal exit and error paths.
---
 src/backend/replication/logical/reorderbuffer.c | 84 +++++++++++++++----------
 1 file changed, 50 insertions(+), 34 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fa95bab..bc562e2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -196,6 +196,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferCleanSerializedTXNs(const char *slotname);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -214,7 +215,8 @@ static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *t
 
 
 /*
- * Allocate a new ReorderBuffer
+ * Allocate a new ReorderBuffer and clean out any old serialized state from
+ * prior ReorderBuffer instances for the same slot.
  */
 ReorderBuffer *
 ReorderBufferAllocate(void)
@@ -223,6 +225,8 @@ ReorderBufferAllocate(void)
 	HASHCTL		hash_ctl;
 	MemoryContext new_ctx;
 
+	Assert(MyReplicationSlot != NULL);
+
 	/* allocate memory in own context, to have better accountability */
 	new_ctx = AllocSetContextCreate(CurrentMemoryContext,
 									"ReorderBuffer",
@@ -266,6 +270,9 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 
+	/* Ensure there's no stale data from prior uses of this slot */
+	ReorderBufferCleanSerializedTXNs(NameStr(MyReplicationSlot->data.name));
+
 	return buffer;
 }
 
@@ -2551,6 +2558,47 @@ ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Remove any leftover serialized reorder buffers from a slot directory after a
+ * prior crash or decoding session exit.
+ */
+static void
+ReorderBufferCleanSerializedTXNs(const char *slotname)
+{
+	DIR		   *spill_dir;
+	struct dirent *spill_de;
+	struct stat statbuf;
+	char		path[MAXPGPATH * 2 + 12];
+
+	sprintf(path, "pg_replslot/%s", slotname);
+
+	/* we're only handling directories here, skip if it's not our's */
+	if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
+		return;
+
+	spill_dir = AllocateDir(path);
+	while ((spill_de = ReadDir(spill_dir, path)) != NULL)
+	{
+		if (strcmp(spill_de->d_name, ".") == 0 ||
+			strcmp(spill_de->d_name, "..") == 0)
+			continue;
+
+		/* only look at names that can be ours */
+		if (strncmp(spill_de->d_name, "xid", 3) == 0)
+		{
+			sprintf(path, "pg_replslot/%s/%s", slotname,
+					spill_de->d_name);
+
+			if (unlink(path) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m",
+								path)));
+		}
+	}
+	FreeDir(spill_dir);
+}
+
+/*
  * Delete all data spilled to disk after we've restarted/crashed. It will be
  * recreated when the respective slots are reused.
  */
@@ -2560,15 +2608,9 @@ StartupReorderBuffer(void)
 	DIR		   *logical_dir;
 	struct dirent *logical_de;
 
-	DIR		   *spill_dir;
-	struct dirent *spill_de;
-
 	logical_dir = AllocateDir("pg_replslot");
 	while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
 	{
-		struct stat statbuf;
-		char		path[MAXPGPATH * 2 + 12];
-
 		if (strcmp(logical_de->d_name, ".") == 0 ||
 			strcmp(logical_de->d_name, "..") == 0)
 			continue;
@@ -2581,33 +2623,7 @@ StartupReorderBuffer(void)
 		 * ok, has to be a surviving logical slot, iterate and delete
 		 * everything starting with xid-*
 		 */
-		sprintf(path, "pg_replslot/%s", logical_de->d_name);
-
-		/* we're only creating directories here, skip if it's not our's */
-		if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
-			continue;
-
-		spill_dir = AllocateDir(path);
-		while ((spill_de = ReadDir(spill_dir, path)) != NULL)
-		{
-			if (strcmp(spill_de->d_name, ".") == 0 ||
-				strcmp(spill_de->d_name, "..") == 0)
-				continue;
-
-			/* only look at names that can be ours */
-			if (strncmp(spill_de->d_name, "xid", 3) == 0)
-			{
-				sprintf(path, "pg_replslot/%s/%s", logical_de->d_name,
-						spill_de->d_name);
-
-				if (unlink(path) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not remove file \"%s\": %m",
-									path)));
-			}
-		}
-		FreeDir(spill_dir);
+		ReorderBufferCleanSerializedTXNs(logical_de->d_name);
 	}
 	FreeDir(logical_dir);
 }
-- 
2.9.5

Petr Jelinek

petr.jelinek@2ndquadrant.com

about 8 years ago

In reply to: Craig Ringer (#1)

Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

Hi,

thanks for writing the patch.

On 05/12/17 06:58, Craig Ringer wrote:

Hi all

[...]

The cause appears to be that walsender.c's ProcessRepliesIfAny writes a

LOG for unexpected EOF then calls proc_exit(0). But serialized txn
cleanup is done by
ReorderBufferRestoreCleanup, as called by ReorderBufferCleanupTXN, which
is invoked from the PG_CATCH() in ReorderBufferCommit() and on various
normal exits. It won't get called if we proc_exit() without an ERROR, so
we leave stale data lying around.

It's not a problem on crash restart because StartupReorderBuffer already
does the required delete.

ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
to have any guard to ensure that the segment files don't already exist
when it goes to serialize a snapshot. Adding one there would probably be
expensive; we don't know the last lsn of the txn yet, so to be really
safe we'd have to do a directory listing and scan for any txn-$OURXID-*
entries.

So to fix, I suggest that we should do a
slot-specific StartupReorderBuffer-style deletion when we start a new
decoding session on a slot, per attached patch.

It might be nice to also add a hook on proc exit, so we don't have stale
buffers lying around until next decoding session, but I didn't want to
add new global state to reorderbuffer.c so I've left that for now.

Hmm, can't we simply call the new cleanup function in
ReplicationSlotRelease()? That's called during process exit and we know
there about slot so no extra global variables are needed.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Petr Jelinek (#2)

Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

Hi,

thanks for writing the patch.

On 05/12/17 06:58, Craig Ringer wrote:

Hi all

[...]

The cause appears to be that walsender.c's ProcessRepliesIfAny writes a

LOG for unexpected EOF then calls proc_exit(0). But serialized txn
cleanup is done by
ReorderBufferRestoreCleanup, as called by ReorderBufferCleanupTXN, which
is invoked from the PG_CATCH() in ReorderBufferCommit() and on various
normal exits. It won't get called if we proc_exit() without an ERROR, so
we leave stale data lying around.

Thank you for the details. I could reproduce this bug. This bug also
happens if pq_flush_if_writable called by WalSndWriteData could not
write data (for example, the case where replicated data violate unique
constraint on the subscriber).

It's not a problem on crash restart because StartupReorderBuffer already
does the required delete.

ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
to have any guard to ensure that the segment files don't already exist
when it goes to serialize a snapshot. Adding one there would probably be
expensive; we don't know the last lsn of the txn yet, so to be really
safe we'd have to do a directory listing and scan for any txn-$OURXID-*
entries.

So to fix, I suggest that we should do a
slot-specific StartupReorderBuffer-style deletion when we start a new
decoding session on a slot, per attached patch.

It might be nice to also add a hook on proc exit, so we don't have stale
buffers lying around until next decoding session, but I didn't want to
add new global state to reorderbuffer.c so I've left that for now.

Hmm, can't we simply call the new cleanup function in
ReplicationSlotRelease()? That's called during process exit and we know
there about slot so no extra global variables are needed.

I guess that ReplicationSlotRelease() currently might not get called
if walsender exits by proc_exit(). ReplicationSlotRelease() can is
called by some functions such as WalSndErrorCleanup(), but at least in
the case where wal sender exits due to failed to write data to socket,
ReplicationSlotRelease() didn't get called as far as I tested.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Petr Jelinek

petr.jelinek@2ndquadrant.com

about 8 years ago

In reply to: Masahiko Sawada (#3)

Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

On 26/12/17 11:13, Masahiko Sawada wrote:

On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

It's not a problem on crash restart because StartupReorderBuffer already
does the required delete.

ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
to have any guard to ensure that the segment files don't already exist
when it goes to serialize a snapshot. Adding one there would probably be
expensive; we don't know the last lsn of the txn yet, so to be really
safe we'd have to do a directory listing and scan for any txn-$OURXID-*
entries.

So to fix, I suggest that we should do a
slot-specific StartupReorderBuffer-style deletion when we start a new
decoding session on a slot, per attached patch.

It might be nice to also add a hook on proc exit, so we don't have stale
buffers lying around until next decoding session, but I didn't want to
add new global state to reorderbuffer.c so I've left that for now.

Hmm, can't we simply call the new cleanup function in
ReplicationSlotRelease()? That's called during process exit and we know
there about slot so no extra global variables are needed.

I guess that ReplicationSlotRelease() currently might not get called
if walsender exits by proc_exit(). ReplicationSlotRelease() can is
called by some functions such as WalSndErrorCleanup(), but at least in
the case where wal sender exits due to failed to write data to socket,
ReplicationSlotRelease() didn't get called as far as I tested.

Are you sure about that testing? Did you test it with replication slot
active? ReplicationSlotRelease() is called from ProcKill() if the
process is using a slot and should be called for any kind of exit except
for outright crash (the kind that makes postgres kill all backends). If
it wasn't we'd never unlock the replication slot used by the exiting
walsender.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Petr Jelinek (#4)

Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 26/12/17 11:13, Masahiko Sawada wrote:

On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

It's not a problem on crash restart because StartupReorderBuffer already
does the required delete.

ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
to have any guard to ensure that the segment files don't already exist
when it goes to serialize a snapshot. Adding one there would probably be
expensive; we don't know the last lsn of the txn yet, so to be really
safe we'd have to do a directory listing and scan for any txn-$OURXID-*
entries.

So to fix, I suggest that we should do a
slot-specific StartupReorderBuffer-style deletion when we start a new
decoding session on a slot, per attached patch.

It might be nice to also add a hook on proc exit, so we don't have stale
buffers lying around until next decoding session, but I didn't want to
add new global state to reorderbuffer.c so I've left that for now.

Hmm, can't we simply call the new cleanup function in
ReplicationSlotRelease()? That's called during process exit and we know
there about slot so no extra global variables are needed.

I guess that ReplicationSlotRelease() currently might not get called
if walsender exits by proc_exit(). ReplicationSlotRelease() can is
called by some functions such as WalSndErrorCleanup(), but at least in
the case where wal sender exits due to failed to write data to socket,
ReplicationSlotRelease() didn't get called as far as I tested.

Are you sure about that testing? Did you test it with replication slot
active? ReplicationSlotRelease() is called from ProcKill() if the
process is using a slot and should be called for any kind of exit except
for outright crash (the kind that makes postgres kill all backends). If
it wasn't we'd never unlock the replication slot used by the exiting
walsender.

Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but
ReplicationSlotRelease() got called. I agree that cleanup function
gets called in ReplicationSlotRelease().

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Stephen Frost

sfrost@snowman.net

about 8 years ago

In reply to: Masahiko Sawada (#5)

Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

Greetings all,

* Masahiko Sawada (sawada.mshk@gmail.com) wrote:

On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 26/12/17 11:13, Masahiko Sawada wrote:

On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

It's not a problem on crash restart because StartupReorderBuffer already
does the required delete.

ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
to have any guard to ensure that the segment files don't already exist
when it goes to serialize a snapshot. Adding one there would probably be
expensive; we don't know the last lsn of the txn yet, so to be really
safe we'd have to do a directory listing and scan for any txn-$OURXID-*
entries.

So to fix, I suggest that we should do a
slot-specific StartupReorderBuffer-style deletion when we start a new
decoding session on a slot, per attached patch.

It might be nice to also add a hook on proc exit, so we don't have stale
buffers lying around until next decoding session, but I didn't want to
add new global state to reorderbuffer.c so I've left that for now.

Hmm, can't we simply call the new cleanup function in
ReplicationSlotRelease()? That's called during process exit and we know
there about slot so no extra global variables are needed.

I guess that ReplicationSlotRelease() currently might not get called
if walsender exits by proc_exit(). ReplicationSlotRelease() can is
called by some functions such as WalSndErrorCleanup(), but at least in
the case where wal sender exits due to failed to write data to socket,
ReplicationSlotRelease() didn't get called as far as I tested.

Are you sure about that testing? Did you test it with replication slot
active? ReplicationSlotRelease() is called from ProcKill() if the
process is using a slot and should be called for any kind of exit except
for outright crash (the kind that makes postgres kill all backends). If
it wasn't we'd never unlock the replication slot used by the exiting
walsender.

Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but
ReplicationSlotRelease() got called. I agree that cleanup function
gets called in ReplicationSlotRelease().

This patch is currently in 'Waiting on Author' state, but looks like it
should actually be in Needs Review (or perhaps Ready for Commmitter)..?

Craig, would you agree with that and, if so, please update the status
accordingly.

Thanks!

Stephen

Craig Ringer

craig@2ndquadrant.com

about 8 years ago

In reply to: Stephen Frost (#6)

Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

On 5 January 2018 at 12:16, Stephen Frost <sfrost@snowman.net> wrote:

Greetings all,

* Masahiko Sawada (sawada.mshk@gmail.com) wrote:

On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 26/12/17 11:13, Masahiko Sawada wrote:

On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

It's not a problem on crash restart because StartupReorderBuffer

already

does the required delete.

ReorderBufferSerializeTXN, which spills the txns to disk, doesn't

appear

to have any guard to ensure that the segment files don't already

exist

when it goes to serialize a snapshot. Adding one there would

probably be

expensive; we don't know the last lsn of the txn yet, so to be

really

safe we'd have to do a directory listing and scan for any

txn-$OURXID-*

entries.

So to fix, I suggest that we should do a
slot-specific StartupReorderBuffer-style deletion when we start a

new

decoding session on a slot, per attached patch.

It might be nice to also add a hook on proc exit, so we don't have

stale

buffers lying around until next decoding session, but I didn't want

to

add new global state to reorderbuffer.c so I've left that for now.

Hmm, can't we simply call the new cleanup function in
ReplicationSlotRelease()? That's called during process exit and we

know

there about slot so no extra global variables are needed.

I guess that ReplicationSlotRelease() currently might not get called
if walsender exits by proc_exit(). ReplicationSlotRelease() can is
called by some functions such as WalSndErrorCleanup(), but at least in
the case where wal sender exits due to failed to write data to socket,
ReplicationSlotRelease() didn't get called as far as I tested.

Are you sure about that testing? Did you test it with replication slot
active? ReplicationSlotRelease() is called from ProcKill() if the
process is using a slot and should be called for any kind of exit

except

for outright crash (the kind that makes postgres kill all backends). If
it wasn't we'd never unlock the replication slot used by the exiting
walsender.

Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but
ReplicationSlotRelease() got called. I agree that cleanup function
gets called in ReplicationSlotRelease().

This patch is currently in 'Waiting on Author' state, but looks like it
should actually be in Needs Review (or perhaps Ready for Commmitter)..?

Craig, would you agree with that and, if so, please update the status
accordingly.

WoA looks correct, Petr suggested a tweak to how and when the cleanup is
done that I need to adopt.

I'm just about out of time before a trip, but I'll get a colleague to
update it if I don't get the chance, so we can get it in and backpatched
for this CF.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Alvaro Herrera

alvherre@alvh.no-ip.org

about 8 years ago

In reply to: Craig Ringer (#1)

Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

I think this should use ReadDirExtended with an elevel less than ERROR,
and do nothing.

Why have strcmp(.) and strcmp(..)? These are going to be skipped by the
comparison to "xid" prefix anyway. Looks like strcmp processing power waste.

Please don't use bare sprintf() -- upgrade to snprintf.

With this coding, if I put a root-owned file "xidfoo" in a replslot
directory, it will PANIC the server. Is that okay? Why not read the
file name with sscanf(), since we know the precise format it has? Then
we don't need to bother with random crap left around. Maybe a good time
to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the
rationale is that if you let random people put "xidfoo" files in your
replication slot dirs, you deserve a PANIC anyway, but I'm not sure.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Craig Ringer

craig@2ndquadrant.com

almost 8 years ago

In reply to: Alvaro Herrera (#8)

Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I think this should use ReadDirExtended with an elevel less than ERROR,
and do nothing.

Why have strcmp(.) and strcmp(..)? These are going to be skipped by the
comparison to "xid" prefix anyway. Looks like strcmp processing power
waste.

Please don't use bare sprintf() -- upgrade to snprintf.

With this coding, if I put a root-owned file "xidfoo" in a replslot
directory, it will PANIC the server. Is that okay? Why not read the
file name with sscanf(), since we know the precise format it has? Then
we don't need to bother with random crap left around. Maybe a good time
to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the
rationale is that if you let random people put "xidfoo" files in your
replication slot dirs, you deserve a PANIC anyway, but I'm not sure.

I'm happy to address those comments.

The PANIC probably made sense when it was only done on startup, but not now
it's at runtime.

The rest is mainly retained from the prior code, but it's a good chance to
make those changes.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

David Steele

david@pgmasters.net

almost 8 years ago

In reply to: Craig Ringer (#9)

Re: Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

Hi Craig,

On 1/21/18 5:45 PM, Craig Ringer wrote:

On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org
<mailto:alvherre@alvh.no-ip.org>> wrote:

I think this should use ReadDirExtended with an elevel less than ERROR,
and do nothing.

Why have strcmp(.) and strcmp(..)? These are going to be skipped by the
comparison to "xid" prefix anyway. Looks like strcmp processing
power waste.

Please don't use bare sprintf() -- upgrade to snprintf.

With this coding, if I put a root-owned file "xidfoo" in a replslot
directory, it will PANIC the server. Is that okay? Why not read the
file name with sscanf(), since we know the precise format it has? Then
we don't need to bother with random crap left around. Maybe a good time
to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the
rationale is that if you let random people put "xidfoo" files in your
replication slot dirs, you deserve a PANIC anyway, but I'm not sure.

I'm happy to address those comments.

The PANIC probably made sense when it was only done on startup, but not
now it's at runtime.

The rest is mainly retained from the prior code, but it's a good chance
to make those changes.

This patch was marked Waiting on Author last December. Do you know when
you'll have a chance to provide an updated patch?

Given that it's a bug fix it would be good to see a patch for this CF,
or very soon after.

Thanks,
--
-David
david@pgmasters.net

#11

Craig Ringer

craig@2ndquadrant.com

almost 8 years ago

In reply to: David Steele (#10)

Re: Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

On 5 March 2018 at 23:25, David Steele <david@pgmasters.net> wrote:

Hi Craig,

On 1/21/18 5:45 PM, Craig Ringer wrote:

On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org
<mailto:alvherre@alvh.no-ip.org>> wrote:

I think this should use ReadDirExtended with an elevel less than

ERROR,

and do nothing.

Why have strcmp(.) and strcmp(..)? These are going to be skipped by

the

comparison to "xid" prefix anyway. Looks like strcmp processing
power waste.

Please don't use bare sprintf() -- upgrade to snprintf.

With this coding, if I put a root-owned file "xidfoo" in a replslot
directory, it will PANIC the server. Is that okay? Why not read the
file name with sscanf(), since we know the precise format it has?

Then

we don't need to bother with random crap left around. Maybe a good

time

to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the
rationale is that if you let random people put "xidfoo" files in your
replication slot dirs, you deserve a PANIC anyway, but I'm not sure.

I'm happy to address those comments.

The PANIC probably made sense when it was only done on startup, but not
now it's at runtime.

The rest is mainly retained from the prior code, but it's a good chance
to make those changes.

This patch was marked Waiting on Author last December. Do you know when
you'll have a chance to provide an updated patch?

Given that it's a bug fix it would be good to see a patch for this CF,
or very soon after.

Thanks for the reminder. I'll address it today if I can.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12

Craig Ringer

craig@2ndquadrant.com

almost 8 years ago

In reply to: Craig Ringer (#11)

1 attachment(s)

Re: Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

On 6 March 2018 at 09:58, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 March 2018 at 23:25, David Steele <david@pgmasters.net> wrote:

Hi Craig,

On 1/21/18 5:45 PM, Craig Ringer wrote:

On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org
<mailto:alvherre@alvh.no-ip.org>> wrote:

I think this should use ReadDirExtended with an elevel less than

ERROR,

and do nothing.

Why have strcmp(.) and strcmp(..)? These are going to be skipped

by the

comparison to "xid" prefix anyway. Looks like strcmp processing
power waste.

Please don't use bare sprintf() -- upgrade to snprintf.

With this coding, if I put a root-owned file "xidfoo" in a replslot
directory, it will PANIC the server. Is that okay? Why not read

the

file name with sscanf(), since we know the precise format it has?

Then

we don't need to bother with random crap left around. Maybe a good

time

to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the
rationale is that if you let random people put "xidfoo" files in

your

replication slot dirs, you deserve a PANIC anyway, but I'm not sure.

I'm happy to address those comments.

The PANIC probably made sense when it was only done on startup, but not
now it's at runtime.

The rest is mainly retained from the prior code, but it's a good chance
to make those changes.

This patch was marked Waiting on Author last December. Do you know when
you'll have a chance to provide an updated patch?

Given that it's a bug fix it would be good to see a patch for this CF,
or very soon after.

Thanks for the reminder. I'll address it today if I can.

Revised patch attached.

I have _not_ rewritten to use sscanf yet. I'll do that next, so you can
choose the fewer-changes patch for backpatching if desired.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-0001-Clean-up-reorder-buffer-files-when-starting-logic.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Clean-up-reorder-buffer-files-when-starting-logic.patchDownload

From 9125690b7216527a4a388de987de609a86224119 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 5 Dec 2017 13:32:45 +0800
Subject: [PATCH v2] Clean up reorder buffer files when starting logical
 decoding

We could fail to clean up reorder buffer files if the walsender exited due to a
client disconnect, because we'd skip both the normal exit and error paths.
---
 src/backend/replication/logical/reorderbuffer.c | 95 +++++++++++++++----------
 1 file changed, 58 insertions(+), 37 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c72a611a39..87522dd121 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -196,6 +196,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferCleanSerializedTXNs(const char *slotname);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -212,9 +213,11 @@ static void ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 
+#define REORDERBUFFER_TXN_FILEPATH_FORMAT "pg_replslot/%s/xid-%u-lsn-%X-%X.snap"
 
 /*
- * Allocate a new ReorderBuffer
+ * Allocate a new ReorderBuffer and clean out any old serialized state from
+ * prior ReorderBuffer instances for the same slot.
  */
 ReorderBuffer *
 ReorderBufferAllocate(void)
@@ -223,6 +226,8 @@ ReorderBufferAllocate(void)
 	HASHCTL		hash_ctl;
 	MemoryContext new_ctx;
 
+	Assert(MyReplicationSlot != NULL);
+
 	/* allocate memory in own context, to have better accountability */
 	new_ctx = AllocSetContextCreate(CurrentMemoryContext,
 									"ReorderBuffer",
@@ -269,6 +274,13 @@ ReorderBufferAllocate(void)
 
 	dlist_init(&buffer->toplevel_by_lsn);
 
+	/*
+	 * Ensure there's no stale data from prior uses of this slot, in case some
+	 * prior exit avoided calling ReorderBufferFree. Failure to do this can
+	 * produce duplicated txns, and it's very cheap if there's nothing there.
+	 */
+	ReorderBufferCleanSerializedTXNs(NameStr(MyReplicationSlot->data.name));
+
 	return buffer;
 }
 
@@ -285,6 +297,9 @@ ReorderBufferFree(ReorderBuffer *rb)
 	 * memory context.
 	 */
 	MemoryContextDelete(context);
+
+	/* Free disk space used by unconsumed reorder buffers */
+	ReorderBufferCleanSerializedTXNs(NameStr(MyReplicationSlot->data.name));
 }
 
 /*
@@ -2070,7 +2085,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			 * No need to care about TLIs here, only used during a single run,
 			 * so each LSN only maps to a specific WAL record.
 			 */
-			sprintf(path, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
+			sprintf(path, REORDERBUFFER_TXN_FILEPATH_FORMAT,
 					NameStr(MyReplicationSlot->data.name), txn->xid,
 					(uint32) (recptr >> 32), (uint32) recptr);
 
@@ -2316,7 +2331,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * No need to care about TLIs here, only used during a single run,
 			 * so each LSN only maps to a specific WAL record.
 			 */
-			sprintf(path, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
+			sprintf(path, REORDERBUFFER_TXN_FILEPATH_FORMAT,
 					NameStr(MyReplicationSlot->data.name), txn->xid,
 					(uint32) (recptr >> 32), (uint32) recptr);
 
@@ -2558,7 +2573,7 @@ ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		XLogSegNoOffsetToRecPtr(cur, 0, recptr, wal_segment_size);
 
-		sprintf(path, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
+		sprintf(path, REORDERBUFFER_TXN_FILEPATH_FORMAT,
 				NameStr(MyReplicationSlot->data.name), txn->xid,
 				(uint32) (recptr >> 32), (uint32) recptr);
 		if (unlink(path) != 0 && errno != ENOENT)
@@ -2568,6 +2583,44 @@ ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 }
 
+/*
+ * Remove any leftover serialized reorder buffers from a slot directory after a
+ * prior crash or decoding session exit.
+ */
+static void
+ReorderBufferCleanSerializedTXNs(const char *slotname)
+{
+	DIR		   *spill_dir;
+	struct dirent *spill_de;
+	struct stat statbuf;
+	char		path[MAXPGPATH * 2 + 12];
+
+	sprintf(path, "pg_replslot/%s", slotname);
+
+	/* we're only handling directories here, skip if it's not ours */
+	if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
+		return;
+
+	spill_dir = AllocateDir(path);
+	while ((spill_de = ReadDirExtended(spill_dir, path, INFO)) != NULL)
+	{
+		/* only look at names that can be ours */
+		if (strncmp(spill_de->d_name, "xid", 3) == 0)
+		{
+			snprintf(path, sizeof(path),
+					 "pg_replslot/%s/%s", slotname,
+					 spill_de->d_name);
+
+			if (unlink(path) != 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\" during removal of pg_replslot/%s/*.xid: %m",
+								path, slotname)));
+		}
+	}
+	FreeDir(spill_dir);
+}
+
 /*
  * Delete all data spilled to disk after we've restarted/crashed. It will be
  * recreated when the respective slots are reused.
@@ -2578,15 +2631,9 @@ StartupReorderBuffer(void)
 	DIR		   *logical_dir;
 	struct dirent *logical_de;
 
-	DIR		   *spill_dir;
-	struct dirent *spill_de;
-
 	logical_dir = AllocateDir("pg_replslot");
 	while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
 	{
-		struct stat statbuf;
-		char		path[MAXPGPATH * 2 + 12];
-
 		if (strcmp(logical_de->d_name, ".") == 0 ||
 			strcmp(logical_de->d_name, "..") == 0)
 			continue;
@@ -2599,33 +2646,7 @@ StartupReorderBuffer(void)
 		 * ok, has to be a surviving logical slot, iterate and delete
 		 * everything starting with xid-*
 		 */
-		sprintf(path, "pg_replslot/%s", logical_de->d_name);
-
-		/* we're only creating directories here, skip if it's not our's */
-		if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
-			continue;
-
-		spill_dir = AllocateDir(path);
-		while ((spill_de = ReadDir(spill_dir, path)) != NULL)
-		{
-			if (strcmp(spill_de->d_name, ".") == 0 ||
-				strcmp(spill_de->d_name, "..") == 0)
-				continue;
-
-			/* only look at names that can be ours */
-			if (strncmp(spill_de->d_name, "xid", 3) == 0)
-			{
-				sprintf(path, "pg_replslot/%s/%s", logical_de->d_name,
-						spill_de->d_name);
-
-				if (unlink(path) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not remove file \"%s\": %m",
-									path)));
-			}
-		}
-		FreeDir(spill_dir);
+		ReorderBufferCleanSerializedTXNs(logical_de->d_name);
 	}
 	FreeDir(logical_dir);
 }
-- 
2.14.3

#13

Craig Ringer

craig@2ndquadrant.com

almost 8 years ago

In reply to: Craig Ringer (#12)

Re: Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

On 6 March 2018 at 16:07, Craig Ringer <craig@2ndquadrant.com> wrote:

On 6 March 2018 at 09:58, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 March 2018 at 23:25, David Steele <david@pgmasters.net> wrote:

Hi Craig,

On 1/21/18 5:45 PM, Craig Ringer wrote:

On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org
<mailto:alvherre@alvh.no-ip.org>> wrote:

I think this should use ReadDirExtended with an elevel less than

ERROR,

and do nothing.

Why have strcmp(.) and strcmp(..)? These are going to be skipped

by the

comparison to "xid" prefix anyway. Looks like strcmp processing
power waste.

Please don't use bare sprintf() -- upgrade to snprintf.

With this coding, if I put a root-owned file "xidfoo" in a replslot
directory, it will PANIC the server. Is that okay? Why not read

the

file name with sscanf(), since we know the precise format it has?

Then

we don't need to bother with random crap left around. Maybe a

good time

to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess

the

rationale is that if you let random people put "xidfoo" files in

your

replication slot dirs, you deserve a PANIC anyway, but I'm not

sure.

I'm happy to address those comments.

The PANIC probably made sense when it was only done on startup, but not
now it's at runtime.

The rest is mainly retained from the prior code, but it's a good chance
to make those changes.

This patch was marked Waiting on Author last December. Do you know when
you'll have a chance to provide an updated patch?

Given that it's a bug fix it would be good to see a patch for this CF,
or very soon after.

Thanks for the reminder. I'll address it today if I can.

Revised patch attached.

I have _not_ rewritten to use sscanf yet. I'll do that next, so you can
choose the fewer-changes patch for backpatching if desired.

... and I'm not convinced it's really an improvement.

uint32 xid, lsn_high, lsn_low;

if (sscanf(spill_de->d_name, REORDERBUFFER_TXN_FILENAME_FORMAT,
xid, lsn_high, lsn_low) == 3)
{

since we don't use the scanned-out information.

I guess my answer to causing problems if you create a file named "xidfoo"
in a slot directory is yeah, don't do that.

Note that this patch changes the PANIC to ERROR. This will promote to FATAL
during the startup process, which I think is fine. Objections?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#14

Petr Jelinek

petr.jelinek@2ndquadrant.com

almost 8 years ago

In reply to: Craig Ringer (#13)

Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

On 06/03/18 09:37, Craig Ringer wrote:

Revised patch attached.

I have _not_ rewritten to use sscanf yet. I'll do that next, so you
can choose the fewer-changes patch for backpatching if desired.

... and I'm not convinced it's really an improvement.

uint32 xid, lsn_high, lsn_low;

if (sscanf(spill_de->d_name, REORDERBUFFER_TXN_FILENAME_FORMAT,
xid, lsn_high, lsn_low) == 3)
{

since we don't use the scanned-out information.

I guess my answer to causing problems if you create a file named
"xidfoo" in a slot directory is yeah, don't do that.

Note that this patch changes the PANIC to ERROR. This will promote to
FATAL during the startup process, which I think is fine. Objections?

You mean because PG_exception_stack is not initialized for startup
process? That deserves comment I think.

Other than that if I am very nitpicky, I'd call the new function
ReorderBufferCleanupSerializedTXNs.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: Craig Ringer (#12)

Re: Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

Craig Ringer wrote:

Revised patch attached.

I have _not_ rewritten to use sscanf yet. I'll do that next, so you can
choose the fewer-changes patch for backpatching if desired.

Pushed, with a further change: it seems more sensible to centralize the
whole operation of building the path, rather than just the format
string, so I created a new function to do that. The code looks cleaner
IMO this way IMO. All tests pass in all branches.

BTW the way the XLogSegNoOffsetToRecPtr() et al macros were modified to
accept wal_segment_size at the end of the argument list, *after* the
output argument, seems pretty bad style.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services