Possible bug in logical replication.

Started by Konstantin Knizhnikabout 8 years ago38 messageshackers

k.knizhnik@postgrespro.ru

about 8 years ago

We got the following assertion failure at our buildfarm of master branch
of Postgres in contrib/test_decoding regression tests:

2018-05-07 19:50:07.241 MSK [5af083bf.54ae:49] DETAIL: Streaming transactions committing after 0/2A00000, reading WAL from 0/29FFF1C.
2018-05-07 19:50:07.241 MSK [5af083bf.54ae:50] STATEMENT: SELECT end_lsn FROM pg_replication_slot_advance('regression_slot1', '0/2A00174')
TRAP: FailedAssertion("!(((RecPtr) % 8192 >= (((uintptr_t) ((sizeof(XLogPageHeaderData))) + ((4) - 1)) & ~((uintptr_t) ((4) - 1)))))", File: "xlogreader.c", Line: 241)

Stack trace is the following:

$ gdb -x ./gdbcmd --batch pgsql.build/tmp_install/home/buildfarm/build-farm/CORE-353-stable-func/inst/bin/postgres pgsql.build/contrib/test_decoding/tmp_check/data/core
[New LWP 21678]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: buildfarm regression [local] SELECT '.
Program terminated with signal SIGABRT, Aborted.
#0 0xf7722c89 in __kernel_vsyscall ()
#0 0xf7722c89 in __kernel_vsyscall ()
#1 0xf6b5ddd0 in __libc_signal_restore_set (set=0xffaf2240) at ../sysdeps/unix/sysv/linux/nptl-signals.h:79
#2 __GI_raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:48
#3 0xf6b5f297 in __GI_abort () at abort.c:89
#4 0x56b3931a in ExceptionalCondition (conditionName=0x56bd0c38 "!(((RecPtr) % 8192 >= (((uintptr_t) ((sizeof(XLogPageHeaderData))) + ((4) - 1)) & ~((uintptr_t) ((4) - 1)))))", errorType=0x56b8bf61 "FailedAssertion", fileName=0x56bd0df0 "xlogreader.c", lineNumber=241) at assert.c:54
#5 0x5678c573 in XLogReadRecord (state=0x57628c84, RecPtr=44040192, errormsg=0xffaf2560) at xlogreader.c:241
#6 0x569c3191 in pg_logical_replication_slot_advance (startlsn=<optimized out>, moveto=44040564) at slotfuncs.c:370
#7 0x569c3c8e in pg_replication_slot_advance (fcinfo=0xffaf2708) at slotfuncs.c:487
#8 0x568a69c0 in ExecMakeTableFunctionResult (setexpr=0x57626e30, econtext=0x57626d88, argContext=0x57620b48, expectedDesc=0x57627e44, randomAccess=false) at execSRF.c:231
#9 0x568b41d3 in FunctionNext (node=0x57626cfc) at nodeFunctionscan.c:94
#10 0x568a5ce2 in ExecScanFetch (recheckMtd=0x568b3ec0 <FunctionRecheck>, accessMtd=0x568b3ed0 <FunctionNext>, node=0x57626cfc) at execScan.c:95
#11 ExecScan (node=0x57626cfc, accessMtd=0x568b3ed0 <FunctionNext>, recheckMtd=0x568b3ec0 <FunctionRecheck>) at execScan.c:162
#12 0x568b4243 in ExecFunctionScan (pstate=0x57626cfc) at nodeFunctionscan.c:270
#13 0x5689caba in ExecProcNode (node=0x57626cfc) at ../../../src/include/executor/executor.h:238
#14 ExecutePlan (execute_once=<optimized out>, dest=0x0, direction=NoMovementScanDirection, numberTuples=<optimized out>, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x57626cfc, estate=0x57626bf0) at execMain.c:1731
#15 standard_ExecutorRun (queryDesc=0x5760f248, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:368
#16 0x56a132cd in PortalRunSelect (portal=portal@entry=0x575c4ea8, forward=forward@entry=true, count=0, count@entry=2147483647, dest=0x576232d4) at pquery.c:932
#17 0x56a14b00 in PortalRun (portal=0x575c4ea8, count=2147483647, isTopLevel=true, run_once=true, dest=0x576232d4, altdest=0x576232d4, completionTag=0xffaf2c40 "") at pquery.c:773
#18 0x56a0fbb7 in exec_simple_query (query_string=query_string@entry=0x57579070 "SELECT end_lsn FROM pg_replication_slot_advance('regression_slot1', '0/2A00174') ") at postgres.c:1122
#19 0x56a11a8e in PostgresMain (argc=1, argv=0x575a0b8c, dbname=<optimized out>, username=0x575a09f0 "buildfarm") at postgres.c:4153
#20 0x566cd9cb in BackendRun (port=0x5759a358) at postmaster.c:4361
#21 BackendStartup (port=0x5759a358) at postmaster.c:4033
#22 ServerLoop () at postmaster.c:1706
#23 0x5698a608 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1379
#24 0x566cf642 in main (argc=<optimized out>, argv=<optimized out>) at main.c:228

As you can see, assertion failure happen because specified startlsn (0x2a00000) is not considered to be valid.
This LSN is taken from slot's confirmed flush LSN in pg_replication_slot_advance:

startlsn = MyReplicationSlot->data.confirmed_flush;

Unfortunately I was not able to reproduce the problem even repeating this regression tests 1000 times, so it seems to be very difficult to reproduced non-deterministic bug.
I wonder if there is a warranty that confirmed_flush always points to the start of the record.
Inspecting of xlogreader.c code shows that confirmed_flush is for example assigned EndRecPtr in DecodingContextFindStartpoint.
And EndRecPtr is updated in XLogReadRecord and if record doesn't cross page boundary, then the following formula is used:

state->EndRecPtr = RecPtr + MAXALIGN(total_len);

And if record ends at page boundary, then looks like EndRecPtr can point to page boundary.
It is confirmed by the following comment in XLogReadRecord function:

/*
* RecPtr is pointing to end+1 of the previous WAL record. If we're
* at a page boundary, no more records can fit on the current page. We
* must skip over the page header, but we can't do that until we've
* read in the page, since the header size is variable.
*/

But it means that it is possible that confirmed_flush can also point to the page boundary which will cause this assertion failure in XLogReadRecord:

/*
* Caller supplied a position to start at.
*
* In this case, the passed-in record pointer should already be
* pointing to a valid record starting position.
*/
Assert(XRecOffIsValid(RecPtr));

May be this bug is very difficult to reproduce because it takes place only if WAL record preceding specified start point ends exactly at page boundary. The probability of it seems to be quite small...

I am not so familiar with wal iteration code and logical decoding, so I will be pleased to receive confirmation orconfutation of my analysis of the problem.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 8 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Possible bug in logical replication.

On 17.05.2018 10:45, Konstantin Knizhnik wrote:

We got the following assertion failure at our buildfarm of master
branch of Postgres in contrib/test_decoding regression tests:

2018-05-07 19:50:07.241 MSK [5af083bf.54ae:49] DETAIL: Streaming transactions committing after 0/2A00000, reading WAL from 0/29FFF1C.
2018-05-07 19:50:07.241 MSK [5af083bf.54ae:50] STATEMENT: SELECT end_lsn FROM pg_replication_slot_advance('regression_slot1', '0/2A00174')
TRAP: FailedAssertion("!(((RecPtr) % 8192 >= (((uintptr_t) ((sizeof(XLogPageHeaderData))) + ((4) - 1)) & ~((uintptr_t) ((4) - 1)))))", File: "xlogreader.c", Line: 241)

Stack trace is the following:

$ gdb -x ./gdbcmd --batch pgsql.build/tmp_install/home/buildfarm/build-farm/CORE-353-stable-func/inst/bin/postgres pgsql.build/contrib/test_decoding/tmp_check/data/core
[New LWP 21678]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: buildfarm regression [local] SELECT '.
Program terminated with signal SIGABRT, Aborted.
#0 0xf7722c89 in __kernel_vsyscall ()
#0 0xf7722c89 in __kernel_vsyscall ()
#1 0xf6b5ddd0 in __libc_signal_restore_set (set=0xffaf2240) at ../sysdeps/unix/sysv/linux/nptl-signals.h:79
#2 __GI_raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:48
#3 0xf6b5f297 in __GI_abort () at abort.c:89
#4 0x56b3931a in ExceptionalCondition (conditionName=0x56bd0c38 "!(((RecPtr) % 8192 >= (((uintptr_t) ((sizeof(XLogPageHeaderData))) + ((4) - 1)) & ~((uintptr_t) ((4) - 1)))))", errorType=0x56b8bf61 "FailedAssertion", fileName=0x56bd0df0 "xlogreader.c", lineNumber=241) at assert.c:54
#5 0x5678c573 in XLogReadRecord (state=0x57628c84, RecPtr=44040192, errormsg=0xffaf2560) at xlogreader.c:241
#6 0x569c3191 in pg_logical_replication_slot_advance (startlsn=<optimized out>, moveto=44040564) at slotfuncs.c:370
#7 0x569c3c8e in pg_replication_slot_advance (fcinfo=0xffaf2708) at slotfuncs.c:487
#8 0x568a69c0 in ExecMakeTableFunctionResult (setexpr=0x57626e30, econtext=0x57626d88, argContext=0x57620b48, expectedDesc=0x57627e44, randomAccess=false) at execSRF.c:231
#9 0x568b41d3 in FunctionNext (node=0x57626cfc) at nodeFunctionscan.c:94
#10 0x568a5ce2 in ExecScanFetch (recheckMtd=0x568b3ec0 <FunctionRecheck>, accessMtd=0x568b3ed0 <FunctionNext>, node=0x57626cfc) at execScan.c:95
#11 ExecScan (node=0x57626cfc, accessMtd=0x568b3ed0 <FunctionNext>, recheckMtd=0x568b3ec0 <FunctionRecheck>) at execScan.c:162
#12 0x568b4243 in ExecFunctionScan (pstate=0x57626cfc) at nodeFunctionscan.c:270
#13 0x5689caba in ExecProcNode (node=0x57626cfc) at ../../../src/include/executor/executor.h:238
#14 ExecutePlan (execute_once=<optimized out>, dest=0x0, direction=NoMovementScanDirection, numberTuples=<optimized out>, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x57626cfc, estate=0x57626bf0) at execMain.c:1731
#15 standard_ExecutorRun (queryDesc=0x5760f248, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:368
#16 0x56a132cd in PortalRunSelect (portal=portal@entry=0x575c4ea8, forward=forward@entry=true, count=0, count@entry=2147483647, dest=0x576232d4) at pquery.c:932
#17 0x56a14b00 in PortalRun (portal=0x575c4ea8, count=2147483647, isTopLevel=true, run_once=true, dest=0x576232d4, altdest=0x576232d4, completionTag=0xffaf2c40 "") at pquery.c:773
#18 0x56a0fbb7 in exec_simple_query (query_string=query_string@entry=0x57579070 "SELECT end_lsn FROM pg_replication_slot_advance('regression_slot1', '0/2A00174') ") at postgres.c:1122
#19 0x56a11a8e in PostgresMain (argc=1, argv=0x575a0b8c, dbname=<optimized out>, username=0x575a09f0 "buildfarm") at postgres.c:4153
#20 0x566cd9cb in BackendRun (port=0x5759a358) at postmaster.c:4361
#21 BackendStartup (port=0x5759a358) at postmaster.c:4033
#22 ServerLoop () at postmaster.c:1706
#23 0x5698a608 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1379
#24 0x566cf642 in main (argc=<optimized out>, argv=<optimized out>) at main.c:228

As you can see, assertion failure happen because specified startlsn (0x2a00000) is not considered to be valid.
This LSN is taken from slot's confirmed flush LSN in pg_replication_slot_advance:
startlsn = MyReplicationSlot->data.confirmed_flush;

Unfortunately I was not able to reproduce the problem even repeating this regression tests 1000 times, so it seems to be very difficult to reproduced non-deterministic bug.
I wonder if there is a warranty that confirmed_flush always points to the start of the record.
Inspecting of xlogreader.c code shows that confirmed_flush is for example assigned EndRecPtr in DecodingContextFindStartpoint.
And EndRecPtr is updated in XLogReadRecord and if record doesn't cross page boundary, then the following formula is used:

state->EndRecPtr = RecPtr + MAXALIGN(total_len);

And if record ends at page boundary, then looks like EndRecPtr can point to page boundary.
It is confirmed by the following comment in XLogReadRecord function:

/*
* RecPtr is pointing to end+1 of the previous WAL record. If we're
* at a page boundary, no more records can fit on the current page. We
* must skip over the page header, but we can't do that until we've
* read in the page, since the header size is variable.
*/

But it means that it is possible that confirmed_flush can also point to the page boundary which will cause this assertion failure in XLogReadRecord:

/*
* Caller supplied a position to start at.
*
* In this case, the passed-in record pointer should already be
* pointing to a valid record starting position.
*/
Assert(XRecOffIsValid(RecPtr));

May be this bug is very difficult to reproduce because it takes place only if WAL record preceding specified start point ends exactly at page boundary. The probability of it seems to be quite small...

I am not so familiar with wal iteration code and logical decoding, so I will be pleased to receive confirmation orconfutation of my analysis of the problem.
--
Konstantin Knizhnik
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Looks like it is not the first time this problem was encountered:

/messages/by-id/34d66f63-40a9-4c3e-c9a1-248d1e393d29@enterprisedb.com

I wonder who is now maintaining logical replication stuff and whether
this bug is going to be fixed?
I think that using restart_lsn instead of confirmed_flush is not right
approach.
If restart_lsn is not available and confirmed_flush is pointing to page
boundary, then in any case we should somehow handle this case and adjust
startlsn to point on the valid record position (by jjust adding page
header size?).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Arseny Sher

a.sher@postgrespro.ru

about 8 years ago

In reply to: Konstantin Knizhnik (#2)

Re: Possible bug in logical replication.

Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:

I think that using restart_lsn instead of confirmed_flush is not right approach.
If restart_lsn is not available and confirmed_flush is pointing to page
boundary, then in any case we should somehow handle this case and adjust
startlsn to point on the valid record position (by jjust adding page header
size?).

Well, restart_lsn is always available on live slot: it is initially set
in ReplicationSlotReserveWal during slot creation.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Michael Paquier

michael@paquier.xyz

about 8 years ago

In reply to: Konstantin Knizhnik (#2)

Re: Possible bug in logical replication.

On Thu, May 17, 2018 at 12:31:09PM +0300, Konstantin Knizhnik wrote:

I wonder who is now maintaining logical replication stuff and whether this
bug is going to be fixed?

I haven't looked at your problem in details so I cannot give a sure
conclusion, but when it comes to pg_replication_slot_advance, which is
new as of v11, you need to look at this commit:
commit: 9c7d06d60680c7f00d931233873dee81fdb311c6
author: Simon Riggs <simon@2ndQuadrant.com>
date: Wed, 17 Jan 2018 11:38:34 +0000
Ability to advance replication slots

Ability to advance both physical and logical replication slots using a
new user function pg_replication_slot_advance().

For logical advance that means records are consumed as fast as possible
and changes are not given to output plugin for sending. Makes 2nd phase
(after we reached SNAPBUILD_FULL_SNAPSHOT) of replication slot creation
faster, especially when there are big transactions as the reorder buffer
does not have to deal with data changes and does not have to spill to
disk.

Author: Petr Jelinek
Reviewed-by: Simon Riggs

I am adding an open item.
--
Michael

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 8 years ago

In reply to: Arseny Sher (#3)

Re: Possible bug in logical replication.

At Thu, 17 May 2018 13:54:07 +0300, Arseny Sher <a.sher@postgrespro.ru> wrote in <87in7md034.fsf@ars-thinkpad>

Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:

I think that using restart_lsn instead of confirmed_flush is not right approach.
If restart_lsn is not available and confirmed_flush is pointing to page
boundary, then in any case we should somehow handle this case and adjust
startlsn to point on the valid record position (by jjust adding page header
size?).

Well, restart_lsn is always available on live slot: it is initially set
in ReplicationSlotReserveWal during slot creation.

restart_lsn stays at the beginning of a transaction until the
transaction ends so just using restart_lsn allows repeated
decoding of a transaction, in short, rewinding occurs. The
function works only for inactive slot so the current code works
fine on this point. Addition to that restart_lsn also can be on a
page bounary.

We can see the problem easily.

1. Just create a logical replication slot with setting current LSN.

select pg_create_logical_replication_slot('s1', 'pgoutput');

2. Advance LSN by two or three pages by doing anyting.

3. Advance the slot to a page bounadry.

e.g. select pg_replication_slot_advance('s1', '0/9624000');

4. advance the slot further, then crash.

So directly set ctx->reader->EndRecPtr by startlsn fixes the
problem, but I found another problem here.

The function accepts any LSN even if it is not at the begiining
of a record. We will see errors or crashs or infinite waiting or
maybe any kind of trouble by such values. The moved LSN must
always be at the "end of a record" (that is, at the start of the
next recored). The attached patch also fixes this.

The documentation doesn't look requiring a fix.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: Possible bug in logical replication.

On Fri, May 18, 2018 at 2:37 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Thu, 17 May 2018 13:54:07 +0300, Arseny Sher <a.sher@postgrespro.ru> wrote in <87in7md034.fsf@ars-thinkpad>

Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:

I think that using restart_lsn instead of confirmed_flush is not right approach.
If restart_lsn is not available and confirmed_flush is pointing to page
boundary, then in any case we should somehow handle this case and adjust
startlsn to point on the valid record position (by jjust adding page header
size?).

Well, restart_lsn is always available on live slot: it is initially set
in ReplicationSlotReserveWal during slot creation.

restart_lsn stays at the beginning of a transaction until the
transaction ends so just using restart_lsn allows repeated
decoding of a transaction, in short, rewinding occurs. The
function works only for inactive slot so the current code works
fine on this point. Addition to that restart_lsn also can be on a
page bounary.

We can see the problem easily.

1. Just create a logical replication slot with setting current LSN.

select pg_create_logical_replication_slot('s1', 'pgoutput');

2. Advance LSN by two or three pages by doing anyting.

3. Advance the slot to a page bounadry.

e.g. select pg_replication_slot_advance('s1', '0/9624000');

4. advance the slot further, then crash.

So directly set ctx->reader->EndRecPtr by startlsn fixes the
problem, but I found another problem here.

I confirmed that the attached patch fixes this problem as well as the
same problem reported on another thread.

I'm not sure it's a good approach to change the state of xlogreader
directly in the replication slot codes because it also means that we
have to care about this code as well when xlogreader code is changed.
Another possible way might be to make XLogFindNextRecord valid in
backend code and move startlsn to the first valid record with an lsn

= startlsn by using that function. Please find attached patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#6)

Re: Possible bug in logical replication.

Hello.

At Wed, 23 May 2018 15:56:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoA+5Tz0Z8zHOmD=sU5F=cygoEjHs7WvbBDL07fH9ayVaw@mail.gmail.com>

So directly set ctx->reader->EndRecPtr by startlsn fixes the
problem, but I found another problem here.

I confirmed that the attached patch fixes this problem as well as the
same problem reported on another thread.

I'm not sure it's a good approach to change the state of xlogreader
directly in the replication slot codes because it also means that we
have to care about this code as well when xlogreader code is changed.

XLogReadRecrod checks whether state->ReadRecPtr is invalid or not
in the case and works as the same to the explicit LSN case if
so. That is suggesting the usage. (I found no actual use case,
though.) It seems somewhat uneasy also to me, though..

Another possible way might be to make XLogFindNextRecord valid in
backend code and move startlsn to the first valid record with an lsn

= startlsn by using that function. Please find attached patch.

The another reason for the code is the fact that confirmed_lsn is
storing EndRecPtr after the last XLogReadRecord call. That is,
from the definition, confirmed_lsn must be on the start of a
record or page boundary and error out if not. For that reason,
calling XLogFindNextRecord would not be the right thing to do
here. We should just skip a header if we are on a boundary but it
already done in XLogReadRecord.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Michael Paquier

michael@paquier.xyz

about 8 years ago

In reply to: Kyotaro Horiguchi (#7)

Re: Possible bug in logical replication.

On Thu, May 24, 2018 at 10:14:01AM +0900, Kyotaro HORIGUCHI wrote:

At Wed, 23 May 2018 15:56:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoA+5Tz0Z8zHOmD=sU5F=cygoEjHs7WvbBDL07fH9ayVaw@mail.gmail.com>

Another possible way might be to make XLogFindNextRecord valid in
backend code and move startlsn to the first valid record with an lsn

= startlsn by using that function. Please find attached patch.

The another reason for the code is the fact that confirmed_lsn is
storing EndRecPtr after the last XLogReadRecord call. That is,
from the definition, confirmed_lsn must be on the start of a
record or page boundary and error out if not. For that reason,
calling XLogFindNextRecord would not be the right thing to do
here. We should just skip a header if we are on a boundary but it
already done in XLogReadRecord.

Maybe I am being too naive, but wouldn't it be just enough to update the
confirmed flush LSN to ctx->reader->ReadRecPtr?  This way, the slot
advances up to the beginning of the last record where user wants to
advance, and not the beginning of the next record:
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -395,7 +395,7 @@ pg_logical_replication_slot_advance(XLogRecPtr startlsn, XLogRecPtr moveto)

         if (ctx->reader->EndRecPtr != InvalidXLogRecPtr)
         {
-            LogicalConfirmReceivedLocation(moveto);
+            LogicalConfirmReceivedLocation(ctx->reader->ReadRecPtr);

/*
* If only the confirmed_flush_lsn has changed the slot won't get

I agree with the point made above to not touch manually the XLogReader
context out of xlogreader.c.
--
Michael

Arseny Sher

a.sher@postgrespro.ru

about 8 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: Possible bug in logical replication.

Hello,

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

restart_lsn stays at the beginning of a transaction until the
transaction ends so just using restart_lsn allows repeated
decoding of a transaction, in short, rewinding occurs. The
function works only for inactive slot so the current code works
fine on this point.

Sorry, I do not follow. restart_lsn is advanced whenever there is a
consistent snapshot dumped (in xl_running_xacts) which is old enough to
wholly decode all xacts not yet confirmed by the client. Could you
please elaborate, what's wrong with that?

Addition to that restart_lsn also can be on a
page bounary.

Do you have an example of that? restart_lsn is set initially to WAL
insert position at ReplicationSlotReserveWal, and later it always points
to xl_running_xacts record with consistent snapshot dumped.

So directly set ctx->reader->EndRecPtr by startlsn fixes the
problem, but I found another problem here.

There is a minor issue with the patch. Now slot advancement hangs
polling for new WAL on my example from [1]/messages/by-id/873720e4hf.fsf@ars-thinkpad; most probably because we
must exit the loop when ctx->reader->EndRecPtr == moveto.

The function accepts any LSN even if it is not at the begiining
of a record. We will see errors or crashs or infinite waiting or
maybe any kind of trouble by such values. The moved LSN must
always be at the "end of a record" (that is, at the start of the
next recored). The attached patch also fixes this.

Indeed, but we have these problems only if we are trying to read WAL
since confirmed_flush.

[1]: /messages/by-id/873720e4hf.fsf@ars-thinkpad

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10

Arseny Sher

a.sher@postgrespro.ru

about 8 years ago

In reply to: Michael Paquier (#8)

Re: Possible bug in logical replication.

Michael Paquier <michael@paquier.xyz> writes:

Maybe I am being too naive, but wouldn't it be just enough to update the
confirmed flush LSN to ctx->reader->ReadRecPtr? This way, the slot
advances up to the beginning of the last record where user wants to
advance, and not the beginning of the next record:

Same problem should be handled at pg_logical_slot_get_changes_guts and
apply worker feedback; and there is a convention that all commits since
confirmed_flush must be decoded, so we risk decoding such boundary
commit twice.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#11

Alvaro Herrera

alvherre@2ndquadrant.com

almost 8 years ago

In reply to: Arseny Sher (#10)

Re: Possible bug in logical replication.

This thread seems to have died down without any fix being proposed.
Simon, you own this open item.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12

Simon Riggs

simon@2ndQuadrant.com

almost 8 years ago

In reply to: Alvaro Herrera (#11)

Re: Possible bug in logical replication.

On 6 June 2018 at 17:22, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

This thread seems to have died down without any fix being proposed.
Simon, you own this open item.

Thanks, will look.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Simon Riggs (#12)

Re: Possible bug in logical replication.

On Thu, Jun 07, 2018 at 08:32:10AM +0100, Simon Riggs wrote:

On 6 June 2018 at 17:22, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

This thread seems to have died down without any fix being proposed.
Simon, you own this open item.

Thanks, will look.

Petr and I have found a couple of issues about the slot advance stuff on
this thread:
/messages/by-id/20180525052805.GA15634@paquier.xyz

The result is f731cfa, which, per my tests, is able to take care of this
issue as well as advancing first a slot to a WAL page boundary, and then
advancing it to the latest LSN available does not trigger any assertions
anymore. It would be nice if there is a double-check though, so I am
letting this thread on the list of open items for now.
--
Michael

#14

Alvaro Herrera

alvherre@2ndquadrant.com

almost 8 years ago

In reply to: Michael Paquier (#13)

Re: Possible bug in logical replication.

Hello

Can somebody (Arseny, Konstantin, horiguti, Sawada) please confirm that
Michaï¿½l's commit fixes the reported bug?

Thanks,

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Masahiko Sawada

sawada.mshk@gmail.com

almost 8 years ago

In reply to: Alvaro Herrera (#14)

Re: Possible bug in logical replication.

On Fri, Jun 15, 2018 at 5:06 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Hello

Can somebody (Arseny, Konstantin, horiguti, Sawada) please confirm that
Michaël's commit fixes the reported bug?

I don't confirm that commit deeply yet but I have confirmed that the
reported bug is fixed using the following test case which led an
assertion failure. And this test case is the same as the previous
report[1]/messages/by-id/34d66f63-40a9-4c3e-c9a1-248d1e393d29@enterprisedb.com.

postgres(1:128004)=# select pg_create_logical_replication_slot('s1',
'pgoutput');
2018-06-15 18:04:31.301 JST [128004] LOG: logical decoding found
consistent point at 0/1645388
2018-06-15 18:04:31.301 JST [128004] DETAIL: There are no running transactions.
2018-06-15 18:04:31.301 JST [128004] STATEMENT: select
pg_create_logical_replication_slot('s1', 'pgoutput');
pg_create_logical_replication_slot
------------------------------------
(s1,0/16453C0)
(1 row)

postgres(1:128004)=# select pg_switch_wal();
pg_switch_wal
---------------
0/16453D8
(1 row)

postgres(1:128004)=# select pg_switch_wal();
pg_switch_wal
---------------
0/2000000
(1 row)

postgres(1:128004)=# select pg_replication_slot_advance('s1', '0/2000000');
2018-06-15 18:04:31.338 JST [128004] LOG: starting logical decoding
for slot "s1"
2018-06-15 18:04:31.338 JST [128004] DETAIL: Streaming transactions
committing after 0/16453C0, reading WAL from 0/1645388.
2018-06-15 18:04:31.338 JST [128004] STATEMENT: select
pg_replication_slot_advance('s1', '0/2000000');
2018-06-15 18:04:31.339 JST [128004] LOG: logical decoding found
consistent point at 0/1645388
2018-06-15 18:04:31.339 JST [128004] DETAIL: There are no running transactions.
2018-06-15 18:04:31.339 JST [128004] STATEMENT: select
pg_replication_slot_advance('s1', '0/2000000');
pg_replication_slot_advance
-----------------------------
(s1,0/2000000)
(1 row)

postgres(1:128004)=# create table a (c int);
select pg_replication_slot_advance('s1', pg_current_wal_lsn());
CREATE TABLE
postgres(1:128004)=# select pg_replication_slot_advance('s1',
pg_current_wal_lsn());
2018-06-15 18:04:31.401 JST [128004] LOG: starting logical decoding
for slot "s1"
2018-06-15 18:04:31.401 JST [128004] DETAIL: Streaming transactions
committing after 0/2000000, reading WAL from 0/1645388.
2018-06-15 18:04:31.401 JST [128004] STATEMENT: select
pg_replication_slot_advance('s1', pg_current_wal_lsn());
2018-06-15 18:04:31.402 JST [128004] LOG: logical decoding found
consistent point at 0/1645388
2018-06-15 18:04:31.402 JST [128004] DETAIL: There are no running transactions.
2018-06-15 18:04:31.402 JST [128004] STATEMENT: select
pg_replication_slot_advance('s1', pg_current_wal_lsn());
pg_replication_slot_advance
-----------------------------
(s1,0/2017828)
(1 row)

[1]: /messages/by-id/34d66f63-40a9-4c3e-c9a1-248d1e393d29@enterprisedb.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#16

Arseny Sher

a.sher@postgrespro.ru

almost 8 years ago

In reply to: Alvaro Herrera (#14)

Re: Possible bug in logical replication.

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Can somebody (Arseny, Konstantin, horiguti, Sawada) please confirm that
Michaël's commit fixes the reported bug?

I confirm that starting reading WAL since restart_lsn as implemented in
f731cfa fixes this issue, as well as the second issue tushar mentioned
at [1]/messages/by-id/5f85bf41-098e-c4e1-7332-9171fef57a0a@enterprisedb.com. I think that the code still can be improved a bit though --
consider the attached patch:
* pg_logical_replication_slot_advance comment was not very informative
and even a bit misleading: it said that we use confirmed_flush_lsn and
restart_lsn, but didn't explain why.
* Excessive check in its main loop.
* Copy-paste comment fix.

[1]: /messages/by-id/5f85bf41-098e-c4e1-7332-9171fef57a0a@enterprisedb.com

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#17

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Arseny Sher (#16)

Re: Possible bug in logical replication.

On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:

I confirm that starting reading WAL since restart_lsn as implemented in
f731cfa fixes this issue, as well as the second issue tushar mentioned
at [1].

Thanks!

+    /*
+     * Start reading WAL at restart_lsn, which certainly points to the valid
+     * record.
+     */
     XLogRecPtr    startlsn = MyReplicationSlot->data.restart_lsn;
     XLogRecPtr    retlsn = MyReplicationSlot->data.confirmed_flush;

What does this one actually bring?

     PG_TRY();
     {
-        /* restart at slot's confirmed_flush */
+        /* start_lsn doesn't matter here, we don't replay xacts at all */
         ctx = CreateDecodingContext(InvalidXLogRecPtr,
                                     NIL,
                                     true,

Okay for this one.

-    /*
-     * The {begin_txn,change,commit_txn}_wrapper callbacks above will
-     * store the description into our tuplestore.
-     */
+    /* Changes are not actually produced in fast_forward mode. */

This one is a good idea. Now CreateDecodingContext is missing the
description of what fast_forward actually does, aka no changes are
produced. Could you update your patch to reflect that? That would be
useful for future callers of CreateDecodingContext as well.

- /* Stop once the moving point wanted by caller has been reached */
- if (moveto <= ctx->reader->EndRecPtr)
- break;
-
CHECK_FOR_INTERRUPTS();

It seems to me that we still want to have the slot forwarding finish in
this case even if this is interrupted. Petr, isn't that the intention
here?
--
Michael

#18

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Michael Paquier (#17)

Re: Possible bug in logical replication.

On Mon, Jun 18, 2018 at 09:42:36PM +0900, Michael Paquier wrote:

On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:
It seems to me that we still want to have the slot forwarding finish in
this case even if this is interrupted. Petr, isn't that the intention
here?

I have been chewing a bit more on the proposed patch, finishing with the
attached to close the loop. Thoughts?
--
Michael

#19

Arseny Sher

a.sher@postgrespro.ru

almost 8 years ago

In reply to: Michael Paquier (#18)

Re: Possible bug in logical replication.

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Jun 18, 2018 at 09:42:36PM +0900, Michael Paquier wrote:

On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:
It seems to me that we still want to have the slot forwarding finish in
this case even if this is interrupted. Petr, isn't that the intention
here?

I have been chewing a bit more on the proposed patch, finishing with the
attached to close the loop. Thoughts?

Sorry for being pedantic, but it seems to me worthwhile to mention *why*
we need decoding machinery at all -- like I wrote:

+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn
+ * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples).

Also,

* The slot's restart_lsn is used as start point for reading records,

This is clearly seen from the code, I propose to remove this.

* while confirmed_lsn is used as base point for the decoding context.

And as I wrote, this doesn't matter as changes are not produced.

* The LSN position to move to is checked by doing a per-record scan and
* logical decoding which makes sure that confirmed_lsn is updated to a
* LSN which allows the future slot consumer to get consistent logical
- * changes.
+ * changes.  As decoding is done with fast_forward mode, no changes are
+ * actually generated.

confirmed_lsn is always updated to `moveto` unless we run out of WAL
earlier (and unless we try to move slot backwards, which is obviously
forbidden) -- consistent changes are practically irrelevant
here. Moreover, we can directly set confirmed_lsn and still have
consistent changes further as restart_lsn and xmin of the slot are not
touched. What we actually do here is trying to advance *restart_lsn and
xmin* as far as we can but up to the point which ensures that decoding
can assemble a consistent snapshot allowing to fully decode all COMMITs
since updated `confirmed_flush_lsn`. All this happens in
SnapBuildProcessRunningXacts.

It seems to me that we still want to have the slot forwarding finish in
this case even if this is interrupted. Petr, isn't that the intention
here?

Probably, though I am not sure what is the point of this. Ok, I keep
this check in the updated (with your comments) patch and CC'ing Petr.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#20

Petr Jelinek

petr@2ndquadrant.com

almost 8 years ago

In reply to: Arseny Sher (#19)

Re: Possible bug in logical replication.

On 20/06/18 09:59, Arseny Sher wrote:

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Jun 18, 2018 at 09:42:36PM +0900, Michael Paquier wrote:

On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:
It seems to me that we still want to have the slot forwarding finish in
this case even if this is interrupted. Petr, isn't that the intention
here?

I have been chewing a bit more on the proposed patch, finishing with the
attached to close the loop. Thoughts?

Sorry for being pedantic, but it seems to me worthwhile to mention *why*
we need decoding machinery at all -- like I wrote:
+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn
+ * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples).

Also,

* The slot's restart_lsn is used as start point for reading records,

This is clearly seen from the code, I propose to remove this.

Given there was just bug fix done for this I guess the extra clarity
there does not hurt.

* The LSN position to move to is checked by doing a per-record scan and
* logical decoding which makes sure that confirmed_lsn is updated to a
* LSN which allows the future slot consumer to get consistent logical
- * changes.
+ * changes.  As decoding is done with fast_forward mode, no changes are
+ * actually generated.
confirmed_lsn is always updated to `moveto` unless we run out of WAL
earlier (and unless we try to move slot backwards, which is obviously
forbidden) -- consistent changes are practically irrelevant
here. Moreover, we can directly set confirmed_lsn and still have
consistent changes further as restart_lsn and xmin of the slot are not
touched. What we actually do here is trying to advance *restart_lsn and
xmin* as far as we can but up to the point which ensures that decoding
can assemble a consistent snapshot allowing to fully decode all COMMITs
since updated `confirmed_flush_lsn`. All this happens in
SnapBuildProcessRunningXacts.

Those are two different things, here is consistent snapshot for logical
decoding without which we can't decode and that's handled by restart_lsn
and xmin. But the consistent stream of data for the consumer is handled
by confirmed_lsn (and the comment says that). You have to take into
account that next use of the slot can consume data (ie may be done via
one of the other interfaces and not by move). So I think what Michael
has here is correct.

It seems to me that we still want to have the slot forwarding finish in
this case even if this is interrupted. Petr, isn't that the intention
here?

Well, it seems wasteful to just exit there if we already finished all
the requested work, also gives some consistency with the coding of
get/peek_changes. Not very important for the functionality either way.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Petr Jelinek (#20)

#22

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 8 years ago

In reply to: Alvaro Herrera (#14)

#23