Reorderbuffer crash during recovery

Started by vignesh Cabout 6 years ago22 messages

vignesh21@gmail.com

about 6 years ago

Hi,

I found couple of crashes in reorderbuffer while review/testing of
logical_work_mem and logical streaming of large in-progress
transactions. Stack trace of the same are given below:
Issue 1:
#0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6
#1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332
#6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
record=0x3342770) at decode.c:121
#7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
#8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
<XLogSendLogical>) at walsender.c:2199
#9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
walsender.c:1128
#10 0x0000000000b6ce83 in exec_replication_command
(cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
at walsender.c:1545
#11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
postgres.c:4256
#12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
#13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
#14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
#15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
postmaster.c:1400
#16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210

Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

From initial analysis it looks like:
Issue1 it seems like if all the reorderbuffer has been flushed and
then the server restarts. This problem occurs.
Issue 2 it seems like if there are many subtransactions present and
then the server restarts. This problem occurs. The subtransaction's
final_lsn is not being set and when ReorderBufferRestoreCleanup is
called the assert fails. May be for this we might have to set the
subtransaction's final_lsn before cleanup(not sure).

I could not reproduce this issue consistently with a test case, But I
felt this looks like a problem from review.

For issue1, I could reproduce by the following steps:
1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
2) Have many open transactions with subtransactions open.
3) Attach one of the transaction from gdb and call abort().

I'm not sure of the fix for this. If I get time I will try to spend
more time to find out the fix.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

about 6 years ago

In reply to: vignesh C (#1)

Re: Reorderbuffer crash during recovery

On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:

Hi,

I found couple of crashes in reorderbuffer while review/testing of
logical_work_mem and logical streaming of large in-progress
transactions. Stack trace of the same are given below:
Issue 1:
#0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6
#1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332
#6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
record=0x3342770) at decode.c:121
#7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
#8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
<XLogSendLogical>) at walsender.c:2199
#9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
walsender.c:1128
#10 0x0000000000b6ce83 in exec_replication_command
(cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
at walsender.c:1545
#11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
postgres.c:4256
#12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
#13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
#14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
#15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
postmaster.c:1400
#16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210

Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

From initial analysis it looks like:
Issue1 it seems like if all the reorderbuffer has been flushed and
then the server restarts. This problem occurs.
Issue 2 it seems like if there are many subtransactions present and
then the server restarts. This problem occurs. The subtransaction's
final_lsn is not being set and when ReorderBufferRestoreCleanup is
called the assert fails. May be for this we might have to set the
subtransaction's final_lsn before cleanup(not sure).

I could not reproduce this issue consistently with a test case, But I
felt this looks like a problem from review.

For issue1, I could reproduce by the following steps:
1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
2) Have many open transactions with subtransactions open.
3) Attach one of the transaction from gdb and call abort().

Do you need subtransactions for the issue1? It appears that after the
restart if the changes list is empty it will hit the assert. Am I
missing something?

I'm not sure of the fix for this. If I get time I will try to spend
more time to find out the fix.
Thoughts?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

vignesh C

vignesh21@gmail.com

about 6 years ago

In reply to: Dilip Kumar (#2)

Re: Reorderbuffer crash during recovery

On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:

Hi,

I found couple of crashes in reorderbuffer while review/testing of
logical_work_mem and logical streaming of large in-progress
transactions. Stack trace of the same are given below:
Issue 1:
#0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6
#1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332
#6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
record=0x3342770) at decode.c:121
#7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
#8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
<XLogSendLogical>) at walsender.c:2199
#9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
walsender.c:1128
#10 0x0000000000b6ce83 in exec_replication_command
(cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
at walsender.c:1545
#11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
postgres.c:4256
#12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
#13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
#14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
#15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
postmaster.c:1400
#16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210

Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

From initial analysis it looks like:
Issue1 it seems like if all the reorderbuffer has been flushed and
then the server restarts. This problem occurs.
Issue 2 it seems like if there are many subtransactions present and
then the server restarts. This problem occurs. The subtransaction's
final_lsn is not being set and when ReorderBufferRestoreCleanup is
called the assert fails. May be for this we might have to set the
subtransaction's final_lsn before cleanup(not sure).

I could not reproduce this issue consistently with a test case, But I
felt this looks like a problem from review.

For issue1, I could reproduce by the following steps:
1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
2) Have many open transactions with subtransactions open.
3) Attach one of the transaction from gdb and call abort().

Do you need subtransactions for the issue1? It appears that after the
restart if the changes list is empty it will hit the assert. Am I
missing something?

When I had reported this issue I could reproduce this issue with
sub-transactions. Now I have tried without using sub-transactions and
could still reproduce this issue. You are right Issue 1 will appear in
both the cases with and without subtransactions.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

about 6 years ago

In reply to: vignesh C (#3)

Re: Reorderbuffer crash during recovery

On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:

Hi,

I found couple of crashes in reorderbuffer while review/testing of
logical_work_mem and logical streaming of large in-progress
transactions. Stack trace of the same are given below:
Issue 1:
#0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6
#1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332
#6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
record=0x3342770) at decode.c:121
#7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
#8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
<XLogSendLogical>) at walsender.c:2199
#9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
walsender.c:1128
#10 0x0000000000b6ce83 in exec_replication_command
(cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
at walsender.c:1545
#11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
postgres.c:4256
#12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
#13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
#14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
#15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
postmaster.c:1400
#16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210

Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

From initial analysis it looks like:
Issue1 it seems like if all the reorderbuffer has been flushed and
then the server restarts. This problem occurs.
Issue 2 it seems like if there are many subtransactions present and
then the server restarts. This problem occurs. The subtransaction's
final_lsn is not being set and when ReorderBufferRestoreCleanup is
called the assert fails. May be for this we might have to set the
subtransaction's final_lsn before cleanup(not sure).

I could not reproduce this issue consistently with a test case, But I
felt this looks like a problem from review.

For issue1, I could reproduce by the following steps:
1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
2) Have many open transactions with subtransactions open.
3) Attach one of the transaction from gdb and call abort().

Do you need subtransactions for the issue1? It appears that after the
restart if the changes list is empty it will hit the assert. Am I
missing something?

When I had reported this issue I could reproduce this issue with
sub-transactions. Now I have tried without using sub-transactions and
could still reproduce this issue. You are right Issue 1 will appear in
both the cases with and without subtransactions.

Okay, thanks for the confirmation.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Dilip Kumar (#4)

Re: Reorderbuffer crash during recovery

On Thu, Nov 07, 2019 at 11:01:17AM +0530, Dilip Kumar wrote:

On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:

Hi,

...

Issue1 it seems like if all the reorderbuffer has been flushed and
then the server restarts. This problem occurs.
Issue 2 it seems like if there are many subtransactions present and
then the server restarts. This problem occurs. The subtransaction's
final_lsn is not being set and when ReorderBufferRestoreCleanup is
called the assert fails. May be for this we might have to set the
subtransaction's final_lsn before cleanup(not sure).

I could not reproduce this issue consistently with a test case, But I
felt this looks like a problem from review.

For issue1, I could reproduce by the following steps:
1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
2) Have many open transactions with subtransactions open.
3) Attach one of the transaction from gdb and call abort().

Do you need subtransactions for the issue1? It appears that after the
restart if the changes list is empty it will hit the assert. Am I
missing something?

When I had reported this issue I could reproduce this issue with
sub-transactions. Now I have tried without using sub-transactions and
could still reproduce this issue. You are right Issue 1 will appear in
both the cases with and without subtransactions.

Okay, thanks for the confirmation.

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too? If only with the patches, which
version exactly?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Amit Kapila

amit.kapila16@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#5)

Re: Reorderbuffer crash during recovery

On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too?

This occurs with the clean master. This is a base code problem
revealed while doing stress testing of logical_work_mem patches.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

about 6 years ago

In reply to: Amit Kapila (#6)

Re: Reorderbuffer crash during recovery

Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:

On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too?

This occurs with the clean master. This is a base code problem
revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

Greetings,

Andres Freund

vignesh C

vignesh21@gmail.com

about 6 years ago

In reply to: Andres Freund (#7)

Re: Reorderbuffer crash during recovery

On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:

On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too?

This occurs with the clean master. This is a base code problem
revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

This problem does not occur consistently. When I was reviewing and testing
"logical streaming for large in-progress transactions" link [1]https://commitfest.postgresql.org/25/1927/ I found the
crashes.

This issue does not occur directly, meaning this issue will occur only when
some crash occurs in postgres process(not from reorderbuffer but due to
some other issue), after the original non-reorderbuffer crash this
reorderbuffer crash appears.

To simplify the reorderbuffer crash, I used the following steps:
1) Make replication setup with publisher/subscriber for some table
2) Prepare a sql file with the below:
begin;
4096 insert statements;
select pg_sleep(120)
3) Execute the above script.
4) Attach the postgres process when pg_sleep is in progress.
5) call abort() from attached gdb.
6) After sometime there will be many core files in publisher installation
data directory.

[1]: https://commitfest.postgresql.org/25/1927/

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:

On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too?

This occurs with the clean master. This is a base code problem
revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

Greetings,

Andres Freund

Amit Kapila

amit.kapila16@gmail.com

about 6 years ago

In reply to: vignesh C (#8)

Re: Reorderbuffer crash during recovery

On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote:

On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:

On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too?

This occurs with the clean master. This is a base code problem
revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

I think it will be bit tricky to write a test case, but the callstack
indicates the problem. For ex. for issue-1, see the below part of
callstack,

#2 0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332

The Assertion failure indicates that the changes list is empty when we
are trying to get 'lsn' of the last change. I think this is possible
when immediately after serializing the transaction changes the server
got restarted (say it crashed or somebody switched off and restarted
it). The reason is that after serializing the changes, the changes
list will be empty and serialized flag for txn will be set. I am not
if there is any reason to set the final_lsn if the changes list is
empty.

Similarly, if we see the call stack of issue-2, the problem is clear.

#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973

I think this also has a symptom similar to the prior issue but for
sub-transactions. The ReorderBufferAbortOld() tries to set the
final_lsn of toplevel transaction, but not for subtransactions, later
in that path ReorderBufferRestoreCleanup expects it to be set even for
subtransaction. Is that Assert in ReorderBufferRestoreCleanup()
required, because immediately after assert, we are anyway setting the
value of final_lsn.

This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.

This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.

To simplify the reorderbuffer crash, I used the following steps:
1) Make replication setup with publisher/subscriber for some table
2) Prepare a sql file with the below:
begin;
4096 insert statements;
select pg_sleep(120)
3) Execute the above script.
4) Attach the postgres process when pg_sleep is in progress.
5) call abort() from attached gdb.

Isn't it important to call this abort immediately after the changes
are serialized as explained above?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10

vignesh C

vignesh21@gmail.com

about 6 years ago

In reply to: Amit Kapila (#9)

Re: Reorderbuffer crash during recovery

On Sat, Nov 9, 2019 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote:

On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:

On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too?

This occurs with the clean master. This is a base code problem
revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

I think it will be bit tricky to write a test case, but the callstack
indicates the problem. For ex. for issue-1, see the below part of
callstack,

#2 0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332

The Assertion failure indicates that the changes list is empty when we
are trying to get 'lsn' of the last change. I think this is possible
when immediately after serializing the transaction changes the server
got restarted (say it crashed or somebody switched off and restarted
it). The reason is that after serializing the changes, the changes
list will be empty and serialized flag for txn will be set. I am not
if there is any reason to set the final_lsn if the changes list is
empty.

Similarly, if we see the call stack of issue-2, the problem is clear.

#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973

I think this also has a symptom similar to the prior issue but for
sub-transactions. The ReorderBufferAbortOld() tries to set the
final_lsn of toplevel transaction, but not for subtransactions, later
in that path ReorderBufferRestoreCleanup expects it to be set even for
subtransaction. Is that Assert in ReorderBufferRestoreCleanup()
required, because immediately after assert, we are anyway setting the
value of final_lsn.

Thanks Amit for more analysis of the issues.

This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.

This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.

To simplify the reorderbuffer crash, I used the following steps:
1) Make replication setup with publisher/subscriber for some table
2) Prepare a sql file with the below:
begin;
4096 insert statements;
select pg_sleep(120)
3) Execute the above script.
4) Attach the postgres process when pg_sleep is in progress.
5) call abort() from attached gdb.

Isn't it important to call this abort immediately after the changes
are serialized as explained above?

As we are performing exactly 4096 insert statements on a fresh
replication setup it will get serialized immediately after 4096
inserts. The variable max_changes_in_memory is initialized with 4096
(reorderbuffer.c file), txn->nentries_mem gets incremented for every
insert statement in ReorderBufferQueueChange function. As we have
executed 4096 insert statements, txn->nentries_mem will become 4096
which results in the serializing from ReorderBufferCheckSerializeTXN
function. Then attaching gdb and call abort() helped in reproducing
the issue consistently.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

#11

vignesh C

vignesh21@gmail.com

about 6 years ago

In reply to: vignesh C (#1)

1 attachment(s)

Re: Reorderbuffer crash during recovery

I found couple of crashes in reorderbuffer while review/testing of
logical_work_mem and logical streaming of large in-progress
transactions. Stack trace of the same are given below:
Issue 1:
#0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6
#1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332
#6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
record=0x3342770) at decode.c:121
#7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
#8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
<XLogSendLogical>) at walsender.c:2199
#9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
walsender.c:1128
#10 0x0000000000b6ce83 in exec_replication_command
(cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
at walsender.c:1545
#11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
postgres.c:4256
#12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
#13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
#14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
#15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
postmaster.c:1400
#16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210

Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

From initial analysis it looks like:
Issue1 it seems like if all the reorderbuffer has been flushed and
then the server restarts. This problem occurs.
Issue 2 it seems like if there are many subtransactions present and
then the server restarts. This problem occurs. The subtransaction's
final_lsn is not being set and when ReorderBufferRestoreCleanup is
called the assert fails. May be for this we might have to set the
subtransaction's final_lsn before cleanup(not sure).

I could not reproduce this issue consistently with a test case, But I
felt this looks like a problem from review.

For issue1, I could reproduce by the following steps:
1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
2) Have many open transactions with subtransactions open.
3) Attach one of the transaction from gdb and call abort().

I'm not sure of the fix for this. If I get time I will try to spend
more time to find out the fix.

I have further analyzed the issue and found that:
After abort is called, when the process restarts, it will clean the
reorder information for the aborted transactions in
ReorderBufferAbortOld function. It crashes in the below code as there
are no changes present currently and all the changes are serialized:
.......
if (txn->serialized && txn->final_lsn == 0))
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);

txn->final_lsn = last->lsn;
}
.......

It sets the final_lsn here so that it can iterate from the start_lsn
to final_lsn and cleanup the serialized files in
ReorderBufferRestoreCleanup function. One solution We were thinking
was to store the lsn of the last serialized change while serialiizing
and set the final_lsn in the above case where it crashes like the
below code:
......
if (txn->serialized && txn->final_lsn == 0 &&
!dlist_is_empty(&txn->changes))
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);

txn->final_lsn = last->lsn;
}
else
{
/*
* If there are no changes present as all of the changes were
* serialized, use the last lsn that was serialized.
*/
txn->final_lsn = txn->current_serialized_lsn;
}
......

I have tested the same scenario and verified it to be working. The
patch for the same is attached. I could not add a test case for this
as it involves attaching to gdb and calling abort.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Reorder-buffer-crash-while-aborting-old-transactions.patchtext/x-patch; charset=US-ASCII; name=0001-Reorder-buffer-crash-while-aborting-old-transactions.patchDownload

From 6dc886922ddd12ca4bc2b40bb340ffc6a188ffad Mon Sep 17 00:00:00 2001
From: vignesh <vignesh@localhost.localdomain>
Date: Wed, 11 Dec 2019 10:37:40 +0530
Subject: [PATCH] Reorder buffer crash while aborting old transactions.

While aborting aborted transactions it crashed as there are no reorderbuffer
changes present in the list because all the reorderbuffer changes were
serialized. Fixing it by storing the last change's lsn while serializing the
reorderbuffer changes. This lsn will be used as final_lsn in scenarios where
all the reorder buffer change list is serialized.
---
 src/backend/replication/logical/reorderbuffer.c | 19 ++++++++++++++++++-
 src/include/replication/reorderbuffer.h         |  8 ++++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 53affeb..0122bdb 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1963,13 +1963,22 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (txn->serialized && txn->final_lsn == 0 &&
+				!dlist_is_empty(&txn->changes))
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
 
 				txn->final_lsn = last->lsn;
 			}
+			else
+			{
+				/*
+				 * There are no changes present as all of the changes were
+				 * serialized, use the last lsn that was serialized.
+				 */
+				txn->final_lsn = txn->last_serialized_lsn;
+			}
 
 			elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
@@ -2433,6 +2442,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferSerializeTXN(rb, subtxn);
 	}
 
+	if (!dlist_is_empty(&txn->changes))
+	{
+		ReorderBufferChange *lastChange =
+		dlist_tail_element(ReorderBufferChange,	node, &txn->changes);
+
+		txn->last_serialized_lsn = lastChange->lsn;
+	}
+
 	/* serialize changestream */
 	dlist_foreach_modify(change_i, &txn->changes)
 	{
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0867ee9..dbf3c7d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,14 @@ typedef struct ReorderBufferTXN
 	bool		serialized;
 
 	/*
+	 * The last lsn that is serialized, this is required to generate the
+	 * serialized file and removing them while aborting old transactions when
+	 * there are no changes present in memory as all of the changes have been
+	 * serialized.
+	 */
+	XLogRecPtr	last_serialized_lsn;
+
+	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
 	 */
-- 
1.8.3.1

#12

Amit Kapila

amit.kapila16@gmail.com

about 6 years ago

In reply to: vignesh C (#11)

Re: Reorderbuffer crash during recovery

On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote:

It sets the final_lsn here so that it can iterate from the start_lsn
to final_lsn and cleanup the serialized files in
ReorderBufferRestoreCleanup function. One solution We were thinking
was to store the lsn of the last serialized change while serialiizing
and set the final_lsn in the above case where it crashes like the
below code:

Sure, we can do something on the lines what you are suggesting, but
why can't we update final_lsn at the time of serializing the changes?
If we do that then we don't even need to compute it separately during
ReorderBufferAbortOld.

Let me try to explain the problem and proposed solutions for the same.
Currently, after serializing the changes we remove the 'changes' from
ReorderBufferTXN. Now, if the system crashes due to any reason after
that, we won't be able to compute final_lsn after the restart. And
that leads to access violation in ReorderBufferAbortOld which is
trying to access changes list from ReorderBufferTXN to compute
final_lsn.

We could fix it by either tracking 'last_serialized_change' as
proposed by Vignesh or we could update the final_lsn while we
serialize the changes.

IIUC, commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e [1]commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: Fri Jan 5 12:17:10 2018 -0300 tried to fix
some related issue which leads to this another problem. Alvaro,
Andres, do you have any suggestions?

[1]: commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: Fri Jan 5 12:17:10 2018 -0300
commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri Jan 5 12:17:10 2018 -0300

Fix failure to delete spill files of aborted transactions

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#13

vignesh C

vignesh21@gmail.com

about 6 years ago

In reply to: Amit Kapila (#12)

1 attachment(s)

Re: Reorderbuffer crash during recovery

On Tue, Dec 17, 2019 at 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote:

It sets the final_lsn here so that it can iterate from the start_lsn
to final_lsn and cleanup the serialized files in
ReorderBufferRestoreCleanup function. One solution We were thinking
was to store the lsn of the last serialized change while serialiizing
and set the final_lsn in the above case where it crashes like the
below code:

Sure, we can do something on the lines what you are suggesting, but
why can't we update final_lsn at the time of serializing the changes?
If we do that then we don't even need to compute it separately during
ReorderBufferAbortOld.

Let me try to explain the problem and proposed solutions for the same.
Currently, after serializing the changes we remove the 'changes' from
ReorderBufferTXN. Now, if the system crashes due to any reason after
that, we won't be able to compute final_lsn after the restart. And
that leads to access violation in ReorderBufferAbortOld which is
trying to access changes list from ReorderBufferTXN to compute
final_lsn.

We could fix it by either tracking 'last_serialized_change' as
proposed by Vignesh or we could update the final_lsn while we
serialize the changes.

I felt amit solution also solves the problem. Attached patch has the
fix based on the solution proposed.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Reorder-buffer-crash-while-aborting-old-transactions.patchtext/x-patch; charset=US-ASCII; name=0001-Reorder-buffer-crash-while-aborting-old-transactions.patchDownload

From 0a25c2f93f9aff267e529b3b0a7842b3e043d782 Mon Sep 17 00:00:00 2001
From: vignesh <vignesh@localhost.localdomain>
Date: Fri, 27 Dec 2019 13:17:04 +0530
Subject: [PATCH] Reorder buffer crash while aborting old transactions.

While aborting aborted transactions it crashed as there are no reorderbuffer
changes present in the list because all the reorderbuffer changes were
serialized. Fixing it by storing the last change's lsn while serializing the
reorderbuffer changes. This lsn will be used as final_lsn which will help in
cleaning of spilled files in pg_replslot.
---
 src/backend/replication/logical/reorderbuffer.c | 16 +---------------
 src/include/replication/reorderbuffer.h         |  3 +--
 2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 53affeb..7fa9163 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1956,21 +1956,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 
 		if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
 		{
-			/*
-			 * We set final_lsn on a transaction when we decode its commit or
-			 * abort record, but we never see those records for crashed
-			 * transactions.  To ensure cleanup of these transactions, set
-			 * final_lsn to that of their last change; this causes
-			 * ReorderBufferRestoreCleanup to do the right thing.
-			 */
-			if (txn->serialized && txn->final_lsn == 0)
-			{
-				ReorderBufferChange *last =
-				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
-
-				txn->final_lsn = last->lsn;
-			}
-
 			elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
 			/* remove potential on-disk data, and deallocate this tx */
@@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		}
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
+		txn->final_lsn = change->lsn;
 		dlist_delete(&change->node);
 		ReorderBufferReturnChange(rb, change);
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0867ee9..139846e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -188,8 +188,7 @@ typedef struct ReorderBufferTXN
 	 * * plain abort record
 	 * * prepared transaction abort
 	 * * error during decoding
-	 * * for a crashed transaction, the LSN of the last change, regardless of
-	 *   what it was.
+	 * * last serialized lsn
 	 * ----
 	 */
 	XLogRecPtr	final_lsn;
-- 
1.8.3.1

#14

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: vignesh C (#13)

Re: Reorderbuffer crash during recovery

On 2019-Dec-27, vignesh C wrote:

I felt amit solution also solves the problem. Attached patch has the
fix based on the solution proposed.
Thoughts?

This seems a sensible fix to me, though I didn't try to reproduce the
failure.

@@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}

ReorderBufferSerializeChange(rb, txn, fd, change);
+ txn->final_lsn = change->lsn;
dlist_delete(&change->node);
ReorderBufferReturnChange(rb, change);

Should this be done insider ReorderBufferSerializeChange itself, instead
of in its caller? Also, would it be sane to verify that the TXN
doesn't already have a newer final_lsn? Maybe as an Assert.

@@ -188,8 +188,7 @@ typedef struct ReorderBufferTXN
* * plain abort record
* * prepared transaction abort
* * error during decoding
-	 * * for a crashed transaction, the LSN of the last change, regardless of
-	 *   what it was.
+	 * * last serialized lsn
* ----

I propose "for a transaction with serialized changes, the LSN of the
latest serialized one, unless one of the above cases."

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Amit Kapila

amit.kapila16@gmail.com

about 6 years ago

In reply to: Alvaro Herrera (#14)

Re: Reorderbuffer crash during recovery

On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Dec-27, vignesh C wrote:

I felt amit solution also solves the problem. Attached patch has the
fix based on the solution proposed.
Thoughts?

This seems a sensible fix to me, though I didn't try to reproduce the
failure.

@@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}

ReorderBufferSerializeChange(rb, txn, fd, change);
+ txn->final_lsn = change->lsn;
dlist_delete(&change->node);
ReorderBufferReturnChange(rb, change);

Should this be done insider ReorderBufferSerializeChange itself, instead
of in its caller?

makes sense. But, I think we should add a comment specifying the
reason why it is important to set final_lsn while serializing the
change.

Also, would it be sane to verify that the TXN
doesn't already have a newer final_lsn? Maybe as an Assert.

I don't think this is a good idea because we update the final_lsn with
commit_lsn in ReorderBufferCommit after which we can try to serialize
the remaining changes. Instead, we should update it only if the
change_lsn value is greater than final_lsn.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#16

vignesh C

vignesh21@gmail.com

about 6 years ago

In reply to: Amit Kapila (#15)

2 attachment(s)

Re: Reorderbuffer crash during recovery

On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Dec-27, vignesh C wrote:

I felt amit solution also solves the problem. Attached patch has the
fix based on the solution proposed.
Thoughts?

This seems a sensible fix to me, though I didn't try to reproduce the
failure.

@@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}

ReorderBufferSerializeChange(rb, txn, fd, change);
+ txn->final_lsn = change->lsn;
dlist_delete(&change->node);
ReorderBufferReturnChange(rb, change);

Should this be done insider ReorderBufferSerializeChange itself, instead
of in its caller?

makes sense. But, I think we should add a comment specifying the
reason why it is important to set final_lsn while serializing the
change.

Fixed

Also, would it be sane to verify that the TXN
doesn't already have a newer final_lsn? Maybe as an Assert.

I don't think this is a good idea because we update the final_lsn with
commit_lsn in ReorderBufferCommit after which we can try to serialize
the remaining changes. Instead, we should update it only if the
change_lsn value is greater than final_lsn.

Fixed.
Thanks Alvaro & Amit for your suggestions. I have made the changes
based on your suggestions. Please find the updated patch for the same.
I have also verified the patch in back branches. Separate patch was
required for Release-10 branch, patch for the same is attached as
0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Reorder-buffer-crash-while-aborting-old-transactions.patchtext/x-patch; charset=US-ASCII; name=0001-Reorder-buffer-crash-while-aborting-old-transactions.patchDownload

From dcd6c495c32d1dfd8ca3e723a0d45584ab5d10e7 Mon Sep 17 00:00:00 2001
From: vignesh <vignesh@localhost.localdomain>
Date: Mon, 30 Dec 2019 22:59:05 +0530
Subject: [PATCH] Reorder buffer crash while aborting old transactions.

While aborting aborted transactions it crashed as there are no reorderbuffer
changes present in the list because all the reorderbuffer changes were
serialized. Fixing it by storing the last change's lsn while serializing the
reorderbuffer changes. This lsn will be used as final_lsn which will help in
cleaning of spilled files in pg_replslot.
---
 src/backend/replication/logical/reorderbuffer.c | 26 +++++++++++--------------
 src/include/replication/reorderbuffer.h         |  4 ++--
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 53affeb..a807a75 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1956,21 +1956,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 
 		if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
 		{
-			/*
-			 * We set final_lsn on a transaction when we decode its commit or
-			 * abort record, but we never see those records for crashed
-			 * transactions.  To ensure cleanup of these transactions, set
-			 * final_lsn to that of their last change; this causes
-			 * ReorderBufferRestoreCleanup to do the right thing.
-			 */
-			if (txn->serialized && txn->final_lsn == 0)
-			{
-				ReorderBufferChange *last =
-				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
-
-				txn->final_lsn = last->lsn;
-			}
-
 			elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
 			/* remove potential on-disk data, and deallocate this tx */
@@ -2679,6 +2664,17 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 	pgstat_report_wait_end();
 
+	/*
+	 * We set final_lsn on a transaction when we decode its commit or abort
+	 * record, but we never see those records for crashed transactions.  To
+	 * ensure cleanup of these transactions, set final_lsn to that of their
+	 * last change; this causes ReorderBufferRestoreCleanup to do the right
+	 * thing. Final_lsn would have been set with commit_lsn earlier when we
+	 * decode it commit, no need to update in that case
+	 */
+	if (txn->final_lsn < change->lsn)
+		txn->final_lsn = change->lsn;
+
 	Assert(ondisk->change.action == change->action);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0867ee9..b58c19a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -188,8 +188,8 @@ typedef struct ReorderBufferTXN
 	 * * plain abort record
 	 * * prepared transaction abort
 	 * * error during decoding
-	 * * for a crashed transaction, the LSN of the last change, regardless of
-	 *   what it was.
+	 * * for a transaction with serialized changes, the LSN of the latest
+	 *   serialized one, unless one of the above cases.
 	 * ----
 	 */
 	XLogRecPtr	final_lsn;
-- 
1.8.3.1

0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patchtext/x-patch; charset=US-ASCII; name=0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patchDownload

From f38ee9810f21397faa1e0fbb699360ee129119df Mon Sep 17 00:00:00 2001
From: vignesh <vignesh@localhost.localdomain>
Date: Mon, 30 Dec 2019 22:52:28 +0530
Subject: [PATCH] Reorder buffer crash while aborting old transactions.

While aborting aborted transactions it crashed as there are no reorderbuffer
changes present in the list because all the reorderbuffer changes were
serialized. Fixing it by storing the last change's lsn while serializing the
reorderbuffer changes. This lsn will be used as final_lsn which will help in
cleaning of spilled files in pg_replslot.
---
 src/backend/replication/logical/reorderbuffer.c | 26 +++++++++++--------------
 src/include/replication/reorderbuffer.h         |  4 ++--
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index be441c7..bc26ba1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1867,21 +1867,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 
 		if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
 		{
-			/*
-			 * We set final_lsn on a transaction when we decode its commit or
-			 * abort record, but we never see those records for crashed
-			 * transactions.  To ensure cleanup of these transactions, set
-			 * final_lsn to that of their last change; this causes
-			 * ReorderBufferRestoreCleanup to do the right thing.
-			 */
-			if (txn->serialized && txn->final_lsn == 0)
-			{
-				ReorderBufferChange *last =
-					dlist_tail_element(ReorderBufferChange, node, &txn->changes);
-
-				txn->final_lsn = last->lsn;
-			}
-
 			elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
 			/* remove potential on-disk data, and deallocate this tx */
@@ -2457,6 +2442,17 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 	pgstat_report_wait_end();
 
+	/*
+	 * We set final_lsn on a transaction when we decode its commit or abort
+	 * record, but we never see those records for crashed transactions.  To
+	 * ensure cleanup of these transactions, set final_lsn to that of their
+	 * last change; this causes ReorderBufferRestoreCleanup to do the right
+	 * thing. Final_lsn would have been set with commit_lsn earlier when we
+	 * decode it commit, no need to update in that case
+	 */
+	if (txn->final_lsn < change->lsn)
+		txn->final_lsn = change->lsn;
+
 	Assert(ondisk->change.action == change->action);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e5cbe25..cf2d4e7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -167,8 +167,8 @@ typedef struct ReorderBufferTXN
 	 * * plain abort record
 	 * * prepared transaction abort
 	 * * error during decoding
-	 * * for a crashed transaction, the LSN of the last change, regardless of
-	 *   what it was.
+	 * * for a transaction with serialized changes, the LSN of the latest
+	 *   serialized one, unless one of the above cases.
 	 * ----
 	 */
 	XLogRecPtr	final_lsn;
-- 
1.8.3.1

#17

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: vignesh C (#16)

Re: Reorderbuffer crash during recovery

On Tue, Dec 31, 2019 at 11:35 AM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Dec-27, vignesh C wrote:

I felt amit solution also solves the problem. Attached patch has the
fix based on the solution proposed.
Thoughts?

This seems a sensible fix to me, though I didn't try to reproduce the
failure.

@@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}

ReorderBufferSerializeChange(rb, txn, fd, change);
+ txn->final_lsn = change->lsn;
dlist_delete(&change->node);
ReorderBufferReturnChange(rb, change);

Should this be done insider ReorderBufferSerializeChange itself, instead
of in its caller?

makes sense. But, I think we should add a comment specifying the
reason why it is important to set final_lsn while serializing the
change.

Fixed

Also, would it be sane to verify that the TXN
doesn't already have a newer final_lsn? Maybe as an Assert.

I don't think this is a good idea because we update the final_lsn with
commit_lsn in ReorderBufferCommit after which we can try to serialize
the remaining changes. Instead, we should update it only if the
change_lsn value is greater than final_lsn.

Fixed.
Thanks Alvaro & Amit for your suggestions. I have made the changes
based on your suggestions. Please find the updated patch for the same.
I have also verified the patch in back branches. Separate patch was
required for Release-10 branch, patch for the same is attached as
0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch.
Thoughts?

One minor comment.  Otherwise, the patch looks fine to me.
+ /*
+ * We set final_lsn on a transaction when we decode its commit or abort
+ * record, but we never see those records for crashed transactions.  To
+ * ensure cleanup of these transactions, set final_lsn to that of their
+ * last change; this causes ReorderBufferRestoreCleanup to do the right
+ * thing. Final_lsn would have been set with commit_lsn earlier when we
+ * decode it commit, no need to update in that case
+ */
+ if (txn->final_lsn < change->lsn)
+ txn->final_lsn = change->lsn;

/decode it commit,/decode its commit,

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#18

vignesh C

vignesh21@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#17)

2 attachment(s)

Re: Reorderbuffer crash during recovery

On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

One minor comment.  Otherwise, the patch looks fine to me.
+ /*
+ * We set final_lsn on a transaction when we decode its commit or abort
+ * record, but we never see those records for crashed transactions.  To
+ * ensure cleanup of these transactions, set final_lsn to that of their
+ * last change; this causes ReorderBufferRestoreCleanup to do the right
+ * thing. Final_lsn would have been set with commit_lsn earlier when we
+ * decode it commit, no need to update in that case
+ */
+ if (txn->final_lsn < change->lsn)
+ txn->final_lsn = change->lsn;

/decode it commit,/decode its commit,

Thanks Dilip for reviewing.
I have fixed the comments you have suggested.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Reorder-buffer-crash-while-aborting-old-transactions.patchtext/x-patch; charset=US-ASCII; name=0001-Reorder-buffer-crash-while-aborting-old-transactions.patchDownload

From 6c5de8db4b6ac4da9d4ada0e9fa56133af0ed008 Mon Sep 17 00:00:00 2001
From: vignesh <vignesh@localhost.localdomain>
Date: Fri, 17 Jan 2020 07:34:31 +0530
Subject: [PATCH] Reorder buffer crash while aborting old transactions.

While aborting aborted transactions it crashed as there are no reorderbuffer
changes present in the list because all the reorderbuffer changes were
serialized. Fixing it by storing the last change's lsn while serializing the
reorderbuffer changes. This lsn will be used as final_lsn which will help in
cleaning of spilled files in pg_replslot.
---
 src/backend/replication/logical/reorderbuffer.c | 26 +++++++++++--------------
 src/include/replication/reorderbuffer.h         |  4 ++--
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbd908a..fa8a6b0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1974,21 +1974,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 
 		if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
 		{
-			/*
-			 * We set final_lsn on a transaction when we decode its commit or
-			 * abort record, but we never see those records for crashed
-			 * transactions.  To ensure cleanup of these transactions, set
-			 * final_lsn to that of their last change; this causes
-			 * ReorderBufferRestoreCleanup to do the right thing.
-			 */
-			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
-			{
-				ReorderBufferChange *last =
-				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
-
-				txn->final_lsn = last->lsn;
-			}
-
 			elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
 			/* remove potential on-disk data, and deallocate this tx */
@@ -2697,6 +2682,17 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 	pgstat_report_wait_end();
 
+	/*
+	 * We set final_lsn on a transaction when we decode its commit or abort
+	 * record, but we never see those records for crashed transactions.  To
+	 * ensure cleanup of these transactions, set final_lsn to that of their
+	 * last change; this causes ReorderBufferRestoreCleanup to do the right
+	 * thing. Final_lsn would have been set with commit_lsn earlier when we
+	 * decode its commit, no need to update in that case
+	 */
+	if (txn->final_lsn < change->lsn)
+		txn->final_lsn = change->lsn;
+
 	Assert(ondisk->change.action == change->action);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4f6c65d..5b67457 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -208,8 +208,8 @@ typedef struct ReorderBufferTXN
 	 * * plain abort record
 	 * * prepared transaction abort
 	 * * error during decoding
-	 * * for a crashed transaction, the LSN of the last change, regardless of
-	 *   what it was.
+	 * * for a transaction with serialized changes, the LSN of the latest
+	 *   serialized one, unless one of the above cases.
 	 * ----
 	 */
 	XLogRecPtr	final_lsn;
-- 
1.8.3.1

0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patchtext/x-patch; charset=US-ASCII; name=0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patchDownload

From e9a9983ac250f2e795bfdfe78cf50e9c6c7dc5a0 Mon Sep 17 00:00:00 2001
From: vignesh <vignesh@localhost.localdomain>
Date: Fri, 17 Jan 2020 07:00:14 +0530
Subject: [PATCH] Reorder buffer crash while aborting old transactions.

While aborting aborted transactions it crashed as there are no reorderbuffer
changes present in the list because all the reorderbuffer changes were
serialized. Fixing it by storing the last change's lsn while serializing the
reorderbuffer changes. This lsn will be used as final_lsn which will help in
cleaning of spilled files in pg_replslot.
---
 src/backend/replication/logical/reorderbuffer.c | 26 +++++++++++--------------
 src/include/replication/reorderbuffer.h         |  4 ++--
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6dcd973..fb2bf70 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1877,21 +1877,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 
 		if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
 		{
-			/*
-			 * We set final_lsn on a transaction when we decode its commit or
-			 * abort record, but we never see those records for crashed
-			 * transactions.  To ensure cleanup of these transactions, set
-			 * final_lsn to that of their last change; this causes
-			 * ReorderBufferRestoreCleanup to do the right thing.
-			 */
-			if (txn->serialized && txn->final_lsn == 0)
-			{
-				ReorderBufferChange *last =
-					dlist_tail_element(ReorderBufferChange, node, &txn->changes);
-
-				txn->final_lsn = last->lsn;
-			}
-
 			elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
 			/* remove potential on-disk data, and deallocate this tx */
@@ -2467,6 +2452,17 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 	pgstat_report_wait_end();
 
+	/*
+	 * We set final_lsn on a transaction when we decode its commit or abort
+	 * record, but we never see those records for crashed transactions.  To
+	 * ensure cleanup of these transactions, set final_lsn to that of their
+	 * last change; this causes ReorderBufferRestoreCleanup to do the right
+	 * thing. Final_lsn would have been set with commit_lsn earlier when we
+	 * decode its commit, no need to update in that case
+	 */
+	if (txn->final_lsn < change->lsn)
+		txn->final_lsn = change->lsn;
+
 	Assert(ondisk->change.action == change->action);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e5cbe25..cf2d4e7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -167,8 +167,8 @@ typedef struct ReorderBufferTXN
 	 * * plain abort record
 	 * * prepared transaction abort
 	 * * error during decoding
-	 * * for a crashed transaction, the LSN of the last change, regardless of
-	 *   what it was.
+	 * * for a transaction with serialized changes, the LSN of the latest
+	 *   serialized one, unless one of the above cases.
 	 * ----
 	 */
 	XLogRecPtr	final_lsn;
-- 
1.8.3.1

#19

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: vignesh C (#18)

Re: Reorderbuffer crash during recovery

On Fri, Jan 17, 2020 at 7:42 AM vignesh C <vignesh21@gmail.com> wrote:

On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

One minor comment.  Otherwise, the patch looks fine to me.
+ /*
+ * We set final_lsn on a transaction when we decode its commit or abort
+ * record, but we never see those records for crashed transactions.  To
+ * ensure cleanup of these transactions, set final_lsn to that of their
+ * last change; this causes ReorderBufferRestoreCleanup to do the right
+ * thing. Final_lsn would have been set with commit_lsn earlier when we
+ * decode it commit, no need to update in that case
+ */
+ if (txn->final_lsn < change->lsn)
+ txn->final_lsn = change->lsn;

/decode it commit,/decode its commit,

Thanks Dilip for reviewing.
I have fixed the comments you have suggested.

Thanks for the updated patch. It looks fine to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#20

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: vignesh C (#18)

Re: Reorderbuffer crash during recovery

On 2020-Jan-17, vignesh C wrote:

Thanks Dilip for reviewing.
I have fixed the comments you have suggested.

I ended up rewording that comment completely; I thought the original was
not explaining things well enough.

I also changed the comment for final_lsn in reorderbuffer.h: not only I
remove the line that I added in df9f682c7bf8 (the previous bugfix), but
I also remove the line that says "error during decoding", which came in
with the very first logical decoding commit (b89e151054a); I couldn't
find any evidence that final_lsn is being set on errors of any kind
(other than transaction abort, which doesn't seem an "error" in that
sense.)

Please give these comments a read; maybe I have misunderstood something
and my comment is wrong.

Pushed now to all branches. Thanks, Vignesh, Amit and Dilip.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Alvaro Herrera (#20)

Re: Reorderbuffer crash during recovery

On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Jan-17, vignesh C wrote:

Thanks Dilip for reviewing.
I have fixed the comments you have suggested.

I ended up rewording that comment completely; I thought the original was
not explaining things well enough.

I also changed the comment for final_lsn in reorderbuffer.h: not only I
remove the line that I added in df9f682c7bf8 (the previous bugfix), but
I also remove the line that says "error during decoding", which came in
with the very first logical decoding commit (b89e151054a); I couldn't
find any evidence that final_lsn is being set on errors of any kind
(other than transaction abort, which doesn't seem an "error" in that
sense.)

Please give these comments a read; maybe I have misunderstood something
and my comment is wrong.

The comments added by you look correct and good to me. Thanks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#22

vignesh C

vignesh21@gmail.com

almost 6 years ago

In reply to: Alvaro Herrera (#20)

Re: Reorderbuffer crash during recovery

On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Jan-17, vignesh C wrote:

Thanks Dilip for reviewing.
I have fixed the comments you have suggested.

I ended up rewording that comment completely; I thought the original was
not explaining things well enough.

I also changed the comment for final_lsn in reorderbuffer.h: not only I
remove the line that I added in df9f682c7bf8 (the previous bugfix), but
I also remove the line that says "error during decoding", which came in
with the very first logical decoding commit (b89e151054a); I couldn't
find any evidence that final_lsn is being set on errors of any kind
(other than transaction abort, which doesn't seem an "error" in that
sense.)

Please give these comments a read; maybe I have misunderstood something
and my comment is wrong.

Pushed now to all branches. Thanks, Vignesh, Amit and Dilip.

Thanks Alvaro for pushing this patch.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com