Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

Started by Alex Richmanover 3 years ago6 messagesbugs
Jump to latest
#1Alex Richman
alexrichman@onesignal.com

Hi,

We've noticed a segfault bug while creating logical replication slots.
We've seen this on 14.6-1 and 15.1-1 (both installed from postgres
apt-archive repos to debian 11 hosts). We're confident that this behaviour
is not present in 14.5-1.

We can only reproduce this if the server is under load (reads/writes from
clients, perhaps also pg_repacks, reindexes etc). If the server is idle
then we cannot reproduce it.

Dmesg logs:
211935.284834] postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp
00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]

Postgres logs:
2022-12-14 22:33:58 UTC 2297089STATEMENT: SELECT
pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput',
false);
2022-12-14 22:33:58 UTC 778310LOG: server process (PID 2297089) was
terminated by signal 11: Segmentation fault
2022-12-14 22:33:58 UTC 778310DETAIL: Failed process was running: SELECT
pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput',
false);
2022-12-14 22:33:58 UTC 778310LOG: terminating any other active server
processes

Journal logs:
Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel:
postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068
error 4 in postgres[559079e4d000+51a000]
Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: Code:
48 83 c4 08 49 89 c0 5b 41 5c 4c 89 c0 41 5d 5d c3 66 90 83 e3 02 75 b8 e9
71 18 b9 ff e9 d3 18 b9 ff 90 48 89 fe 48 8b 7f f8 <48> 8b 47 >

Let me know if you need any more info.

Thanks,
- Alex.

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Alex Richman (#1)
Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

On Thu, Dec 15, 2022 at 4:35 AM Alex Richman <alexrichman@onesignal.com> wrote:

We've noticed a segfault bug while creating logical replication slots. We've seen this on 14.6-1 and 15.1-1 (both installed from postgres apt-archive repos to debian 11 hosts). We're confident that this behaviour is not present in 14.5-1.

We can only reproduce this if the server is under load (reads/writes from clients, perhaps also pg_repacks, reindexes etc). If the server is idle then we cannot reproduce it.

Dmesg logs:
211935.284834] postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]

Postgres logs:
2022-12-14 22:33:58 UTC 2297089STATEMENT: SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput', false);
2022-12-14 22:33:58 UTC 778310LOG: server process (PID 2297089) was terminated by signal 11: Segmentation fault
2022-12-14 22:33:58 UTC 778310DETAIL: Failed process was running: SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput', false);
2022-12-14 22:33:58 UTC 778310LOG: terminating any other active server processes

Journal logs:
Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]
Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: Code: 48 83 c4 08 49 89 c0 5b 41 5c 4c 89 c0 41 5d 5d c3 66 90 83 e3 02 75 b8 e9 71 18 b9 ff e9 d3 18 b9 ff 90 48 89 fe 48 8b 7f f8 <48> 8b 47 >

Let me know if you need any more info.

It is difficult to diagnose this without more information because
there is no clear call stack or a reproducible scenario. Will it be
possible for you to get a reproducible test case or at least call
stack to proceed?

--
With Regards,
Amit Kapila.

#3Alex Richman
alexrichman@onesignal.com
In reply to: Amit Kapila (#2)
Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

Apologies for the delay (and happy christmas/new years).

Please find included a full backtrace[1]https://gist.github.com/alex-richman-onesignal/0ba626ca4de00d24c331b4b224f0b399 of a sample of this crash,
replicated on postgres 15.1-1 in the same environment described in my
original email. Included as a gist due to the length but lmk if it should
be pasted in full for posterity. I've also added the python script[2]https://gist.github.com/alex-richman-onesignal/d1e51f2d4e0ae9da80f00457655b2eb8 used
to replicate, if that's relevant.

Unfortunately we have not been able to reproduce this in a clean room
environment, however we can note a few additional things:
- This has occurred over multiple distinct servers with different data
sets, though similar write loads. Suggesting it's not a specific server
with data corruption.
- Disabling pg_repack, autovacuum, automatic reindexing, has no effect, the
bug can still occur
- Running the same script on a read-only logical replica does not hit the
bug
- As above, if the server is idle (no write traffic), then it does not hit
the bug
- The bug occurs roughly 1 in every 10 executions of the create replication
slot, so is not 100% consistent.
- We're fairly confident that this did not occur pre 14.5-1, and started
occurring in 14.6-1 & 15.1-1.
So we would assume that there is some concurrent write traffic from our web
tier that sometimes causes a segfault in the logical replication slot
creation.

Please let me know if you need any more information.

Thanks,
- Alex.

[1]: https://gist.github.com/alex-richman-onesignal/0ba626ca4de00d24c331b4b224f0b399
https://gist.github.com/alex-richman-onesignal/0ba626ca4de00d24c331b4b224f0b399
[2]: https://gist.github.com/alex-richman-onesignal/d1e51f2d4e0ae9da80f00457655b2eb8
https://gist.github.com/alex-richman-onesignal/d1e51f2d4e0ae9da80f00457655b2eb8

On Thu, 15 Dec 2022 at 03:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

Show quoted text

On Thu, Dec 15, 2022 at 4:35 AM Alex Richman <alexrichman@onesignal.com>
wrote:

We've noticed a segfault bug while creating logical replication slots.

We've seen this on 14.6-1 and 15.1-1 (both installed from postgres
apt-archive repos to debian 11 hosts). We're confident that this behaviour
is not present in 14.5-1.

We can only reproduce this if the server is under load (reads/writes

from clients, perhaps also pg_repacks, reindexes etc). If the server is
idle then we cannot reproduce it.

Dmesg logs:
211935.284834] postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp

00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]

Postgres logs:
2022-12-14 22:33:58 UTC 2297089STATEMENT: SELECT

pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput',
false);

2022-12-14 22:33:58 UTC 778310LOG: server process (PID 2297089) was

terminated by signal 11: Segmentation fault

2022-12-14 22:33:58 UTC 778310DETAIL: Failed process was running:

SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28',
'pgoutput', false);

2022-12-14 22:33:58 UTC 778310LOG: terminating any other active server

processes

Journal logs:
Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel:

postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068
error 4 in postgres[559079e4d000+51a000]

Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel:

Code: 48 83 c4 08 49 89 c0 5b 41 5c 4c 89 c0 41 5d 5d c3 66 90 83 e3 02 75
b8 e9 71 18 b9 ff e9 d3 18 b9 ff 90 48 89 fe 48 8b 7f f8 <48> 8b 47 >

Let me know if you need any more info.

It is difficult to diagnose this without more information because
there is no clear call stack or a reproducible scenario. Will it be
possible for you to get a reproducible test case or at least call
stack to proceed?

--
With Regards,
Amit Kapila.

#4Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Alex Richman (#3)
Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

Hi,

On Tue, Jan 3, 2023 at 9:57 PM Alex Richman <alexrichman@onesignal.com> wrote:

Apologies for the delay (and happy christmas/new years).

Please find included a full backtrace[1] of a sample of this crash, replicated on postgres 15.1-1 in the same environment described in my original email. Included as a gist due to the length but lmk if it should be pasted in full for posterity. I've also added the python script[2] used to replicate, if that's relevant.

Unfortunately we have not been able to reproduce this in a clean room environment, however we can note a few additional things:
- This has occurred over multiple distinct servers with different data sets, though similar write loads. Suggesting it's not a specific server with data corruption.
- Disabling pg_repack, autovacuum, automatic reindexing, has no effect, the bug can still occur
- Running the same script on a read-only logical replica does not hit the bug
- As above, if the server is idle (no write traffic), then it does not hit the bug
- The bug occurs roughly 1 in every 10 executions of the create replication slot, so is not 100% consistent.
- We're fairly confident that this did not occur pre 14.5-1, and started occurring in 14.6-1 & 15.1-1.
So we would assume that there is some concurrent write traffic from our web tier that sometimes causes a segfault in the logical replication slot creation.

Please let me know if you need any more information.

Thank you for providing more information.

One possibility is that you encountered the bug in snapbuild.c that is
already fixed by commit 898ef41bf6f4 and will be included in 14.7 and
15.2. I've attached patches of this fix for PG14 and PG15. Could you
please try the same scenario again with these patches and see if the
issue happens?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_pg15.patchapplication/octet-stream; name=fix_pg15.patchDownload+7-0
fix_pg14.patchapplication/octet-stream; name=fix_pg14.patchDownload+7-0
#5Alex Richman
alexrichman@onesignal.com
In reply to: Masahiko Sawada (#4)
Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

I've attached patches of this fix for PG14 and PG15. Could you
please try the same scenario again with these patches and see if the
issue happens?

I applied your pg15 patch to the 15.1 source and it seems to be fixed,
thanks!

Look forward to the 15.2 release so we can deploy this more broadly.

Thanks,
- Alex.

#6Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Alex Richman (#5)
Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

On Sat, Jan 7, 2023 at 12:20 AM Alex Richman <alexrichman@onesignal.com> wrote:

I've attached patches of this fix for PG14 and PG15. Could you
please try the same scenario again with these patches and see if the
issue happens?

I applied your pg15 patch to the 15.1 source and it seems to be fixed, thanks!

Look forward to the 15.2 release so we can deploy this more broadly.

Thank you for your confirmation!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com