Re: crash with synchronized_standby_slots

Started by Alvaro Herreraabout 1 year ago2 messages

alvherre@alvh.no-ip.org

about 1 year ago

On 2024-Nov-29, Amit Kapila wrote:

I tried it on my Windows machine and noticed that ReplicationSlotCtl
is NULL for syslogger, so the problem doesn't occur. The reason is
that we don't attach to shared memory in syslogger, so ideally
ReplicationSlotCtl should be NULL. Because we inherit everything
through the fork for Linux systems and then later for processes that
don't want to attach to shared memory, we call PGSharedMemoryDetach()
from postmaster_child_launch(). The PGSharedMemoryDetach won't
reinitialize the memory pointed to by ReplicationSlotCtl, so, it would
be an invalid memory.

Heh, interesting. I'm not sure if we should try to do something about
invalid pointers being left around after shmem initialization. Also, is
this the first GUC check_hook that needs to take an LWLock?

Anyway, I have pushed this.

BTW it occurs to me that there might well be some sort of thundering
herd problem if every process needs to run the check_hook when a SIGHUP
is broadcast, and they'll all be waiting on that particular lwlock and
run the same validation locally again and again. I bet if you have a
few thousand backends (hi Jakub! [1]/messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com) it's problematic. Maybe we need a
different way to validate the GUC, but I don't know what that might be;
but doing the validation once and storing the result in shmem might be
better.

On 2024-Nov-29, Zhijie Hou (Fujitsu) wrote:

I can also reproduce this bug and confirmed that the bug is fixed
after applying the patch. In addition to the regression tests, I also
manually tested the behavior of the postmaster, walsender, and user
backend after reloading the configuration, and they all work as
expected.

Many thanks for testing!

[1]: /messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"I apologize for the confusion in my previous responses.
There appears to be an error." (ChatGPT)

Import Notes

Reply to msg id not found: OS0PR01MB571695C993F2B1738C6358FA942A2@OS0PR01MB5716.jpnprd01.prod.outlook.comCAA4eK1JmXd8VsKkH2dpnw5un-GUHd3PjaGwZZW73TDVKrHvNQ@mail.gmail.com

Amit Kapila

amit.kapila16@gmail.com

about 1 year ago

In reply to: Alvaro Herrera (#1)

On Tue, Dec 3, 2024 at 10:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Nov-29, Amit Kapila wrote:

BTW it occurs to me that there might well be some sort of thundering
herd problem if every process needs to run the check_hook when a SIGHUP
is broadcast, and they'll all be waiting on that particular lwlock and
run the same validation locally again and again. I bet if you have a
few thousand backends (hi Jakub! [1]) it's problematic.

The lock is taken in shared mode, so, ideally, it shouldn't create a
problem but if it ever creates a problem, we can even skip that check
during validation. The validation will anyway happen later during
replication in StandbySlotsHaveCaughtup(). This check is mostly to
detect the error in GUC early.

Maybe we need a
different way to validate the GUC, but I don't know what that might be;
but doing the validation once and storing the result in shmem might be
better.

What if that particular GUC changes again? We may need some additional
invalidation mechanism.

--
With Regards,
Amit Kapila.