Postmaster crashed during start
Hi,
when we kill postmaster using kill -9 and start immediately it crashes with
FATAL: pre-existing shared memory block (key 2495405, ID 360501) is still
in use
HINT: Terminate any old server processes associated with data directory
*We can reproduce this*
kill -9 $(head -n 1 $PGDATA/postmaster.pid) & ./pg_ctl -D $PGDATA -l
$PGDATA/logfile start
*Reason of crash:*
when we kill postmaster with -9 signal the clean up does not happen where
the shared memory segment won't be detached but kernel will does this when
a process dies means the process which attached to segment will be detached
so shm_nattch will be 0 but in case before kernel comes up to detach the
process if we try to start postmaster again, during creation postmaster.pid
using CreateDataDirLockFile() postmaster checks for whether previous shmem
segment is still in use ,for this we are depending on shmStat.shm_nattch ==
0 ? SHMSTATE_UNATTACHED : SHMSTATE_ATTACHED; as if kernel didn't come up so
shm_attach is still 1 so the new postmaster will think the shmem segment is
in use and crashes.
should we even consider this as a bug or we should leave it as it depends
of how busy the kernel is and it didn't got time to do the clean up of the
dead postmaster process so didn't detached and decrement the shmem_nattach.
thoughts?
Thanks and Regards
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com <http://www.enterprisedb.com/>
Srinath Reddy <srinath2133@gmail.com> writes:
when we kill postmaster using kill -9 and start immediately it crashes with
FATAL: pre-existing shared memory block (key 2495405, ID 360501) is still
in use
"Doctor, it hurts when I do this!"
"So don't do that!"
This is not a supported way of shutting down the postmaster, and it
never will be. Use SIGINT, or SIGQUIT if you are in a desperate
hurry and are willing to have the next startup take longer.
I think the specific reason you are seeing this is that it takes
nonzero time for the postmaster's orphaned child processes to
notice that the postmaster is gone and terminate. As long as
any of those children remain, the shared memory block will have
a nonzero reference count. The new postmaster sees that and
refuses to start, for the very sound reason that it risks
data corruption if it brings up a new set of worker processes
while any of the old ones are still running.
regards, tom lane
On Wed, Feb 26, 2025 at 9:50 AM Srinath Reddy <srinath2133@gmail.com> wrote:
Show quoted text
On Wed, Feb 26, 2025 at 9:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Srinath Reddy <srinath2133@gmail.com> writes:
when we kill postmaster using kill -9 and start immediately it crashes
with
FATAL: pre-existing shared memory block (key 2495405, ID 360501) is
still
in use
"Doctor, it hurts when I do this!"
"So don't do that!"
This is not a supported way of shutting down the postmaster, and it
never will be. Use SIGINT, or SIGQUIT if you are in a desperate
hurry and are willing to have the next startup take longer.i was actually trying to recreate power outage scenario using
node->kill9(),node->start() in a custom tap test,then i found this crash.I think the specific reason you are seeing this is that it takes
nonzero time for the postmaster's orphaned child processes to
notice that the postmaster is gone and terminate. As long as
any of those children remain, the shared memory block will have
a nonzero reference count. The new postmaster sees that and
refuses to start, for the very sound reason that it risks
data corruption if it brings up a new set of worker processes
while any of the old ones are still running.regards, tom lane
i am guessing you mean "reference count to shared memory block" means
shmem_nattach right? i think this will be incremented by 1 when a process
attached to the shmem segment using shmat() in postgres case its the
postmaster who attaches during creation of shmem segment and detaches
during postmaster's on_shmem_exit is called during if it exits properly or
not dies suddenly (as the case with kill -9) ,during detaching only the
shmem_nattach will be decremented by 1 ,AFAIK the child processes will get
to use the shmem segment but never attaches or detaches so they are not
effecting the shmem_nattach.so as the shmem_nattach is not
0 PGSharedMemoryAttach thinks the shmem state is still attached and in use.
Import Notes
Reply to msg id not found: CAFC+b6pTSwirmOgc3NbeasKqR0JgC-_CetMcEQ8mZOxbb9m1cw@mail.gmail.com
On Wed, Feb 26, 2025 at 12:31 AM Srinath Reddy <srinath2133@gmail.com>
wrote:
i was actually trying to recreate power outage scenario using
node->kill9(),node->start() in a custom tap test,then i found this crash.
LOL ,that's not a power outage test, that's a kill -9 postgres test. A true
power outage would take care of any shared memory problems as well.
Carefully clear the shared memory as part of the test (you can find the key
in postmaster.pid), or do a proper test with something like:
echo b > /proc/sysrq-trigger
i am guessing you mean "reference count to shared memory block" means
shmem_nattach right? i think this will be incremented by 1 when a process
attached to the shmem segment using shmat() in postgres case its the
postmaster who attaches during creation of shmem segment and detaches
during postmaster's on_shmem_exit is called during if it exits properly or
not dies suddenly (as the case with kill -9) ,during detaching only the
shmem_nattach will be decremented by 1 ,AFAIK the child processes will get
to use the shmem segment but never attaches or detaches so they are not
effecting the shmem_nattach.so as the shmem_nattach is not
0 PGSharedMemoryAttach thinks the shmem state is still attached and in use.
You might be overthinking this. A server crash is much more likely than a
random postgres crash. Test the former, by all means. The latter is
expected to have some potential manual cleanup, for safety reasons as
explained above.
--
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support