[BUG] Re-entering malloc problem when use --enable-nls build postgresql

Started by 158306855almost 8 years ago15 messagesbugs
Jump to latest
#1158306855
anderson2013@qq.com

HI pg team:

I found that compiling postgresql with enable-nls may be introduce a problem

1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function)
3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.
5. This is the deadlock process stack:

Attachments:

802FF513@D9CBCD09.6F04F15A.jpgimage/jpeg; name="802FF513@D9CBCD09.6F04F15A.jpg"Download
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: 158306855 (#1)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

"=?ISO-8859-1?B?MTU4MzA2ODU1?=" <anderson2013@qq.com> writes:

I found that compiling postgresql with enable-nls may be introduce a problem

1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function)
3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.

I can't get excited about this. quickdie's attempt to report that it's
killing the process is necessarily a "best effort" undertaking, because
we cannot be sure that the process is in a good state. In this situation,
it isn't. --enable-nls might make the odds of that a bit worse, but we
could get such a failure regardless.

There are not any better alternatives. We can't just set a flag in the
signal handler and hope that control will someday reach a place that
notices the flag. We could exit without attempting to report anything,
but nobody would find that user-friendly. So we try to report, in the
full understanding that sometimes it won't work. That's one reason why
the postmaster has a timeout-and-then-hard-SIGKILL behavior.

regards, tom lane

#3Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

On 2018-05-08 01:32:33 -0400, Tom Lane wrote:

"=?ISO-8859-1?B?MTU4MzA2ODU1?=" <anderson2013@qq.com> writes:

I found that compiling postgresql with enable-nls may be introduce a problem

1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function)
3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.

I don't think it's realistic to treat this is something we'll
necessarily backpatch. Any fix is going to be too complicated.

I can't get excited about this. quickdie's attempt to report that it's
killing the process is necessarily a "best effort" undertaking, because
we cannot be sure that the process is in a good state. In this situation,
it isn't. --enable-nls might make the odds of that a bit worse, but we
could get such a failure regardless.

As previously, I disagree with this. There's plenty ways to reach
quickdie, several where we are quite sure about the state. -m immediate
is a thing, and there's plenty situations where it's an entirely
reasonable choice.

There are not any better alternatives. We can't just set a flag in the
signal handler and hope that control will someday reach a place that
notices the flag. We could exit without attempting to report anything,
but nobody would find that user-friendly. So we try to report, in the
full understanding that sometimes it won't work.

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context. I think it'd be ok to not to translate in
that situation.

We should also use _exit or the like when exiting quickly, calling exit
handlers from a signal handlers is a bad bad idea.

Greetings,

Andres Freund

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

Andres Freund <andres@anarazel.de> writes:

On 2018-05-08 01:32:33 -0400, Tom Lane wrote:

There are not any better alternatives. We can't just set a flag in the
signal handler and hope that control will someday reach a place that
notices the flag. We could exit without attempting to report anything,
but nobody would find that user-friendly. So we try to report, in the
full understanding that sometimes it won't work.

It'd be fairly unproblematic to write an untranslated message out.

To stderr, maybe. Across an SSL-encrypted client connection? You're
dreaming.

regards, tom lane

#5158306855
anderson2013@qq.com
In reply to: Andres Freund (#3)
Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context. I think it'd be ok to not to translate in
that situation.

I agree with this solution when build postgres with enable-nls.

------------------ Original ------------------
From: "Andres Freund"<andres@anarazel.de>;
Date: Tue, May 8, 2018 01:57 PM
To: "Tom Lane"<tgl@sss.pgh.pa.us>;
Cc: "158306855"<anderson2013@qq.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

On 2018-05-08 01:32:33 -0400, Tom Lane wrote:

"=?ISO-8859-1?B?MTU4MzA2ODU1?=" <anderson2013@qq.com> writes:

I found that compiling postgresql with enable-nls may be introduce a problem

1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function)
3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.

I don't think it's realistic to treat this is something we'll
necessarily backpatch. Any fix is going to be too complicated.

I can't get excited about this. quickdie's attempt to report that it's
killing the process is necessarily a "best effort" undertaking, because
we cannot be sure that the process is in a good state. In this situation,
it isn't. --enable-nls might make the odds of that a bit worse, but we
could get such a failure regardless.

As previously, I disagree with this. There's plenty ways to reach
quickdie, several where we are quite sure about the state. -m immediate
is a thing, and there's plenty situations where it's an entirely
reasonable choice.

There are not any better alternatives. We can't just set a flag in the
signal handler and hope that control will someday reach a place that
notices the flag. We could exit without attempting to report anything,
but nobody would find that user-friendly. So we try to report, in the
full understanding that sometimes it won't work.

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context. I think it'd be ok to not to translate in
that situation.

We should also use _exit or the like when exiting quickly, calling exit
handlers from a signal handlers is a bad bad idea.

Greetings,

Andres Freund

#6Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

On 2018-05-08 02:07:08 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-05-08 01:32:33 -0400, Tom Lane wrote:

There are not any better alternatives. We can't just set a flag in the
signal handler and hope that control will someday reach a place that
notices the flag. We could exit without attempting to report anything,
but nobody would find that user-friendly. So we try to report, in the
full understanding that sometimes it won't work.

It'd be fairly unproblematic to write an untranslated message out.

To stderr, maybe. Across an SSL-encrypted client connection? You're
dreaming.

libpq invents an equivalent message when the server closes the
connection anyway, IIRC. So that'd not necessarily be too bad.

Greetings,

Andres Freund

#7Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#6)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

On 2018-05-08 01:36:31 -0700, Andres Freund wrote:

On 2018-05-08 02:07:08 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-05-08 01:32:33 -0400, Tom Lane wrote:

There are not any better alternatives. We can't just set a flag in the
signal handler and hope that control will someday reach a place that
notices the flag. We could exit without attempting to report anything,
but nobody would find that user-friendly. So we try to report, in the
full understanding that sometimes it won't work.

It'd be fairly unproblematic to write an untranslated message out.

To stderr, maybe. Across an SSL-encrypted client connection? You're
dreaming.

libpq invents an equivalent message when the server closes the
connection anyway, IIRC. So that'd not necessarily be too bad.

Oh, also: It looks like it'd actually be relatively easy to give openssl
its own memory allocator + pool:
Create a global 'openssl' memory context with preallocation, use
CRYPTO_set_mem_functions() to make openssl allocations go through small
wrapper functions around palloc/repalloc/pfree.

It's still not entirely kosher to call into openssl from a signal
handler because we could be inside openssl - but the window for that is
a lot smaller than being inside *any* memory allocation.

Greetings,

Andres Freund

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#7)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

On Wed, May 9, 2018 at 7:51 AM, Andres Freund <andres@anarazel.de> wrote:

On 2018-05-08 01:36:31 -0700, Andres Freund wrote:

On 2018-05-08 02:07:08 -0400, Tom Lane wrote:

To stderr, maybe. Across an SSL-encrypted client connection? You're
dreaming.

libpq invents an equivalent message when the server closes the
connection anyway, IIRC. So that'd not necessarily be too bad.

Oh, also: It looks like it'd actually be relatively easy to give openssl
its own memory allocator + pool:
Create a global 'openssl' memory context with preallocation, use
CRYPTO_set_mem_functions() to make openssl allocations go through small
wrapper functions around palloc/repalloc/pfree.

It's still not entirely kosher to call into openssl from a signal
handler because we could be inside openssl - but the window for that is
a lot smaller than being inside *any* memory allocation.

Can't we use a more traditional signal handling style to defer
execution here? 1. Set an in_OpenSSL flag whenever you're about to
enter OpenSSL. 2. In the SIGQUIT handler, if you find that flag is
set, then just set got_SIGQUIT and return. 3. After you leave the
OpenSSL code, if you see got_SIGQUIT, then run the handler explicitly.

--
Thomas Munro
http://www.enterprisedb.com

#9Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#8)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

On 2018-05-09 09:30:58 +1200, Thomas Munro wrote:

On Wed, May 9, 2018 at 7:51 AM, Andres Freund <andres@anarazel.de> wrote:

On 2018-05-08 01:36:31 -0700, Andres Freund wrote:

On 2018-05-08 02:07:08 -0400, Tom Lane wrote:

To stderr, maybe. Across an SSL-encrypted client connection? You're
dreaming.

libpq invents an equivalent message when the server closes the
connection anyway, IIRC. So that'd not necessarily be too bad.

Oh, also: It looks like it'd actually be relatively easy to give openssl
its own memory allocator + pool:
Create a global 'openssl' memory context with preallocation, use
CRYPTO_set_mem_functions() to make openssl allocations go through small
wrapper functions around palloc/repalloc/pfree.

It's still not entirely kosher to call into openssl from a signal
handler because we could be inside openssl - but the window for that is
a lot smaller than being inside *any* memory allocation.

Can't we use a more traditional signal handling style to defer
execution here? 1. Set an in_OpenSSL flag whenever you're about to
enter OpenSSL. 2. In the SIGQUIT handler, if you find that flag is
set, then just set got_SIGQUIT and return. 3. After you leave the
OpenSSL code, if you see got_SIGQUIT, then run the handler explicitly.

Well, the question is if that'd ever have us defer killing the process
for longer. quickdie is intended to actually die quickly. If there's
rekeying, a large COPY incoming, or something like that it could take a
bit. Aren't there also still places where we do blocking network IO?

Greetings,

Andres Freund

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#9)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

Andres Freund <andres@anarazel.de> writes:

Well, the question is if that'd ever have us defer killing the process
for longer. quickdie is intended to actually die quickly.

Yeah. Though now that we have the postmaster mechanism to wait-five-
seconds-then-SIGKILL, maybe we could rethink that requirement? If we
reimplemented SIGQUIT to work more like SIGTERM, it would surely be
a lot safer. There would be cases where a stuck backend wouldn't
respond and it'd eventually get SIGKILL'd, but in return we'd get rid
of problems like this one.

On balance I'm not sure whether that solution is more or less user-
friendly than the current setup, but for sure it'd be a lot easier
to reason about.

regards, tom lane

#11Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#10)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

On 2018-05-08 18:04:07 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Well, the question is if that'd ever have us defer killing the process
for longer. quickdie is intended to actually die quickly.

Yeah. Though now that we have the postmaster mechanism to wait-five-
seconds-then-SIGKILL, maybe we could rethink that requirement? If we
reimplemented SIGQUIT to work more like SIGTERM, it would surely be
a lot safer. There would be cases where a stuck backend wouldn't
respond and it'd eventually get SIGKILL'd, but in return we'd get rid
of problems like this one.

Right now we intentionally do not accept interrupts in a couple places
where we want to die quickly because we're making persistent changes. I
don't think it'd be good to continue e.g. committing any longer than
possible after one process segfaulted. One counter-argument to that is
that the timing right now is far from synchronous either.

Greetings,

Andres Freund

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#11)
Re: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

Andres Freund <andres@anarazel.de> writes:

On 2018-05-08 18:04:07 -0400, Tom Lane wrote:

Yeah. Though now that we have the postmaster mechanism to wait-five-
seconds-then-SIGKILL, maybe we could rethink that requirement? If we
reimplemented SIGQUIT to work more like SIGTERM, it would surely be
a lot safer. There would be cases where a stuck backend wouldn't
respond and it'd eventually get SIGKILL'd, but in return we'd get rid
of problems like this one.

Right now we intentionally do not accept interrupts in a couple places
where we want to die quickly because we're making persistent changes. I
don't think it'd be good to continue e.g. committing any longer than
possible after one process segfaulted.

Well, that's an implementation question. I was sort of envisioning that
we could allow CHECK_FOR_INTERRUPTS to respond to a SIGQUIT interrupt
even when regular interrupts are disabled, reasoning that if we're at
a CHECK_FOR_INTERRUPTS at all, then we should be someplace where it's
more or less safe to trigger the exit. If we had it like that, then
for example the commit sequence could put a CHECK_FOR_INTERRUPTS at
the last possible moment before committing, knowing that regular
interrupts are held off --- but if there's a SIGQUIT pending we'd
take it.

regards, tom lane

#13158306855
anderson2013@qq.com
In reply to: Andres Freund (#3)
Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context. I think it'd be ok to not to translate in
that situation.

I wrote a patch to improve this problem.

zeng wenjing

------------------ Original ------------------
From: "Andres Freund"<andres@anarazel.de>;
Date: Tue, May 8, 2018 01:57 PM
To: "Tom Lane"<tgl@sss.pgh.pa.us>;
Cc: "158306855"<anderson2013@qq.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

On 2018-05-08 01:32:33 -0400, Tom Lane wrote:

"=?ISO-8859-1?B?MTU4MzA2ODU1?=" <anderson2013@qq.com> writes:

I found that compiling postgresql with enable-nls may be introduce a problem

1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function)
3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.

I don't think it's realistic to treat this is something we'll
necessarily backpatch. Any fix is going to be too complicated.

I can't get excited about this. quickdie's attempt to report that it's
killing the process is necessarily a "best effort" undertaking, because
we cannot be sure that the process is in a good state. In this situation,
it isn't. --enable-nls might make the odds of that a bit worse, but we
could get such a failure regardless.

As previously, I disagree with this. There's plenty ways to reach
quickdie, several where we are quite sure about the state. -m immediate
is a thing, and there's plenty situations where it's an entirely
reasonable choice.

There are not any better alternatives. We can't just set a flag in the
signal handler and hope that control will someday reach a place that
notices the flag. We could exit without attempting to report anything,
but nobody would find that user-friendly. So we try to report, in the
full understanding that sometimes it won't work.

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context. I think it'd be ok to not to translate in
that situation.

We should also use _exit or the like when exiting quickly, calling exit
handlers from a signal handlers is a bad bad idea.

Greetings,

Andres Freund

Attachments:

ereport_message_not_translateit_in_quickdie.patchapplication/octet-stream; charset=ISO-8859-1; name=ereport_message_not_translateit_in_quickdie.patchDownload+66-3
#14Andres Freund
andres@anarazel.de
In reply to: 158306855 (#13)
Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

On 2018-05-21 10:49:53 +0800, 158306855 wrote:

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context. I think it'd be ok to not to translate in
that situation.

I wrote a patch to improve this problem.

+extern int	errmsg_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
+extern int	errdetail_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
+extern int	errhint_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
+
/*
* errcontext() is typically called in error context callback functions, not
* within an ereport() invocation. The callback function can be in a different

Can't we just reuse errmsg_internal etc?

Greetings,

Andres Freund

#15158306855
anderson2013@qq.com
In reply to: Andres Freund (#14)
Re: [BUG] Re-entering malloc problem when use --enable-nlsbuildpostgresql

Can't we just reuse errmsg_internal etc?

Yes, the *_internal function can be reused. This is my modified patch.
Please take a look.

Zeng Wenjing

------------------ Original ------------------
From: "Andres Freund"<andres@anarazel.de>;
Date: Wed, May 23, 2018 05:11 AM
To: "158306855"<anderson2013@qq.com>;
Cc: "Tom Lane"<tgl@sss.pgh.pa.us>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: [BUG] Re-entering malloc problem when use --enable-nlsbuildpostgresql

On 2018-05-21 10:49:53 +0800, 158306855 wrote:

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context. I think it'd be ok to not to translate in
that situation.

I wrote a patch to improve this problem.

+extern int	errmsg_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
+extern int	errdetail_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
+extern int	errhint_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
+
/*
* errcontext() is typically called in error context callback functions, not
* within an ereport() invocation. The callback function can be in a different

Can't we just reuse errmsg_internal etc?

Greetings,

Andres Freund

Attachments:

ereport_message_not_translateit_in_quickdie_v2.patchapplication/octet-stream; charset=ISO-8859-1; name=ereport_message_not_translateit_in_quickdie_v2.patchDownload+25-3