Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)

Started by Liam Bowenover 4 years ago4 messagesbugs

liambowen@gmail.com

over 4 years ago

This is to expand on this thread that Clemens started (Re: BUG #17299: Exit
code 3 when open connections concurrently (PQisthreadsafe() == 1))
/messages/by-id/45a5a8c0-da4c-31f7-0bf9-23a622bc44e6@sussol.net

This bug affects me and I would like to help to resolve it. Unfortunately,
the only way that I can reproduce this is by using Rust (with Diesel and
r2d2), but I think that has to do with multithreading because I can only
reproduce the crash around 90% of the time by invoking the executable the
same way. Also, reducing the number of connections makes the problem go
away. When my postmortem debugger is launched after a crash, however, there
is only one thread that is running.

I used the EnterpriseDB installer for several versions, and narrowed down
the bug as being introduced between 13.5 and 14.1. Then I used git bisect
to narrow down which revision actually introduced the bug. Each time, I
would build libpq and copy the DLL into the same directory as my executable
and verify that my build of libpq was being loaded. Eventually my bisection
pointed to 52a1022.

Here is a debugging session:
https://gist.github.com/hut8/3b25e6a581a600589bdc62644734de18. I really
couldn't glean too much that was new from this, but I am confident that the
bug was not present before revision 52a1022. One thing that I found a bit
strange is that in libpq_binddomain, ldir = "/share/locale" which looks
like a Unix path and this bug only happens on Windows. Is this relevant? I
have no idea. This frame seems to have the values I would expect:
https://gist.github.com/hut8/3b25e6a581a600589bdc62644734de18#check-out-frame-9
-- displayed_host, displayed_port, and host_addr all seem fine. And
conn->errorMessage is empty, which seems right too. I was trying to find
values that would create memory corruption, like a buffer overflow or
something, but haven't found any yet.

It is true that the immediate crash is in libintl-9.dll -- however, I'm
confident that almost everyone who's using Postgres on Windows is using the
EnterpriseDB distribution, and I verified that in all of the recent
versions (including 12.* and 13.*), the libintl-9.dll is exactly the same
as in 14.*. I can't find a way to build libintl-9.dll in the exact same way
as EnterpriseDB, and the instructions for obtaining it in the documentation
haven't worked for a long time (I reported that on pgsql-docs). This really
hampers my debugging; I don't know what revision is being used to build
libintl-9.dll or various other details that would make the build
reproducible so I could at least get relevant symbols and links to source.

So, at least I found a problematic revision. I'm a bit stuck at this point,
but I'm happy to provide any more information that I can. Perhaps a dump
would be useful; I can send one to whomever wants it. Thank you all for
your time.

--
Liam Bowen

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Liam Bowen (#1)

Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)

Liam Bowen <liambowen@gmail.com> writes:

I used the EnterpriseDB installer for several versions, and narrowed down
the bug as being introduced between 13.5 and 14.1. Then I used git bisect
to narrow down which revision actually introduced the bug. Each time, I
would build libpq and copy the DLL into the same directory as my executable
and verify that my build of libpq was being loaded. Eventually my bisection
pointed to 52a1022.

Hmmm ....

Here is a debugging session:
https://gist.github.com/hut8/3b25e6a581a600589bdc62644734de18.

So as with the original bug #17299 report, this is showing an abort()
failure inside libintl, although yours is in a slightly different
place. It would be interesting to look into the libintl code and
try to understand what it's seeing as wrong. Can you get debug
data out of those stack frames?

My gut feeling right now is that the reason 52a1022 tickles this
is that it causes us to invoke gettext() in the normal, successful
PQconnect code paths. I'm too tired to go look, but I suspect that
before that patch, a non-error connection would never perform any
message translations (or at least would not do so in the most basic
cases). So that would result in us hitting gettext() much more
heavily than before.

Given that you're only seeing this when performing concurrent
connections (or at least that's how I understood what you said),
I'm eyeing libpq_binddomain with a bit of suspicion. Does it
help to move down the update of already_bound, like this?

const char *ldir;

- already_bound = true;
/* No relocatable lookup here because the binary could be anywhere */
ldir = getenv("PGLOCALEDIR");
if (!ldir)
ldir = LOCALEDIR;
bindtextdomain(PG_TEXTDOMAIN("libpq"), ldir);
+ already_bound = true;
#ifdef WIN32
SetLastError(save_errno);
#else
errno = save_errno;
#endif

My thought here is that if two threads went through this code
at about the same time, it'd be possible for one of them to
decide that the bindtextdomain call had already happened when
it had not. That could lead to that thread calling gettext
and then the other one calling bindtextdomain later than that;
and maybe libintl's response to that is as unfriendly as abort().

The above quick-hack change would result in both threads calling
bindtextdomain, which I'm guessing libintl is prepared to cope
with (at least, its docs claim bindtextdomain is "MT-Safe").
If it isn't really, then we might have to additionally add a
critical section here.

regards, tom lane

Liam Bowen

liambowen@gmail.com

over 4 years ago

In reply to: Tom Lane (#2)

Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)

I will try your fix tomorrow. That's a good idea. You asked about debug
information for the libintl stack frame. I wish this were easier, but it
looks like the Enterprise DB folks don't include a PDB file for that DLL,
although they do include them for almost every other module. In addition, I
have no way of knowing what revision or config switches they used to build
it. Any idea how I could contact someone at Enterprise DB that could help
us out here? I would really rather just get the symbols and source revision
from them, as building it myself on Windows is quite a process and may not
contain the same bug.

Thanks for a quick reply, Tom!

On Thu, Jan 20, 2022, 22:56 Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Liam Bowen <liambowen@gmail.com> writes:

I used the EnterpriseDB installer for several versions, and narrowed down
the bug as being introduced between 13.5 and 14.1. Then I used git bisect
to narrow down which revision actually introduced the bug. Each time, I
would build libpq and copy the DLL into the same directory as my

executable

and verify that my build of libpq was being loaded. Eventually my

bisection

pointed to 52a1022.

Hmmm ....

Here is a debugging session:
https://gist.github.com/hut8/3b25e6a581a600589bdc62644734de18.

So as with the original bug #17299 report, this is showing an abort()
failure inside libintl, although yours is in a slightly different
place. It would be interesting to look into the libintl code and
try to understand what it's seeing as wrong. Can you get debug
data out of those stack frames?

My gut feeling right now is that the reason 52a1022 tickles this
is that it causes us to invoke gettext() in the normal, successful
PQconnect code paths. I'm too tired to go look, but I suspect that
before that patch, a non-error connection would never perform any
message translations (or at least would not do so in the most basic
cases). So that would result in us hitting gettext() much more
heavily than before.

Given that you're only seeing this when performing concurrent
connections (or at least that's how I understood what you said),
I'm eyeing libpq_binddomain with a bit of suspicion. Does it
help to move down the update of already_bound, like this?

const char *ldir;

- already_bound = true;
/* No relocatable lookup here because the binary could be
anywhere */
ldir = getenv("PGLOCALEDIR");
if (!ldir)
ldir = LOCALEDIR;
bindtextdomain(PG_TEXTDOMAIN("libpq"), ldir);
+ already_bound = true;
#ifdef WIN32
SetLastError(save_errno);
#else
errno = save_errno;
#endif

My thought here is that if two threads went through this code
at about the same time, it'd be possible for one of them to
decide that the bindtextdomain call had already happened when
it had not. That could lead to that thread calling gettext
and then the other one calling bindtextdomain later than that;
and maybe libintl's response to that is as unfriendly as abort().

The above quick-hack change would result in both threads calling
bindtextdomain, which I'm guessing libintl is prepared to cope
with (at least, its docs claim bindtextdomain is "MT-Safe").
If it isn't really, then we might have to additionally add a
critical section here.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Liam Bowen (#1)

Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)

Liam Bowen <liambowen@gmail.com> writes:

Hey Tom, this quick patch unfortunately didn't work, but now I'm looking
into other options.

Yeah. I've concluded that that *is* a bug (and just committed the fix)
but it doesn't explain what you're seeing.

regards, tom lane

Import Notes

Reply to msg id not found: CAE7q7EjrO+Sm2XkzLY2OwuDRQT13gkpaG1TFmW-Pgh8bP0Unog@mail.gmail.com