logical replication launcher crash on buildfarm

Started by Andres Freundabout 9 years ago41 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

I just unstuck a bunch of my buildfarm animals. That triggered some
spurious failures (on piculet, calliphoridae, mylodon), but also one
that doesn't really look like that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03

with the pertinent point being:

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
[New LWP 1894]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication launcher '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e265bff5e3 in ?? ()
#0 0x000055e265bff5e3 in ?? ()
#1 0x000055d3ccabed0d in StartBackgroundWorker () at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
#2 0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
#3 0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205

it's possible that me killing things and upgrading caused this, but
given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
it's more than that. The machine is a bit backed up at the moment, so
it'll probably be a while till it's at that animal/branch again,
otherwise I'd not have mentioned this.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: logical replication launcher crash on buildfarm

On 2017-03-15 20:28:33 -0700, Andres Freund wrote:

Hi,

I just unstuck a bunch of my buildfarm animals. That triggered some
spurious failures (on piculet, calliphoridae, mylodon), but also one
that doesn't really look like that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03

with the pertinent point being:

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
[New LWP 1894]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication launcher '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e265bff5e3 in ?? ()
#0 0x000055e265bff5e3 in ?? ()
#1 0x000055d3ccabed0d in StartBackgroundWorker () at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
#2 0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
#3 0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205

it's possible that me killing things and upgrading caused this, but
given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
it's more than that. The machine is a bit backed up at the moment, so
it'll probably be a while till it's at that animal/branch again,
otherwise I'd not have mentioned this.

For some reason it ran again pretty soon. And I'm afraid it's indeed an
issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#2)
Re: logical replication launcher crash on buildfarm

On 16/03/17 04:42, Andres Freund wrote:

On 2017-03-15 20:28:33 -0700, Andres Freund wrote:

Hi,

I just unstuck a bunch of my buildfarm animals. That triggered some
spurious failures (on piculet, calliphoridae, mylodon), but also one
that doesn't really look like that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03

with the pertinent point being:

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
[New LWP 1894]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication launcher '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e265bff5e3 in ?? ()
#0 0x000055e265bff5e3 in ?? ()
#1 0x000055d3ccabed0d in StartBackgroundWorker () at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
#2 0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
#3 0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205

it's possible that me killing things and upgrading caused this, but
given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
it's more than that. The machine is a bit backed up at the moment, so
it'll probably be a while till it's at that animal/branch again,
otherwise I'd not have mentioned this.

For some reason it ran again pretty soon. And I'm afraid it's indeed an
issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02

Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
seems to work fine on my two machines. I don't see anything else
different on culicidae though. Sadly the backtrace is not that
informative either. I'll try to investigate more but it will take time...

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#3)
Re: logical replication launcher crash on buildfarm

On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:

On 16/03/17 04:42, Andres Freund wrote:

On 2017-03-15 20:28:33 -0700, Andres Freund wrote:

Hi,

I just unstuck a bunch of my buildfarm animals. That triggered some
spurious failures (on piculet, calliphoridae, mylodon), but also one
that doesn't really look like that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03

with the pertinent point being:

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
[New LWP 1894]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication launcher '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e265bff5e3 in ?? ()
#0 0x000055e265bff5e3 in ?? ()
#1 0x000055d3ccabed0d in StartBackgroundWorker () at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
#2 0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
#3 0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205

it's possible that me killing things and upgrading caused this, but
given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
it's more than that. The machine is a bit backed up at the moment, so
it'll probably be a while till it's at that animal/branch again,
otherwise I'd not have mentioned this.

For some reason it ran again pretty soon. And I'm afraid it's indeed an
issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02

Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
seems to work fine on my two machines. I don't see anything else
different on culicidae though. Sadly the backtrace is not that
informative either. I'll try to investigate more but it will take time...

I can give you a login to that machine, it doesn't do anything but run
buildfarm animals... Will have to be my tomorrow however.

(Also need to fix config for older branches that don't work with
the upgraded ssl. This is a really bad situation :()

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#4)
Re: logical replication launcher crash on buildfarm

On 16/03/17 09:44, Andres Freund wrote:

On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:

On 16/03/17 04:42, Andres Freund wrote:

On 2017-03-15 20:28:33 -0700, Andres Freund wrote:

Hi,

I just unstuck a bunch of my buildfarm animals. That triggered some
spurious failures (on piculet, calliphoridae, mylodon), but also one
that doesn't really look like that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03

with the pertinent point being:

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
[New LWP 1894]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication launcher '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e265bff5e3 in ?? ()
#0 0x000055e265bff5e3 in ?? ()
#1 0x000055d3ccabed0d in StartBackgroundWorker () at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
#2 0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
#3 0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205

it's possible that me killing things and upgrading caused this, but
given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
it's more than that. The machine is a bit backed up at the moment, so
it'll probably be a while till it's at that animal/branch again,
otherwise I'd not have mentioned this.

For some reason it ran again pretty soon. And I'm afraid it's indeed an
issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02

Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
seems to work fine on my two machines. I don't see anything else
different on culicidae though. Sadly the backtrace is not that
informative either. I'll try to investigate more but it will take time...

I can give you a login to that machine, it doesn't do anything but run
buildfarm animals... Will have to be my tomorrow however.

That would be helpful, thanks.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#3)
Re: logical replication launcher crash on buildfarm

On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:

On 16/03/17 04:42, Andres Freund wrote:

On 2017-03-15 20:28:33 -0700, Andres Freund wrote:

Hi,

I just unstuck a bunch of my buildfarm animals. That triggered some
spurious failures (on piculet, calliphoridae, mylodon), but also one
that doesn't really look like that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03

with the pertinent point being:

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
[New LWP 1894]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication launcher '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e265bff5e3 in ?? ()
#0 0x000055e265bff5e3 in ?? ()
#1 0x000055d3ccabed0d in StartBackgroundWorker () at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
#2 0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
#3 0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205

it's possible that me killing things and upgrading caused this, but
given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
it's more than that. The machine is a bit backed up at the moment, so
it'll probably be a while till it's at that animal/branch again,
otherwise I'd not have mentioned this.

For some reason it ran again pretty soon. And I'm afraid it's indeed an
issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02

Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
seems to work fine on my two machines. I don't see anything else
different on culicidae though. Sadly the backtrace is not that
informative either. I'll try to investigate more but it will take time...

Worthwhile additional failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A55%3A01

Same animal, also EXEC_BACKEND, but 9.6.

A quick look at the relevant line:
/*
* If bgw_main is set, we use that value as the initial entrypoint.
* However, if the library containing the entrypoint wasn't loaded at
* postmaster startup time, passing it as a direct function pointer is not
* possible. To work around that, we allow callers for whom a function
* pointer is not available to pass a library name (which will be loaded,
* if necessary) and a function name (which will be looked up in the named
* library).
*/
if (worker->bgw_main != NULL)
entrypt = worker->bgw_main;

makes the issue clear - we appear to be assuming that bgw_main is
meaningful across processes. Which it isn't in the EXEC_BACKEND case
when ASLR is in use...

This kinda sounds familiar, but a quick google search doesn't find
anything relevant.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#6)
Re: logical replication launcher crash on buildfarm

On 16/03/17 09:53, Andres Freund wrote:

On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:

On 16/03/17 04:42, Andres Freund wrote:

On 2017-03-15 20:28:33 -0700, Andres Freund wrote:

Hi,

I just unstuck a bunch of my buildfarm animals. That triggered some
spurious failures (on piculet, calliphoridae, mylodon), but also one
that doesn't really look like that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03

with the pertinent point being:

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
[New LWP 1894]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication launcher '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e265bff5e3 in ?? ()
#0 0x000055e265bff5e3 in ?? ()
#1 0x000055d3ccabed0d in StartBackgroundWorker () at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
#2 0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
#3 0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205

it's possible that me killing things and upgrading caused this, but
given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
it's more than that. The machine is a bit backed up at the moment, so
it'll probably be a while till it's at that animal/branch again,
otherwise I'd not have mentioned this.

For some reason it ran again pretty soon. And I'm afraid it's indeed an
issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02

Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
seems to work fine on my two machines. I don't see anything else
different on culicidae though. Sadly the backtrace is not that
informative either. I'll try to investigate more but it will take time...

Worthwhile additional failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A55%3A01

Same animal, also EXEC_BACKEND, but 9.6.

A quick look at the relevant line:
/*
* If bgw_main is set, we use that value as the initial entrypoint.
* However, if the library containing the entrypoint wasn't loaded at
* postmaster startup time, passing it as a direct function pointer is not
* possible. To work around that, we allow callers for whom a function
* pointer is not available to pass a library name (which will be loaded,
* if necessary) and a function name (which will be looked up in the named
* library).
*/
if (worker->bgw_main != NULL)
entrypt = worker->bgw_main;

makes the issue clear - we appear to be assuming that bgw_main is
meaningful across processes. Which it isn't in the EXEC_BACKEND case
when ASLR is in use...

This kinda sounds familiar, but a quick google search doesn't find
anything relevant.

Hmm now that you mention it, I remember discussing something similar
with you last year in Dallas in regards to parallel query. IIRC Windows
should not have this problem but other systems with EXEC_BACKEND do.
Don't remember the details though.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#7)
Re: logical replication launcher crash on buildfarm

On Thu, Mar 16, 2017 at 5:13 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

Hmm now that you mention it, I remember discussing something similar
with you last year in Dallas in regards to parallel query. IIRC Windows
should not have this problem but other systems with EXEC_BACKEND do.
Don't remember the details though.

Generally, extension code can't use bgw_main safely, and must use
bgw_library_name and bgw_function_name. But bgw_main is supposedly
safe for core code. If it's not even safe there, then I guess we
should remove it entirely as a useless foot-gun.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#8)
Re: logical replication launcher crash on buildfarm

On 2017-03-16 09:27:59 -0400, Robert Haas wrote:

On Thu, Mar 16, 2017 at 5:13 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

Hmm now that you mention it, I remember discussing something similar
with you last year in Dallas in regards to parallel query. IIRC Windows
should not have this problem but other systems with EXEC_BACKEND do.
Don't remember the details though.

Generally, extension code can't use bgw_main safely, and must use
bgw_library_name and bgw_function_name. But bgw_main is supposedly
safe for core code.

I indeed think it's not safe, and it's going to get less and less safe
on windows (or EXEC_BACKEND). I don't think we can afford to disable
ASLR in the long run (I indeed supect that'll just be disallowed at some
point), and that's the only thing making it safe-ish in combination with
EXEC_BACKEND.

If it's not even safe there, then I guess we should remove it entirely
as a useless foot-gun.

I indeed think that's the right consequence. One question is what to
replace it with exactly - are we guaranteed we can dynamically lookup
symbols by name in the main binary on every platform? Alternatively we
can just hardcode a bunch of bgw_function_name values that are matched
to specific functions if bgw_library_name is NULL - I suspect that'd be
the easiest / least worrysome portability-wise.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#9)
Re: logical replication launcher crash on buildfarm

On Thu, Mar 16, 2017 at 2:55 PM, Andres Freund <andres@anarazel.de> wrote:

I indeed think it's not safe, and it's going to get less and less safe
on windows (or EXEC_BACKEND). I don't think we can afford to disable
ASLR in the long run (I indeed supect that'll just be disallowed at some
point), and that's the only thing making it safe-ish in combination with
EXEC_BACKEND.

Ugh.

If it's not even safe there, then I guess we should remove it entirely
as a useless foot-gun.

I indeed think that's the right consequence. One question is what to
replace it with exactly - are we guaranteed we can dynamically lookup
symbols by name in the main binary on every platform?

I don't know the answer to that question.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#9)
Re: logical replication launcher crash on buildfarm

On 3/16/17 14:55, Andres Freund wrote:

I indeed think that's the right consequence. One question is what to
replace it with exactly - are we guaranteed we can dynamically lookup
symbols by name in the main binary on every platform?

I think there is probably a way to do this on all platforms. But it
seems that at least the Windows port of pg_dlopen would need to be
updated to support this.

Alternatively we
can just hardcode a bunch of bgw_function_name values that are matched
to specific functions if bgw_library_name is NULL - I suspect that'd be
the easiest / least worrysome portability-wise.

Basically a variant of fmgrtab, which addresses the same sort of problem.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#7)
Re: logical replication launcher crash on buildfarm

On 2017-03-16 10:13:37 +0100, Petr Jelinek wrote:

On 16/03/17 09:53, Andres Freund wrote:

On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:

On 16/03/17 04:42, Andres Freund wrote:

On 2017-03-15 20:28:33 -0700, Andres Freund wrote:

Hi,

I just unstuck a bunch of my buildfarm animals. That triggered some
spurious failures (on piculet, calliphoridae, mylodon), but also one
that doesn't really look like that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&amp;dt=2017-03-16%2002%3A40%3A03

with the pertinent point being:

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
[New LWP 1894]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication launcher '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e265bff5e3 in ?? ()
#0 0x000055e265bff5e3 in ?? ()
#1 0x000055d3ccabed0d in StartBackgroundWorker () at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
#2 0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
#3 0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205

it's possible that me killing things and upgrading caused this, but
given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
it's more than that. The machine is a bit backed up at the moment, so
it'll probably be a while till it's at that animal/branch again,
otherwise I'd not have mentioned this.

For some reason it ran again pretty soon. And I'm afraid it's indeed an
issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&amp;dt=2017-03-16%2003%3A30%3A02

Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
seems to work fine on my two machines. I don't see anything else
different on culicidae though. Sadly the backtrace is not that
informative either. I'll try to investigate more but it will take time...

Worthwhile additional failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&amp;dt=2017-03-16%2002%3A55%3A01

Same animal, also EXEC_BACKEND, but 9.6.

A quick look at the relevant line:
/*
* If bgw_main is set, we use that value as the initial entrypoint.
* However, if the library containing the entrypoint wasn't loaded at
* postmaster startup time, passing it as a direct function pointer is not
* possible. To work around that, we allow callers for whom a function
* pointer is not available to pass a library name (which will be loaded,
* if necessary) and a function name (which will be looked up in the named
* library).
*/
if (worker->bgw_main != NULL)
entrypt = worker->bgw_main;

makes the issue clear - we appear to be assuming that bgw_main is
meaningful across processes. Which it isn't in the EXEC_BACKEND case
when ASLR is in use...

This kinda sounds familiar, but a quick google search doesn't find
anything relevant.

Robert, Petr, either of you planning to fix this (as outlined elsewhere
in the thred)?

Hmm now that you mention it, I remember discussing something similar
with you last year in Dallas in regards to parallel query. IIRC Windows
should not have this problem but other systems with EXEC_BACKEND do.
Don't remember the details though.

Don't think that's reliable, only works as long as the binary is
compiled without position independent code.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)
Re: logical replication launcher crash on buildfarm

On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:

Robert, Petr, either of you planning to fix this (as outlined elsewhere
in the thred)?

Oh, I didn't realize anybody was looking to me to fix this. I sort of
thought that it was fallout from the logical replication patch and
that Petr or Peter would deal with it. If that's not the case, I'm
not totally unwilling to take a whack at it, but I don't have much
personal enthusiasm for trying to figure out how to make dynamic
loading on the postgres binary itself work everywhere, so if it falls
to me to fix, it's likely to get a hard-coded check for some
hard-coded name.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#13)
Re: logical replication launcher crash on buildfarm

On 2017-03-27 13:01:11 -0400, Robert Haas wrote:

On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:

Robert, Petr, either of you planning to fix this (as outlined elsewhere
in the thred)?

Oh, I didn't realize anybody was looking to me to fix this.

Well, it's borked in 9.6. I'm starting to get annoyed by culicidae's
failures ;)

I sort of thought that it was fallout from the logical replication
patch and that Petr or Peter would deal with it. If that's not the
case, I'm not totally unwilling to take a whack at it, but I don't
have much personal enthusiasm for trying to figure out how to make
dynamic loading on the postgres binary itself work everywhere, so if
it falls to me to fix, it's likely to get a hard-coded check for some
hard-coded name.

I'm all for that approach - there seems very little upside in the
dynamic loading approach. Just defining a bgw_entry_points[enum
BuiltinBGWorkerType] -> bgworker_main_type array seems to be simple
enough - it's not like we're going to add new types of builtin bgworkers
at runtime.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: logical replication launcher crash on buildfarm

Robert Haas <robertmhaas@gmail.com> writes:

... I don't have much
personal enthusiasm for trying to figure out how to make dynamic
loading on the postgres binary itself work everywhere, so if it falls
to me to fix, it's likely to get a hard-coded check for some
hard-coded name.

+1. This seems like no time to be buying into brand new portability
requirements without a very pressing need to do so; and this patch
doesn't appear to create one.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#13)
Re: logical replication launcher crash on buildfarm

On 27/03/17 19:01, Robert Haas wrote:

On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:

Robert, Petr, either of you planning to fix this (as outlined elsewhere
in the thred)?

Oh, I didn't realize anybody was looking to me to fix this. I sort of
thought that it was fallout from the logical replication patch and
that Petr or Peter would deal with it. If that's not the case, I'm
not totally unwilling to take a whack at it, but I don't have much
personal enthusiasm for trying to figure out how to make dynamic
loading on the postgres binary itself work everywhere, so if it falls
to me to fix, it's likely to get a hard-coded check for some
hard-coded name.

It affects parallel workers same way, I'll write patch for HEAD soon,
9.6 probably with some delay. I'll expand the InternalBgWorkers thing
that was added with logical replication to handle this in similar
fashion how we do fmgrtab.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#16)
Re: logical replication launcher crash on buildfarm

On 28/03/17 03:31, Petr Jelinek wrote:

On 27/03/17 19:01, Robert Haas wrote:

On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:

Robert, Petr, either of you planning to fix this (as outlined elsewhere
in the thred)?

Oh, I didn't realize anybody was looking to me to fix this. I sort of
thought that it was fallout from the logical replication patch and
that Petr or Peter would deal with it. If that's not the case, I'm
not totally unwilling to take a whack at it, but I don't have much
personal enthusiasm for trying to figure out how to make dynamic
loading on the postgres binary itself work everywhere, so if it falls
to me to fix, it's likely to get a hard-coded check for some
hard-coded name.

It affects parallel workers same way, I'll write patch for HEAD soon,
9.6 probably with some delay. I'll expand the InternalBgWorkers thing
that was added with logical replication to handle this in similar
fashion how we do fmgrtab.

Btw now that I look at the code, I guess we'll want to get rid of
bgw_main completely in HEAD given that we can't guarantee it will be
valid even for shared_preload_library libraries. For older branches I
would leave things as they are in this regard as there don't seem to be
any immediate issue for standard binaries.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#17)
Re: logical replication launcher crash on buildfarm

On 2017-03-28 03:47:50 +0200, Petr Jelinek wrote:

On 28/03/17 03:31, Petr Jelinek wrote:

On 27/03/17 19:01, Robert Haas wrote:

On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:

Robert, Petr, either of you planning to fix this (as outlined elsewhere
in the thred)?

Oh, I didn't realize anybody was looking to me to fix this. I sort of
thought that it was fallout from the logical replication patch and
that Petr or Peter would deal with it. If that's not the case, I'm
not totally unwilling to take a whack at it, but I don't have much
personal enthusiasm for trying to figure out how to make dynamic
loading on the postgres binary itself work everywhere, so if it falls
to me to fix, it's likely to get a hard-coded check for some
hard-coded name.

It affects parallel workers same way, I'll write patch for HEAD soon,
9.6 probably with some delay. I'll expand the InternalBgWorkers thing
that was added with logical replication to handle this in similar
fashion how we do fmgrtab.

Btw now that I look at the code, I guess we'll want to get rid of
bgw_main completely in HEAD given that we can't guarantee it will be
valid even for shared_preload_library libraries. For older branches I
would leave things as they are in this regard as there don't seem to be
any immediate issue for standard binaries.

As long as you fix it so culicidae is happy (in 9.6) ;). I think it's
fine to just introduce bgw_builtin_id or such, and leave the bgw_main
code in place in < HEAD.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#18)
Re: logical replication launcher crash on buildfarm

On Mon, Mar 27, 2017 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:

Btw now that I look at the code, I guess we'll want to get rid of
bgw_main completely in HEAD given that we can't guarantee it will be
valid even for shared_preload_library libraries. For older branches I
would leave things as they are in this regard as there don't seem to be
any immediate issue for standard binaries.

As long as you fix it so culicidae is happy (in 9.6) ;). I think it's
fine to just introduce bgw_builtin_id or such, and leave the bgw_main
code in place in < HEAD.

I wasn't thinking of introducing bgw_builtin_id. My idea was just
along the lines of

if (bgw_library_name == NULL && bgw_function_name != NULL)
{
if (strcmp(bgw_function_name, "ParallelQueryMain") == 0)
ParallelQueryMain(blah);
else if (strcmp(bgw_function_name, "LogicalReplicationMain") == 0)
LogicalReplicationMain(blah);
}

I think something like that is certainly better for the back-branches,
because it doesn't cause an ABI break. But I think it would also be
fine for master.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#19)
Re: logical replication launcher crash on buildfarm

Robert Haas <robertmhaas@gmail.com> writes:

I wasn't thinking of introducing bgw_builtin_id. My idea was just
along the lines of

if (bgw_library_name == NULL && bgw_function_name != NULL)
{
if (strcmp(bgw_function_name, "ParallelQueryMain") == 0)
ParallelQueryMain(blah);
else if (strcmp(bgw_function_name, "LogicalReplicationMain") == 0)
LogicalReplicationMain(blah);
}

I think something like that is certainly better for the back-branches,
because it doesn't cause an ABI break. But I think it would also be
fine for master.

That seems perfectly reasonable from here: surely the cost of a couple
of strcmp's is trivial in comparison to a process launch.

We can redesign the API whenever this way starts getting unwieldy,
but that's likely to be quite some time away.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#19)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#21)
#23Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#22)
#24Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#23)
#25Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#25)
#27Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#27)
#29Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#29)
#31Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#29)
#32Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#32)
#34Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#30)
#35Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#34)
#36Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#34)
#37Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#36)
#39Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#38)