BUG #15367: Crash in pg_fe_scram_free when using foreign tables
The following bug has been logged on the website:
Bug reference: 15367
Logged by: Jeremy Evans
Email address: code@jeremyevans.net
PostgreSQL version: 10.5
Operating system: OpenBSD 6.3 amd64
Description:
One of the postgres processes in our production environment crashed with the
following backtrace:
(gdb) bt
#0 thrkill () at -:3
#1 0x000004a14ca47fbe in _libc_abort () at
/usr/src/lib/libc/stdlib/abort.c:51
#2 0x000004a14c9fd029 in wrterror (d=Variable "d" is not available.
) at /usr/src/lib/libc/stdlib/malloc.c:288
#3 0x000004a14c9fd34f in ofree (argpool=Variable "argpool" is not
available.
) at /usr/src/lib/libc/stdlib/malloc.c:1298
#4 0x000004a14c9fd109 in free (ptr=0x4a115aec398) at
/usr/src/lib/libc/stdlib/malloc.c:1416
#5 0x000004a09a7301e8 in pg_fe_scram_free () from
/usr/local/lib/libpq.so.6.10
#6 0x000004a09a73561f in pqPacketSend () from
/usr/local/lib/libpq.so.6.10
#7 0x000004a09a731507 in PQfinish () from /usr/local/lib/libpq.so.6.10
#8 0x000004a074dc5549 in GetConnection () from
/usr/local/lib/postgresql/postgres_fdw.so
#9 0x000004a074dbc02a in postgresBeginForeignScan () from
/usr/local/lib/postgresql/postgres_fdw.so
#10 0x0000049e62eab412 in ExecInitForeignScan () from
/usr/local/bin/postgres
#11 0x0000049e62e89ace in ExecInitNode () from /usr/local/bin/postgres
#12 0x0000049e62e9aae8 in ExecInitHashJoin () from /usr/local/bin/postgres
#13 0x0000049e62e89b3e in ExecInitNode () from /usr/local/bin/postgres
#14 0x0000049e62ea7a06 in ExecInitSort () from /usr/local/bin/postgres
#15 0x0000049e62e89aee in ExecInitNode () from /usr/local/bin/postgres
#16 0x0000049e62e85379 in standard_ExecutorStart () from
/usr/local/bin/postgres
#17 0x0000049e62fc4ffb in PortalStart () from /usr/local/bin/postgres
#18 0x0000049e62fc43d5 in exec_simple_query () from
/usr/local/bin/postgres
#19 0x0000049e62fc250b in PostgresMain () from /usr/local/bin/postgres
#20 0x0000049e62f473c5 in PostmasterMain () from /usr/local/bin/postgres
#21 0x0000049e62ec986b in main () from /usr/local/bin/postgres
The PostgreSQL logfile showed:
postgres(64978) in free(): bogus pointer (double free?) 0x4a115aec398
2018-09-06 12:01:52.202 PDT [45953] LOG: server process (PID 64978) was
terminated by signal 6: Abort trap
We are using foreign tables to other databases in the same cluster, and all
databases are set to require SCRAM authentication. We've been using SCRAM
authentication for about 2 months, and foreign tables for a few years
without problems.
PostgreSQL version: PostgreSQL 10.5 on x86_64-unknown-openbsd6.3, compiled
by OpenBSD clang version 5.0.1 (tags/RELEASE_501/final) (based on LLVM
5.0.1), 64-bit
PostgreSQL was installed via OpenBSD ports, using the 6.3-stable branch.
PostgreSQL configuration file changes:
DateStyle | ISO, MDY | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
lc_messages | C | configuration file
lc_monetary | C | configuration file
lc_numeric | C | configuration file
lc_time | C | configuration file
listen_addresses | * | configuration file
log_timezone | US/Pacific | configuration file
max_connections | 100 | configuration file
max_stack_depth | 2MB | environment variable
password_encryption | scram-sha-256 | configuration file
shared_buffers | 128MB | configuration file
ssl | on | configuration file
TimeZone | US/Pacific | configuration file
Operating system: OpenBSD 6.3 amd64, with OpenBSD syspatches up to 016:
OpenBSD bethpage.bsa.ca.gov 6.3 GENERIC.MP#8 amd64
As the backtrace shows the error was during the closing of a connection,
this appears to be unrelated to any specific query. The backtrace also
shows withis is due to the use of foreign tables, and thus likely
independent of what client program was used to connect. In case it does
matter, it was using libpq via the ruby-pg library, and no connection pool
or load balancer was being used.
Nothing unusual in the PostgreSQL logs other than the the crash and the
typical crash recovery.
Hardware is a HP DL380p Gen8 server with 256GB of ECC memory and 2 800GB
SSDs in hardware RAID-1, with 2 Xeon E5-2670 CPUs. Due to the crash, this is
unlikely to be disk/RAID related. General memory corruption is also
unlikely as the memory is ECC.
If necessary I can build a debug version of PostgreSQL and try using that in
production so I can get a better backtrace if it crashes again. However,
considering that the crash is rare in my environment, it's unlikely I will
be able to produce a better backtrace for the error quickly.
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
One of the postgres processes in our production environment crashed with the
following backtrace:
(gdb) bt
#0 thrkill () at -:3
#1 0x000004a14ca47fbe in _libc_abort () at
/usr/src/lib/libc/stdlib/abort.c:51
#2 0x000004a14c9fd029 in wrterror (d=Variable "d" is not available.
) at /usr/src/lib/libc/stdlib/malloc.c:288
#3 0x000004a14c9fd34f in ofree (argpool=Variable "argpool" is not
available.
) at /usr/src/lib/libc/stdlib/malloc.c:1298
#4 0x000004a14c9fd109 in free (ptr=0x4a115aec398) at
/usr/src/lib/libc/stdlib/malloc.c:1416
#5 0x000004a09a7301e8 in pg_fe_scram_free () from
/usr/local/lib/libpq.so.6.10
#6 0x000004a09a73561f in pqPacketSend () from
/usr/local/lib/libpq.so.6.10
#7 0x000004a09a731507 in PQfinish () from /usr/local/lib/libpq.so.6.10
#8 0x000004a074dc5549 in GetConnection () from
/usr/local/lib/postgresql/postgres_fdw.so
Hmm. Unfortunately, I don't think I believe this backtrace. For one
thing, there's no direct pathway from pqPacketSend to pg_fe_scram_free.
Also, if we ignore that and figure that somehow pg_fe_scram_free is
doing a duplicate free, it's still really hard to see how that would
happen.
Given that you didn't have debug symbols installed, it's not very
surprising for the trace to be misleading. Probably, what we're seeing
here is the nearest exported functions' names rather than where control
really went.
If necessary I can build a debug version of PostgreSQL and try using that in
production so I can get a better backtrace if it crashes again.
Please. I don't doubt that there's an issue here, but we're going to
need a higher-quality fix on where it is before we can do much.
regards, tom lane
On Thu, Sep 06, 2018 at 08:35:39PM +0000, PG Bug reporting form wrote:
If necessary I can build a debug version of PostgreSQL and try using that in
production so I can get a better backtrace if it crashes again. However,
considering that the crash is rare in my environment, it's unlikely I will
be able to produce a better backtrace for the error quickly.
That would be nice. From what I can see this would be a race condition,
which is not obvious by looking at the code. Testing with a two-node
deployment where the first node has a foreign table which connects to a
second node, using SCRAM authentication, holding the physical table,
then doing many foreign scans across many clients don't show any
problem. Did libpq complain at some point in the session where the
crash happened about any error?
--
Michael
On 09/06 02:58, Michael Paquier wrote:
On Thu, Sep 06, 2018 at 08:35:39PM +0000, PG Bug reporting form wrote:
If necessary I can build a debug version of PostgreSQL and try using that in
production so I can get a better backtrace if it crashes again. However,
considering that the crash is rare in my environment, it's unlikely I will
be able to produce a better backtrace for the error quickly.That would be nice. From what I can see this would be a race condition,
which is not obvious by looking at the code. Testing with a two-node
deployment where the first node has a foreign table which connects to a
second node, using SCRAM authentication, holding the physical table,
then doing many foreign scans across many clients don't show any
problem. Did libpq complain at some point in the session where the
crash happened about any error?
The PostgreSQL logfile only shows:
postgres(64978) in free(): bogus pointer (double free?) 0x4a115aec398
2018-09-06 12:01:52.202 PDT [45953] LOG: server process (PID 64978) was terminated by signal 6: Abort trap
2018-09-06 12:01:52.202 PDT [45953] DETAIL: Failed process was running: ...
2018-09-06 12:01:52.202 PDT [45953] LOG: terminating any other active server processes
If there is another place I should look, please let me know. The log
files of the client process don't show anything during the crash,
probably because the client libpq connection was just dropped when the
server process crashed. After the crash, other client libpq connections
show the following, which is probably expected:
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
I'll try to install a version with debug symbols on September 14, and
if it crashes again I'll respond with a more complete and accurate
backtrace.
Thanks,
Jeremy
On 09/06 03:28, Jeremy Evans wrote:
On 09/06 02:58, Michael Paquier wrote:
On Thu, Sep 06, 2018 at 08:35:39PM +0000, PG Bug reporting form wrote:
If necessary I can build a debug version of PostgreSQL and try using that in
production so I can get a better backtrace if it crashes again. However,
considering that the crash is rare in my environment, it's unlikely I will
be able to produce a better backtrace for the error quickly.That would be nice. From what I can see this would be a race condition,
which is not obvious by looking at the code. Testing with a two-node
deployment where the first node has a foreign table which connects to a
second node, using SCRAM authentication, holding the physical table,
then doing many foreign scans across many clients don't show any
problem. Did libpq complain at some point in the session where the
crash happened about any error?The PostgreSQL logfile only shows:
postgres(64978) in free(): bogus pointer (double free?) 0x4a115aec398
2018-09-06 12:01:52.202 PDT [45953] LOG: server process (PID 64978) was terminated by signal 6: Abort trap
2018-09-06 12:01:52.202 PDT [45953] DETAIL: Failed process was running: ...
2018-09-06 12:01:52.202 PDT [45953] LOG: terminating any other active server processesIf there is another place I should look, please let me know. The log
files of the client process don't show anything during the crash,
probably because the client libpq connection was just dropped when the
server process crashed. After the crash, other client libpq connections
show the following, which is probably expected:WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.I'll try to install a version with debug symbols on September 14, and
if it crashes again I'll respond with a more complete and accurate
backtrace.
We experienced an almost identical crash this morning. The query was
different, but the backtrace was almost the same, and the query was
using a foreign table with SCRAM authentication, just like the one
yesterday.
One thing that was similar between the two crashes is that shortly
before both crashes, we were testing database changes on different
databases in the same cluster, different from both the postgres
process that crashed (the client of the foreign table scan) and the
postgres process executing the foreign table scan.
The database changes mostly consisted of the following statement types:
DROP TABLE
DROP SCHEMA
DROP FUNCTION
CREATE FUNCTION
CREATE SCHEMA
CREATE TABLE
CREATE INDEX
CREATE TRIGGER
INSERT
GRANT
We only recently started testing these database changes in this cluster
yesterday. Based on the timing, I'm guessing this issue only occurs
when system table changes are being made.
I hadn't yet had time to install debug symbols on the production server,
but since I think I have a better idea on how to recreate this issue, I
will try recreating this on a test cluster with debug symbols.
Thanks,
Jeremy
On 09/07 09:51, Jeremy Evans wrote:
I hadn't yet had time to install debug symbols on the production server,
but since I think I have a better idea on how to recreate this issue, I
will try recreating this on a test cluster with debug symbols.
Thankfully, I was able to recreate this on a test server with debug
symbols, by using \watch in psql to run the foreign query while running
the database changes on a separate database in a loop.
Here's the backtrace:
#1 0x000017c254721a8e in _libc_abort () at /usr/src/lib/libc/stdlib/abort.c:51
#2 0x000017c2547303b9 in wrterror (d=Variable "d" is not available.) at /usr/src/lib/libc/stdlib/malloc.c:288
#3 0x000017c2547306df in ofree (argpool=Variable "argpool" is not available.) at /usr/src/lib/libc/stdlib/malloc.c:1298
#4 0x000017c254730499 in free (ptr=0x17c1ee153398) at /usr/src/lib/libc/stdlib/malloc.c:1416
#5 0x000017c20a6b8b00 in pg_fe_scram_free (opaq=0x17c223712000) at fe-auth-scram.c:127
#6 0x000017c20a6b9d20 in pqDropConnection (conn=0x17c1dac61800, flushInput=1 '\001') at fe-connect.c:479
#7 0x000017c20a6bf95e in closePGconn (conn=0x17c1dac61800) at fe-connect.c:3710
#8 0x000017c20a6ba230 in PQfinish (conn=0x17c1dac61800) at fe-connect.c:3731
#9 0x000017c2c276a838 in disconnect_pg_server (entry=0x17c25b118658) at connection.c:312
#10 0x000017c2c2769dba in GetConnection (user=0x17c2c2285c50, will_prep_stmt=0 '\0') at connection.c:169
#11 0x000017c2c275c597 in postgresBeginForeignScan (node=0x17c2d8b18028, eflags=16) at postgres_fdw.c:1325
#12 0x000017bfd91aaa24 in ExecInitForeignScan (node=0x17c26c166408, estate=0x17c2d8b17038, eflags=16) at nodeForeignscan.c:231
#13 0x000017bfd9177f9a in ExecInitNode (node=0x17c26c166408, estate=0x17c2d8b17038, eflags=16) at execProcnode.c:277
#14 0x000017bfd9190f84 in ExecInitHashJoin (node=0x17c26c16a050, estate=0x17c2d8b17038, eflags=16) at nodeHashjoin.c:433
#15 0x000017bfd917800a in ExecInitNode (node=0x17c26c16a050, estate=0x17c2d8b17038, eflags=16) at execProcnode.c:300
#16 0x000017bfd91a4ae0 in ExecInitSort (node=0x17c26c168c50, estate=0x17c2d8b17038, eflags=16) at nodeSort.c:207
#17 0x000017bfd9178042 in ExecInitNode (node=0x17c26c168c50, estate=0x17c2d8b17038, eflags=16) at execProcnode.c:313
#18 0x000017bfd917114b in InitPlan (queryDesc=0x17c1e8fce038, eflags=16) at execMain.c:1045
#19 0x000017bfd9170920 in standard_ExecutorStart (queryDesc=0x17c1e8fce038, eflags=16) at execMain.c:264
#20 0x000017bfd91706ee in ExecutorStart (queryDesc=0x17c1e8fce038, eflags=0) at execMain.c:152
#21 0x000017bfd935a604 in PortalStart (portal=0x17c1dac5f038, params=0x0, eflags=0, snapshot=0x0) at pquery.c:520
#22 0x000017bfd935682d in exec_simple_query ( query_string=0x17c1dbf45038 "SELECT "...) at postgres.c:1060
#23 0x000017bfd9355d13 in PostgresMain (argc=1, argv=0x17c2460a1180, dbname=0x17c2528b2ff8 "bsawww", username=0x17c2528b2fc8 "recommendations_public") at postgres.c:4088
#24 0x000017bfd92adb29 in BackendRun (port=0x17c2bf214800) at postmaster.c:4405
#25 0x000017bfd92aceb3 in BackendStartup (port=0x17c2bf214800) at postmaster.c:4077
#26 0x000017bfd92abf3a in ServerLoop () at postmaster.c:1755
#27 0x000017bfd92a9b46 in PostmasterMain (argc=3, argv=0x7f7ffffc1c58) at postmaster.c:1363
#28 0x000017bfd91d6f57 in main (argc=3, argv=0x7f7ffffc1c58) at main.c:228
Hopefully this helps. If you need more information, please let me know.
Thanks,
Jeremy
On Fri, Sep 07, 2018 at 10:55:18AM -0700, Jeremy Evans wrote:
Here's the backtrace:
#5 0x000017c20a6b8b00 in pg_fe_scram_free (opaq=0x17c223712000) at fe-auth-scram.c:127
#6 0x000017c20a6b9d20 in pqDropConnection (conn=0x17c1dac61800, flushInput=1 '\001') at fe-connect.c:479
#7 0x000017c20a6bf95e in closePGconn (conn=0x17c1dac61800) at fe-connect.c:3710Hopefully this helps. If you need more information, please let me know.
Thanks. This fails when freeing the password field in fe_scram_state.
I have a question: does your password use non-ASCII characters which
could make SASLprep to be run on the password string? The result string
from pg_saslprep allocates a new string for any results returned by
reading the code, I am wondering if we could be missing something..
--
Michael
On 09/07 11:15, Michael Paquier wrote:
On Fri, Sep 07, 2018 at 10:55:18AM -0700, Jeremy Evans wrote:
Here's the backtrace:
#5 0x000017c20a6b8b00 in pg_fe_scram_free (opaq=0x17c223712000) at fe-auth-scram.c:127
#6 0x000017c20a6b9d20 in pqDropConnection (conn=0x17c1dac61800, flushInput=1 '\001') at fe-connect.c:479
#7 0x000017c20a6bf95e in closePGconn (conn=0x17c1dac61800) at fe-connect.c:3710Hopefully this helps. If you need more information, please let me know.
Thanks. This fails when freeing the password field in fe_scram_state.
I have a question: does your password use non-ASCII characters which
could make SASLprep to be run on the password string? The result string
from pg_saslprep allocates a new string for any results returned by
reading the code, I am wondering if we could be missing something..
No, the user mapping password is comprised of hex-ascii characters: [0-9a-f]{16}
In case it helps, here's the content of opaq from gdb (with the specific
strings cleared).
print *((fe_scram_state *) opaq)
{
state = FE_SCRAM_FINISHED,
username = 0x17c267526380 "...",
password = 0x17c1ee153398 <Address 0x17c1ee153398 out of bounds>,
SaltedPassword = 0x17c223712018 "...",
client_nonce = 0x17c267529f60 "...",
client_first_message_bare = 0x17c223711da0 "...",
client_final_message_without_proof = 0x17c271ca19c0 "...",
server_first_message = 0x17c1ea48a400 "...",
salt = 0x17c2675261c0 "...",
saltlen = 16,
iterations = 4096,
nonce = 0x17c2179e9880 "...",
server_final_message = 0x17c271ca1a40 "...",
ServerSignature = 0x17c223712078 "..."
}
Thanks,
Jeremy
Michael Paquier <michael@paquier.xyz> writes:
Thanks. This fails when freeing the password field in fe_scram_state.
I had been studying that when the report first arrived. I'm not really
happy with the coding in pg_saslprep(): it leaves garbage in its output
parameter in some code paths, and I think it leaks memory in others.
But I couldn't see a path that would lead to the observed failure.
regards, tom lane
On Fri, Sep 07, 2018 at 04:18:55PM -0400, Tom Lane wrote:
I had been studying that when the report first arrived. I'm not really
happy with the coding in pg_saslprep(): it leaves garbage in its output
parameter in some code paths, and I think it leaks memory in others.
Agreed. Do you want to give it a go? I can see some of those code
paths? Likely you spotted more.
But I couldn't see a path that would lead to the observed failure.
Agreed.
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
On Fri, Sep 07, 2018 at 04:18:55PM -0400, Tom Lane wrote:
I had been studying that when the report first arrived. I'm not really
happy with the coding in pg_saslprep(): it leaves garbage in its output
parameter in some code paths, and I think it leaks memory in others.
Agreed. Do you want to give it a go? I can see some of those code
paths? Likely you spotted more.
Looking closer, there are no leaks, though I think we could use a comment
about that. And it still makes me itch that pg_saslprep (mostly) doesn't
worry about clearing *output on failure while only three of its four
callers bother to initialize their variables to null. That's a recipe
for future bugs. So I propose the attached cleanup.
We're still no closer to an explanation of Jeremy's failure, though
I'm now pretty sure that pg_saslprep itself isn't the issue.
regards, tom lane
Attachments:
make-pg-saslprep-more-bulletproof.patchtext/x-diff; charset=us-ascii; name=make-pg-saslprep-more-bulletproof.patchDownload+11-6
I wrote:
We're still no closer to an explanation of Jeremy's failure, though
I'm now pretty sure that pg_saslprep itself isn't the issue.
I had an idea about that --- it's probably all wet, but the code as
written seems bulletproof enough that I'm forced to postulate something
very strange is happening.
Observe that there are two copies of pg_saslprep() in play: there is one
in the backend, which is compiled to allocate its result with palloc,
and there is one in libpq, which is compiled to allocate its result
with malloc. Could it be that somehow, when libpq is loaded into the
backend address space as it is here, libpq winds up calling the backend's
copy of pg_saslprep rather than its own? That would work just fine,
until libpq tried to free the returned string using free(), and then
we'd get exactly the reported error.
The main weakness in this theory is that it suggests that Jeremy's
postgres_fdw connections ought to be falling over more easily than
they are. However, I think that postgres_fdw will never explicitly
close a PGconn unless it's forced to; during a normal backend session
exit, the process just dies without going through PQfinish, so that
the problem would be masked. The only way to get to the PQfinish
call shown in the backtrace is for pgfdw_inval_callback to mark the
connection invalid, which'd require either an update on the relevant
foreign server object, an update on the user mapping in use, or a
SI cache reset. That explains how heavy DDL activity in an apparently
unrelated database can trigger the problem: it eventually results in
a SI message queue overrun and ensuing cache reset. In the absence
of SI cache resets, maybe indeed the problem is rare even if the
pg_saslprep result is misallocated every time.
Not sure about a good way to test this theory. Need more caffeine.
regards, tom lane
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> Could it be that somehow, when libpq is loaded into the backend
Tom> address space as it is here, libpq winds up calling the backend's
Tom> copy of pg_saslprep rather than its own?
Yes, that is a thing that happens with ELF-style shared libs when you
dynamically load shared libraries that weren't linked with -Bsymbolic.
Is this potentially also a problem for libpqwalreceiver, which also
loads libpq?
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> Could it be that somehow, when libpq is loaded into the backend
Tom> address space as it is here, libpq winds up calling the backend's
Tom> copy of pg_saslprep rather than its own?
Yes, that is a thing that happens with ELF-style shared libs when you
dynamically load shared libraries that weren't linked with -Bsymbolic.
If true, that's potentially pretty disastrous, because *every* src/port
or src/common function that's used by libpq would be similarly subject
to replacement by its backend version, and every behavioral difference
between the frontend and backend coding would be a source of trouble.
However, the info I can find about this suggests that it's only relevant
for exported functions, not ones that are hidden by means of a
version-script. Since libpq doesn't export pg_saslprep, this shouldn't
be happening?
But wait, this problem is being reported on OpenBSD. I wonder whether
their linker behaves anything like Linux's ... hm, it looks like it
*is* the same linker.
I tried setting up a scram-auth-based postgres_fdw connection on
my RHEL6 box and forcing it through the questionable code path
by issuing a dummy ALTER USER MAPPING command from another session.
It worked fine, so the problem doesn't seem to be present on Linux.
Repeating the same thing on a nearby OpenBSD 6.0 image ... kaboom!
It fails exactly as Jeremy describes. Furthermore, I can confirm
that the attached patch fixes it.
I wonder which other platforms we have that behave like this?
And how come we've not seen symptoms before? It's hard to believe
that src/common/saslprep.c is the only thing libpq imports
that has FRONTEND-vs-not-FRONTEND behavior differences.
regards, tom lane
Attachments:
openbsd-link.patchtext/x-diff; charset=us-ascii; name=openbsd-link.patchDownload+5-0
I wrote:
I tried setting up a scram-auth-based postgres_fdw connection on
my RHEL6 box and forcing it through the questionable code path
by issuing a dummy ALTER USER MAPPING command from another session.
It worked fine, so the problem doesn't seem to be present on Linux.
Oh, I meant to show the test I used, for the benefit of anyone who
wants to test on other platforms. Do this (with adjustments to
taste) to set up:
set password_encryption = "scram-sha-256" ;
create user u1 password 'foobar';
create extension postgres_fdw;
create server loopback foreign data wrapper postgres_fdw
options (host 'localhost', dbname 'postgres');
create user mapping for u1 server loopback
options (user 'u1', password 'foobar');
grant usage on foreign server loopback to u1;
\c - u1
create table remote (f1 int);
insert into remote values(1),(2);
create foreign table localt (f1 int) server loopback
options (table_name 'remote');
At this point you should be able to do
select * from localt;
as u1; if not, adjust pg_hba.conf to require a password on localhost
connections. Once the select works, *leave that session open*, and
in another session do any no-op ALTER on the user mapping, such as
alter user mapping for u1 server loopback options (set user 'u1');
Now, in the first session, repeat
select * from localt;
It'll dump core with a backtrace similar to what Jeremy showed,
if the bug is present.
regards, tom lane
On Sat, Sep 08, 2018 at 07:56:23AM -0400, Tom Lane wrote:
Looking closer, there are no leaks, though I think we could use a comment
about that. And it still makes me itch that pg_saslprep (mostly) doesn't
worry about clearing *output on failure while only three of its four
callers bother to initialize their variables to null. That's a recipe
for future bugs. So I propose the attached cleanup.
Okay, that looks acceptable to me. Thanks for looking at that.
--
Michael
On Sat, Sep 08, 2018 at 03:02:00PM -0400, Tom Lane wrote:
I tried setting up a scram-auth-based postgres_fdw connection on
my RHEL6 box and forcing it through the questionable code path
by issuing a dummy ALTER USER MAPPING command from another session.
It worked fine, so the problem doesn't seem to be present on Linux.Repeating the same thing on a nearby OpenBSD 6.0 image ... kaboom!
It fails exactly as Jeremy describes. Furthermore, I can confirm
that the attached patch fixes it.
That's not good! I have tried the report on Linux only, with pgbench -C
doing many foreign scans through postgres_fdw which connects to the
remote using SCRAM authentication, and this was stable. Anything newer
than ec136d19 is impacted. I am surprised this has not been seen
earlier to be honest.
I wonder which other platforms we have that behave like this?
And how come we've not seen symptoms before? It's hard to believe
that src/common/saslprep.c is the only thing libpq imports
that has FRONTEND-vs-not-FRONTEND behavior differences.
libpq is quite particular in the fact that it compiles directly with
stuff in src/common, src/port, etc. I think that the same problem
exists for pg_md5_encrypt(). So I like the solution proposed. I guess
you have made sure that postgres_fdw becomes stable this way? Have you
tried to stress it more with pgbench -C doing only foreign scans and
dozens of clients running in parallel?
--
Michael
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> However, the info I can find about this suggests that it's only
Tom> relevant for exported functions, not ones that are hidden by means
Tom> of a version-script. Since libpq doesn't export pg_saslprep, this
Tom> shouldn't be happening?
Tom> But wait, this problem is being reported on OpenBSD. I wonder
Tom> whether their linker behaves anything like Linux's ... hm, it
Tom> looks like it *is* the same linker.
But the makefile is not doing the same thing. SHLIB_EXPORTS looks like
it's being ignored on every platform except linux, macos and windows in
Makefile.shlib.
--
Andrew (irc:RhodiumToad)
I wrote:
I wonder which other platforms we have that behave like this?
And how come we've not seen symptoms before? It's hard to believe
that src/common/saslprep.c is the only thing libpq imports
that has FRONTEND-vs-not-FRONTEND behavior differences.
After groveling through all the FRONTEND conditionals in those
directories, it seems like indeed this is the first really obvious problem
that would ensue from libpq invoking the backend version of some common
function. There are some minor pecadilloes:
* "could not determine encoding" warning in chklocale.c might get printed
via ereport, not fprintf(stderr) as expected; but likely nobody would even
realize that was the wrong thing, and it's a very unlikely message anyway.
* getaddrinfo.c might use non-reentrant gethostbyname, but in the backend
environment that's hardly a problem due to lack of threading.
* thread.c might use non-reentrant strerror, getpwuid, gethostbyname,
but ditto.
* saslprep.c might ereport about password-too-long, which would be bad
(leading to leakage of a partially-built PGconn) but it's probably not
a case anybody would exercise in the field.
* scram_build_verifier might palloc instead of malloc, as might
unicode_norm.c. In some code paths I think that'd be OK because
the caller would also be backend code and would pfree not free,
but I think there are other paths where this is also a crash risk.
Still, these also are issues that are new with SCRAM support.
So the bottom line is that this theory does fit the observed facts.
Hence we need to make sure that version-script filtering happens on every
ELF-based platform, not just the most popular ones. I see that the
FreeBSD and NetBSD stanzas in Makefile.shlib also have "ifdef ELF_SYSTEM"
portions, so probably we need to put the version-script stuff into those
too. Anybody have thoughts about other possibly-affected platforms?
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
Is this potentially also a problem for libpqwalreceiver, which also
loads libpq?
The proposed fix would make libpq OK in that context, but I'm now
wondering if libpqwalreceiver itself needs a version-script.
regards, tom lane
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> But wait, this problem is being reported on OpenBSD. I wonder
Tom> whether their linker behaves anything like Linux's ... hm, it
Tom> looks like it *is* the same linker.
But the makefile is not doing the same thing. SHLIB_EXPORTS looks like
it's being ignored on every platform except linux, macos and windows in
Makefile.shlib.
Exactly. See proposed patch.
regards, tom lane