BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Started by PG Bug reporting formover 6 years ago16 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 16160
Logged by: Dmitry Uspenskiy
Email address: duspensky@ya.ru
PostgreSQL version: 12.1
Operating system: Any
Description:

LeakSanitizer shows memory leak at the following place:

[ts-1] ==17346==ERROR: LeakSanitizer: detected memory leaks
[ts-1]
[ts-1] Direct leak of 144 byte(s) in 1 object(s) allocated from:
[ts-1] #0 0x563b7f in __interceptor_malloc
/home/duspensky/code/yugabyte-db/thirdparty/build/common/llvm-7.0.1.src/../../../src/llvm-7.0.1.src/projects/comp
iler-rt/lib/asan/asan_malloc_linux.cc:146
[ts-1] #1 0x7f697e6a7ee7 in CRYPTO_malloc
(/home/duspensky/.linuxbrew-yb-build/linuxbrew-20181203T161736/lib/libcrypto.so.1.0.0+0x71ee7)
[ts-1] #2 0x7f697e746b2e in DH_new_method
(/home/duspensky/.linuxbrew-yb-build/linuxbrew-20181203T161736/lib/libcrypto.so.1.0.0+0x110b2e)
[ts-1] #3 0x7f697e74609c in dh_cb
(/home/duspensky/.linuxbrew-yb-build/linuxbrew-20181203T161736/lib/libcrypto.so.1.0.0+0x11009c)
[ts-1] #4 0x7f697e78a1fc in ASN1_item_ex_new
(/home/duspensky/.linuxbrew-yb-build/linuxbrew-20181203T161736/lib/libcrypto.so.1.0.0+0x1541fc)
[ts-1] #5 0x7f697e78f09a in ASN1_item_ex_d2i
(/home/duspensky/.linuxbrew-yb-build/linuxbrew-20181203T161736/lib/libcrypto.so.1.0.0+0x15909a)
[ts-1] #6 0x7f697e78f7ba in ASN1_item_d2i
(/home/duspensky/.linuxbrew-yb-build/linuxbrew-20181203T161736/lib/libcrypto.so.1.0.0+0x1597ba)
[ts-1] #7 0x7f697e7a0ad3 in PEM_read_bio_DHparams
(/home/duspensky/.linuxbrew-yb-build/linuxbrew-20181203T161736/lib/libcrypto.so.1.0.0+0x16aad3)
[ts-1] #8 0xcd75bd in load_dh_buffer
/home/duspensky/code/yugabyte-db/src/postgres/src/backend/libpq/../../../../../../src/postgres/src/backend/libpq/be-secure-openssl.c:848:7
[ts-1] #9 0xcd4d07 in initialize_dh
/home/duspensky/code/yugabyte-db/src/postgres/src/backend/libpq/../../../../../../src/postgres/src/backend/libpq/be-secure-openssl.c:976:8
[ts-1] #10 0xcd437c in be_tls_init
/home/duspensky/code/yugabyte-db/src/postgres/src/backend/libpq/../../../../../../src/postgres/src/backend/libpq/be-secure-openssl.c:198:7
[ts-1] #11 0xeecd86 in PostmasterMain
/home/duspensky/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:981:10
[ts-1] #12 0xcd79e3 in PostgresServerProcessMain
/home/duspensky/code/yugabyte-db/src/postgres/src/backend/main/../../../../../../src/postgres/src/backend/main/main.c:234:3
[ts-1] #13 0xcd8081 in main
(/home/duspensky/code/yugabyte-db/build/asan-clang-dynamic-ninja/postgres/bin/postgres+0xcd8081)

According to the following information

https://wiki.openssl.org/index.php/Diffie-Hellman_parameters

DH_free function must be called after SSL_CTX_set_tmp_dh

#2Michael Paquier
michael@paquier.xyz
In reply to: PG Bug reporting form (#1)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

On Wed, Dec 11, 2019 at 04:22:03PM +0000, PG Bug reporting form wrote:

According to the following information

https://wiki.openssl.org/index.php/Diffie-Hellman_parameters

DH_free function must be called after SSL_CTX_set_tmp_dh

That's not directly mentioned on their docs actually:
https://www.openssl.org/docs/man1.0.2/man3/SSL_CTX_set_tmp_dh.html

But it seems to me that you are right. If I look at the OpenSSL code,
ssl3_ctrl() does a DH_free() on error when going though
SSL_CTRL_SET_TMP_DH, and the code copies the DH directly using
DHparams_dup() so we don't need to keep any reference to it in our
code.

One more disturbing issue is that we can would accumulate garbage if
we keep reloading the SSL context in the postmaster. For this reason,
it could justify a backpatch down to the point where SSL parameters
are reloadable. On the other hand, the leak is small, so my take
is actually to just fix HEAD and call it a day.

Attached is a patch, I'll go commit that if there are no objections.
The DH handling does not really change regarding the way it gets
free'd or not down to 0.9.8.
--
Michael

Attachments:

bessl-dh-free.patchtext/x-diff; charset=us-asciiDownload+2-0
#3Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#2)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

On Fri, Dec 13, 2019 at 03:39:15PM +0900, Michael Paquier wrote:

Attached is a patch, I'll go commit that if there are no objections.
The DH handling does not really change regarding the way it gets
free'd or not down to 0.9.8.

And committed. Dmitry has pointed out offline that we need to do the
same with the error code path, and he is right as OpenSSL does not
touch the passed-in DH information for 0.9.8~.
--
Michael

#4Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Michael Paquier (#3)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

We ran into this memory leak on PG11 in production. The lea was determined
to be the root cause of OOM errors we were seeing. There was a combination
of a things that caused this leak to become serious enough for these OOM
errors to happen:

1. Very frequent SIGHUPs (every minute). Which causes this memory leak to
cumulatively leak significant amounts of memory over the course of a few
months (MBs instead of KBs)
2. A semi high number of connections that the workload had open (~150
connections). Each of these connections would start with the cumulative
memory leaked as copy-on-write memory. This multiplied the memory leak to
cause multiple GBs of copy-on-write memory.
3. We run Linux with vm.overcommit_memory=2. This causes copy-on-write
memory that isn't changed to effectively count towards allocated memory.

To clarify the context a bit more if you're not familiar with the details
of vm.overcommit_memory: There's "used" memory and "commited_as" memory.
The copy-on-write memory in all backends is counted towards "commited_as"
memory. "used" memory does not increase for every backend, because it's
copy-on-write and none of the backends write to this memory (since it's
leaked so there's no live pointer to it).

Linux puts a hard limit on commited_as, because we use
vm.overcommit_memory=2 (which means memory overcommitting is disabled). If
we had memory overcommiting enabled, then this memory leak wouldn't be a
real problem. The amount of "used" memory is pretty much negligable. It
only becomes a problem, because it's commited_as is multiplied for every
process and we care about commited_as because of disabled overcommiting.

It would be great if this could be backpatched to all currently supported
PG versions. The patch is very small, so it should be very little effort I
think. I'd be happy to help with that if that's useful or needed.

On Tue, 16 Mar 2021 at 16:09, Michael Paquier <michael@paquier.xyz> wrote:

Show quoted text

On Fri, Dec 13, 2019 at 03:39:15PM +0900, Michael Paquier wrote:

Attached is a patch, I'll go commit that if there are no objections.
The DH handling does not really change regarding the way it gets
free'd or not down to 0.9.8.

And committed. Dmitry has pointed out offline that we need to do the
same with the error code path, and he is right as OpenSSL does not
touch the passed-in DH information for 0.9.8~.
--
Michael

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#4)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Jelte Fennema <postgres@jeltef.nl> writes:

We ran into this memory leak on PG11 in production. The lea was determined
to be the root cause of OOM errors we were seeing. There was a combination
of a things that caused this leak to become serious enough for these OOM
errors to happen:

I follow the argument that a leak in the postmaster's SIGHUP processing
could accumulate enough to be a problem. However, how sure are you
really that this specific bug accounts for all of the leakage you saw?

I'm wondering about that because I see some other stuff in be_tls_init()
that looks like it might get leaked, notably the root_cert_list read
from the ssl_ca_file. This code was originally meant to be run exactly
once at postmaster start, so it's not too surprising that it's not as
careful as it now needs to be.

regards, tom lane

#6Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Tom Lane (#5)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

I'm pretty sure it was the only cause in this specific case. When running
postgres with valgrind this was the only block that was repeatedly being
leaked.

Originally we also thought that root_cert_list could be the cause. Changing
the size of the root cert did not speed up the memory leak though, so this
was rejected as one of the causes (before we ran postgres with valgrind).
The reason it doesn't leak is that root_cert_list gets added to the SSL_CTX
a little while after creation if everything goes well.

Looking at it again now, I see that if an error occurs when parsing
ssl_crl_file the root_cert_list is in fact leaked. This was easy to
reproduce by specifying a bogus path for ssl_crl_file. Running postgres
again with valgrind then yields this leak (when stopping it after 47
SIGHUPs):

==13061== 30,738 (1,504 direct, 29,234 indirect) bytes in 47 blocks are
definitely lost in loss record 147 of 148
==13061== at 0x4C31B0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13061== by 0x5825F08: CRYPTO_zalloc (build_shared/../crypto/mem.c:230)
==13061== by 0x588702E: OPENSSL_sk_new_reserve
(build_shared/../crypto/stack/stack.c:209)
==13061== by 0x544D644: sk_X509_NAME_new_null
(build_shared/../include/openssl/x509.h:77)
==13061== by 0x544D644: SSL_load_client_CA_file
(build_shared/../ssl/ssl_cert.c:634)
==13061== by 0x3F814D: be_tls_init
(home/jelte/.pgenv/src/postgresql-11.10/src/backend/libpq/be-secure-openssl.c:221)
==13061== by 0x3E65F0: secure_initialize
(home/jelte/.pgenv/src/postgresql-11.10/src/backend/libpq/be-secure.c:77)
==13061== by 0x4973C5: SIGHUP_handler
(home/jelte/.pgenv/src/postgresql-11.10/src/backend/postmaster/postmaster.c:2603)
==13061== by 0x4E5097F: ??? (in /lib/x86_64-linux-gnu/libpthread-2.27.so)
==13061== by 0x6C8FDD6: select
(build/glibc-S9d2JN/glibc-2.27/misc/../sysdeps/unix/sysv/linux/select.c:41)
==13061== by 0x49B50E: ServerLoop
(home/jelte/.pgenv/src/postgresql-11.10/src/backend/postmaster/postmaster.c:1692)
==13061== by 0x49C9F7: PostmasterMain
(home/jelte/.pgenv/src/postgresql-11.10/src/backend/postmaster/postmaster.c:1401)
==13061== by 0x3F947A: main
(home/jelte/.pgenv/src/postgresql-11.10/src/backend/main/main.c:228)

Parsing our ssl_crl_file did not cause errors though, so in our case this
was not the cause.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#6)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Jelte Fennema <postgres@jeltef.nl> writes:

I'm pretty sure it was the only cause in this specific case. When running
postgres with valgrind this was the only block that was repeatedly being
leaked.

Yeah, after doing some testing locally it seems that the non-error path,
at least, is free of additional problems. I set up a shell loop to hit
the postmaster with SIGHUP ten times a second. Looking at v12, an
ssl-enabled postmaster leaks very visibly after a few moments; while
there's no detectable leak in HEAD. So +1 for back-patching the DH_free
fix. (Michael, do you want to do the honors?)

Looking at it again now, I see that if an error occurs when parsing
ssl_crl_file the root_cert_list is in fact leaked. This was easy to
reproduce by specifying a bogus path for ssl_crl_file.

Interesting. While this case doesn't seem likely to pose much of a
practical problem, maybe we should clean it up. I'd already wondered
what was the point of separating the creation and use of the
root_cert_list by so much --- seems like we could install it
immediately after creation.

regards, tom lane

#8Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Hi,

On 2021-03-16 12:31:17 -0400, Tom Lane wrote:

However, how sure are you really that this specific bug accounts for
all of the leakage you saw?

I'd not be surprised if there were more...

I'm wondering about that because I see some other stuff in be_tls_init()
that looks like it might get leaked, notably the root_cert_list read
from the ssl_ca_file.

I think that specific instance should be, at least in the non-error
paths, fine though:

https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_client_CA_list.html

SSL_CTX_set_client_CA_list() sets the list of CAs sent to the client
when requesting a client certificate for ctx. Ownership of list is
transferred to ctx and it should not be freed by the caller.

So it seems like we'd need cleanup root_cert_list in case of errors
(we'd not reach the SSL_CTX_set_client_CA_list), but not otherwise?

Given that we're careful to destroy the "temporary" ssl context in case
of error, perhaps the best way to deal with root_cert_list being freed
in case of error would be to assign it to the context as soon as its
loaded?

This code was originally meant to be run exactly
once at postmaster start, so it's not too surprising that it's not as
careful as it now needs to be.

Yea.

Greetings,

Andres Freund

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#8)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Andres Freund <andres@anarazel.de> writes:

On 2021-03-16 12:31:17 -0400, Tom Lane wrote:

I'm wondering about that because I see some other stuff in be_tls_init()
that looks like it might get leaked, notably the root_cert_list read
from the ssl_ca_file.

Given that we're careful to destroy the "temporary" ssl context in case
of error, perhaps the best way to deal with root_cert_list being freed
in case of error would be to assign it to the context as soon as its
loaded?

Yeah, I came to the same conclusion downthread. I'll go see about
making that happen.

regards, tom lane

#10Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#7)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Hi,

(replying here because Jelte's email doesn't yet seem to have gone
through moderation)

On 2021-03-16 13:36:24 -0400, Tom Lane wrote:

Jelte Fennema <postgres@jeltef.nl> writes:

I'm pretty sure it was the only cause in this specific case. When running
postgres with valgrind this was the only block that was repeatedly being
leaked.

I wonder if it'd be worth starting to explicitly annotate all the places
that do allocations and are fine with leaking them. E.g. by introducing
malloc_permanently() or such. Right now it's hard to use valgrind et al
to detect leaks because of all the false positives due to such "ok to
leak" allocations.

I sometimes think that we're not great at spotting leak errors because
we're so used to things getting cleaned up in case of error due to
memory contexts...

Greetings,

Andres Freund

#11Stephen Frost
sfrost@snowman.net
In reply to: Jelte Fennema-Nio (#4)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Greetings,

* Jelte Fennema (postgres@jeltef.nl) wrote:

We ran into this memory leak on PG11 in production. The lea was determined
to be the root cause of OOM errors we were seeing. There was a combination
of a things that caused this leak to become serious enough for these OOM
errors to happen:

OOMs errors should be gracefully handled and PG should continue to
function. Was that not the case..?

To clarify the context a bit more if you're not familiar with the details
of vm.overcommit_memory: There's "used" memory and "commited_as" memory.
The copy-on-write memory in all backends is counted towards "commited_as"
memory. "used" memory does not increase for every backend, because it's
copy-on-write and none of the backends write to this memory (since it's
leaked so there's no live pointer to it).

Right- and is also why it's certainly important to be monitoring the
committed_as value vs the commit limit.

Linux puts a hard limit on commited_as, because we use
vm.overcommit_memory=2 (which means memory overcommitting is disabled). If
we had memory overcommiting enabled, then this memory leak wouldn't be a
real problem. The amount of "used" memory is pretty much negligable. It
only becomes a problem, because it's commited_as is multiplied for every
process and we care about commited_as because of disabled overcommiting.

Allowing overcommit, on the other hand, ends up with the Linux OOM
Killer running and sending essentially a kill -9 to PG, causing the
entire PG instance to crash and have to go through recovery.

It would be great if this could be backpatched to all currently supported
PG versions. The patch is very small, so it should be very little effort I
think. I'd be happy to help with that if that's useful or needed.

+1 on back-patching these fixes. -1 on what came across, to me at
least, as an argument for allowing overcommit. I realize you didn't
explicitly say that, but figured it'd be good for the archives to
discuss a bit more about why having overcommit_memory set to 2 is
strongly recommended. Without that, runaway queries could lead to the
OOM Killer running and the entire PG instance crashing.

Thanks,

Stephen

#12Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Stephen Frost (#11)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

OOMs errors should be gracefully handled and PG should continue to
function. Was that not the case..?

Yes, they were gracefully handled. I also didn't mean to suggest that
disabling overcommit would be the right solution. We definitely don't want
to do that. I mainly added that to the email to make clear why a few MBs
expanded to effectively a few GBs.

On Tue, 16 Mar 2021 at 19:44, Stephen Frost <sfrost@snowman.net> wrote:

Show quoted text

Greetings,

* Jelte Fennema (postgres@jeltef.nl) wrote:

We ran into this memory leak on PG11 in production. The lea was

determined

to be the root cause of OOM errors we were seeing. There was a

combination

of a things that caused this leak to become serious enough for these OOM
errors to happen:

OOMs errors should be gracefully handled and PG should continue to
function. Was that not the case..?

To clarify the context a bit more if you're not familiar with the details
of vm.overcommit_memory: There's "used" memory and "commited_as" memory.
The copy-on-write memory in all backends is counted towards "commited_as"
memory. "used" memory does not increase for every backend, because it's
copy-on-write and none of the backends write to this memory (since it's
leaked so there's no live pointer to it).

Right- and is also why it's certainly important to be monitoring the
committed_as value vs the commit limit.

Linux puts a hard limit on commited_as, because we use
vm.overcommit_memory=2 (which means memory overcommitting is disabled).

If

we had memory overcommiting enabled, then this memory leak wouldn't be a
real problem. The amount of "used" memory is pretty much negligable. It
only becomes a problem, because it's commited_as is multiplied for every
process and we care about commited_as because of disabled overcommiting.

Allowing overcommit, on the other hand, ends up with the Linux OOM
Killer running and sending essentially a kill -9 to PG, causing the
entire PG instance to crash and have to go through recovery.

It would be great if this could be backpatched to all currently supported
PG versions. The patch is very small, so it should be very little effort

I

think. I'd be happy to help with that if that's useful or needed.

+1 on back-patching these fixes. -1 on what came across, to me at
least, as an argument for allowing overcommit. I realize you didn't
explicitly say that, but figured it'd be good for the archives to
discuss a bit more about why having overcommit_memory set to 2 is
strongly recommended. Without that, runaway queries could lead to the
OOM Killer running and the entire PG instance crashing.

Thanks,

Stephen

#13Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#10)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

On Tue, Mar 16, 2021 at 11:12:26AM -0700, Andres Freund wrote:

I wonder if it'd be worth starting to explicitly annotate all the places
that do allocations and are fine with leaking them. E.g. by introducing
malloc_permanently() or such. Right now it's hard to use valgrind et al
to detect leaks because of all the false positives due to such "ok to
leak" allocations.

Yeah, I have been annoyed by those false positives in the past. It is
possible to know at quick glance what's wrong once you get used to it,
but anybody running valgrind would be right to be surprised, so having
a different code path for that sounds like a good idea.
--
Michael

#14Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#11)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Hi,

On 2021-03-16 14:44:55 -0400, Stephen Frost wrote:

OOMs errors should be gracefully handled and PG should continue to
function. Was that not the case..?

Depends on what you mean with graceful. Unless you restart the database
you'll eventually not be able to do anything anymore, since even the
smallest memory allocation will fail?

Greetings,

Andres Freund

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Andres Freund <andres@anarazel.de> writes:

On 2021-03-16 14:44:55 -0400, Stephen Frost wrote:

OOMs errors should be gracefully handled and PG should continue to
function. Was that not the case..?

Depends on what you mean with graceful. Unless you restart the database
you'll eventually not be able to do anything anymore, since even the
smallest memory allocation will fail?

Yeah, a leak in the postmaster is not gonna end well.

I went ahead and back-patched e0e569e1d. While looking at that,
I noticed that the load_dh_file subroutine was also a few DH_frees
shy of a load, so I fixed that too.

regards, tom lane

#16Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#14)
Re: BUG #16160: Minor memory leak in case of starting postgres server with SSL encryption

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2021-03-16 14:44:55 -0400, Stephen Frost wrote:

OOMs errors should be gracefully handled and PG should continue to
function. Was that not the case..?

Depends on what you mean with graceful. Unless you restart the database
you'll eventually not be able to do anything anymore, since even the
smallest memory allocation will fail?

Bit confused- I certainly agree with fixing the leak and back-patching
it, not sure how it came across otherwise. I was asking the questions
that I was to try to figure out if there was some other issue at play as
well.

Thanks,

Stephen