openssl valgrind failures on skink are due to openssl issue
Hi,
Skink a few days ago started failing [1]https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skink&br=HEAD[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2019-06-10%2001%3A36%3A12 with errors like:
==2732== Conditional jump or move depends on uninitialised value(s)
==2732== at 0x4C612E3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732== by 0x4C621FA: RAND_DRBG_generate (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732== by 0x4C63620: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732== by 0x4C61A09: RAND_DRBG_instantiate (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732== by 0x4C62937: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732== by 0x4C62CC9: RAND_DRBG_get0_public (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732== by 0x4C62CEF: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732== by 0x69C6AF: pg_strong_random (pg_strong_random.c:135)
==2732== by 0x4A2841: InitProcessGlobals (postmaster.c:2581)
==2732== by 0x661137: InitStandaloneProcess (miscinit.c:322)
==2732== by 0x2943CF: AuxiliaryProcessMain (bootstrap.c:209)
==2732== by 0x4005D1: main (main.c:220)
==2732== Uninitialised value was created by a stack allocation
==2732== at 0x4C633B0: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
and then a lot of followup error.
Reproduced that locally to get a nicer trace:
==7146== Conditional jump or move depends on uninitialised value(s)
==7146== at 0x4B122E3: inc_128 (drbg_ctr.c:32)
==7146== by 0x4B122E3: drbg_ctr_generate (drbg_ctr.c:330)
==7146== by 0x4B131FA: RAND_DRBG_generate (drbg_lib.c:638)
==7146== by 0x4B14620: rand_drbg_get_entropy (rand_lib.c:172)
==7146== by 0x4B12A09: RAND_DRBG_instantiate (drbg_lib.c:338)
==7146== by 0x4B13937: drbg_setup (drbg_lib.c:892)
==7146== by 0x4B13CC9: RAND_DRBG_get0_public (drbg_lib.c:1120)
==7146== by 0x4B13CC9: RAND_DRBG_get0_public (drbg_lib.c:1109)
==7146== by 0x4B13CEF: drbg_bytes (drbg_lib.c:963)
==7146== by 0x87BD60: pg_strong_random (pg_strong_random.c:139)
==7146== by 0x5CAFFC: InitProcessGlobals (postmaster.c:2581)
==7146== by 0x81AAD7: InitStandaloneProcess (miscinit.c:322)
==7146== by 0x681A61: PostgresMain (postgres.c:3732)
==7146== by 0x4E3D86: main (main.c:224)
==7146== Uninitialised value was created by a stack allocation
==7146== at 0x4B143B0: rand_drbg_get_nonce (rand_lib.c:231)
reading through the code lead me to figure out that that's due to a
recent openssl change:
https://github.com/openssl/openssl/commit/b3d113ed2993801ee643126118ccf6592ad18ef7
as explained in
https://github.com/openssl/openssl/issues/8460
and fixed since in
https://github.com/openssl/openssl/commit/15d7e7997e219fc5fef3f6003cc6bd7b2e7379d4
For reasons I do not understand the "cosmetic change" was backpatched
into 1.1.1 And the fix for the cosmetic change, made on master at the
end of March, was only backpatched to 1.1.1 *after* the 1.1.1c release
was made in late May. I mean, huh.
That release was then installed on skink recently:
2019-06-01 06:39:20 upgrade libssl1.1:amd64 1.1.1b-2 1.1.1c-1
And finally the reason for the issue only being visible on master: I am
stupid, and matched branch names like REL9_6 instead of REL9_6_STABLE
etc to enable openssl (9.4 doesn't work against current openssl).
I can't think of a better way to fix skink for now than just disabling
openssl for skink, until 1.1.1d is released.
Greetings,
Andres Freund
[1]: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skink&br=HEAD
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2019-06-10%2001%3A36%3A12
Andres Freund <andres@anarazel.de> writes:
For reasons I do not understand the "cosmetic change" was backpatched
into 1.1.1 And the fix for the cosmetic change, made on master at the
end of March, was only backpatched to 1.1.1 *after* the 1.1.1c release
was made in late May. I mean, huh.
Bleah. Not that we've not made equally dumb mistakes :-(
I can't think of a better way to fix skink for now than just disabling
openssl for skink, until 1.1.1d is released.
Couldn't you install a local valgrind exclusion matching this stack trace?
regards, tom lane
Hi,
On 2019-06-11 16:55:28 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I can't think of a better way to fix skink for now than just disabling
openssl for skink, until 1.1.1d is released.Couldn't you install a local valgrind exclusion matching this stack trace?
Unfortunately no. The error spreads through significant parts of openssl
*and* postgres, because it taints the returned random value, which then
is used in a number of places. We could try to block all of those, but
that seems fairly painful. And one, to my knowledge, cannot do valgrind
suppressions based on the source of uninitialized memory.
Greetings,
Andres Freund
On Tue, Jun 11, 2019 at 02:07:29PM -0700, Andres Freund wrote:
On 2019-06-11 16:55:28 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I can't think of a better way to fix skink for now than just disabling
openssl for skink, until 1.1.1d is released.
Thanks for digging into the details of that! I was wondering if we
did something wrong on our side but the backtraces were weird.
--
Michael
Hi,
On 2019-06-11 14:07:29 -0700, Andres Freund wrote:
On 2019-06-11 16:55:28 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I can't think of a better way to fix skink for now than just disabling
openssl for skink, until 1.1.1d is released.Couldn't you install a local valgrind exclusion matching this stack trace?
Unfortunately no. The error spreads through significant parts of openssl
*and* postgres, because it taints the returned random value, which then
is used in a number of places. We could try to block all of those, but
that seems fairly painful. And one, to my knowledge, cannot do valgrind
suppressions based on the source of uninitialized memory.
What we could do is add a suppression like:
{
broken-openssl-accesses-random
Memcheck:Cond
...
fun:pg_strong_random
fun:InitProcessGlobals
fun:PostmasterMain
fun:main
}
(alternatively one suppression for each RAND_status, RAND_poll,
RAND_bytes(), to avoid suppressing all of pg_strong_random itself)
and then prevent spread of the uninitialized memory by adding a
VALGRIND_MAKE_MEM_DEFINED(buf, len);
after a successful RAND_bytes() call.
I tested that that quiesces the problem locally. Probably not worth
pushing something like that though?
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
What we could do is add a suppression like:
{
broken-openssl-accesses-random
Memcheck:Cond
...
fun:pg_strong_random
fun:InitProcessGlobals
fun:PostmasterMain
fun:main
}
(alternatively one suppression for each RAND_status, RAND_poll,
RAND_bytes(), to avoid suppressing all of pg_strong_random itself)
and then prevent spread of the uninitialized memory by adding a
VALGRIND_MAKE_MEM_DEFINED(buf, len);
after a successful RAND_bytes() call.
I tested that that quiesces the problem locally. Probably not worth
pushing something like that though?
Yeah, that seems awfully aggressive to be pushing to machines that
don't have the problem. Did you get any sense of how fast the
openssl fix is goinng to show up?
regards, tom lane
Hi,
On 2019-06-18 19:25:12 -0400, Tom Lane wrote:
Did you get any sense of how fast the openssl fix is goinng to show up?
It's merged to both branches that contain the broken code. Now we need
to wait for the next set of openssl releases, and then for distros to
pick that up. Based on the past release cadence
https://www.openssl.org/news/openssl-1.1.1-notes.html
that seems to be likely to happen within 2-3 months.
Greetings,
Andres Freund
On Tue, Jun 18, 2019 at 04:34:07PM -0700, Andres Freund wrote:
It's merged to both branches that contain the broken code. Now we need
to wait for the next set of openssl releases, and then for distros to
pick that up. Based on the past release cadence
https://www.openssl.org/news/openssl-1.1.1-notes.html
that seems to be likely to happen within 2-3 months.
If that's for the buildfarm coverage. I would be of the opinion to
wait a bit. Another possibility is that you could compile your own
version of OpenSSL with the patch included, say only 1.1.1c with the
patch. Still that would cause the plpython tests to complain as the
system's python may still link to the system's OpenSSL which is
broken?
Another possibility would be to move back to 1.1.1b for the time
being...
--
Michael
On 2019-06-19 11:28:16 +0900, Michael Paquier wrote:
On Tue, Jun 18, 2019 at 04:34:07PM -0700, Andres Freund wrote:
It's merged to both branches that contain the broken code. Now we need
to wait for the next set of openssl releases, and then for distros to
pick that up. Based on the past release cadence
https://www.openssl.org/news/openssl-1.1.1-notes.html
that seems to be likely to happen within 2-3 months.If that's for the buildfarm coverage. I would be of the opinion to
wait a bit.
Yea. For now I've just disabled ssl support on skink, but that has its
own disadvantages.
Another possibility is that you could compile your own
version of OpenSSL with the patch included, say only 1.1.1c with the
patch.
Really, I can do that?
On Tue, Jun 18, 2019 at 07:44:26PM -0700, Andres Freund wrote:
Really, I can do that?
Here is some of the stuff I use, just for the reference:
./Configure linux-x86_64 --prefix=$HOME/stable/openssl/1.1.1/
./config --prefix=$HOME/stable/openssl/1.1.1 shared
--
Michael