PG 10: could not generate random cancel key

Started by Dean Rasheedover 7 years ago11 messages
#1Dean Rasheed
dean.a.rasheed@gmail.com
1 attachment(s)

Last week I upgraded 15 servers from various pre-10 versions to 10.4.
At first everything looked OK, but then (around 4 days later) one of
them failed with this in the logs:

2018-07-14 01:53:35.840 BST LOG: could not generate random cancel key
2018-07-14 01:53:37.233 BST LOG: could not generate random cancel key
2018-07-14 01:53:37.245 BST LOG: could not generate random cancel key
2018-07-14 01:53:38.553 BST LOG: could not generate random cancel key
2018-07-14 01:53:38.581 BST LOG: could not generate random cancel key
2018-07-14 01:54:43.851 BST WARNING: worker took too long to start; canceled
2018-07-14 01:54:43.862 BST LOG: could not generate random cancel key
2018-07-14 01:55:09.861 BST LOG: could not generate random cancel key
2018-07-14 01:55:09.874 BST LOG: could not generate random cancel key
...

After that it would not accept any new connections until I restarted
postmaster a few hours later. Since then, it has been OK.

It was built using --with-openssl and strong random support enabled,
so it was OpenSSL's RAND_bytes() that failed for some reason. I
attempted to reproduce it with a small C program directly calling
RAND_bytes(), but it refused to fail, even if I disabled haveged and
ran my tests in an @reboot cron job. So this failure is evidently
quite rare, but the documentation for RAND_bytes() says it *can* fail
(returning 0) if it isn't seeded with enough entropy, in which case
more must be added, which we're not doing.

In addition, once it does fail, repeated calls to RAND_bytes() will
continue to fail if it isn't seeded with more data -- hence the
inability to start any new backends until after a postmaster restart,
which is not a very friendly failure mode.

The OpenSSL documentation suggests that we should use RAND_status()
[1]: https://www.openssl.org/docs/man1.1.1/man3/RAND_status.html

RAND_status() indicates whether or not the CSPRNG has been sufficiently
seeded. If not, functions such as RAND_bytes(3) will fail.

and if not, RAND_poll() can be used to fix that:

RAND_poll() uses the system's capabilities to seed the CSPRNG using
random input obtained from polling various trusted entropy sources. The
default choice of the entropy source can be modified at build time using
the --with-rand-seed configure option, see also the NOTES section. A
summary of the configure options can be displayed with the OpenSSL
version(1) command.

Looking for precedents elsewhere, I found [2]https://github.com/nodejs/node/blob/master/src/node_crypto.cc which does exactly that,
although I'm slightly dubious about the need for the for-loop there. I
also found a thread [3]https://github.com/openssl/openssl/issues/4148, which recommends simply doing

if (RAND_status() == 0)
RAND_poll();

which seems preferable. Attached is a patch to do this in pg_strong_random().

Thoughts?

Regards,
Dean

[1]: https://www.openssl.org/docs/man1.1.1/man3/RAND_status.html
[2]: https://github.com/nodejs/node/blob/master/src/node_crypto.cc
[3]: https://github.com/openssl/openssl/issues/4148

Attachments:

pg_strong_random.patchtext/x-patch; charset=US-ASCII; name=pg_strong_random.patchDownload
diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c
new file mode 100644
index bc7a8aa..fd5ad92
--- a/src/port/pg_strong_random.c
+++ b/src/port/pg_strong_random.c
@@ -103,6 +103,9 @@ pg_strong_random(void *buf, size_t len)
 	 * When built with OpenSSL, use OpenSSL's RAND_bytes function.
 	 */
 #if defined(USE_OPENSSL_RANDOM)
+	/* Make sure that OpenSSL's CSPRNG has been sufficiently seeded */
+	if (RAND_status() == 0)
+		(void) RAND_poll();
 	if (RAND_bytes(buf, len) == 1)
 		return true;
 	return false;
#2Michael Paquier
michael@paquier.xyz
In reply to: Dean Rasheed (#1)
Re: PG 10: could not generate random cancel key

On Tue, Jul 17, 2018 at 01:33:11PM +0100, Dean Rasheed wrote:

Looking for precedents elsewhere, I found [2] which does exactly that,
although I'm slightly dubious about the need for the for-loop there. I
also found a thread [3], which recommends simply doing

if (RAND_status() == 0)
RAND_poll();

which seems preferable. Attached is a patch to do this in pg_strong_random().

Checking for the return result of RAND_poll() would also be good thing
to do. From what I read in OpenSSL code it could fail as well, and
we could combine that with a loop attempted to feed the machinery a
decided amount of times, just failing after successive failures.
--
Michael

#3Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Michael Paquier (#2)
Re: PG 10: could not generate random cancel key

On 17 July 2018 at 14:04, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 17, 2018 at 01:33:11PM +0100, Dean Rasheed wrote:

Looking for precedents elsewhere, I found [2] which does exactly that,
although I'm slightly dubious about the need for the for-loop there. I
also found a thread [3], which recommends simply doing

if (RAND_status() == 0)
RAND_poll();

which seems preferable. Attached is a patch to do this in pg_strong_random().

Checking for the return result of RAND_poll() would also be good thing
to do. From what I read in OpenSSL code it could fail as well, and
we could combine that with a loop attempted to feed the machinery a
decided amount of times, just failing after successive failures.

From what I understand from here [1]https://wiki.openssl.org/index.php/Random_Numbers, some parts of OpenSSL call
RAND_poll() once on initialisation, and that's enough to get the PRNG
going. It's not obvious that calling it multiple times would have any
benefit.

They also don't appear to bother checking the return code from
RAND_poll() [2]https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c. If it did fail, there'd not be much you could do
anyway, so you might as well just let it continue and let RAND_bytes()
fail. In fact it may even be possible for RAND_poll() to fail, but
just do enough to cause RAND_bytes() to succeed.

Regards,
Dean

[1]: https://wiki.openssl.org/index.php/Random_Numbers
[2]: https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c

#4Robert Haas
robertmhaas@gmail.com
In reply to: Dean Rasheed (#1)
Re: PG 10: could not generate random cancel key

On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

if (RAND_status() == 0)
RAND_poll();

Looks like a recipe for an infinite loop. At least, I think we ought
to have a CHECK_FOR_INTERRUPTS() in that loop.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#4)
Re: PG 10: could not generate random cancel key

On 2018-Jul-17, Robert Haas wrote:

On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

if (RAND_status() == 0)
RAND_poll();

Looks like a recipe for an infinite loop. At least, I think we ought
to have a CHECK_FOR_INTERRUPTS() in that loop.

What loop?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#5)
Re: PG 10: could not generate random cancel key

On Tue, Jul 17, 2018 at 1:27 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2018-Jul-17, Robert Haas wrote:

On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

if (RAND_status() == 0)
RAND_poll();

Looks like a recipe for an infinite loop. At least, I think we ought
to have a CHECK_FOR_INTERRUPTS() in that loop.

What loop?

Ugh, I'm not doing very well today, am I? I read that as while() but
it says if().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#6)
Re: PG 10: could not generate random cancel key

On Tue, Jul 17, 2018 at 01:31:01PM -0400, Robert Haas wrote:

On Tue, Jul 17, 2018 at 1:27 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2018-Jul-17, Robert Haas wrote:

On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

if (RAND_status() == 0)
RAND_poll();

Looks like a recipe for an infinite loop. At least, I think we ought
to have a CHECK_FOR_INTERRUPTS() in that loop.

What loop?

The CHECK_FOR_INTERRUPTS() addition could be an addition if the number
of retries gets high enough, but that just does not apply here.

Ugh, I'm not doing very well today, am I? I read that as while() but
it says if().

Time for vacations :)
--
Michael

#8Michael Paquier
michael@paquier.xyz
In reply to: Dean Rasheed (#3)
Re: PG 10: could not generate random cancel key

On Tue, Jul 17, 2018 at 02:28:14PM +0100, Dean Rasheed wrote:

From what I understand from here [1], some parts of OpenSSL call
RAND_poll() once on initialisation, and that's enough to get the PRNG
going. It's not obvious that calling it multiple times would have any
benefit.

They also don't appear to bother checking the return code from
RAND_poll() [2]. If it did fail, there'd not be much you could do
anyway, so you might as well just let it continue and let RAND_bytes()
fail. In fact it may even be possible for RAND_poll() to fail, but
just do enough to cause RAND_bytes() to succeed.

[1] https://wiki.openssl.org/index.php/Random_Numbers

This quote from the wiki is scary so that's not quite clean either for
Windows:
"Be careful when deferring to RAND_poll on some Unix systems because it
does not seed the generator. See the code guarded with
OPENSSL_SYS_VXWORKS in rand_unix.c. Additionally, RAND_poll can have
negative interactions on newer Windows platforms, so your program could
hang or crash depending on the potential issue. See Windows Issues
below."

[2] https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c

This repository is outdated, on OpenSSL HEAD I am seeing this used only
in rand_win.c. And this commit is sort of interesting because there was
a retry loop done with RAND_poll(). Please see this one:
commit: c16de9d8329d41a2433d0f273c080d9d06ad7a87
author: Dr. Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com>
date: Thu, 31 Aug 2017 23:16:22 +0200
committer: Ben Kaduk <kaduk@mit.edu>
date: Wed, 18 Oct 2017 08:39:20 -0500
Fix reseeding issues of the public RAND_DRBG

apps/ocsp.c also has the wisdom to check for a failure on RAND_poll().
--
Michael

#9Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Michael Paquier (#8)
1 attachment(s)
Re: PG 10: could not generate random cancel key

On 18 July 2018 at 03:17, Michael Paquier <michael@paquier.xyz> wrote:

[1] https://wiki.openssl.org/index.php/Random_Numbers

This quote from the wiki is scary so that's not quite clean either for
Windows:
"Be careful when deferring to RAND_poll on some Unix systems because it
does not seed the generator. See the code guarded with
OPENSSL_SYS_VXWORKS in rand_unix.c. Additionally, RAND_poll can have
negative interactions on newer Windows platforms, so your program could
hang or crash depending on the potential issue. See Windows Issues
below."

I think that wiki page is somewhat out of date in places. Both the
Windows issues it links to seem to have been fixed a long time ago, so
I think using RAND_poll() is probably safe now, although perhaps there
are still some Unix platforms on which it won't help either.

[2] https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c

This repository is outdated, on OpenSSL HEAD I am seeing this used only
in rand_win.c. And this commit is sort of interesting because there was
a retry loop done with RAND_poll(). Please see this one:
commit: c16de9d8329d41a2433d0f273c080d9d06ad7a87
author: Dr. Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com>
date: Thu, 31 Aug 2017 23:16:22 +0200
committer: Ben Kaduk <kaduk@mit.edu>
date: Wed, 18 Oct 2017 08:39:20 -0500
Fix reseeding issues of the public RAND_DRBG

apps/ocsp.c also has the wisdom to check for a failure on RAND_poll().

OK, I guess that it is possible that an older version of OpenSSL
requires RAND_poll() to be called multiple times. Here's an updated
patch doing that (with up to 8 retries, based on the old OpenSSL
code).

Regards,
Dean

Attachments:

pg_strong_random-v2.patchtext/x-patch; charset=US-ASCII; name=pg_strong_random-v2.patchDownload
diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c
new file mode 100644
index bc7a8aa..a2ab8c1
--- a/src/port/pg_strong_random.c
+++ b/src/port/pg_strong_random.c
@@ -103,6 +103,37 @@ pg_strong_random(void *buf, size_t len)
 	 * When built with OpenSSL, use OpenSSL's RAND_bytes function.
 	 */
 #if defined(USE_OPENSSL_RANDOM)
+	/*
+	 * Check that OpenSSL's CSPRNG has been sufficiently seeded, and if not
+	 * add more seed data using RAND_poll().  With some older versions of
+	 * OpenSSL, it may be necessary to call RAND_poll() a number of times.
+	 */
+#define NUM_RAND_POLL_RETRIES 8
+
+	if (RAND_status() == 0)
+	{
+		int			i;
+
+		for (i = 0; i < NUM_RAND_POLL_RETRIES; i++)
+		{
+			if (RAND_poll() == 0)
+			{
+				/*
+				 * RAND_poll() failed to generate any seed data, which means
+				 * that RAND_bytes() will probably fail.  For now, just fall
+				 * through and let that happen.  XXX: maybe we could seed it
+				 * some other way.
+				 */
+				break;
+			}
+
+			if (RAND_status() == 1)
+			{
+				/* The CSPRNG is now sufficiently seeded */
+				break;
+			}
+		}
+	}
 	if (RAND_bytes(buf, len) == 1)
 		return true;
 	return false;
#10Michael Paquier
michael@paquier.xyz
In reply to: Dean Rasheed (#9)
1 attachment(s)
Re: PG 10: could not generate random cancel key

On Wed, Jul 18, 2018 at 10:14:56AM +0100, Dean Rasheed wrote:

OK, I guess that it is possible that an older version of OpenSSL
requires RAND_poll() to be called multiple times. Here's an updated
patch doing that (with up to 8 retries, based on the old OpenSSL
code).

Thanks for the updated version. This looks safer to me. It is possible
to simplify the code by removing the external RAND_status() call and
check for RAND_status() first in the loop as per the attached.
--
Michael

Attachments:

pg_strong_random-v3-michael.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c
index bc7a8aacb9..8ed5c04459 100644
--- a/src/port/pg_strong_random.c
+++ b/src/port/pg_strong_random.c
@@ -103,6 +103,35 @@ pg_strong_random(void *buf, size_t len)
 	 * When built with OpenSSL, use OpenSSL's RAND_bytes function.
 	 */
 #if defined(USE_OPENSSL_RANDOM)
+	int			i;
+
+	/*
+	 * Check that OpenSSL's CSPRNG has been sufficiently seeded, and if not
+	 * add more seed data using RAND_poll().  With some older versions of
+	 * OpenSSL, it may be necessary to call RAND_poll() a number of times.
+	 */
+#define NUM_RAND_POLL_RETRIES 8
+
+	for (i = 0; i < NUM_RAND_POLL_RETRIES; i++)
+	{
+		if (RAND_status() == 1)
+		{
+			/* CSPRNG is sufficiently seeded */
+			break;
+		}
+
+		if (RAND_poll() == 0)
+		{
+			/*
+			 * RAND_poll() failed to generate any seed data, which means that
+			 * RAND_bytes() will probably fail.  For now, just fall through
+			 * and let that happen.  XXX: maybe we could seed it some other
+			 * way.
+			 */
+			break;
+		}
+	}
+
 	if (RAND_bytes(buf, len) == 1)
 		return true;
 	return false;
#11Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Michael Paquier (#10)
Re: PG 10: could not generate random cancel key

On 18 July 2018 at 14:01, Michael Paquier <michael@paquier.xyz> wrote:

Thanks for the updated version. This looks safer to me. It is possible
to simplify the code by removing the external RAND_status() call and
check for RAND_status() first in the loop as per the attached.

OK, thanks.

Barring any further comments, I'll push and back-patch this to v10
later this week.

Regards,
Dean