jacana hung after failing to acquire random number

Started by Andrew Dunstanabout 9 years ago9 messages

andrew@dunslane.net

about 9 years ago

jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
check". After starting the autovacuum launcher there are 120 messages on
its log about "Could not acquire random number". Then nothing.

So I suspect the problem here is commit
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
detail.

Shouldn't we want the postmaster to shut down if it's not going to go
further? Note that frogmouth, also mingw, which builds with openssl,
doesn't have this issue.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Andrew Dunstan (#1)

1 attachment(s)

Re: jacana hung after failing to acquire random number

On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
check". After starting the autovacuum launcher there are 120 messages on its
log about "Could not acquire random number". Then nothing.

So I suspect the problem here is commit
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
detail.

Shouldn't we want the postmaster to shut down if it's not going to go
further? Note that frogmouth, also mingw, which builds with openssl, doesn't
have this issue.

Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15
And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG: could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG: could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG: could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though. A
second difference are the missing CRYPT_MACHINE_KEYSET and
CRYP_NEWKEYSET. So, with something like the patch attached, do you see
improvements? What this patch does is to use a different context at
each call, and with the key container flags to allow proper access to
it (as there are winxp-only limitations here). I have tried to compile
with MinGW in my environment but my gcc compiler keep crashing, so I
cannot reproduce directly the problem I am afraid.
--
Michael

Attachments:

strong-random-win.patchtext/x-diff; charset=US-ASCII; name=strong-random-win.patchDownload

diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c
index 6d3aa38..34cab6f 100644
--- a/src/port/pg_strong_random.c
+++ b/src/port/pg_strong_random.c
@@ -31,14 +31,6 @@
 #include <Wincrypt.h>
 #endif
 
-#ifdef WIN32
-/*
- * Cache a global crypto provider that only gets freed when the process
- * exits, in case we need random numbers more than once.
- */
-static HCRYPTPROV hProvider = 0;
-#endif
-
 #if defined(USE_DEV_URANDOM)
 /*
  * Read (random) bytes from a file.
@@ -111,28 +103,36 @@ pg_strong_random(void *buf, size_t len)
 	 * Windows has CryptoAPI for strong cryptographic numbers.
 	 */
 #elif defined(USE_WIN32_RANDOM)
-	if (hProvider == 0)
+	HCRYPTPROV	hProvider;
+	DWORD		flags;
+
+	flags = CRYPT_VERIFYCONTEXT | CRYPT_SILENT | CRYPT_MACHINE_KEYSET;
+
+	/* Create a crypto provider */
+	if (!CryptAcquireContext(&hProvider,
+							 NULL,
+							 MS_DEF_PROV,
+							 PROV_RSA_FULL,
+							 flags))
 	{
+		/* If previous creation failed, try with a new key container */
+		flags |= CRYPT_NEWKEYSET;
 		if (!CryptAcquireContext(&hProvider,
 								 NULL,
 								 MS_DEF_PROV,
 								 PROV_RSA_FULL,
-								 CRYPT_VERIFYCONTEXT | CRYPT_SILENT))
-		{
-			/*
-			 * On failure, set back to 0 in case the value was for some reason
-			 * modified.
-			 */
-			hProvider = 0;
-		}
+								 flags))
+			return false;
 	}
-	/* Re-check in case we just retrieved the provider */
-	if (hProvider != 0)
+
+	if (!CryptGenRandom(hProvider, len, buf))
 	{
-		if (CryptGenRandom(hProvider, len, buf))
-			return true;
+		CryptReleaseContext(hProvider, 0);
+		return false;
 	}
-	return false;
+
+	CryptReleaseContext(hProvider, 0);
+	return true;
 
 	/*
 	 * Read /dev/urandom ourselves.

Heikki Linnakangas

hlinnaka@iki.fi

about 9 years ago

In reply to: Michael Paquier (#2)

Re: jacana hung after failing to acquire random number

On 12/12/2016 05:58 AM, Michael Paquier wrote:

On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
check". After starting the autovacuum launcher there are 120 messages on its
log about "Could not acquire random number". Then nothing.

So I suspect the problem here is commit
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
detail.

Shouldn't we want the postmaster to shut down if it's not going to go
further? Note that frogmouth, also mingw, which builds with openssl, doesn't
have this issue.

Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15
And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG: could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG: could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG: could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though.

Actually, looking at the config.log on jacana, it's trying to use
/dev/urandom:

configure:15028: checking for /dev/urandom
configure:15041: result: yes
configure:15054: checking which random number source to use
configure:15073: result: /dev/urandom

And looking closer at configure.in, I can see why:

elif test "$PORTNAME" = x"win32" ; then
USE_WIN32_RANDOM=1

That test is broken. It looks like the x"$VAR" = x"constant" idiom, but
the left side of the comparison doesn't have the 'x'. Oops.

Fixed that, let's see if it made jacana happy again.

This makes me wonder if we should work a bit harder to get a good error
message, if acquiring a random number fails for any reason. This needs
to work in the frontend as well backend, but we could still have an
elog(LOG, ...) there, inside an #ifndef FRONTEND block.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Heikki Linnakangas (#3)

Re: jacana hung after failing to acquire random number

On Mon, Dec 12, 2016 at 4:32 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

That test is broken. It looks like the x"$VAR" = x"constant" idiom, but the left side of the comparison doesn't have the 'x'. Oops.

Good catch.

This makes me wonder if we should work a bit harder to get a good error
message, if acquiring a random number fails for any reason. This needs to
work in the frontend as well backend, but we could still have an elog(LOG,
...) there, inside an #ifndef FRONTEND block.

Yeah, this has been itching me as well. We could allocate an error
string in a psprintf()'d string and let the callers of
pg_strong_backend() use it as they are responsible for the error
handling. What do you think?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

about 9 years ago

In reply to: Heikki Linnakangas (#3)

Re: jacana hung after failing to acquire random number

On 12/12/2016 02:32 AM, Heikki Linnakangas wrote:

On 12/12/2016 05:58 AM, Michael Paquier wrote:

On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <andrew@dunslane.net>
wrote:

jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
check". After starting the autovacuum launcher there are 120
messages on its
log about "Could not acquire random number". Then nothing.

So I suspect the problem here is commit
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
detail.

Shouldn't we want the postmaster to shut down if it's not going to go
further? Note that frogmouth, also mingw, which builds with openssl,
doesn't
have this issue.

Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15

And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG: could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG: could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG: could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though.

Actually, looking at the config.log on jacana, it's trying to use
/dev/urandom:

configure:15028: checking for /dev/urandom
configure:15041: result: yes
configure:15054: checking which random number source to use
configure:15073: result: /dev/urandom

And looking closer at configure.in, I can see why:

elif test "$PORTNAME" = x"win32" ; then
USE_WIN32_RANDOM=1

That test is broken. It looks like the x"$VAR" = x"constant" idiom,
but the left side of the comparison doesn't have the 'x'. Oops.

Fixed that, let's see if it made jacana happy again.

This makes me wonder if we should work a bit harder to get a good
error message, if acquiring a random number fails for any reason. This
needs to work in the frontend as well backend, but we could still have
an elog(LOG, ...) there, inside an #ifndef FRONTEND block.

I see you have now improved the messages in postmaster.c, which is good.

However, the bigger problem (ISTM) is that when this failed I had a
system which was running but where every connection immediately failed:

============== creating temporary instance ==============
============== initializing database system ==============
============== starting postmaster ==============

pg_regress: postmaster did not respond within 120 seconds
Examine c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.log for the reason
make: *** [check] Error 2

Should one or more of these errors be fatal? Or should we at least get
pg_regress to try to shut down the postmaster if it can't connect after
120 seconds?

[In answer to Michael's question above, I forcibly shut down the
postmaster by hand. Otherwise it would still be running, and we would
not have got the report on the buildfarm server.]

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnaka@iki.fi

about 9 years ago

In reply to: Andrew Dunstan (#5)

Re: jacana hung after failing to acquire random number

On 12/12/2016 03:40 PM, Andrew Dunstan wrote:

On 12/12/2016 02:32 AM, Heikki Linnakangas wrote:

On 12/12/2016 05:58 AM, Michael Paquier wrote:

On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <andrew@dunslane.net>
wrote:

jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
check". After starting the autovacuum launcher there are 120
messages on its
log about "Could not acquire random number". Then nothing.

So I suspect the problem here is commit
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
detail.

Shouldn't we want the postmaster to shut down if it's not going to go
further? Note that frogmouth, also mingw, which builds with openssl,
doesn't
have this issue.

Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15

And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG: could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG: could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG: could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though.

Actually, looking at the config.log on jacana, it's trying to use
/dev/urandom:

configure:15028: checking for /dev/urandom
configure:15041: result: yes
configure:15054: checking which random number source to use
configure:15073: result: /dev/urandom

And looking closer at configure.in, I can see why:

elif test "$PORTNAME" = x"win32" ; then
USE_WIN32_RANDOM=1

That test is broken. It looks like the x"$VAR" = x"constant" idiom,
but the left side of the comparison doesn't have the 'x'. Oops.

Fixed that, let's see if it made jacana happy again.

This makes me wonder if we should work a bit harder to get a good
error message, if acquiring a random number fails for any reason. This
needs to work in the frontend as well backend, but we could still have
an elog(LOG, ...) there, inside an #ifndef FRONTEND block.

I see you have now improved the messages in postmaster.c, which is good.

Well, I only wordsmithed them a bit, it still doesn't give much clue on
why it failed. We should add more details to it.

However, the bigger problem (ISTM) is that when this failed I had a
system which was running but where every connection immediately failed:

============== creating temporary instance ==============
============== initializing database system ==============
============== starting postmaster ==============

pg_regress: postmaster did not respond within 120 seconds
Examine c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.log for the reason
make: *** [check] Error 2

Should one or more of these errors be fatal? Or should we at least get
pg_regress to try to shut down the postmaster if it can't connect after
120 seconds?

Making it fatal, i.e. bringing down the server, doesn't seem like an
improvement. If the failure is transient, you don't want to kill the
whole server, when one connection attempt fails.

It would be nice to fail earlier if it's permanently failing, though.
Like, if someone does "rm /dev/urandom". Perhaps we should perform one
pg_strong_random() call at postmaster startup, and if that fails, refuse
to start up.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnaka@iki.fi

about 9 years ago

In reply to: Andrew Dunstan (#5)

Re: jacana hung after failing to acquire random number

On 12/12/2016 03:40 PM, Andrew Dunstan wrote:

Or should we at least get pg_regress to try to shut down the
postmaster if it can't connect after 120 seconds?

I think that makes a lot of sense, independently of this random stuff.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 9 years ago

In reply to: Heikki Linnakangas (#6)

Re: jacana hung after failing to acquire random number

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 12/12/2016 03:40 PM, Andrew Dunstan wrote:

Should one or more of these errors be fatal? Or should we at least get
pg_regress to try to shut down the postmaster if it can't connect after
120 seconds?

Making it fatal, i.e. bringing down the server, doesn't seem like an
improvement. If the failure is transient, you don't want to kill the
whole server, when one connection attempt fails.

It would be nice to fail earlier if it's permanently failing, though.
Like, if someone does "rm /dev/urandom". Perhaps we should perform one
pg_strong_random() call at postmaster startup, and if that fails, refuse
to start up.

That's sort of contradictory. If you're worried about transient failures,
allowing a single failed try to cause postmaster startup failure isn't the
way to make things more robust. Giving up after a bunch of failed tries
over a very short interval isn't much better.

I'm not sure how hard we need to work here. The case at hand seems
to be one of simply not having gotten the bugs out of the initial
implementation, so maybe we shouldn't read too much into it.

I do agree that the buildfarm needs to be more robust against broken
postmasters, because finding bugs is its raison d' etre. But I'm not
convinced that it's a good idea to have the postmaster itself conclude
that there's something wrong with its configured random-number source.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

about 9 years ago

In reply to: Heikki Linnakangas (#7)

Re: jacana hung after failing to acquire random number

On 12/12/2016 09:02 AM, Heikki Linnakangas wrote:

On 12/12/2016 03:40 PM, Andrew Dunstan wrote:

Or should we at least get pg_regress to try to shut down the
postmaster if it can't connect after 120 seconds?

I think that makes a lot of sense, independently of this random stuff.

I will add it to my long TODO list.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers