PQexec() hangs on OOM

Started by Heikki Linnakangasover 11 years ago48 messagesbugs
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

Alex Shulgin's find of a missing NULL check after strdup()
(87tx1or3cc.fsf@commandprompt.com) prompted me to do some testing of
libpq, when malloc/strdup returns NULL. To simulate running out of
memory, I wrote a little LD_PRELOAD library that allows causing malloc()
to return NULL, after a particular number of calls. and ran
src/test/examples/testlibpq with that hack.

After fixing all the missing NULL-checks, I found that testlibpq
sometimes just hangs. It happens when this malloc() call in
PQmakeEmptyPGResult() fails:

#3 0x00007f6dc86495c0 in malloc () from /home/heikki/mallocfail.so
#4 0x00007f6dc8423b6e in PQmakeEmptyPGresult (conn=0x1bea040,
status=PGRES_COMMAND_OK) at fe-exec.c:144
#5 0x00007f6dc8430b27 in pqParseInput3 (conn=0x1bea040) at
fe-protocol3.c:204
#6 0x00007f6dc8426468 in parseInput (conn=0x1bea040) at fe-exec.c:1652
#7 0x00007f6dc8426583 in PQgetResult (conn=0x1bea040) at fe-exec.c:1727
#8 0x00007f6dc8426c76 in PQexecFinish (conn=0x1bea040) at fe-exec.c:2000
#9 0x00007f6dc84268ad in PQexec (conn=0x1bea040, query=0x400e32
"BEGIN") at fe-exec.c:1834
#10 0x0000000000400b18 in main (argc=1, argv=0x7fffc9b6a568) at
testlibpq.c:59

When that malloc() returns NULL, parseInput returns without reading any
input. PQgetResult() takes that as a sign that it needs to read more
input from the server, before calling parseInput() again, and that read
never returns because there is no more data coming from the server.

I don't have any immediate plans to fix this, or to continue testing
this. There might well be more cases like this. Patches are welcome.

Attached is the little wrapper library I used to test this. testlibpq
hangs when run with MALLOC_FAIL_AT=110. It's really quick & dirty, sorry
about that. I'm sure there are more sophisticated tools to do similar
testing out there somewhere..

- Heikki

Attachments:

mallocfail.ctext/x-csrc; name=mallocfail.cDownload
#2Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#1)
Re: PQexec() hangs on OOM

On Tue, Nov 25, 2014 at 10:15 PM, Heikki Linnakangas wrote:

When that malloc() returns NULL, parseInput returns without reading any
input. PQgetResult() takes that as a sign that it needs to read more input
from the server, before calling parseInput() again, and that read never
returns because there is no more data coming from the server.

I don't have any immediate plans to fix this, or to continue testing this.
There might well be more cases like this. Patches are welcome.

Attached is the little wrapper library I used to test this. testlibpq

hangs

when run with MALLOC_FAIL_AT=110. It's really quick & dirty, sorry about
that. I'm sure there are more sophisticated tools to do similar testing

out

there somewhere..

With MALLOC_FAIL_AT=84, 86, 92, the backtrace just before the malloc
creating the OOM looks like that:
#0 0x00007f76316964d0 in __poll_nocancel () from /usr/lib/libc.so.6
#1 0x00007f7631971577 in pqSocketPoll (sock=4, forRead=1, forWrite=0,
end_time=-1) at fe-misc.c:1133
#2 0x00007f7631971461 in pqSocketCheck (conn=0x1495040, forRead=1,
forWrite=0, end_time=-1) at fe-misc.c:1075
#3 0x00007f76319712f8 in pqWaitTimed (forRead=1, forWrite=0,
conn=0x1495040, finish_time=-1) at fe-misc.c:1007
#4 0x00007f76319712ca in pqWait (forRead=1, forWrite=0, conn=0x1495040) at
fe-misc.c:990
#5 0x00007f763196d21d in PQgetResult (conn=0x1495040) at fe-exec.c:1711
#6 0x00007f763196d913 in PQexecFinish (conn=0x1495040) at fe-exec.c:1997
#7 0x00007f763196d576 in PQexec (conn=0x1495040, query=0x400ef2 "BEGIN")
at fe-exec.c:1831
#8 0x0000000000400bd8 in main (argc=1, argv=0x7ffd8a2644c8) at
testlibpq.c:5

In this case, as what happens is an OOM related to the allocation of
PGResult, I think that we had better store a status flag in PGConn related
to this OOM, as PGConn->errorMessage may not be empty to take care of the
ambiguity that PGResult == NULL does not necessarily mean wait for more
results. Something like PGResultStatus to avoid any API incompatibility.
Thoughts?

Looking at the other malloc() calls of llibpq, we do not really have this
ambiguity. For example if makeEmptyPGconn() == NULL means OOM. I am
guessing from the code as well that PQmakeEmptyPGresult() == NULL means
OOM, so the error handling problem comes from parseInput and its underlings.

Also in pqSaveParameterStatus, shouldn't we have a better OOM handling
there as well for pstatus?
Regards,
--
Michael

#3Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#2)
Re: PQexec() hangs on OOM

On Mon, Apr 6, 2015 at 3:54 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

In this case, as what happens is an OOM related to the allocation of
PGResult, I think that we had better store a status flag in PGConn related
to this OOM, as PGConn->errorMessage may not be empty to take care of the
ambiguity that PGResult == NULL does not necessarily mean wait for more
results. Something like PGResultStatus to avoid any API incompatibility.
Thoughts?

Second idea: return a status with parseInput as it is not part of the APIs
of libpq.
--
Michael

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#3)
Re: PQexec() hangs on OOM

Michael Paquier <michael.paquier@gmail.com> writes:

Second idea: return a status with parseInput as it is not part of the APIs
of libpq.

Yeah; most subroutines in libpq have a zero-or-EOF return convention,
we can make parseInput do likewise. I'm not sure if that'd need to
propagate further down though ...

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#4)
Re: PQexec() hangs on OOM

On Mon, Apr 6, 2015 at 10:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

Second idea: return a status with parseInput as it is not part of the APIs
of libpq.

Yeah; most subroutines in libpq have a zero-or-EOF return convention,
we can make parseInput do likewise. I'm not sure if that'd need to
propagate further down though ...

So, I have been looking at that in more details, and finished with the
patch attached that makes the problem go away for me with a nice "out
of memory" error. I have extended parseInput() to have it return a
status code to decide if parsing should be stopped or should continue.
Note that I have tried to be careful to make a clear distinction
between cases where routines return EOF because of not enough data
parsed and actual OOMs.

I have noticed as well that getCopyStart() in fe-protocol3.c needs to
be made a little bit smarter to make the difference between an OOM and
the not-enough-data type of problem.

This problem has a low probability to happen in the field, and no
people complained about that as well, so I think that patching only
HEAD is adapted.
Regards,
--
Michael

Attachments:

0001-Improve-OOM-detection-of-parseInput-in-libpq.patchtext/x-patch; charset=US-ASCII; name=0001-Improve-OOM-detection-of-parseInput-in-libpq.patchDownload+134-74
#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#5)
Re: PQexec() hangs on OOM

On 04/07/2015 09:18 AM, Michael Paquier wrote:

@@ -143,7 +143,11 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)

result = (PGresult *) malloc(sizeof(PGresult));
if (!result)
+	{
+		printfPQExpBuffer(&conn->errorMessage,
+						  libpq_gettext("out of memory\n"));
return NULL;
+	}

result->ntups = 0;
result->numAttributes = 0;

That's not cool, because PQmakeEmptyPGresult may be called with conn ==
NULL. In general, I'm a bit wary of changing PQmakeEmptyResult so that
it sets the error message. Will have to check all the callers carefully
to see if that would upset any of them. And it might be called by
applications too.

I have noticed as well that getCopyStart() in fe-protocol3.c needs to
be made a little bit smarter to make the difference between an OOM and
the not-enough-data type of problem.

Yeah. getParamDescription still has the issue, with your patch.

This problem has a low probability to happen in the field, and no
people complained about that as well, so I think that patching only
HEAD is adapted.

As long as the patch applies easily, I don't see much reason to not
backpatch.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#6)
Re: PQexec() hangs on OOM

On Wed, Apr 8, 2015 at 3:54 AM, Heikki Linnakangas wrote:

On 04/07/2015 09:18 AM, Michael Paquier wrote:

@@ -143,7 +143,11 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType
status)

[...]

That's not cool, because PQmakeEmptyPGresult may be called with conn ==
NULL. In general, I'm a bit wary of changing PQmakeEmptyResult so that it
sets the error message. Will have to check all the callers carefully to see
if that would upset any of them. And it might be called by applications too.

Oops, my mistake. For a patch arguing to not change how libpq routines
behave this is bad, and contrary to what is mentioned in the docs as
well. I moved the error message into parseInput. I guess it makes more
sense there.

I have noticed as well that getCopyStart() in fe-protocol3.c needs to
be made a little bit smarter to make the difference between an OOM and
the not-enough-data type of problem.

Yeah. getParamDescription still has the issue, with your patch.

Check.

This problem has a low probability to happen in the field, and no
people complained about that as well, so I think that patching only
HEAD is adapted.

As long as the patch applies easily, I don't see much reason to not
backpatch.

It applies cleanly down to 9.1. For 9.0 some minor tweaks are visibly needed.
In any case, take two is attached.
Regards,
--
Michael

Attachments:

0001-Improve-OOM-detection-of-parseInput-in-libpq.patchapplication/x-patch; name=0001-Improve-OOM-detection-of-parseInput-in-libpq.patchDownload+166-79
#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#7)
Re: PQexec() hangs on OOM

On 04/08/2015 07:27 AM, Michael Paquier wrote:

On Wed, Apr 8, 2015 at 3:54 AM, Heikki Linnakangas wrote:

On 04/07/2015 09:18 AM, Michael Paquier wrote:

I have noticed as well that getCopyStart() in fe-protocol3.c needs to
be made a little bit smarter to make the difference between an OOM and
the not-enough-data type of problem.

Yeah. getParamDescription still has the issue, with your patch.

Check.

There are still a few parseInput subroutines that have similar issues.
In fe-protocol2.c:

* pqGetErrorNotice2 returns EOF if there is not enough data, but also on
OOM. The caller assumes it's because not enough data.

* getRowDescriptions() is the same, although it sets conn->errorMessage
on OOM.

* getAnotherTuple() is the same, but it also skips over the received
data, so you won't get stuck. So maybe that's OK.

* getNotify() just drops any NOTIFY messages on OOM. Perhaps that's OK..

The corresponding functions in fe-protocol3.c are pretty much identical,
with the same issues. In addition:

* getParameterStatus will act as if the parameter value was "out of
memory". That'll be fun for something like client_encoding or
standard_conforming_strings. Would be good to use a small char[100]
variable, in stack, if it fits, and only use malloc for larger values.
That would cover all the common variables that need to be machine-parsed.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#8)
Re: PQexec() hangs on OOM

On Fri, Apr 10, 2015 at 2:42 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/08/2015 07:27 AM, Michael Paquier wrote:

On Wed, Apr 8, 2015 at 3:54 AM, Heikki Linnakangas wrote:

On 04/07/2015 09:18 AM, Michael Paquier wrote:

I have noticed as well that getCopyStart() in fe-protocol3.c needs to
be made a little bit smarter to make the difference between an OOM and
the not-enough-data type of problem.

Yeah. getParamDescription still has the issue, with your patch.

Check.

There are still a few parseInput subroutines that have similar issues. In
fe-protocol2.c:

* pqGetErrorNotice2 returns EOF if there is not enough data, but also on
OOM. The caller assumes it's because not enough data.
* getRowDescriptions() is the same, although it sets conn->errorMessage on
OOM.

OK, I extended those two with the same logic as previously.

* getAnotherTuple() is the same, but it also skips over the received data,
so you won't get stuck. So maybe that's OK.

I think that it should be changed for consistency with the others, so
done this way for this function, and this way we only need to bail-out
if status == -2, a status of -1 meaning that there is not enough data.

* getNotify() just drops any NOTIFY messages on OOM. Perhaps that's OK..

This one is different though, it directly skips the messages.

The corresponding functions in fe-protocol3.c are pretty much identical,
with the same issues. In addition:

* getParameterStatus will act as if the parameter value was "out of memory".
That'll be fun for something like client_encoding or
standard_conforming_strings. Would be good to use a small char[100]
variable, in stack, if it fits, and only use malloc for larger values. That
would cover all the common variables that need to be machine-parsed.

Are you suggesting to replace PQExpBuffer?

So, attached is take three for all the other things above.
Regards,
--
Michael

Attachments:

0001-Improve-OOM-detection-of-parseInput-in-libpq.patchtext/x-patch; charset=US-ASCII; name=0001-Improve-OOM-detection-of-parseInput-in-libpq.patchDownload+374-182
#10Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Michael Paquier (#9)
Re: PQexec() hangs on OOM

Michael Paquier <michael.paquier@gmail.com> writes:

There are still a few parseInput subroutines that have similar issues. In
fe-protocol2.c:

* pqGetErrorNotice2 returns EOF if there is not enough data, but also on
OOM. The caller assumes it's because not enough data.
* getRowDescriptions() is the same, although it sets conn->errorMessage on
OOM.

OK, I extended those two with the same logic as previously.

* getAnotherTuple() is the same, but it also skips over the received data,
so you won't get stuck. So maybe that's OK.

I think that it should be changed for consistency with the others, so
done this way for this function, and this way we only need to bail-out
if status == -2, a status of -1 meaning that there is not enough data.

* getNotify() just drops any NOTIFY messages on OOM. Perhaps that's OK..

This one is different though, it directly skips the messages.

The corresponding functions in fe-protocol3.c are pretty much identical,
with the same issues. In addition:

* getParameterStatus will act as if the parameter value was "out of memory".
That'll be fun for something like client_encoding or
standard_conforming_strings. Would be good to use a small char[100]
variable, in stack, if it fits, and only use malloc for larger values. That
would cover all the common variables that need to be machine-parsed.

Are you suggesting to replace PQExpBuffer?

So, attached is take three for all the other things above.

There's still one call to pqGetErrorNotice3 that doesn't check returned
value for -2, in fe-connect.c. Shouldn't we also check it like this:

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index e7c7a25..330b8da 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2242,6 +2242,7 @@ keep_going:						/* We will come back to here until there is
 				int			msgLength;
 				int			avail;
 				AuthRequest areq;
+				int			res;
 				/*
 				 * Scan the message from current point (note that if we find
@@ -2366,11 +2367,18 @@ keep_going:						/* We will come back to here until there is
 				{
 					if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
 					{
-						if (pqGetErrorNotice3(conn, true))
+						res = pqGetErrorNotice3(conn, true);
+						if (res == -1)
 						{
 							/* We'll come back when there is more data */
 							return PGRES_POLLING_READING;
 						}
+						else if (res == -2)
+						{
+							printfPQExpBuffer(&conn->errorMessage,
+											  libpq_gettext("out of memory\n"));
+							goto error_return;
+						}
 					}
 					else
 					{

--
Alex

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11Michael Paquier
michael@paquier.xyz
In reply to: Shulgin, Oleksandr (#10)
Re: PQexec() hangs on OOM

On Thu, May 21, 2015 at 1:31 AM, Oleksandr Shulgin wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

So, attached is take three for all the other things above.

There's still one call to pqGetErrorNotice3 that doesn't check returned
value for -2, in fe-connect.c. Shouldn't we also check it like this:

[...]

Yes, you are right. Take 4 attached is updated with something similar
to what you sent to cover this code path.
--
Michael

Attachments:

0001-Improve-OOM-detection-of-parseInput-in-libpq.patchtext/x-diff; charset=US-ASCII; name=0001-Improve-OOM-detection-of-parseInput-in-libpq.patchDownload+382-183
#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#11)
Re: PQexec() hangs on OOM

On 05/26/2015 10:01 AM, Michael Paquier wrote:

On Thu, May 21, 2015 at 1:31 AM, Oleksandr Shulgin wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

So, attached is take three for all the other things above.

There's still one call to pqGetErrorNotice3 that doesn't check returned
value for -2, in fe-connect.c. Shouldn't we also check it like this:

[...]

Yes, you are right. Take 4 attached is updated with something similar
to what you sent to cover this code path.

This is still a few bricks shy of a load. Before this patch, if you run
out of memory when allocating a large result set - which is probably the
most common reason for OOM in libpq - you got this error:

postgres=# select generate_series(1, 10000000);
out of memory for query result

With this patch, I got:

postgres=# select generate_series(1, 10000000);
server sent data ("D" message) without prior row description ("T" message)

Looking at the patch again, I think we should actually leave
getAnotherTuple() alone. It's a lot nicer if getAnotherTuple() can
handle the OOM error itself than propagating it to the caller.

There's only one caller for getAnotherTuple(), but for
pqGetErrorNotice3() the same is even more true: it would be much nicer
if it could handle OOM itself, without propagating it to the caller. And
it well could do so. When it's processing an ERROR, it could just set
conn->errorMessage to "out of memory", and discard the error it received
from the server. When processing a NOTICE, it could likewise just send
"out of memory" to the NOTICE processsor, and discard the message from
the server. The real problem with pqGetErrorNotice3() today is that it
treats OOM the same as "no data available yet", and we can fix that by
reading but discarding the backend message. Like getAnotherTuple() does.

In short, the idea of returning a status code from parseInput(), instead
of just dealing with the error, was a bad one. Sorry I didn't have this
epiphany earlier...

I came up with the attached. It fixes the few cases where we were
currently returning "need more data" when OOM happened, to deal with the
OOM error instead, by setting conn->errorMessage. How does this look to you?

- Heikki

--
- Heikki

Attachments:

improve-OOM-handling-in-libpq-2.patchtext/x-diff; name=improve-OOM-handling-in-libpq-2.patchDownload+115-41
#13Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#12)
Re: PQexec() hangs on OOM

On Sat, Jul 4, 2015 at 1:32 AM, Heikki Linnakangas wrote:

In short, the idea of returning a status code from parseInput(), instead of
just dealing with the error, was a bad one. Sorry I didn't have this
epiphany earlier...

I came up with the attached. It fixes the few cases where we were currently
returning "need more data" when OOM happened, to deal with the OOM error
instead, by setting conn->errorMessage. How does this look to you?

So this relies on the fact that when errorMessage is set subsequent
messages are ignored, right? This looks neat.

At the bottom of getAnotherTuple() in fe-protocol2.c if
PQmakeEmptyPGresult returns NULL, shouldn't the error message be
enforced to "out of memory"? It is an error code path, so an error
will be set, but perhaps the message is incorrect.

-       if (!res->errMsg)
-               goto failure;
+       if (res)
+       {
+               res->resultStatus = isError ? PGRES_FATAL_ERROR :
PGRES_NONFATAL_ERROR;
+               res->errMsg = pqResultStrdup(res, workBuf.data);
+       }
If res->errMsg is NULL, we may have a crash later on when calling
appendPQExpBufferStr on this field. I think that we should add an
additional check on it.
-- 
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#13)
Re: PQexec() hangs on OOM

On 07/04/2015 03:40 PM, Michael Paquier wrote:

On Sat, Jul 4, 2015 at 1:32 AM, Heikki Linnakangas wrote:

In short, the idea of returning a status code from parseInput(), instead of
just dealing with the error, was a bad one. Sorry I didn't have this
epiphany earlier...

I came up with the attached. It fixes the few cases where we were currently
returning "need more data" when OOM happened, to deal with the OOM error
instead, by setting conn->errorMessage. How does this look to you?

So this relies on the fact that when errorMessage is set subsequent
messages are ignored, right? This looks neat.

At the bottom of getAnotherTuple() in fe-protocol2.c if
PQmakeEmptyPGresult returns NULL, shouldn't the error message be
enforced to "out of memory"? It is an error code path, so an error
will be set, but perhaps the message is incorrect.

-       if (!res->errMsg)
-               goto failure;
+       if (res)
+       {
+               res->resultStatus = isError ? PGRES_FATAL_ERROR :
PGRES_NONFATAL_ERROR;
+               res->errMsg = pqResultStrdup(res, workBuf.data);
+       }
If res->errMsg is NULL, we may have a crash later on when calling
appendPQExpBufferStr on this field. I think that we should add an
additional check on it.

Yeah, added.

I tested the various error paths this patch modifies with a debugger,
and found out that the getCopyStart changes were not doing much good.
The caller still waits for the COPY-IN result, so it still gets stuck.
So I left out that change.

The getParamDescriptions() changes were slightly broken. It didn't read
the whole input message with pqGetInt() etc., so pqParseInput3() threw
the "message contents do not agree with length in message type" error. I
started fixing that, by changing the error handling in that function to
be more like that in getRowDescriptions(), but then I realized that all
the EOF return cases in the pqParseInput3() subroutines are actually
dead code. pqParseInput3() always reads the whole message, before
passing it on to the right subroutine. That was documented for
getRowDescriptions() and getAnotherTuple(), but the rest of the
functions were more like the protocol version 2 code, prepared to deal
with incomplete messages. I think it would be good to refactor that,
removing the EOF return cases altogether. So I left out that change for
now as well.

That left me with the attached patch. It doesn't handle the
getParamDescription() case, nor the getCopyStart() case, but it's a
start. We don't necessarily need to fix everything in one go. Does this
look correct to you, as far as it goes?

- Heikki

Attachments:

improve-OOM-handling-in-libpq-3.patchtext/x-diff; name=improve-OOM-handling-in-libpq-3.patchDownload+83-31
#15Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#14)
Re: PQexec() hangs on OOM

On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote:

The getParamDescriptions() changes were slightly broken. It didn't read the
whole input message with pqGetInt() etc., so pqParseInput3() threw the
"message contents do not agree with length in message type" error. I started
fixing that, by changing the error handling in that function to be more like
that in getRowDescriptions(), but then I realized that all the EOF return
cases in the pqParseInput3() subroutines are actually dead code.
pqParseInput3() always reads the whole message, before passing it on to the
right subroutine. That was documented for getRowDescriptions() and
getAnotherTuple(), but the rest of the functions were more like the protocol
version 2 code, prepared to deal with incomplete messages. I think it would
be good to refactor that, removing the EOF return cases altogether. So I
left out that change for now as well.

Yes, (the latter case is not actually used currently). Well, I don't
mind writing the additional patch to update . On top of that The
refactoring should be a master-only change, perhaps?

That left me with the attached patch. It doesn't handle the
getParamDescription() case, nor the getCopyStart() case, but it's a start.
We don't necessarily need to fix everything in one go. Does this look
correct to you, as far as it goes?

I have been carefully through the routines modified, doing some tests
at the same time and I haven't spotted an issue.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#16Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#15)
Re: PQexec() hangs on OOM

On Tue, Jul 7, 2015 at 3:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote:

The getParamDescriptions() changes were slightly broken. It didn't read the
whole input message with pqGetInt() etc., so pqParseInput3() threw the
"message contents do not agree with length in message type" error. I started
fixing that, by changing the error handling in that function to be more like
that in getRowDescriptions(), but then I realized that all the EOF return
cases in the pqParseInput3() subroutines are actually dead code.
pqParseInput3() always reads the whole message, before passing it on to the
right subroutine. That was documented for getRowDescriptions() and
getAnotherTuple(), but the rest of the functions were more like the protocol
version 2 code, prepared to deal with incomplete messages. I think it would
be good to refactor that, removing the EOF return cases altogether. So I
left out that change for now as well.

Yes, (the latter case is not actually used currently). Well, I don't
mind writing the additional patch to update . On top of that The
refactoring should be a master-only change, perhaps?

I pushed the send button too early.
I don't mind writing the additional patch for the other routines,
patch that should be backpatched, and the refactoring patch
(master-only?).
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#16)
Re: PQexec() hangs on OOM

On 07/07/2015 09:32 AM, Michael Paquier wrote:

On Tue, Jul 7, 2015 at 3:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote:

The getParamDescriptions() changes were slightly broken. It didn't read the
whole input message with pqGetInt() etc., so pqParseInput3() threw the
"message contents do not agree with length in message type" error. I started
fixing that, by changing the error handling in that function to be more like
that in getRowDescriptions(), but then I realized that all the EOF return
cases in the pqParseInput3() subroutines are actually dead code.
pqParseInput3() always reads the whole message, before passing it on to the
right subroutine. That was documented for getRowDescriptions() and
getAnotherTuple(), but the rest of the functions were more like the protocol
version 2 code, prepared to deal with incomplete messages. I think it would
be good to refactor that, removing the EOF return cases altogether. So I
left out that change for now as well.

Yes, (the latter case is not actually used currently). Well, I don't
mind writing the additional patch to update . On top of that The
refactoring should be a master-only change, perhaps?

I pushed the send button too early.
I don't mind writing the additional patch for the other routines,
patch that should be backpatched, and the refactoring patch
(master-only?).

Ok, committed and backpatched the latest patch I posted. Yeah, we'll
need additional patches for the refactoring and the remaining issues.
I'm not sure if it's worth trying to backpatch them; let's see how big
the patch is.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#18Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#17)
Re: PQexec() hangs on OOM

On Wed, Jul 8, 2015 at 1:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/07/2015 09:32 AM, Michael Paquier wrote:

On Tue, Jul 7, 2015 at 3:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote:

The getParamDescriptions() changes were slightly broken. It didn't read
the
whole input message with pqGetInt() etc., so pqParseInput3() threw the
"message contents do not agree with length in message type" error. I
started
fixing that, by changing the error handling in that function to be more
like
that in getRowDescriptions(), but then I realized that all the EOF
return
cases in the pqParseInput3() subroutines are actually dead code.
pqParseInput3() always reads the whole message, before passing it on to
the
right subroutine. That was documented for getRowDescriptions() and
getAnotherTuple(), but the rest of the functions were more like the
protocol
version 2 code, prepared to deal with incomplete messages. I think it
would
be good to refactor that, removing the EOF return cases altogether. So I
left out that change for now as well.

Yes, (the latter case is not actually used currently). Well, I don't
mind writing the additional patch to update . On top of that The
refactoring should be a master-only change, perhaps?

I pushed the send button too early.
I don't mind writing the additional patch for the other routines,
patch that should be backpatched, and the refactoring patch
(master-only?).

Ok, committed and backpatched the latest patch I posted. Yeah, we'll need
additional patches for the refactoring and the remaining issues. I'm not
sure if it's worth trying to backpatch them; let's see how big the patch is.

So, here are two patches aimed at fixing the hangling issues with
getStartCopy and getParamDescriptions. After trying a couple of
approaches, I found out that the most elegant way to prevent the
infinite loop in parseInput is to introduce a new PGASYNC flag to
control the error and let the caller know what happened.

More refactoring could be done, and I think that the use of this new
ASYNC flag could be spread to other places as well if you think that's
a useful. For now I have focused only on fixing the existing problems
with fixes that are rather back-patchable.
--
Michael

Attachments:

0002-Prevent-hangling-of-libpq-for-BIND-message-results-o.patchtext/x-diff; charset=US-ASCII; name=0002-Prevent-hangling-of-libpq-for-BIND-message-results-o.patchDownload+50-12
0001-Prevent-COPY-start-from-hangling-in-libpq-on-OOM.patchtext/x-diff; charset=US-ASCII; name=0001-Prevent-COPY-start-from-hangling-in-libpq-on-OOM.patchDownload+72-17
#19Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#18)
Re: PQexec() hangs on OOM

On Thu, Jul 9, 2015 at 10:04 PM, Michael Paquier
<michael.paquier@gmail.com>wrote:

So, here are two patches aimed at fixing the hangling issues with
getStartCopy and getParamDescriptions. After trying a couple of
approaches, I found out that the most elegant way to prevent the
infinite loop in parseInput is to introduce a new PGASYNC flag to
control the error and let the caller know what happened.

More refactoring could be done, and I think that the use of this new
ASYNC flag could be spread to other places as well if you think that's
a useful. For now I have focused only on fixing the existing problems
with fixes that are rather back-patchable.

As those are actually new bug fixes, I am adding an entry in CF 2015-09.
--
Michael

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#18)
Re: PQexec() hangs on OOM

On Thu, Jul 9, 2015 at 6:34 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Wed, Jul 8, 2015 at 1:01 AM, Heikki Linnakangas <hlinnaka@iki.fi>

wrote:

On 07/07/2015 09:32 AM, Michael Paquier wrote:

Ok, committed and backpatched the latest patch I posted. Yeah, we'll

need

additional patches for the refactoring and the remaining issues. I'm not
sure if it's worth trying to backpatch them; let's see how big the

patch is.

So, here are two patches aimed at fixing the hangling issues with
getStartCopy and getParamDescriptions. After trying a couple of
approaches, I found out that the most elegant way to prevent the
infinite loop in parseInput is to introduce a new PGASYNC flag to
control the error and let the caller know what happened.

One thing that looks slightly non-elegant about this approach is that
new async status (PGASYNC_FATAL) is used to deal with errors only
in few paths in function pqParseInput3() and not-other paths which should
be okay if there is no other better way. I have not spent enough time on
it to suggest any better alternative, but would like to hear what other
approaches you have considered?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#21Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#20)
#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#21)
#23Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#22)
#24Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#23)
#25Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#24)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#25)
#27Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#28)
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#28)
#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#25)
#32Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#30)
#33Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#31)
#34Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#33)
#35Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#34)
#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#34)
#37Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#36)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#37)
#39Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#38)
#40Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#39)
#41Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#40)
#42Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#41)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#42)
#44Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#43)
#45Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#41)
#46Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#45)
#47Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#46)
#48Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#47)