OOM in libpq and infinite loop with getCopyStart()

Started by Michael Paquierabout 10 years ago28 messageshackers
Jump to latest
#1Michael Paquier
michael@paquier.xyz

Hi all,

One year and a half ago Heikki has reported that libpq could run on an
infinite loop in a couple of code paths should an OOM happen in
PQExec():
/messages/by-id/547480DE.4040408@vmware.com
The code paths are:
- BIND messages for getRowDescriptions(), or 'T' messages, fixed per 7b96bf44
- Code paths handling some calls to PQmakeEmptyPGresult(), fixed per 414bef3
- Start message for COPY command, not fixed yet.
The two first problems have been addressed, not the last one. And as
the last thread became long and confusing here is a revival of the
topic on a new thread, with a fresh patch, and a fresh study of the
problem submitted for this CF because it would be nice to get things
fixed, or improved in some cases, but see below...

Here is a summary of what this patch does and improves in some cases:
1) COPY IN and COPY OUT, without the patch, those two remain stuck in
an infinite loop... With the patch an error "out of memory" is
reported to the caller.
I had a look as well at some community utilities, like pgloader and
pg_bulkload, though none of them are using getCopyStart(). Do other
people have ideas to things to look at? pgadmin for example?

Now in the case of COPY_BOTH. let's take a couple of examples:
2) pg_receivexlog:
In the case of pg_receivexlog, the patch allows libpq to throw
correctly an error back to the caller, then pg_receivexlog will try to
send START_REPLICATION again at its next loop:
pg_receivexlog: could not send replication command
"START_REPLICATION": out of memory
Without the patch pg_receivexlog would just ignore the error, and at
the next iteration of ParseInput() the message obtained from server
would be treated because the message cursor does not move on in this
case.
3) pg_recvlogical
Similarly pg_recvlogical fails with the following error with the patch attached:
pg_recvlogical: could not send replication command "START_REPLICATION
SLOT "toto" LOGICAL 0/0": out of memory
And after the failure it loops back to the next attempt, and is able
to perform its streaming stuff. Without the patch, similarly to
pg_receivexlog, COPY_BOTH would take the code path of ParseInput()
once again and it would treat the message, ignoring any OOM problems
on the way.
4) libpqwalreceiver: upon noticing the OOM error with the patch
attached, a standby would fail after sending START_REPLICATION, log
the OOM error properly and attempt to restart later a new WAL receiver
process to try again. Without this patch, the WAL receiver would still
fail to start, and log the following, unhelpful message:
2016-03-01 14:07:00 JST [84058]: [22-1] db=,user=,app=,client= LOG:
invalid record length at 0/3000970
That's not really informational for the user :( Note though that in
this case the WAL receiver does not remain in an infinite loop, and
that it is able to restart properly because it fails with this
"invalid record length error", and then start streaming at its next
attempt when the WAL receiver is started again.

So, that's how things are standing. I still see that it would be a
gain in terms of failure visibility to log properly this OOM failure
in all cases. Now it is true that if there are some applications that
may expect libpq to *not* return an error and treat the COPY start
message at the next loop of ParseInput(), though by looking at what is
in-core things can be handled properly.

Thoughts? I have registered that in the CF app, and a patch is attached.
--
Michael

Attachments:

0001-Fix-OOM-error-handling-in-COPY-protocol-of-libpq.patchtext/x-diff; charset=US-ASCII; name=0001-Fix-OOM-error-handling-in-COPY-protocol-of-libpq.patchDownload+69-16
#2Aleksander Alekseev
aleksander@timescale.com
In reply to: Michael Paquier (#1)
Re: OOM in libpq and infinite loop with getCopyStart()

Hello, Michael

I didn't checked your patch in detail yet but here is a thought I would
like to share.

In my experience usually it takes number of rewrites before patch will
be accepted. To make sure that after every rewrite your patch still
solves an issue you described you should probably provide a few test
programs too. If it's possible to make these programs part of regression
tests suite it would be just great.

Without such test programs I personally find it difficult to verify
that your patch fixes something. If such examples are provided your
patch would be much more likely to accepted. "See, this program crashes
everything. Now we apply a patch and everything works! Cool, heh?"

Best regards,
Aleksander

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Michael Paquier
michael@paquier.xyz
In reply to: Aleksander Alekseev (#2)
Re: OOM in libpq and infinite loop with getCopyStart()

On Thu, Mar 3, 2016 at 10:18 PM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

I didn't checked your patch in detail yet but here is a thought I would
like to share.

In my experience usually it takes number of rewrites before patch will
be accepted. To make sure that after every rewrite your patch still
solves an issue you described you should probably provide a few test
programs too. If it's possible to make these programs part of regression
tests suite it would be just great.

Yes that's usually the case, this patch is not at its first version.

Without such test programs I personally find it difficult to verify
that your patch fixes something. If such examples are provided your
patch would be much more likely to accepted. "See, this program crashes
everything. Now we apply a patch and everything works! Cool, heh?"

The easiest way to perform tests with this patch is to take a debugger
and enforce the malloc'd pointers to NULL in the code paths. That's
wild but efficient, and that's actually for we did for the other two
libpq patches treating those OOM failures. In the case of the standby
code path what I did was to put a sleep in the WAL receiver code and
then trigger the OOM manually, leading to the errors listed upthread.
Regarding automated tests, it is largely feasible to have tests on
Linux at least by using for example LD_PRELOAD or glibc-specific
malloc hooks, though those would be platform-restricted. Here is for
example one for COPY IN/OUT, that this patch also passed (there has
been already some discussion for this patch).
/messages/by-id/55FC501A.5000409@iki.fi
Applying it for recovery is doable as well by manipulating the
recovery protocol with libpq though it is aimed really at covering a
narrow case.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Aleksander Alekseev
aleksander@timescale.com
In reply to: Michael Paquier (#3)
Re: OOM in libpq and infinite loop with getCopyStart()

Hello, Michael

The easiest way to perform tests with this patch is to take a debugger
and enforce the malloc'd pointers to NULL in the code paths.

I see. Still I don't think it's an excuse to not provide clear steps to
reproduce an issue. As I see it anyone should be able to easily check
your patch locally without having deep understanding of how libpq is
implemented or reading thread which contains 48 e-mails.

For instance you cold provide a step-by-step instruction like:

1. run gdb --args some-prog arg1 arg2 ...
2. b some_file.c:123
3. c
4. ...
5. we expect ... but actually see ...

Or you cold automate these steps using gdb batch file:

gdb --batch --command=gdb.script --args some-prog arg1 arg2

... where gdb.script is:

b some_file.c:123
c
... etc ...

Naturally all of this is optional. But it will simplify both
understanding and reviewing of this path. Thus chances that this patch
will be accepted will be increased and time required for this will be
reduced significantly.

Best regards,
Aleksander

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Paquier
michael@paquier.xyz
In reply to: Aleksander Alekseev (#4)
Re: OOM in libpq and infinite loop with getCopyStart()

On Fri, Mar 4, 2016 at 12:59 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

The easiest way to perform tests with this patch is to take a debugger
and enforce the malloc'd pointers to NULL in the code paths.

I see. Still I don't think it's an excuse to not provide clear steps to
reproduce an issue. As I see it anyone should be able to easily check
your patch locally without having deep understanding of how libpq is
implemented or reading thread which contains 48 e-mails.

OK, here they are if that helps. Following those steps and having some
knowledge of lldb or gdb is enough. The key point is that getCopyStart
is the routine to breakpoint and make fail.
A) psql and COPY.
1) gdb --args psql -c 'COPY (SELECT 1) TO stdout'
2) Then take a breakpoint at getCopyStart()
3) run
4) Enforce the return result of PQmakeEmptyPGresult to NULL:
p result = 0
5) Enjoy the infinite loop with HEAD, and the error with the patch

B) pg_receivexlog:
1) mkdir data
gdb --args pg_receivexlog --verbose -D data/
2) Take a breakpoint at getCopyStart
3) run
4) enforce result = 0 after the call to PQmakeEmptyPGresult
5) And enjoy what has been reported

C) pg_recvlogical, similar to B.
1) Create a replication slot on a server that accepts them
select pg_create_logical_replication_slot('foo', 'test_decoding');
2) Same breakpoint as B) with this start command or similar (depends
on where the logical slot has been created):
pg_recvlogical --start --slot foo -f - -d postgres

D) triggering standby problem:
1) Patch upstream, as follows by adding a sleep in
libpqrcv_startstreaming of libpqwalreceiver.c. This is a simple method
to have the time to take a breakpoint on the WAL receiver process that
has been started, with streaming that has not begun yet.
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index f670957..c7fccf1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -188,6 +188,9 @@ libpqrcv_startstreaming(TimeLineID tli, XLogRecPtr
startpoint, char *slotname)
    char        cmd[256];
    PGresult   *res;

+ /* ZZZzzz... */
+ pg_usleep(10000L);
+
/* Start streaming from the point requested by startup process */
if (slotname != NULL)
snprintf(cmd, sizeof(cmd),
Trying to design a test taking into account the context of a WAL
receiver does not seem worth it to me with a simple method like this
one...
2) Start the standby and attach debugger on the WAL receiver process,
send SIGSTOP to it, whatever...
3) breakpoint at getCopyStart
4) Then move on with the same process as previously, and enjoy the errors.
--
Michael

Attachments:

sleep-wal-receiver.patchapplication/x-patch; name=sleep-wal-receiver.patchDownload+3-0
#6Aleksander Alekseev
aleksander@timescale.com
In reply to: Michael Paquier (#5)
Re: OOM in libpq and infinite loop with getCopyStart()

Hello, Michael

Thanks a lot for steps to reproduce you provided.

I tested your path on Ubuntu Linux 14.04 LTS (GCC 4.8.4) and FreeBSD
10.2 RELEASE (Clang 3.4.1). In both cases patch applies cleanly, there
are no warnings during compilation and all regression tests pass. A few
files are not properly pgindent'ed though:

```
diff --git a/src/interfaces/libpq/fe-exec.c
b/src/interfaces/libpq/fe-exec.c index c99f193..2769719 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -2031,7 +2031,7 @@ PQexecFinish(PGconn *conn)
            conn->status == CONNECTION_BAD)
            break;
        else if ((conn->asyncStatus == PGASYNC_COPY_IN ||
-                 conn->asyncStatus == PGASYNC_COPY_OUT  ||
+                 conn->asyncStatus == PGASYNC_COPY_OUT ||
                  conn->asyncStatus == PGASYNC_COPY_BOTH) &&
                 result->resultStatus == PGRES_FATAL_ERROR)
            break;
diff --git a/src/interfaces/libpq/fe-protocol3.c
b/src/interfaces/libpq/fe-protocol3.c index 21a1d9b..280ca16 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -49,9 +49,9 @@ static int    getParamDescriptions(PGconn *conn, int
msgLength); static int getAnotherTuple(PGconn *conn, int msgLength);
 static int getParameterStatus(PGconn *conn);
 static int getNotify(PGconn *conn);
-static int getCopyStart(PGconn *conn,
-                        ExecStatusType copytype,
-                        int msgLength);
+static int getCopyStart(PGconn *conn,
+            ExecStatusType copytype,
+            int msgLength);
 static int getReadyForQuery(PGconn *conn);
 static void reportErrorPosition(PQExpBuffer msg, const char *query,
                    int loc, int encoding);
```

Indeed your patch solves issues you described. Still here is something
that concerns me:

```
$ gdb --args pg_receivexlog --verbose -D ./temp_data/

(gdb) b getCopyStart
Function "getCopyStart" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (getCopyStart) pending.
(gdb) r
Starting program: /usr/local/pgsql/bin/pg_receivexlog --verbose
-D ./temp_data/ [Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1". pg_receivexlog: starting log
streaming at 0/1000000 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610220, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398 const char
*errmsg = NULL; (gdb) n
1400 result = PQmakeEmptyPGresult(conn, copytype);
(gdb)
1401 if (!result)
(gdb) p result = 0
$1 = (PGresult *) 0x0
(gdb) c
Continuing.
pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398 const char
*errmsg = NULL; (gdb) n
1400 result = PQmakeEmptyPGresult(conn, copytype);
(gdb) n
1401 if (!result)
(gdb) p result = 0
$2 = (PGresult *) 0x0
(gdb) c
Continuing.
pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398 const char
*errmsg = NULL;
```

Granted this behaviour is a bit better then the current one. But
basically it's the same infinite loop only with pauses and warnings. I
wonder if this is a behaviour we really want. For instance wouldn't it
be better just to terminate an application in out-of-memory case? "Let
it crash" as Erlang programmers say.

Best regards,
Aleksander

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Aleksander Alekseev (#6)
Re: OOM in libpq and infinite loop with getCopyStart()

Aleksander Alekseev wrote:

pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398 const char
*errmsg = NULL;
```

Granted this behaviour is a bit better then the current one. But
basically it's the same infinite loop only with pauses and warnings. I
wonder if this is a behaviour we really want. For instance wouldn't it
be better just to terminate an application in out-of-memory case? "Let
it crash" as Erlang programmers say.

Hmm. It would be useful to retry in the case that there is a chance
that the program releases memory and can continue later. But if it will
only stay there doing nothing other than retrying, then that obviously
will not happen. One situation where this might help is if the overall
*system* is short on memory and we expect that situation to resolve
itself after a while -- after all, if the system is so loaded that it
can't allocate a few more bytes for the COPY message, then odds are that
other things are also crashing and eventually enough memory will be
released that pg_receivexlog can continue.

On the other hand, if the system is so loaded, perhaps it's better to
"let it crash" and have it restart later -- presumably once the admin
notices the problem and restarts it manually after cleaning up the mess.

If all programs are well behaved and nothing crashes when OOM but they
all retry instead, then everything will continue to retry infinitely and
make no progress. That cannot be good.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#7)
Re: OOM in libpq and infinite loop with getCopyStart()

On Thu, Mar 10, 2016 at 12:12 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Aleksander Alekseev wrote:

pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398 const char
*errmsg = NULL;
```

Granted this behaviour is a bit better then the current one. But
basically it's the same infinite loop only with pauses and warnings. I
wonder if this is a behaviour we really want. For instance wouldn't it
be better just to terminate an application in out-of-memory case? "Let
it crash" as Erlang programmers say.

Hmm. It would be useful to retry in the case that there is a chance
that the program releases memory and can continue later. But if it will
only stay there doing nothing other than retrying, then that obviously
will not happen. One situation where this might help is if the overall
*system* is short on memory and we expect that situation to resolve
itself after a while -- after all, if the system is so loaded that it
can't allocate a few more bytes for the COPY message, then odds are that
other things are also crashing and eventually enough memory will be
released that pg_receivexlog can continue.

Yep, that's my assumption regarding that, at some point the system may
succeed, and I don't think that we should break the current behaviors
of pg_receivexlog and pg_recvlogical regarding that in the
back-branches. Now, note that without the patch we actually have the
same problem. Say if OOMs happen continuously in getCopyStart, with
COPY_BOTH libpq would attempt to read the next message continuously
and would keep failing. Except that in this case the caller has no
idea what is happening as things keep running in libpq itself.

On the other hand, if the system is so loaded, perhaps it's better to
"let it crash" and have it restart later -- presumably once the admin
notices the problem and restarts it manually after cleaning up the mess.

If all programs are well behaved and nothing crashes when OOM but they
all retry instead, then everything will continue to retry infinitely and
make no progress. That cannot be good.

That's something we could take care of in those client utilities I
think with a new option like --maximum-retries or similar, but anyway
I think that's a different discussion. The patch I am proposing here
allows a client application to be made aware of OOM errors that
happen. If we don't do something about that first, something like
--maximum-retries would be useless for COPY_BOTH as the client will
never be made aware of the OOM that happened in libpq and would keep
looping inside libpq itself until some memory is freed.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#1)
Re: OOM in libpq and infinite loop with getCopyStart()

On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Thoughts? I have registered that in the CF app, and a patch is attached.

It is very difficult to believe that this is a good idea:

--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
         if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
             PQresultStatus(lastResult) == PGRES_COPY_OUT ||
             PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
+            PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
             PQstatus(streamConn) == CONNECTION_BAD)
             break;
     }

I mean, why would it be a good idea to blindly skip over fatal errors?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#9)
Re: OOM in libpq and infinite loop with getCopyStart()

On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Thoughts? I have registered that in the CF app, and a patch is attached.

It is very difficult to believe that this is a good idea:

--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
PQresultStatus(lastResult) == PGRES_COPY_OUT ||
PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
+            PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
PQstatus(streamConn) == CONNECTION_BAD)
break;
}

I mean, why would it be a good idea to blindly skip over fatal errors?

I think it is not about skipping the FATAL error, rather to stop trying to
get further results on FATAL error. This is to ensure that OOM error is
reported rather that ignored. There has been discussion about this
previously as well [1]/messages/by-id/CAB7nPqT6gKj6iS9VTPth_h6Sz5Jo-177s6QJN_jrW66wyCjJ=w@mail.gmail.com.

[1]: /messages/by-id/CAB7nPqT6gKj6iS9VTPth_h6Sz5Jo-177s6QJN_jrW66wyCjJ=w@mail.gmail.com
/messages/by-id/CAB7nPqT6gKj6iS9VTPth_h6Sz5Jo-177s6QJN_jrW66wyCjJ=w@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#10)
Re: OOM in libpq and infinite loop with getCopyStart()

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

It is very difficult to believe that this is a good idea:

--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
PQresultStatus(lastResult) == PGRES_COPY_OUT ||
PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
+            PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
PQstatus(streamConn) == CONNECTION_BAD)
break;

I mean, why would it be a good idea to blindly skip over fatal errors?

I think it is not about skipping the FATAL error, rather to stop trying to
get further results on FATAL error.

If the code already includes "lost the connection" as a case to break on,
I'm not quite sure why "got a query error" is not. Maybe that PQstatus
check is broken too, but it doesn't seem like this patch makes it more so.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#11)
Re: OOM in libpq and infinite loop with getCopyStart()

On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

It is very difficult to believe that this is a good idea:

--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
PQresultStatus(lastResult) == PGRES_COPY_OUT ||
PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
+            PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
PQstatus(streamConn) == CONNECTION_BAD)
break;

I mean, why would it be a good idea to blindly skip over fatal errors?

I think it is not about skipping the FATAL error, rather to stop trying

to

get further results on FATAL error.

If the code already includes "lost the connection" as a case to break on,
I'm not quite sure why "got a query error" is not.

This error check is exactly same as PQexecFinish() and there some
explanation is given in comments which hints towards the reason for
continuing on error, basically both the functions PQexecFinish()
and libpqrcv_PQexec() returns the last result if there are many
and PQexecFinish() concatenates the errors as well in some cases. Do we
see any use in continuing to get result after getting PGRES_FATAL_ERROR
error?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#13Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#12)
Re: OOM in libpq and infinite loop with getCopyStart()

On Tue, Mar 22, 2016 at 2:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

It is very difficult to believe that this is a good idea:

--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
PQresultStatus(lastResult) == PGRES_COPY_OUT ||
PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
+            PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
PQstatus(streamConn) == CONNECTION_BAD)
break;

I mean, why would it be a good idea to blindly skip over fatal errors?

I think it is not about skipping the FATAL error, rather to stop trying
to
get further results on FATAL error.

If the code already includes "lost the connection" as a case to break on,
I'm not quite sure why "got a query error" is not.

This error check is exactly same as PQexecFinish() and there some
explanation is given in comments which hints towards the reason for
continuing on error, basically both the functions PQexecFinish() and
libpqrcv_PQexec() returns the last result if there are many and
PQexecFinish() concatenates the errors as well in some cases. Do we see any
use in continuing to get result after getting PGRES_FATAL_ERROR error?

I spent quite a couple of hours looking at that, and I still fail to
see how it would be an advantage to stack errors. We reduce the error
message visibility for frontend clients, and this does not prevent
clients to actually see error messages generated by a backend, take a
WAL sender. And actually depending on the message type received we may
end up by simply ignoring those error messages and have libpq discard
them. So it seems like an oversight of 03a571a4.

One could always say that this change breaks the case where multiple
error messages are sent in a row from a client with COPY_* (IN, OUT
and BOTH), because we'd get back the last error message, while with
this change we let client know the first one, though that would mean
that client is missing something when using COPY_BOTH.

Also...
@@ -2023,6 +2030,11 @@ PQexecFinish(PGconn *conn)
            result->resultStatus == PGRES_COPY_BOTH ||
            conn->status == CONNECTION_BAD)
            break;
+       else if ((conn->asyncStatus == PGASYNC_COPY_IN ||
+                 conn->asyncStatus == PGASYNC_COPY_OUT  ||
+                 conn->asyncStatus == PGASYNC_COPY_BOTH) &&
+                result->resultStatus == PGRES_FATAL_ERROR)
+           break;
The reason behind this check in PQexecFinish is that we need to make
the difference between an error where there is not enough data, which
means that we want to retry again the message fetching through
parseInput3(), and the case where an actual OOM happened, where we
want to exit immediately and let the caller know that there has been a
frontend error. Now the other reason why we went on with
PGRES_FATAL_ERROR here is that it was discussed that it was an
overkill to assign a special status to PGASYNC to detect a
frontend-side failure, because we already treat other frontend-side
errors with PGRES_FATAL_ERROR and assign an error message to them (see
for example when an allocation fails for 'C', introduced in another
patch of the OOM failure series), and we still need to know that we
are in PGASYNC_COPY_* mode in this code path as well because it means
that the start message has been processed, but we had a failure in
deparsing it so we bypassed it, and need to fail immediately. So using
PGREC_FATAL_ERROR simplifies the code paths creating an error stack.

Hope this brings some light in.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14David Steele
david@pgmasters.net
In reply to: Michael Paquier (#13)
Re: OOM in libpq and infinite loop with getCopyStart()

Hi Amit,

On 3/22/16 3:37 AM, Michael Paquier wrote:

Hope this brings some light in.

Do you know when you'll have time to respond to Michael's last email?
I've marked this patch "waiting on author" in the meantime.

Thanks,
--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#14)
Re: OOM in libpq and infinite loop with getCopyStart()

On Wed, Mar 30, 2016 at 12:01 AM, David Steele <david@pgmasters.net> wrote:

Hi Amit,

On 3/22/16 3:37 AM, Michael Paquier wrote:

Hope this brings some light in.

Do you know when you'll have time to respond to Michael's last email? I've
marked this patch "waiting on author" in the meantime.

Shouldn't it be "needs review" instead? I am marked as the author of this patch.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16David Steele
david@pgmasters.net
In reply to: Michael Paquier (#15)
Re: OOM in libpq and infinite loop with getCopyStart()

On 3/29/16 8:21 PM, Michael Paquier wrote:

On Wed, Mar 30, 2016 at 12:01 AM, David Steele <david@pgmasters.net> wrote:

Hi Amit,

On 3/22/16 3:37 AM, Michael Paquier wrote:

Hope this brings some light in.

Do you know when you'll have time to respond to Michael's last email? I've
marked this patch "waiting on author" in the meantime.

Shouldn't it be "needs review" instead? I am marked as the author of this patch.

Whoops! I got the author and reviewer backwards. Set back to "needs
review".

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: David Steele (#14)
Re: OOM in libpq and infinite loop with getCopyStart()

On Tue, Mar 29, 2016 at 8:31 PM, David Steele <david@pgmasters.net> wrote:

Hi Amit,

On 3/22/16 3:37 AM, Michael Paquier wrote:

Hope this brings some light in.

Do you know when you'll have time to respond to Michael's last email? I've
marked this patch "waiting on author" in the meantime.

I think this patch needs to be looked upon by committer now. I have done
review and added some code in this patch as well long back, just see the
e-mail [1]/messages/by-id/CAB7nPqTNrw8LR1HD7ZLnOJjC1BP1Evw_EMAdGOrV+s-sBCrKtw@mail.gmail.com, patch is just same as it was in October 2015. I think myself
and Michael are in agreement that this patch solves the reported problem.
There is one similar problem [2]/messages/by-id/566EF84F.1030206@iki.fi reported by Heikki which I think can be
fixed separately.

I think the main reason of moving this thread to hackers from bugs is to
gain some more traction which I see that it achieves its purpose to some
extent, but I find that more or less we are at same situation as we were
back in October.

Let me know if you think anything more from myside can help in moving patch.

[1]: /messages/by-id/CAB7nPqTNrw8LR1HD7ZLnOJjC1BP1Evw_EMAdGOrV+s-sBCrKtw@mail.gmail.com
/messages/by-id/CAB7nPqTNrw8LR1HD7ZLnOJjC1BP1Evw_EMAdGOrV+s-sBCrKtw@mail.gmail.com
[2]: /messages/by-id/566EF84F.1030206@iki.fi

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#18Aleksander Alekseev
aleksander@timescale.com
In reply to: Amit Kapila (#17)
Re: OOM in libpq and infinite loop with getCopyStart()

I think this patch needs to be looked upon by committer now. I have
done review and added some code in this patch as well long back, just
see the e-mail [1], patch is just same as it was in October 2015. I
think myself and Michael are in agreement that this patch solves the
reported problem. There is one similar problem [2] reported by Heikki
which I think can be fixed separately.
[...]
Let me know if you think anything more from myside can help in moving
patch.

+1, same here. Lets see whats committer's opinion.

--
Best regards,
Aleksander Alekseev
http://eax.me/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Aleksander Alekseev (#18)
Re: OOM in libpq and infinite loop with getCopyStart()

Aleksander Alekseev <a.alekseev@postgrespro.ru> writes:

+1, same here. Lets see whats committer's opinion.

I fooled around with this patch quite a bit but could not bring myself
to pull the trigger, because I think it fundamentally breaks applications
that follow the "repeat PQgetResult until NULL" rule. The only reason
that psql manages not to fail is that it doesn't do that, it just calls
PQexec; and you've hacked PQexecFinish so that it falls out without
pumping PQgetResult till dry. But that's not a good solution, it's just
a hack that makes the behavior unprincipled and unlike direct use of
PQgetResult. The key problem is that, assuming that the changes in
getCopyStart() succeed in returning a PGRES_FATAL_ERROR PGresult, the
application may just follow the rule of doing nothing with it unless it's
the last non-null result from PQgetResult. And it won't be, because
you've switched libpq's asyncStatus into one or another COPY status, which
will cause PQgetResult to continually create and return PGRES_COPY_XXX
results, which is what it's supposed to do in that situation (cf the last
step in getCopyResult).

Now, this will accidentally fail to fail if PQgetResult's attempt to
manufacture a PGRES_COPY_XXX result fails and returns null, which is
certainly possible if we're up against an OOM situation. But what if
it doesn't fail --- which is also possible, because a PGRES_COPY_XXX
with no attached data will not need as much space as a PGRES_FATAL_ERROR
with attached error message. The app probably throws away the
PGRES_FATAL_ERROR and tries to use the PGRES_COPY_XXX result, which
is almost okay, except that it will lack the copy format information
which will be fatal if the application is relying on that.

So AFAICT, this patch kinda sorta works for psql but it is not going
to make things better for other apps.

The other problem is that I don't have a lot of faith in the theory
that getCopyStart is going to be able to make an error PGresult when
it couldn't make a COPY PGresult. The existing message-receipt routines
that this is modeled on are dealing with PGresults that are expected to
grow large, so throwing away the accumulated PGresult is highly likely
to free enough memory to let us allocate an error PGresult. Not so
here.

I have to run, but the bottom line is I don't feel comfortable with
this. It's trying to fix what's a very corner-case problem (since
COPY PGresults aren't large) but it's introducing new corner case
behaviors of its own.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#19)
Re: OOM in libpq and infinite loop with getCopyStart()

I thought about this patch a bit more...

When I first looked at the patch, I didn't believe that it worked at
all: it failed to return a PGRES_COPY_XXX result to the application,
and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX?
Wouldn't things be hopelessly confused? I tried it out and saw that
indeed it seemed to work in psql, and after tracing through that found
that psql has no idea what's going on, but when psql issues its next
command PQexecStart manages to get us out of the copy state (barring
more OOM failures, anyway). That seems a bit accidental, though,
and for sure it wasn't documented in the patch. In any case, having
libpq in that state would have side-effects on how it responds to
application actions, which is the core of my complaint about
PQgetResult behavior. Those side effects only make sense if the app
knows (or should have known) that the connection is in COPY state.

I think it might be possible to get saner behavior if we invent a
set of new asyncStatus values PGASYNC_COPY_XXX_FAILED, having the
semantics that we got the relevant CopyStart message from the backend
but were unable to report it to the application. PQexecStart would
treat these the same as the corresponding normal PGASYNC_COPY_XXX
states and do what's needful to get out of the copy mode. But in
PQgetResult, we would *not* treat those states as a reason to return a
PGRES_COPY_XXX result to the application. Probably we'd just return
NULL, with the implication being "we're done so far as you're concerned,
please give a new command". I might be underthinking it though.

Anyway, the short of my review is that we need more clarity of thought
about what state libpq is in after a failure like this, and what that
state looks like to the application, and how that should affect how
libpq reacts to application calls.

I'll set this back to Waiting on Author ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#20)
#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#21)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#22)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#23)
#26Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#25)
#27Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#27)