BUG #5804: Connection aborted after many queries.

Started by Paul Davisover 15 years ago7 messagesbugs

paul.joseph.davis@gmail.com

over 15 years ago

The following bug has been logged online:

Bug reference: 5804
Logged by: Paul J. Davis
Email address: paul.joseph.davis@gmail.com
PostgreSQL version: 9.0.2
Operating system: OS X 10.6.5, Ubuntu 10.04
Description: Connection aborted after many queries.
Details:

After running many queries (millions) a connection will report an error that
the server has unexpectedly closed the connection. I first noticed this
through psycopg2, but I've been able to reproduce it with a small C program
using only libpq which I've included below. I compiled this against a libpq
built by Homebrew (after upgrading the formula to use a 9.0.2 tarball) on OS
X 10.6.5. The server was installed from 9.0.2 package available from
https://launchpad.net/~pitti/+archive/postgresql

My next step is to try building libpq with --enable-cassert to see if that
triggers anything client side. Let me know if there's something else I
should be doing to debug this.

This test has been bailing between 2.6 and 2.7M queries:

#include <stdio.h>
#include <stdlib.h>
#include "libpq-fe.h"

static void
fail(PGconn* conn, PGresult* res)
{
if(res != NULL) PQclear(res);
PQfinish(conn);
exit(1);
}

static void
check(PGconn* conn, PGresult* res, const char* fmt)
{
ExecStatusType status = PQresultStatus(res);

if(status != PGRES_COMMAND_OK && status != PGRES_TUPLES_OK)
{
fprintf(stderr, fmt, PQerrorMessage(conn));
fail(conn, res);
}
}

void
run_query(PGconn* conn, PGresult* res)
{
int nFields, i, j;

res = PQexec(conn, "DECLARE myportal CURSOR FOR select 1");
check(conn, res, "DECLARE CURSOR failed: %s");
PQclear(res);

res = PQexec(conn, "FETCH ALL in myportal");
check(conn, res, "FETCH ALL failed: %s");

nFields = PQnfields(res);
for(i = 0; i < PQntuples(res); i++)
{
for(j = 0; j < nFields; j++)
{
PQgetvalue(res, i, j);
}
}

PQclear(res);

res = PQexec(conn, "CLOSE myportal");
check(conn, res, "CLOSE failed: %s");
PQclear(res);
}

int
main(int argc, char **argv)
{
PGconn* conn;
PGresult* res;
int i;

if(argc != 2)
{
fprintf(stderr, "usage: %s DSN\n", argv[0]);
exit(1);
}

conn = PQconnectdb(argv[1]);

if(PQstatus(conn) != CONNECTION_OK)
{
fprintf(stderr, "Connection failed: %s", PQerrorMessage(conn));
fail(conn, NULL);
}

res = PQexec(conn, "BEGIN");
check(conn, res, "BEGIN failed: %s");
PQclear(res);

for(i = 0; i < 10000000; i++)
{
if((i+1) % 100000 == 0)
{
fprintf(stderr, "I: %d\n", i);
}
run_query(conn, res);
}

res = PQexec(conn, "END");
check(conn, res, "END failed: %s");
PQclear(res);

PQfinish(conn);

return 0;
}

Paul Davis

paul.joseph.davis@gmail.com

over 15 years ago

In reply to: Paul Davis (#1)

Re: BUG #5804: Connection aborted after many queries.

On Wed, Dec 29, 2010 at 10:30 AM, Paul J. Davis
<paul.joseph.davis@gmail.com> wrote:

The following bug has been logged online:

Bug reference: 5804
Logged by: Paul J. Davis
Email address: paul.joseph.davis@gmail.com
PostgreSQL version: 9.0.2
Operating system: OS X 10.6.5, Ubuntu 10.04
Description: Connection aborted after many queries.
Details:

After running many queries (millions) a connection will report an error that
the server has unexpectedly closed the connection. I first noticed this
through psycopg2, but I've been able to reproduce it with a small C program
using only libpq which I've included below. I compiled this against a libpq
built by Homebrew (after upgrading the formula to use a 9.0.2 tarball) on OS
X 10.6.5. The server was installed from 9.0.2 package available from
https://launchpad.net/~pitti/+archive/postgresql

My next step is to try building libpq with --enable-cassert to see if that
triggers anything client side. Let me know if there's something else I
should be doing to debug this.

This test has been bailing between 2.6 and 2.7M queries:

#include <stdio.h>
#include <stdlib.h>
#include "libpq-fe.h"

static void
fail(PGconn* conn, PGresult* res)
{
if(res != NULL) PQclear(res);
PQfinish(conn);
exit(1);
}

static void
check(PGconn* conn, PGresult* res, const char* fmt)
{
ExecStatusType status = PQresultStatus(res);

if(status != PGRES_COMMAND_OK && status != PGRES_TUPLES_OK)
{
fprintf(stderr, fmt, PQerrorMessage(conn));
fail(conn, res);
}
}

void
run_query(PGconn* conn, PGresult* res)
{
int nFields, i, j;

res = PQexec(conn, "DECLARE myportal CURSOR FOR select 1");
check(conn, res, "DECLARE CURSOR failed: %s");
PQclear(res);

res = PQexec(conn, "FETCH ALL in myportal");
check(conn, res, "FETCH ALL failed: %s");

nFields = PQnfields(res);
for(i = 0; i < PQntuples(res); i++)
{
for(j = 0; j < nFields; j++)
{
PQgetvalue(res, i, j);
}
}

PQclear(res);

res = PQexec(conn, "CLOSE myportal");
check(conn, res, "CLOSE failed: %s");
PQclear(res);
}

int
main(int argc, char **argv)
{
PGconn* conn;
PGresult* res;
int i;

if(argc != 2)
{
fprintf(stderr, "usage: %s DSN\n", argv[0]);
exit(1);
}

conn = PQconnectdb(argv[1]);

if(PQstatus(conn) != CONNECTION_OK)
{
fprintf(stderr, "Connection failed: %s", PQerrorMessage(conn));
fail(conn, NULL);
}

res = PQexec(conn, "BEGIN");
check(conn, res, "BEGIN failed: %s");
PQclear(res);

for(i = 0; i < 10000000; i++)
{
if((i+1) % 100000 == 0)
{
fprintf(stderr, "I: %d\n", i);
}
run_query(conn, res);
}

res = PQexec(conn, "END");
check(conn, res, "END failed: %s");
PQclear(res);

PQfinish(conn);

return 0;
}

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

I should've mentioned the various version combinations I tried this with.

Originally the client was 8.2 ish on OS X 10.5.something (it was an
old MacBook I don't have anymore) against the 8.2 server package in
Ubuntu 9.04. The python scripts where I noticed this issue would run
fine against that combination. After upgrading my MacBook to a Mac
Pro, I ended up installing Postgres 9.0.1 on the client (and building
psycopg2 against that) which is when I started getting errors. The
original error in the 9.0.1 client against the older server was that
libpq would get stuck on a poll() call down when trying to fetch
tuples or execute a command.

After a bit of narrowing down what was the cause I ended up trying to
upgrade the server to see if it was just a weird interplay between
9.0.1 and the older server. After upgrading to Ubuntu 10.04 and
installing Postgres 8.4 (from apt) the error turned into the current
manifestation in that libpq would give an error saying that the server
had unexpectedly closed the connection (instead of blocking on the
poll() call).

At some point I upgraded my client install to 9.0.2 and started a
server locally. Running the test program against a local database
failed to trigger the bug. I then tried to downgrade my local client
to 8.4 and tested that against the 8.4 install on Ubuntu which showed
the bug. And finally I upgraded both the server and the client to
9.0.2 and I can trigger the bug.

Thanks,
Paul Davis

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 15 years ago

In reply to: Paul Davis (#1)

Re: BUG #5804: Connection aborted after many queries.

"Paul J. Davis" <paul.joseph.davis@gmail.com> wrote:

After running many queries (millions) a connection will report an
error that the server has unexpectedly closed the connection.

What message are you getting? (Copy/paste is a good thing.)

What do you see in the server log at the time of failure?

Are you using a connection pool?

-Kevin

Paul Davis

paul.joseph.davis@gmail.com

over 15 years ago

In reply to: Kevin Grittner (#3)

Re: BUG #5804: Connection aborted after many queries.

On Wed, Dec 29, 2010 at 10:58 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

"Paul J. Davis" <paul.joseph.davis@gmail.com> wrote:

After running many queries (millions) a connection will report an
error that the server has unexpectedly closed the connection.

What message are you getting? (Copy/paste is a good thing.)

What do you see in the server log at the time of failure?

Are you using a connection pool?

-Kevin

Doh, that was on my list of things to add but managed to forget.

The error message reported by the client:

DECLARE CURSOR failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

And this intriguing error in the server logs from around that time:

2010-12-28 18:40:02 EST LOG: SSL renegotiation failure
2010-12-28 18:40:02 EST LOG: SSL failed to send renegotiation request
2010-12-28 18:40:02 EST LOG: SSL renegotiation failure
2010-12-28 18:40:02 EST LOG: SSL error: unsafe legacy renegotiation disabled
2010-12-28 18:40:02 EST LOG: could not send data to client:
Connection reset by peer
2010-12-28 18:40:02 EST LOG: SSL error: unsafe legacy renegotiation disabled
2010-12-28 18:40:02 EST LOG: could not receive data from client:
Connection reset by peer
2010-12-28 18:40:02 EST LOG: unexpected EOF on client connection

Googling, I see something that suggests turning off SSL renegotiation
which I'll try next.

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Paul Davis (#4)

Re: BUG #5804: Connection aborted after many queries.

Paul Davis <paul.joseph.davis@gmail.com> writes:

And this intriguing error in the server logs from around that time:

2010-12-28 18:40:02 EST LOG: SSL renegotiation failure
2010-12-28 18:40:02 EST LOG: SSL failed to send renegotiation request
2010-12-28 18:40:02 EST LOG: SSL renegotiation failure
2010-12-28 18:40:02 EST LOG: SSL error: unsafe legacy renegotiation disabled
2010-12-28 18:40:02 EST LOG: could not send data to client:
Connection reset by peer
2010-12-28 18:40:02 EST LOG: SSL error: unsafe legacy renegotiation disabled
2010-12-28 18:40:02 EST LOG: could not receive data from client:
Connection reset by peer
2010-12-28 18:40:02 EST LOG: unexpected EOF on client connection

Googling, I see something that suggests turning off SSL renegotiation
which I'll try next.

In all cases, you were testing a client against a server on a different
machine, right? This looks to me like you've got two different openssl
libraries, one of which has a bogus partial fix for the recent SSL
renegotiation security issue. I'm not sure what the state of play is
in Apple's shipping version of openssl --- you might have to get an
up-to-date source distribution and compile it yourself to have non-bogus
renegotiation behavior. Or you could just disable renegotiation on the
PG server.

regards, tom lane

Paul Davis

paul.joseph.davis@gmail.com

over 15 years ago

In reply to: Tom Lane (#5)

Re: BUG #5804: Connection aborted after many queries.

On Wed, Dec 29, 2010 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Paul Davis <paul.joseph.davis@gmail.com> writes:

And this intriguing error in the server logs from around that time:

2010-12-28 18:40:02 EST LOG: SSL renegotiation failure
2010-12-28 18:40:02 EST LOG: SSL failed to send renegotiation request
2010-12-28 18:40:02 EST LOG: SSL renegotiation failure
2010-12-28 18:40:02 EST LOG: SSL error: unsafe legacy renegotiation disabled
2010-12-28 18:40:02 EST LOG: could not send data to client:
Connection reset by peer
2010-12-28 18:40:02 EST LOG: SSL error: unsafe legacy renegotiation disabled
2010-12-28 18:40:02 EST LOG: could not receive data from client:
Connection reset by peer
2010-12-28 18:40:02 EST LOG: unexpected EOF on client connection

Googling, I see something that suggests turning off SSL renegotiation
which I'll try next.

In all cases, you were testing a client against a server on a different
machine, right? This looks to me like you've got two different openssl
libraries, one of which has a bogus partial fix for the recent SSL
renegotiation security issue. I'm not sure what the state of play is
in Apple's shipping version of openssl --- you might have to get an
up-to-date source distribution and compile it yourself to have non-bogus
renegotiation behavior. Or you could just disable renegotiation on the
PG server.

regards, tom lane

Yeah, all failures were between separate machines with various
versions of OpenSSL that I never thought to keep track of. After more
Googling I've found that OS X "fixed" the renegotiation issue by
disabling it in a security fix [1]http://support.apple.com/kb/HT4004.
For the time being I'll just disable it server side as traffic isn't
ever routed across a public network.

Thanks for the help.

Paul Davis

[1]: http://support.apple.com/kb/HT4004

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Paul Davis (#6)

Re: BUG #5804: Connection aborted after many queries.

Paul Davis <paul.joseph.davis@gmail.com> writes:

On Wed, Dec 29, 2010 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In all cases, you were testing a client against a server on a different
machine, right?

Yeah, all failures were between separate machines with various
versions of OpenSSL that I never thought to keep track of. After more
Googling I've found that OS X "fixed" the renegotiation issue by
disabling it in a security fix [1].

Yeah, I can reproduce the failure pretty quickly on my Mac laptop,
if I use SSL and reduce the ssl_renegotiation_limit setting to
100MB or so. The server's log looks a bit different from what you
showed:

LOG: SSL renegotiation failure
LOG: SSL renegotiation failure
LOG: SSL error: internal error
LOG: could not send data to client: Connection reset by peer
LOG: SSL error: internal error
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection

which confirms my thought that you were working with two different
openssl libraries, one of which was willing to do renegotiation
and the other not.

This machine is fully up-to-date, so Apple still hasn't fixed the
renegotiation problem beyond the brain-dead "fix" of aborting the
connection :-(. Get on the stick, Cupertino ... everybody else
had this fixed six months ago.

regards, tom lane