BUG #5585: SSL problems with long COPYs
The following bug has been logged online:
Bug reference: 5585
Logged by: Karl Denninger
Email address: karl@denninger.net
PostgreSQL version: 8.4.4
Operating system: FreeBSD 8.0
Description: SSL problems with long COPYs
Details:
This is a copy of a message I posted this evening on the SLONY list.
Synopsis: With SSL ON a large table copy containing a BYTEA field fails
repeatedly a few minutes into the operation. The table itself is some ~25gb
in size, and a dump and restore of the table onto a different machine (or
the same machine) can be completed without incident (e.g. the data is known
to not be corrupt.)
However, during a COPY performed by SLONY 2.0.4, the system says it lost the
connection. It in fact did not - the copy is running on the server,
although the output is closed (thus it is dumping into a closed file
descriptor)
Perusal of the server logs discloses that the SSL module is apparently
internally faulting, although the actual error message is less than useful
in trying to track down the cause.
Here is the copy of the email I just sent to the SLONY list.
Note that the copy, while not yet complete, is proceeding normally (and
about 20% in - a good 10x further than it ever got with SSL enabled) - with
SSL turned off. It is far enough along at this point that I am reasonably
confident that with SSL off it will complete as expected.
Karl Denninger wrote:
I upgraded somewhat-recently from 2.0.2 to 2.0.4, and now I've got a
serious problem.The reason for the "gotta do it now" was that somehow one of the tables
got out of sync, and a delete was failing to propagate - hanging the
process.OK, ok, so 2.0.2 with Postgres 8.4.4 is a bit old and mismatched. So I
upgraded to 2.0.4 on all the nodes, and told the subscriber to reload -
ditched the client config and re-subscribed the sets.All went well until a very large table came up - it failed.
There's no error in the logs indicating why, other than the following:
Jul 31 22:52:53 dbms TICKER[70295]: [153-1] CONFIG
remoteWorkerThread_3:
copy table "public"."images"
Jul 31 22:52:53 dbms TICKER[70295]: [154-1] CONFIG
remoteWorkerThread_3:
Begin COPY of table "public"."images"
Jul 31 22:54:24 dbms TICKER[70295]: [155-1] ERROR
remoteWorkerThread_3:
PGgetCopyData() server closed the connection unexpectedly
Jul 31 22:54:24 dbms TICKER[70295]: [155-2] This probably means the
server terminated abnormally
Jul 31 22:54:24 dbms TICKER[70295]: [155-3] before or while
processing the request.
Jul 31 22:54:24 dbms TICKER[70295]: [156-1] WARN
remoteWorkerThread_3:
data copy for set 1 failed 1 times - sleep 15 seconds
And in 15 seconds, the entire process of trying to re-init the node
starts over - from the beginning!Near as I can tell, it's failing pretty early on.
The source host is fine. This particular table contains a BYTEA field,
and it's BIG. ~20ish gigs big. But I've re-initialized in the past
without problems. I tried going back to 2.0.2, and that still fails.
Both servers are running with encoding set to SQL_ASCII, if it matters.When it fails the SERVER's COPY is still running - so the client is
definitely wrong on the reported error. I have NOTHING in the server's
SLON log and there are no comms problems between the two hosts.I'm going to run a dump of the table and see if I can manually bring it
over to the other host and load it. There's nothing going on with the
master that implicates the data being damaged......Ideas?
-- Karl
Oh boy, I think I know what's going on....
This looks like a problem in the SSL code (!!!!!)
I shut off SSL and I'm now ~4x further than where the copy has
previously failed. For obvious reasons this is decidedly un-good - a
perusal of the server's log (not slony's, but the server's) disclosed
several SSL errors in the log, which would account for the problem and
what the client thought were disconnects - but weren't.
Specifically, I got a bunch of these...
Jul 31 23:00:00 tickerforum postgres[27093]: [9593-2] STATEMENT: copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thu
mb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-1] LOG: SSL error:
internal error
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-2] STATEMENT: copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thu
mb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:01 tickerforum postgres[27093]: [9595-1] LOG: SSL error:
internal error
This isn't SLONY's issue, but it's definitely a problem. I'll report it
over on the Postgres list in the morning...
In the meantime, I think I can get the copy to go with SSL off, and then
turn it back on once the copy is complete.
On Sun, Aug 1, 2010 at 00:08, Karl Denninger <karl@denninger.net> wrote:
The following bug has been logged online:
Bug reference: 5585
Logged by: Karl Denninger
Email address: karl@denninger.net
PostgreSQL version: 8.4.4
Operating system: FreeBSD 8.0
Description: SSL problems with long COPYs
Details:This is a copy of a message I posted this evening on the SLONY list.
Synopsis: With SSL ON a large table copy containing a BYTEA field fails
repeatedly a few minutes into the operation.
My guess is its due to the server or client disabling ssl
renegotiation, per the docs:
ssl_renegotiation_limit (integer)
Specifies how much data can flow over an SSL encrypted connection
before renegotiation of the session will take place. Renegotiation of
the session decreases the chance of doing cryptanalysis when large
amounts of data are sent, but it also carries a large performance
penalty. The sum of sent and received traffic is used to check the
limit. If the parameter is set to 0, renegotiation is disabled. The
default is 512MB.
Note: SSL libraries from before November 2009 are insecure when using
SSL renegotiation, due to a vulnerability in the SSL protocol. As a
stop-gap fix for this vulnerability, some vendors also shipped SSL
libraries incapable of doing renegotiation. If any of these libraries
are in use on the client or server, SSL renegotiation should be
disabled.
Id try setting that to 0 in your postgresql.conf and see if it still fails.
Alex Hunsaker wrote:
On Sun, Aug 1, 2010 at 00:08, Karl Denninger <karl@denninger.net> wrote:
The following bug has been logged online:
Bug reference: 5585
Logged by: Karl Denninger
Email address: karl@denninger.net
PostgreSQL version: 8.4.4
Operating system: FreeBSD 8.0
Description: SSL problems with long COPYs
Details:This is a copy of a message I posted this evening on the SLONY list.
Synopsis: With SSL ON a large table copy containing a BYTEA field fails
repeatedly a few minutes into the operation.My guess is its due to the server or client disabling ssl
renegotiation, per the docs:ssl_renegotiation_limit (integer)
Specifies how much data can flow over an SSL encrypted connection
before renegotiation of the session will take place. Renegotiation of
the session decreases the chance of doing cryptanalysis when large
amounts of data are sent, but it also carries a large performance
penalty. The sum of sent and received traffic is used to check the
limit. If the parameter is set to 0, renegotiation is disabled. The
default is 512MB.Note: SSL libraries from before November 2009 are insecure when using
SSL renegotiation, due to a vulnerability in the SSL protocol. As a
stop-gap fix for this vulnerability, some vendors also shipped SSL
libraries incapable of doing renegotiation. If any of these libraries
are in use on the client or server, SSL renegotiation should be
disabled.Id try setting that to 0 in your postgresql.conf and see if it still fails.
I will attempt this but it is at least somewhat unlikely to be the
cause, as prior to the failure two tables of over 1GB each did correctly
transfer. They did not, however, have any binary (bytea) fields in them.
-- Karl
On 08/01/2010 08:33 AM, Karl Denninger wrote:
Alex Hunsaker wrote:
On Sun, Aug 1, 2010 at 00:08, Karl Denninger<karl@denninger.net> wrote:
The following bug has been logged online:
Bug reference: 5585
Logged by: Karl Denninger
Email address:karl@denninger.net
PostgreSQL version: 8.4.4
Operating system: FreeBSD 8.0
Description: SSL problems with long COPYs
Details:This is a copy of a message I posted this evening on the SLONY list.
Synopsis: With SSL ON a large table copy containing a BYTEA field fails
repeatedly a few minutes into the operation.My guess is its due to the server or client disabling ssl
renegotiation, per the docs:ssl_renegotiation_limit (integer)
Specifies how much data can flow over an SSL encrypted connection
before renegotiation of the session will take place. Renegotiation of
the session decreases the chance of doing cryptanalysis when large
amounts of data are sent, but it also carries a large performance
penalty. The sum of sent and received traffic is used to check the
limit. If the parameter is set to 0, renegotiation is disabled. The
default is 512MB.Note: SSL libraries from before November 2009 are insecure when using
SSL renegotiation, due to a vulnerability in the SSL protocol. As a
stop-gap fix for this vulnerability, some vendors also shipped SSL
libraries incapable of doing renegotiation. If any of these libraries
are in use on the client or server, SSL renegotiation should be
disabled.Id try setting that to 0 in your postgresql.conf and see if it still fails.
I will attempt this but it is at least somewhat unlikely to be the
cause, as prior to the failure two tables of over 1GB each did correctly
transfer. They did not, however, have any binary (bytea) fields in them.
how exactly did you measure the 1GB? If that's the on-disk size of the
table (maybe even including indexes) it is entirely believable that the
amount of data transfered through COPY on the SSL-session was much less
than 512MB.
Given the symthoms reported I would agree with Alex on suspecting a
broken openssl library.
Stefan
Stefan Kaltenbrunner wrote:
On 08/01/2010 08:33 AM, Karl Denninger wrote:
Alex Hunsaker wrote:
On Sun, Aug 1, 2010 at 00:08, Karl Denninger<karl@denninger.net>
wrote:The following bug has been logged online:
Bug reference: 5585
Logged by: Karl Denninger
Email address:karl@denninger.net
PostgreSQL version: 8.4.4
Operating system: FreeBSD 8.0
Description: SSL problems with long COPYs
Details:This is a copy of a message I posted this evening on the SLONY list.
Synopsis: With SSL ON a large table copy containing a BYTEA field
fails
repeatedly a few minutes into the operation.My guess is its due to the server or client disabling ssl
renegotiation, per the docs:ssl_renegotiation_limit (integer)
Specifies how much data can flow over an SSL encrypted connection
before renegotiation of the session will take place. Renegotiation of
the session decreases the chance of doing cryptanalysis when large
amounts of data are sent, but it also carries a large performance
penalty. The sum of sent and received traffic is used to check the
limit. If the parameter is set to 0, renegotiation is disabled. The
default is 512MB.Note: SSL libraries from before November 2009 are insecure when using
SSL renegotiation, due to a vulnerability in the SSL protocol. As a
stop-gap fix for this vulnerability, some vendors also shipped SSL
libraries incapable of doing renegotiation. If any of these libraries
are in use on the client or server, SSL renegotiation should be
disabled.Id try setting that to 0 in your postgresql.conf and see if it still
fails.I will attempt this but it is at least somewhat unlikely to be the
cause, as prior to the failure two tables of over 1GB each did correctly
transfer. They did not, however, have any binary (bytea) fields in
them.how exactly did you measure the 1GB? If that's the on-disk size of the
table (maybe even including indexes) it is entirely believable that
the amount of data transfered through COPY on the SSL-session was much
less than 512MB.
Given the symthoms reported I would agree with Alex on suspecting a
broken openssl library.
The reported copy table size in the SLON log. It exceeded 1GB for two
of the tables the successfully came over before the error.
-- Karl
Karl Denninger <karl@denninger.net> writes:
Stefan Kaltenbrunner wrote:
how exactly did you measure the 1GB?
The reported copy table size in the SLON log. It exceeded 1GB for two
of the tables the successfully came over before the error.
Hmm, I'm not sure how Slony comes by that number, so this might or might
not be meaningful. I agree with the other respondents that the symptom
sounds exactly like broken renegotiation --- the earliest security
patches to close the openssl CVE hole resulted in failures exactly like
this whenever the server tried to force key renegotiation. You might
check whether libssl was recently updated on either the server or client
machine.
regards, tom lane
Tom Lane wrote:
Karl Denninger <karl@denninger.net> writes:
Stefan Kaltenbrunner wrote:
how exactly did you measure the 1GB?
The reported copy table size in the SLON log. It exceeded 1GB for two
of the tables the successfully came over before the error.Hmm, I'm not sure how Slony comes by that number, so this might or might
not be meaningful. I agree with the other respondents that the symptom
sounds exactly like broken renegotiation --- the earliest security
patches to close the openssl CVE hole resulted in failures exactly like
this whenever the server tried to force key renegotiation. You might
check whether libssl was recently updated on either the server or client
machine.regards, tom lane
I've turned off SSL for now and the copy appears to be proceeding
normally. I'll see if I can isolate this further once I have the
replication stabilized again - I have a test box I can set up as a third
node off one of the others on a higher-bandwidth connection, which will
make testing this in some more detail easier and faster.
I've also shut off SSL renegotiation for now and will leave it off until
I can figure out what's up - if it breaks during my testing with
renegotiation off (or not) I'll update here.
-- Karl
Tom Lane wrote:
Karl Denninger <karl@denninger.net> writes:
Stefan Kaltenbrunner wrote:
how exactly did you measure the 1GB.
The reported copy table size in the SLON log. It exceeded 1GB for two
of the tables the successfully came over before the error.Hmm, I'm not sure how Slony comes by that number, so this might or might
not be meaningful. I agree with the other respondents that the symptom
sounds exactly like broken renegotiation --- the earliest security
patches to close the openssl CVE hole resulted in failures exactly like
this whenever the server tried to force key renegotiation. You might
check whether libssl was recently updated on either the server or client
machine.
I set the ssl_renegotiation off and the copy is now being attempted (and
is well past where it failed before) with SSL on.
There's a second problem in that SLONY appears to have a memory
management issue that I've tickled with a COPY of this particular table,
and it's a bad one - it may preclude me from being able to resync at all
- but that's not Postgres' fault.
Looks like this bug report can be closed as the issue does not appear to
be yours beyond the SSL issue that is documented.
(Whether Postgress 9's internal replication will solve this for me when
it is released is something I'm not sure about - I think the answer is
"no", since if I'm reading the docs correctly Postgres 9 requires that
both master and slave be in sync via some other method before the
replication is enabled - that is, it's not capable of taking a "raw"
(populated with empty tables or not) new system and bringing it into
sync and then replicating from there. That's a major problem in a
"live" environment if there's a failure of some sort and you want to
bring the system that failed back into the cluster......)
-- Karl
On Sun, Aug 1, 2010 at 14:02, Karl Denninger <karl@denninger.net> wrote:
(Whether Postgress 9's internal replication will solve this for me when it
is released is something I'm not sure about - I think the answer is "no",
since if I'm reading the docs correctly Postgres 9 requires that both master
and slave be in sync via some other method before the replication is enabled
Not only that, but you cant sync certain tables or databases.
However, I have had great success with a simple:
select pg_start_backup();
... rsync database dir to slave, fixup postgresql.conf,recovery.conf
select pg_stop_backup();
That lets you "sync" up the slave with 0 downtime. (the io from rsync
can be a problem tho)