Dropped connections with pg_basebackup

Started by Francisco Reyesover 10 years ago7 messagesgeneral
Jump to latest
#1Francisco Reyes
lists@natserv.net

Have an existing setup of 9.3 servers. Replication has been rock solid,
but recently the circuits between data centers were upgraded and
pg_basebackup now seems to fail often when setting up streaming
replication. What used to take 10+ hours now only took 68 minutes, but
had to do many retries. Many attempts fail within minutes while others
go to 90% or higher and then drop. The reason we are doing a sync is
because we have to swap data centers every so often for compliance. So I
had to swap master and slave.

Calling pg_basebackup like this:
pg_basebackup -P -R -X s -h <HostName> -D <Folder> -U replicator

The error we keep having is:
Sep 23 13:36:32 <HostName> postgres[16804]: [11-1] 2015-09-23 13:36:32
EDT <IP> [unknown] replicator LOG: SSL error: bad write retry
Sep 23 13:36:32 <HostName> postgres[16804]: [12-1] 2015-09-23 13:36:32
EDT <IP> [unknown] replicator LOG: SSL error: bad write retry
Sep 23 13:36:32 <HostName> postgres[16804]: [13-1] 2015-09-23 13:36:32
EDT <IP> [unknown] replicator FATAL: connection to client lost
Sep 23 13:36:32 <HostName> postgres[16972]: [9-1] 2015-09-23 13:36:32
EDT <IP> [unknown] replicator LOG: could not receive data from client:
Connection reset by peer

I have been working with the network team and we have even been actively
monitoring the line, and running ping, as the replication is setup. At
the point the connection reset by peer error happens, we don't see any
issue with the network and ping doesn't show an issue at that point in time.

The issue also happened on another set of machines and likewise, had to
retry many times before pg_basebackup would do the initial sync. Once
the initial sync is set, replication is fine.

I tried both "-X s" (stream) and "-X f" (fetch) and both fail often.

Any ideas what may be going on?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Sherrylyn Branchaw
sbranchaw@gmail.com
In reply to: Francisco Reyes (#1)
Re: Dropped connections with pg_basebackup

I'm assuming based on the "SSL error" that you have ssl set to 'on'. What's
your ssl_renegotiation_limit? The default is 512MB, but setting it to 0 has
solved problems for a number of people on this list, including myself.

Sherrylyn

On Thu, Sep 24, 2015 at 3:57 PM, Francisco Reyes <lists@natserv.net> wrote:

Show quoted text

Have an existing setup of 9.3 servers. Replication has been rock solid,
but recently the circuits between data centers were upgraded and
pg_basebackup now seems to fail often when setting up streaming
replication. What used to take 10+ hours now only took 68 minutes, but had
to do many retries. Many attempts fail within minutes while others go to
90% or higher and then drop. The reason we are doing a sync is because we
have to swap data centers every so often for compliance. So I had to swap
master and slave.

Calling pg_basebackup like this:
pg_basebackup -P -R -X s -h <HostName> -D <Folder> -U replicator

The error we keep having is:
Sep 23 13:36:32 <HostName> postgres[16804]: [11-1] 2015-09-23 13:36:32 EDT
<IP> [unknown] replicator LOG: SSL error: bad write retry
Sep 23 13:36:32 <HostName> postgres[16804]: [12-1] 2015-09-23 13:36:32 EDT
<IP> [unknown] replicator LOG: SSL error: bad write retry
Sep 23 13:36:32 <HostName> postgres[16804]: [13-1] 2015-09-23 13:36:32 EDT
<IP> [unknown] replicator FATAL: connection to client lost
Sep 23 13:36:32 <HostName> postgres[16972]: [9-1] 2015-09-23 13:36:32 EDT
<IP> [unknown] replicator LOG: could not receive data from client:
Connection reset by peer

I have been working with the network team and we have even been actively
monitoring the line, and running ping, as the replication is setup. At the
point the connection reset by peer error happens, we don't see any issue
with the network and ping doesn't show an issue at that point in time.

The issue also happened on another set of machines and likewise, had to
retry many times before pg_basebackup would do the initial sync. Once the
initial sync is set, replication is fine.

I tried both "-X s" (stream) and "-X f" (fetch) and both fail often.

Any ideas what may be going on?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Francisco Reyes (#1)
Re: Dropped connections with pg_basebackup

On 09/24/2015 12:57 PM, Francisco Reyes wrote:

Have an existing setup of 9.3 servers. Replication has been rock solid,
but recently the circuits between data centers were upgraded and
pg_basebackup now seems to fail often when setting up streaming
replication. What used to take 10+ hours now only took 68 minutes, but
had to do many retries. Many attempts fail within minutes while others
go to 90% or higher and then drop. The reason we are doing a sync is
because we have to swap data centers every so often for compliance. So I
had to swap master and slave.

Calling pg_basebackup like this:
pg_basebackup -P -R -X s -h <HostName> -D <Folder> -U replicator

The error we keep having is:
Sep 23 13:36:32 <HostName> postgres[16804]: [11-1] 2015-09-23 13:36:32
EDT <IP> [unknown] replicator LOG: SSL error: bad write retry
Sep 23 13:36:32 <HostName> postgres[16804]: [12-1] 2015-09-23 13:36:32
EDT <IP> [unknown] replicator LOG: SSL error: bad write retry

Seems to be an SSL problem, so how is your SSL set up on the servers?

Sep 23 13:36:32 <HostName> postgres[16804]: [13-1] 2015-09-23 13:36:32
EDT <IP> [unknown] replicator FATAL: connection to client lost
Sep 23 13:36:32 <HostName> postgres[16972]: [9-1] 2015-09-23 13:36:32
EDT <IP> [unknown] replicator LOG: could not receive data from client:
Connection reset by peer

I have been working with the network team and we have even been actively
monitoring the line, and running ping, as the replication is setup. At
the point the connection reset by peer error happens, we don't see any
issue with the network and ping doesn't show an issue at that point in
time.

The issue also happened on another set of machines and likewise, had to
retry many times before pg_basebackup would do the initial sync. Once
the initial sync is set, replication is fine.

I tried both "-X s" (stream) and "-X f" (fetch) and both fail often.

Any ideas what may be going on?

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Sherrylyn Branchaw (#2)
Re: Dropped connections with pg_basebackup

Sherrylyn Branchaw wrote:

I'm assuming based on the "SSL error" that you have ssl set to 'on'. What's
your ssl_renegotiation_limit? The default is 512MB, but setting it to 0 has
solved problems for a number of people on this list, including myself.

Moreover, the default has been set to 0, because the bugs both in our
usage and in OpenSSL code itself seem never to end. Just disable it.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Sherrylyn Branchaw
sbranchaw@gmail.com
In reply to: Alvaro Herrera (#4)
Re: Dropped connections with pg_basebackup

Ah, yes, it's been removed from 9.5:
http://www.postgresql.org/docs/9.5/static/release-9-5.html

Good to know.

On Thu, Sep 24, 2015 at 4:34 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Show quoted text

Sherrylyn Branchaw wrote:

I'm assuming based on the "SSL error" that you have ssl set to 'on'.

What's

your ssl_renegotiation_limit? The default is 512MB, but setting it to 0

has

solved problems for a number of people on this list, including myself.

Moreover, the default has been set to 0, because the bugs both in our
usage and in OpenSSL code itself seem never to end. Just disable it.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Francisco Reyes
lists@natserv.net
In reply to: Sherrylyn Branchaw (#2)
Re: Dropped connections with pg_basebackup

On 09/24/2015 04:29 PM, Sherrylyn Branchaw wrote:

I'm assuming based on the "SSL error" that you have ssl set to 'on'.
What's your ssl_renegotiation_limit? The default is 512MB, but setting
it to 0 has solved problems for a number of people on this list,
including myself.

I have also seen instances were ssl_renegotiation_limit=0 helped and I
already tried that. Did not help in this case.

Perhaps will try some tests with a non SSL connection. These are
machines in an internal network so it may not be too much a security
issue to turn off SSL at least during initial sync.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Francisco Reyes
lists@natserv.net
In reply to: Alvaro Herrera (#4)
Re: Dropped connections with pg_basebackup

On 09/24/2015 04:34 PM, Alvaro Herrera wrote:

Sherrylyn Branchaw wrote:
Moreover, the default has been set to 0, because the bugs both in our
usage and in OpenSSL code itself seem never to end. Just disable it.

Set it to 0 and did not help.
Likely will move all machines to have it =0 since I have seen some SSL
errors in logs.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general