Streaming Replication Randomly Locking Up

Started by Andrew Bermanover 12 years ago14 messagesgeneral
Jump to latest
#1Andrew Berman
rexxe98@gmail.com

Hello,

I'm having an issue where streaming replication just randomly stops
working. I haven't been able to find anything in the logs which point to
an issue, but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres:
wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up and
replication halts. I verified that the two databases were out of sync with
a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32

Slave:

wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = off

Thank you for any help you can provide!

Andrew

#2Lonni J Friedman
netllama@gmail.com
In reply to: Andrew Berman (#1)
Re: Streaming Replication Randomly Locking Up

I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:

Hello,

I'm having an issue where streaming replication just randomly stops working.
I haven't been able to find anything in the logs which point to an issue,
but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres:
wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up and
replication halts. I verified that the two databases were out of sync with
a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32

Slave:

wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = off

Thank you for any help you can provide!

Andrew

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Andrew Berman
rexxe98@gmail.com
In reply to: Lonni J Friedman (#2)
Re: Streaming Replication Randomly Locking Up

Hi Lonni,

Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look more
into this. They are communicating on a private network and switch.

I forgot to mention that after I restart the slave, everything syncs right
back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.

Thanks!

On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>wrote:

Show quoted text

I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:

Hello,

I'm having an issue where streaming replication just randomly stops

working.

I haven't been able to find anything in the logs which point to an issue,
but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54

postgres:

startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30

postgres:

writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03

postgres:

stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31

postgres:

wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up and
replication halts. I verified that the two databases were out of sync

with

a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32

Slave:

wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = off

Thank you for any help you can provide!

Andrew

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org

#4Lonni J Friedman
netllama@gmail.com
In reply to: Andrew Berman (#3)
Re: Streaming Replication Randomly Locking Up

Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:

Hi Lonni,

Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look more
into this. They are communicating on a private network and switch.

I forgot to mention that after I restart the slave, everything syncs right
back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.

Thanks!

On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
wrote:

I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:

Hello,

I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an
issue,
but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
postgres:
wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up and
replication halts. I verified that the two databases were out of sync
with
a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32

Slave:

wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = off

Thank you for any help you can provide!

Andrew

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Andrew Berman
rexxe98@gmail.com
In reply to: Lonni J Friedman (#4)
Re: Streaming Replication Randomly Locking Up

The only thing I see that is a possibility for the issue is in the slave
log:

LOG: unexpected EOF on client connection
LOG: could not receive data from client: Connection reset by peer

I don't know if that's related or not as it could just be somebody running
a query. The log file does seem to be riddled with these but the
replication failures don't happen constantly.

As far as I know I'm not swallowing any errors. The logging is all set as
the default:

log_destination = 'stderr'
logging_collector = on
#client_min_messages = notice
#log_min_messages = warning
#log_min_error_statement = error
#log_min_duration_statement = -1
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_error_verbosity = default

I'm going to have a look at the NICs to make sure there's no issue there.

Thanks again for your help!

On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>wrote:

Show quoted text

Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:

Hi Lonni,

Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look more
into this. They are communicating on a private network and switch.

I forgot to mention that after I restart the slave, everything syncs

right

back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.

Thanks!

On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
wrote:

I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>

wrote:

Hello,

I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an
issue,
but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
postgres:
wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up

and

replication halts. I verified that the two databases were out of sync
with
a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32

Slave:

wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = off

Thank you for any help you can provide!

Andrew

#6Lonni J Friedman
netllama@gmail.com
In reply to: Andrew Berman (#5)
Re: Streaming Replication Randomly Locking Up

I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname. That will help to rule
in/out those 'unexpected EOF' errors.

On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote:

The only thing I see that is a possibility for the issue is in the slave
log:

LOG: unexpected EOF on client connection
LOG: could not receive data from client: Connection reset by peer

I don't know if that's related or not as it could just be somebody running a
query. The log file does seem to be riddled with these but the replication
failures don't happen constantly.

As far as I know I'm not swallowing any errors. The logging is all set as
the default:

log_destination = 'stderr'
logging_collector = on
#client_min_messages = notice
#log_min_messages = warning
#log_min_error_statement = error
#log_min_duration_statement = -1
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_error_verbosity = default

I'm going to have a look at the NICs to make sure there's no issue there.

Thanks again for your help!

On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>
wrote:

Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:

Hi Lonni,

Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look more
into this. They are communicating on a private network and switch.

I forgot to mention that after I restart the slave, everything syncs
right
back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.

Thanks!

On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
wrote:

I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
wrote:

Hello,

I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an
issue,
but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
postgres:
wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up
and
replication halts. I verified that the two databases were out of
sync
with
a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32

Slave:

wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = off

Thank you for any help you can provide!

Andrew

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Andrew Berman
rexxe98@gmail.com
In reply to: Lonni J Friedman (#6)
Re: Streaming Replication Randomly Locking Up

Yep, that's the first thing I'm going to do.

On Thu, Aug 15, 2013 at 12:34 PM, Lonni J Friedman <netllama@gmail.com>wrote:

Show quoted text

I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname. That will help to rule
in/out those 'unexpected EOF' errors.

On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote:

The only thing I see that is a possibility for the issue is in the slave
log:

LOG: unexpected EOF on client connection
LOG: could not receive data from client: Connection reset by peer

I don't know if that's related or not as it could just be somebody

running a

query. The log file does seem to be riddled with these but the

replication

failures don't happen constantly.

As far as I know I'm not swallowing any errors. The logging is all set

as

the default:

log_destination = 'stderr'
logging_collector = on
#client_min_messages = notice
#log_min_messages = warning
#log_min_error_statement = error
#log_min_duration_statement = -1
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_error_verbosity = default

I'm going to have a look at the NICs to make sure there's no issue there.

Thanks again for your help!

On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>
wrote:

Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com>

wrote:

Hi Lonni,

Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look

more

into this. They are communicating on a private network and switch.

I forgot to mention that after I restart the slave, everything syncs
right
back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.

Thanks!

On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <

netllama@gmail.com>

wrote:

I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
wrote:

Hello,

I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an
issue,
but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
postgres:
wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up
and
replication halts. I verified that the two databases were out of
sync
with
a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32

Slave:

wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = off

Thank you for any help you can provide!

Andrew

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org

#8Jeff Janes
jeff.janes@gmail.com
In reply to: Andrew Berman (#1)
Re: Streaming Replication Randomly Locking Up

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:

Hello,

I'm having an issue where streaming replication just randomly stops working.
I haven't been able to find anything in the logs which point to an issue,
but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
startup process recovering 000000010000053D0000003F waiting

There is a recovery conflict which it is waiting to go away. In other
words, you have a long-running (or long-idle) transaction on the slave
which is blocking recovery.

max_standby_archive_delay = -1
max_standby_streaming_delay = -1

...and you are willing to wait forever.

Cheers,

Jeff

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Andrew Berman
rexxe98@gmail.com
In reply to: Jeff Janes (#8)
Re: Streaming Replication Randomly Locking Up

Hi Jeff,

Here is the full process list at the time it stopped working (I have
changed the actual username, db and IP for security). Would the idle in
transaction process be the culprit?

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
startup process recovering 000000010000053D0000003F waiting****

postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres:
writer process****

postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres:
stats collector process****

postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres:
wal receiver process streaming 549/216B3730****

postgres 10403 0.0 0.2 3430372 25920 ? Ss Aug14 0:31 postgres:
user db x.x.x.x(61656) idle in transaction****

postgres 19933 0.0 0.4 3426604 49564 ? S Aug05 0:06
/usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/9.1/data****

postgres 19935 0.0 0.0 175288 396 ? Ss Aug05 0:13 postgres:
logger process****

postgres 21133 0.0 0.2 3443600 30680 ? Ss 09:28 0:00 postgres:
user db x.x.x.x(64430) idle****

postgres 21134 0.4 0.2 3430160 27656 ? Ss 09:28 0:16 postgres:
user db x.x.x.x(64431) idle****

root 21529 0.0 0.0 103240 844 pts/0 S+ 10:33 0:00 grep
--color postgres****

**

Thanks,

Andrew

On Thu, Aug 15, 2013 at 1:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Show quoted text

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:

Hello,

I'm having an issue where streaming replication just randomly stops

working.

I haven't been able to find anything in the logs which point to an issue,
but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54

postgres:

startup process recovering 000000010000053D0000003F waiting

There is a recovery conflict which it is waiting to go away. In other
words, you have a long-running (or long-idle) transaction on the slave
which is blocking recovery.

max_standby_archive_delay = -1
max_standby_streaming_delay = -1

...and you are willing to wait forever.

Cheers,

Jeff

#10John DeSoi
desoi@pgedit.com
In reply to: Andrew Berman (#1)
Re: Streaming Replication Randomly Locking Up

On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote:

I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up and replication halts. I verified that the two databases were out of sync with a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null'
max_wal_senders = 2
wal_keep_segments = 32

I recently posted about the same thing -- replication just stops after working OK for days or weeks, no errors in the logs on master or slave.

It appears I solved it by adding --timeout=30 to my rsync command. My guess was some kind of network hang and then rsync would just wait forever and never return.

John DeSoi, Ph.D.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Andrew Berman
rexxe98@gmail.com
In reply to: John DeSoi (#10)
Re: Streaming Replication Randomly Locking Up

Awesome, I'll give that a shot John.

On Fri, Aug 16, 2013 at 8:39 AM, John DeSoi <desoi@pgedit.com> wrote:

Show quoted text

On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote:

I'm having an issue where streaming replication just randomly stops

working. I haven't been able to find anything in the logs which point to
an issue, but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54

postgres: startup process recovering 000000010000053D0000003F waiting

postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30

postgres: writer process

postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03

postgres: stats collector process

postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31

postgres: wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up and

replication halts. I verified that the two databases were out of sync with
a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f

</dev/null'

max_wal_senders = 2
wal_keep_segments = 32

I recently posted about the same thing -- replication just stops after
working OK for days or weeks, no errors in the logs on master or slave.

It appears I solved it by adding --timeout=30 to my rsync command. My
guess was some kind of network hang and then rsync would just wait forever
and never return.

John DeSoi, Ph.D.

#12Jeff Janes
jeff.janes@gmail.com
In reply to: Andrew Berman (#9)
Re: Streaming Replication Randomly Locking Up

On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:

Hi Jeff,

Here is the full process list at the time it stopped working (I have changed
the actual username, db and IP for security). Would the idle in transaction
process be the culprit?

Most likely, yes. You should be able to dig into pg_locks to verify.

Cheers,

Jeff

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Jeff Janes
jeff.janes@gmail.com
In reply to: Jeff Janes (#12)
Re: Streaming Replication Randomly Locking Up

On Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:

Hi Jeff,

Here is the full process list at the time it stopped working (I have changed
the actual username, db and IP for security). Would the idle in transaction
process be the culprit?

Most likely, yes. You should be able to dig into pg_locks to verify.

Actually, you can't. The waiting doesn't show up in pg_locks, because
it polls in a sleep-loop, rather than doing a normal wait on the lock.

Still, that idle in transaction process is almost surely the culprit.

Cheers,

Jeff

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14Andrew Berman
rexxe98@gmail.com
In reply to: Jeff Janes (#13)
Re: Streaming Replication Randomly Locking Up

Ok, next time it happens I'll try to do more sleuthing to figure out if
that's the issue. For now, I'm going to try adding --timeout=30 to the
rsync command and see if that fixes things.

Thanks again for your help!

Andrew

On Fri, Aug 16, 2013 at 10:12 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

Show quoted text

On Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com>

wrote:

Hi Jeff,

Here is the full process list at the time it stopped working (I have

changed

the actual username, db and IP for security). Would the idle in

transaction

process be the culprit?

Most likely, yes. You should be able to dig into pg_locks to verify.

Actually, you can't. The waiting doesn't show up in pg_locks, because
it polls in a sleep-loop, rather than doing a normal wait on the lock.

Still, that idle in transaction process is almost surely the culprit.

Cheers,

Jeff