Streaming Replication Randomly Locking Up
Hello,
I'm having an issue where streaming replication just randomly stops
working. I haven't been able to find anything in the logs which point to
an issue, but the Postgres process shows a "waiting" status on the slave:
postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres:
wal receiver process streaming 549/216B3730
The replication works great for days, but randomly seems to lock up and
replication halts. I verified that the two databases were out of sync with
a query on both of them. Has anyone experienced this issue before?
Here are some relevant config settings:
Master:
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32
Slave:
wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = off
Thank you for any help you can provide!
Andrew
I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?
Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?
On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
Hello,
I'm having an issue where streaming replication just randomly stops working.
I haven't been able to find anything in the logs which point to an issue,
but the Postgres process shows a "waiting" status on the slave:postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres:
wal receiver process streaming 549/216B3730The replication works great for days, but randomly seems to lock up and
replication halts. I verified that the two databases were out of sync with
a query on both of them. Has anyone experienced this issue before?Here are some relevant config settings:
Master:
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32Slave:
wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = offThank you for any help you can provide!
Andrew
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi Lonni,
Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look more
into this. They are communicating on a private network and switch.
I forgot to mention that after I restart the slave, everything syncs right
back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.
Thanks!
On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>wrote:
Show quoted text
I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
Hello,
I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an issue,
but the Postgres process shows a "waiting" status on the slave:postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31postgres:
wal receiver process streaming 549/216B3730
The replication works great for days, but randomly seems to lock up and
replication halts. I verified that the two databases were out of syncwith
a query on both of them. Has anyone experienced this issue before?
Here are some relevant config settings:
Master:
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32Slave:
wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = offThank you for any help you can provide!
Andrew
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org
Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?
On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
Hi Lonni,
Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look more
into this. They are communicating on a private network and switch.I forgot to mention that after I restart the slave, everything syncs right
back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.Thanks!
On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
wrote:I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
Hello,
I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an
issue,
but the Postgres process shows a "waiting" status on the slave:postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
postgres:
wal receiver process streaming 549/216B3730The replication works great for days, but randomly seems to lock up and
replication halts. I verified that the two databases were out of sync
with
a query on both of them. Has anyone experienced this issue before?Here are some relevant config settings:
Master:
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32Slave:
wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = offThank you for any help you can provide!
Andrew
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
The only thing I see that is a possibility for the issue is in the slave
log:
LOG: unexpected EOF on client connection
LOG: could not receive data from client: Connection reset by peer
I don't know if that's related or not as it could just be somebody running
a query. The log file does seem to be riddled with these but the
replication failures don't happen constantly.
As far as I know I'm not swallowing any errors. The logging is all set as
the default:
log_destination = 'stderr'
logging_collector = on
#client_min_messages = notice
#log_min_messages = warning
#log_min_error_statement = error
#log_min_duration_statement = -1
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_error_verbosity = default
I'm going to have a look at the NICs to make sure there's no issue there.
Thanks again for your help!
On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>wrote:
Show quoted text
Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
Hi Lonni,
Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look more
into this. They are communicating on a private network and switch.I forgot to mention that after I restart the slave, everything syncs
right
back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.Thanks!
On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
wrote:I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
wrote:
Hello,
I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an
issue,
but the Postgres process shows a "waiting" status on the slave:postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
postgres:
wal receiver process streaming 549/216B3730The replication works great for days, but randomly seems to lock up
and
replication halts. I verified that the two databases were out of sync
with
a query on both of them. Has anyone experienced this issue before?Here are some relevant config settings:
Master:
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32Slave:
wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = offThank you for any help you can provide!
Andrew
I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname. That will help to rule
in/out those 'unexpected EOF' errors.
On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote:
The only thing I see that is a possibility for the issue is in the slave
log:LOG: unexpected EOF on client connection
LOG: could not receive data from client: Connection reset by peerI don't know if that's related or not as it could just be somebody running a
query. The log file does seem to be riddled with these but the replication
failures don't happen constantly.As far as I know I'm not swallowing any errors. The logging is all set as
the default:log_destination = 'stderr'
logging_collector = on
#client_min_messages = notice
#log_min_messages = warning
#log_min_error_statement = error
#log_min_duration_statement = -1
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_error_verbosity = defaultI'm going to have a look at the NICs to make sure there's no issue there.
Thanks again for your help!
On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>
wrote:Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
Hi Lonni,
Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look more
into this. They are communicating on a private network and switch.I forgot to mention that after I restart the slave, everything syncs
right
back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.Thanks!
On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
wrote:I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
wrote:Hello,
I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an
issue,
but the Postgres process shows a "waiting" status on the slave:postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
postgres:
wal receiver process streaming 549/216B3730The replication works great for days, but randomly seems to lock up
and
replication halts. I verified that the two databases were out of
sync
with
a query on both of them. Has anyone experienced this issue before?Here are some relevant config settings:
Master:
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32Slave:
wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = offThank you for any help you can provide!
Andrew
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Yep, that's the first thing I'm going to do.
On Thu, Aug 15, 2013 at 12:34 PM, Lonni J Friedman <netllama@gmail.com>wrote:
Show quoted text
I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname. That will help to rule
in/out those 'unexpected EOF' errors.On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote:
The only thing I see that is a possibility for the issue is in the slave
log:LOG: unexpected EOF on client connection
LOG: could not receive data from client: Connection reset by peerI don't know if that's related or not as it could just be somebody
running a
query. The log file does seem to be riddled with these but the
replication
failures don't happen constantly.
As far as I know I'm not swallowing any errors. The logging is all set
as
the default:
log_destination = 'stderr'
logging_collector = on
#client_min_messages = notice
#log_min_messages = warning
#log_min_error_statement = error
#log_min_duration_statement = -1
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_error_verbosity = defaultI'm going to have a look at the NICs to make sure there's no issue there.
Thanks again for your help!
On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>
wrote:Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com>
wrote:
Hi Lonni,
Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will lookmore
into this. They are communicating on a private network and switch.
I forgot to mention that after I restart the slave, everything syncs
right
back up and all if working again so if it is a network issue, the
replication is just stopping after some hiccup instead of retrying and
resuming when things are back up.Thanks!
On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <
netllama@gmail.com>
wrote:
I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
wrote:Hello,
I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an
issue,
but the Postgres process shows a "waiting" status on the slave:postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
postgres:
writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
postgres:
stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
postgres:
wal receiver process streaming 549/216B3730The replication works great for days, but randomly seems to lock up
and
replication halts. I verified that the two databases were out of
sync
with
a query on both of them. Has anyone experienced this issue before?Here are some relevant config settings:
Master:
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
</dev/null'
max_wal_senders = 2
wal_keep_segments = 32Slave:
wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = offThank you for any help you can provide!
Andrew
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org
On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
Hello,
I'm having an issue where streaming replication just randomly stops working.
I haven't been able to find anything in the logs which point to an issue,
but the Postgres process shows a "waiting" status on the slave:postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
startup process recovering 000000010000053D0000003F waiting
There is a recovery conflict which it is waiting to go away. In other
words, you have a long-running (or long-idle) transaction on the slave
which is blocking recovery.
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
...and you are willing to wait forever.
Cheers,
Jeff
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi Jeff,
Here is the full process list at the time it stopped working (I have
changed the actual username, db and IP for security). Would the idle in
transaction process be the culprit?
postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
startup process recovering 000000010000053D0000003F waiting****
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres:
writer process****
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres:
stats collector process****
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres:
wal receiver process streaming 549/216B3730****
postgres 10403 0.0 0.2 3430372 25920 ? Ss Aug14 0:31 postgres:
user db x.x.x.x(61656) idle in transaction****
postgres 19933 0.0 0.4 3426604 49564 ? S Aug05 0:06
/usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/9.1/data****
postgres 19935 0.0 0.0 175288 396 ? Ss Aug05 0:13 postgres:
logger process****
postgres 21133 0.0 0.2 3443600 30680 ? Ss 09:28 0:00 postgres:
user db x.x.x.x(64430) idle****
postgres 21134 0.4 0.2 3430160 27656 ? Ss 09:28 0:16 postgres:
user db x.x.x.x(64431) idle****
root 21529 0.0 0.0 103240 844 pts/0 S+ 10:33 0:00 grep
--color postgres****
**
Thanks,
Andrew
On Thu, Aug 15, 2013 at 1:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Show quoted text
On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
Hello,
I'm having an issue where streaming replication just randomly stops
working.
I haven't been able to find anything in the logs which point to an issue,
but the Postgres process shows a "waiting" status on the slave:postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres:
startup process recovering 000000010000053D0000003F waiting
There is a recovery conflict which it is waiting to go away. In other
words, you have a long-running (or long-idle) transaction on the slave
which is blocking recovery.max_standby_archive_delay = -1
max_standby_streaming_delay = -1...and you are willing to wait forever.
Cheers,
Jeff
On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote:
I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a "waiting" status on the slave:
postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730The replication works great for days, but randomly seems to lock up and replication halts. I verified that the two databases were out of sync with a query on both of them. Has anyone experienced this issue before?
Here are some relevant config settings:
Master:
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null'
max_wal_senders = 2
wal_keep_segments = 32
I recently posted about the same thing -- replication just stops after working OK for days or weeks, no errors in the logs on master or slave.
It appears I solved it by adding --timeout=30 to my rsync command. My guess was some kind of network hang and then rsync would just wait forever and never return.
John DeSoi, Ph.D.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Awesome, I'll give that a shot John.
On Fri, Aug 16, 2013 at 8:39 AM, John DeSoi <desoi@pgedit.com> wrote:
Show quoted text
On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote:
I'm having an issue where streaming replication just randomly stops
working. I haven't been able to find anything in the logs which point to
an issue, but the Postgres process shows a "waiting" status on the slave:postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
postgres: startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
postgres: writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
postgres: stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
postgres: wal receiver process streaming 549/216B3730
The replication works great for days, but randomly seems to lock up and
replication halts. I verified that the two databases were out of sync with
a query on both of them. Has anyone experienced this issue before?Here are some relevant config settings:
Master:
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f</dev/null'
max_wal_senders = 2
wal_keep_segments = 32I recently posted about the same thing -- replication just stops after
working OK for days or weeks, no errors in the logs on master or slave.It appears I solved it by adding --timeout=30 to my rsync command. My
guess was some kind of network hang and then rsync would just wait forever
and never return.John DeSoi, Ph.D.
On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:
Hi Jeff,
Here is the full process list at the time it stopped working (I have changed
the actual username, db and IP for security). Would the idle in transaction
process be the culprit?
Most likely, yes. You should be able to dig into pg_locks to verify.
Cheers,
Jeff
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:
Hi Jeff,
Here is the full process list at the time it stopped working (I have changed
the actual username, db and IP for security). Would the idle in transaction
process be the culprit?Most likely, yes. You should be able to dig into pg_locks to verify.
Actually, you can't. The waiting doesn't show up in pg_locks, because
it polls in a sleep-loop, rather than doing a normal wait on the lock.
Still, that idle in transaction process is almost surely the culprit.
Cheers,
Jeff
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Ok, next time it happens I'll try to do more sleuthing to figure out if
that's the issue. For now, I'm going to try adding --timeout=30 to the
rsync command and see if that fixes things.
Thanks again for your help!
Andrew
On Fri, Aug 16, 2013 at 10:12 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
Show quoted text
On Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com>
wrote:
Hi Jeff,
Here is the full process list at the time it stopped working (I have
changed
the actual username, db and IP for security). Would the idle in
transaction
process be the culprit?
Most likely, yes. You should be able to dig into pg_locks to verify.
Actually, you can't. The waiting doesn't show up in pg_locks, because
it polls in a sleep-loop, rather than doing a normal wait on the lock.Still, that idle in transaction process is almost surely the culprit.
Cheers,
Jeff