Replication failed after stalling

Started by Joe Van Dykover 12 years ago15 messagesgeneral
Jump to latest
#1Joe Van Dyk
joe@tanga.com

I'm running Postgresql 9.3. I have a streaming replication server. Someone
was running a long COPY query (8 hours) on the standby which halted
replication. The replication stopped at 3:30 am. I canceled the
long-running query at 9:30 am and replication data started catching up.

The data up until 10 am got restored fine (took until 10:30 am to restore
that much). Then I started getting errors like "FATAL: could not receive
data from WAL stream: ERROR: requested WAL segment
00000001000003C300000086 has already been removed".

I'm confused about how pg could restore data from 3:30 am to 10 am, then
start complaining about missing WAL files.

What's the best way to avoid this problem? Increase wal_keep_segments?

Joe

#2Joe Van Dyk
joe@tanga.com
In reply to: Joe Van Dyk (#1)
Re: Replication failed after stalling

A possibly related question:

I've set wal_keep_segments to 10,000 and also have archive_command running
wal-e. I'm seeing my wal files disappear from pg_xlog after 30 minutes. Is
that expected? Is there a way around that?

(I want to use streaming replication and wal-e for PITR restores)

On Wed, Dec 18, 2013 at 11:26 AM, Joe Van Dyk <joe@tanga.com> wrote:

Show quoted text

I'm running Postgresql 9.3. I have a streaming replication server. Someone
was running a long COPY query (8 hours) on the standby which halted
replication. The replication stopped at 3:30 am. I canceled the
long-running query at 9:30 am and replication data started catching up.

The data up until 10 am got restored fine (took until 10:30 am to restore
that much). Then I started getting errors like "FATAL: could not receive
data from WAL stream: ERROR: requested WAL segment
00000001000003C300000086 has already been removed".

I'm confused about how pg could restore data from 3:30 am to 10 am, then
start complaining about missing WAL files.

What's the best way to avoid this problem? Increase wal_keep_segments?

Joe

#3Jerry Sievers
gsievers19@comcast.net
In reply to: Joe Van Dyk (#1)
Re: Replication failed after stalling

Joe Van Dyk <joe@tanga.com> writes:

I'm running Postgresql 9.3. I have a streaming replication server. Someone was running a long COPY query (8 hours) on the standby which halted replication. The
replication stopped at 3:30 am. I canceled the long-running query at 9:30 am and replication data started catching up.

The data up until 10 am got restored fine (took until 10:30 am to restore that much). Then I started getting errors like "FATAL:  could not receive data from WAL
stream: ERROR:  requested WAL segment 00000001000003C300000086 has already been removed".

I'm confused about how pg could restore data from 3:30 am to 10 am, then start complaining about missing WAL files.

What's the best way to avoid this problem? Increase wal_keep_segments?

Yes and/or implement as a hybrid of streaming and WAL shipping.

Quite simply, your wal_keep segments was almost enough to get you
through that backlog period but as your standby was catching up, it hit
a point where there was a gap.

Depending on how much traffic your master sees at various times of the
day, it's unsurprising that during peak loads, your grace-period is a
lot lower than during off-peak times due to variations in how quickly
WAL segments are filled and cycled over.

HTH

Joe

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Joe Van Dyk (#2)
Re: Replication failed after stalling

On 12/18/2013 12:15 PM, Joe Van Dyk wrote:

A possibly related question:

I've set wal_keep_segments to 10,000 and also have archive_command
running wal-e. I'm seeing my wal files disappear from pg_xlog after 30
minutes. Is that expected? Is there a way around that?

Well a WAL segment is 16MB in size so that should give you a basis for
determining whether the above is appropriate, my guess it is not. I do
not know enough about Wal-e, but my guess is it is siphoning off WAL
segments before you want it to.

--
Adrian Klaver
adrian.klaver@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Sergey Konoplev
gray.ru@gmail.com
In reply to: Joe Van Dyk (#1)
Re: Replication failed after stalling

On Wed, Dec 18, 2013 at 11:26 AM, Joe Van Dyk <joe@tanga.com> wrote:

I'm running Postgresql 9.3. I have a streaming replication server. Someone
was running a long COPY query (8 hours) on the standby which halted
replication. The replication stopped at 3:30 am. I canceled the long-running
query at 9:30 am and replication data started catching up.

What do you mean by "COPY on the standby halted replication"?

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Joe Van Dyk
joe@tanga.com
In reply to: Sergey Konoplev (#5)
Re: Replication failed after stalling

On Wed, Dec 18, 2013 at 3:39 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:

On Wed, Dec 18, 2013 at 11:26 AM, Joe Van Dyk <joe@tanga.com> wrote:

I'm running Postgresql 9.3. I have a streaming replication server.

Someone

was running a long COPY query (8 hours) on the standby which halted
replication. The replication stopped at 3:30 am. I canceled the

long-running

query at 9:30 am and replication data started catching up.

What do you mean by "COPY on the standby halted replication"?

If I run "COPY (select * from complicate_view) to stdout" on the standby,
I've noticed that sometimes halts replication updates to the slave.

For example, that's happening right now and "now() -
pg_last_xact_replay_timestamp()" is 22 minutes. There's many transactions
per second being committed on the master. Once that query is canceled, the
slave catches up immediately.

Joe

#7Sergey Konoplev
gray.ru@gmail.com
In reply to: Joe Van Dyk (#6)
Re: Replication failed after stalling

On Sun, Dec 29, 2013 at 9:56 PM, Joe Van Dyk <joe@tanga.com> wrote:

On Wed, Dec 18, 2013 at 3:39 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:

On Wed, Dec 18, 2013 at 11:26 AM, Joe Van Dyk <joe@tanga.com> wrote:

I'm running Postgresql 9.3. I have a streaming replication server.
Someone
was running a long COPY query (8 hours) on the standby which halted
replication. The replication stopped at 3:30 am. I canceled the
long-running
query at 9:30 am and replication data started catching up.

What do you mean by "COPY on the standby halted replication"?

If I run "COPY (select * from complicate_view) to stdout" on the standby,
I've noticed that sometimes halts replication updates to the slave.

For example, that's happening right now and "now() -
pg_last_xact_replay_timestamp()" is 22 minutes. There's many transactions
per second being committed on the master. Once that query is canceled, the
slave catches up immediately.

And what

\x
select * from pg_stat_repication;

shows?

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Joe Van Dyk (#6)
Re: Replication failed after stalling

Joe Van Dyk wrote:

If I run "COPY (select * from complicate_view) to stdout" on the standby, I've noticed that sometimes
halts replication updates to the slave.

For example, that's happening right now and "now() - pg_last_xact_replay_timestamp()" is 22 minutes.
There's many transactions per second being committed on the master. Once that query is canceled, the
slave catches up immediately.

You have hot_standby_feedback = on, right?

In that case that is expected behaviour.
Some change on the master conflicted with the query on the standby,
perhaps with a tuple cleaned up after a HOT update. Replication will
stall until the query is done.

If that is unacceptable to you and you would rather have queries
canceled on the standby if they take too long, set hot_standby_feedback
back to off and increase vacuum_defer_cleanup_age.

But it is not possible to have a predictable maximum replication lag
and arbitrarily long running queries on the standby at the same time.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Sergey Konoplev
gray.ru@gmail.com
In reply to: Laurenz Albe (#8)
Re: Replication failed after stalling

On Mon, Dec 30, 2013 at 12:27 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:

Joe Van Dyk wrote:

If I run "COPY (select * from complicate_view) to stdout" on the standby, I've noticed that sometimes
halts replication updates to the slave.

For example, that's happening right now and "now() - pg_last_xact_replay_timestamp()" is 22 minutes.
There's many transactions per second being committed on the master. Once that query is canceled, the
slave catches up immediately.

You have hot_standby_feedback = on, right?

In that case that is expected behaviour.
Some change on the master conflicted with the query on the standby,
perhaps with a tuple cleaned up after a HOT update. Replication will
stall until the query is done.

IIRC, the applying process is paused but the receiving one is going on
in this case, isn't it?

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Joe Van Dyk
joe@tanga.com
In reply to: Joe Van Dyk (#1)
Re: Replication failed after stalling

On Mon, Dec 30, 2013 at 10:27 AM, Sergey Konoplev <gray.ru@gmail.com> wrote:

On Mon, Dec 30, 2013 at 12:02 AM, Joe Van Dyk <joe@tanga.com> wrote:

On Sun, Dec 29, 2013 at 10:52 PM, Sergey Konoplev <gray.ru@gmail.com>

wrote:

On Sun, Dec 29, 2013 at 9:56 PM, Joe Van Dyk <joe@tanga.com> wrote:

On Wed, Dec 18, 2013 at 3:39 PM, Sergey Konoplev <gray.ru@gmail.com>
wrote:
If I run "COPY (select * from complicate_view) to stdout" on the
standby,
I've noticed that sometimes halts replication updates to the slave.

For example, that's happening right now and "now() -
pg_last_xact_replay_timestamp()" is 22 minutes. There's many
transactions
per second being committed on the master. Once that query is canceled,
the
slave catches up immediately.

And what

\x
select * from pg_stat_repication;

shows?

on the master, right?

Yes.

And it would be very useful to take a look at your checkpoints and
replication configuration parameters on both master and replica.

master and replica have same settings.

checkpoint_completion_target: 0.9
checkpoint_segments: 16
checkpoint_timeout: 5m
checkpoint_warning: 30s
hot_standby: on
hot_standby_feedback: on

pid | 10736
usesysid | 10
usename | postgres
application_name | walreceiver
client_addr | <the ip>
client_hostname |
client_port | 47124
backend_start | 2013-12-30 12:08:42.967868-08
state | streaming
sent_location | 410/BC152000
write_location | 410/BC152000
flush_location | 410/BC152000
replay_location | 410/A758B7D0
sync_priority | 0
sync_state | async

#11Sergey Konoplev
gray.ru@gmail.com
In reply to: Joe Van Dyk (#10)
Re: Replication failed after stalling

On Mon, Dec 30, 2013 at 8:56 PM, Joe Van Dyk <joe@tanga.com> wrote:

On Mon, Dec 30, 2013 at 10:27 AM, Sergey Konoplev <gray.ru@gmail.com> wrote:

On Mon, Dec 30, 2013 at 12:02 AM, Joe Van Dyk <joe@tanga.com> wrote:

On Sun, Dec 29, 2013 at 10:52 PM, Sergey Konoplev <gray.ru@gmail.com>
wrote:

On Sun, Dec 29, 2013 at 9:56 PM, Joe Van Dyk <joe@tanga.com> wrote:

On Wed, Dec 18, 2013 at 3:39 PM, Sergey Konoplev <gray.ru@gmail.com>
wrote:
If I run "COPY (select * from complicate_view) to stdout" on the
standby,
I've noticed that sometimes halts replication updates to the slave.

\x
select * from pg_stat_repication;

And it would be very useful to take a look at your checkpoints and
replication configuration parameters on both master and replica.

master and replica have same settings.

checkpoint_completion_target: 0.9
checkpoint_segments: 16
checkpoint_timeout: 5m
checkpoint_warning: 30s
hot_standby: on
hot_standby_feedback: on

I meant all the replication settings, see [1]http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html. And pg_stat_statements
when there is a problem, preferable the error, because when everything
is okay it is not very useful actually.

[1]: http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Scott Marlowe
scott.marlowe@gmail.com
In reply to: Adrian Klaver (#4)
Re: Replication failed after stalling

On Wed, Dec 18, 2013 at 1:51 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote:

On 12/18/2013 12:15 PM, Joe Van Dyk wrote:

A possibly related question:

I've set wal_keep_segments to 10,000 and also have archive_command
running wal-e. I'm seeing my wal files disappear from pg_xlog after 30
minutes. Is that expected? Is there a way around that?

Well a WAL segment is 16MB in size so that should give you a basis for
determining whether the above is appropriate, my guess it is not. I do not
know enough about Wal-e, but my guess is it is siphoning off WAL segments
before you want it to.

Don't some operations like forced checkpoints etc skip to the next WAL
resulting in them not necessarily being "full"?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Joe Van Dyk
joe@tanga.com
In reply to: Sergey Konoplev (#11)
Re: Replication failed after stalling

On Mon, Dec 30, 2013 at 9:11 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:

On Mon, Dec 30, 2013 at 8:56 PM, Joe Van Dyk <joe@tanga.com> wrote:

On Mon, Dec 30, 2013 at 10:27 AM, Sergey Konoplev <gray.ru@gmail.com>

wrote:

On Mon, Dec 30, 2013 at 12:02 AM, Joe Van Dyk <joe@tanga.com> wrote:

On Sun, Dec 29, 2013 at 10:52 PM, Sergey Konoplev <gray.ru@gmail.com>
wrote:

On Sun, Dec 29, 2013 at 9:56 PM, Joe Van Dyk <joe@tanga.com> wrote:

On Wed, Dec 18, 2013 at 3:39 PM, Sergey Konoplev <

gray.ru@gmail.com>

wrote:
If I run "COPY (select * from complicate_view) to stdout" on the
standby,
I've noticed that sometimes halts replication updates to the slave.

\x
select * from pg_stat_repication;

And it would be very useful to take a look at your checkpoints and
replication configuration parameters on both master and replica.

master and replica have same settings.

checkpoint_completion_target: 0.9
checkpoint_segments: 16
checkpoint_timeout: 5m
checkpoint_warning: 30s
hot_standby: on
hot_standby_feedback: on

I meant all the replication settings, see [1]. And pg_stat_statements
when there is a problem, preferable the error, because when everything
is okay it is not very useful actually.

I don't understand, how is pg_stat_statements helpful here, and what error?

[1]
http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html

max_wal_senders: 5
wal_keep_segments: 10000
wal_sender_timeout: 1m
synchronous_standby_names: n/a
vacuum_defer_cleanup_age: 0
max_standby_archive_delay: 30s
max_standby_streaming_delay: -1
wal_receiver_status_interval: 10s
hot_standby_feedback: on
wal_receiver_timeout: 1m

#14Sergey Konoplev
gray.ru@gmail.com
In reply to: Joe Van Dyk (#13)
Re: Replication failed after stalling

On Mon, Dec 30, 2013 at 10:05 PM, Joe Van Dyk <joe@tanga.com> wrote:

I meant all the replication settings, see [1]. And pg_stat_statements
when there is a problem, preferable the error, because when everything
is okay it is not very useful actually.

I don't understand, how is pg_stat_statements helpful here, and what error?

The error you showed in the initial email.

My guess is that the master might stop sending WAL records to the
replica, that is why I wanted to check the stat_replication query. Oh,
yes, and I forget to put current_xlog_location in the query. So, the
correct one is below.

\x
select pg_current_xlog_location(), * from pg_stat_replication;

checkpoint_completion_target: 0.9
checkpoint_segments: 16
checkpoint_timeout: 5m
checkpoint_warning: 30s

[...]

max_wal_senders: 5
wal_keep_segments: 10000
vacuum_defer_cleanup_age: 0
max_standby_archive_delay: 30s
max_standby_streaming_delay: -1
wal_receiver_status_interval: 10s
hot_standby_feedback: on

[...]

That 10000 looks weird and I would increase checkpoint_segments and
checkpoint_timeout, but first let us check how often checkpoints and
checkpoint warnings happen on master. You can see it in logs. Turn
log_checkpoints on if it is off.

And also how many WAL your system generates and for what period.

ls -lt /path/to/pg_xlog/ | wc -l
ls -lt /path/to/pg_xlog/ | head
ls -lt /path/to/pg_xlog/ | tail

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Sergey Konoplev (#9)
Re: Replication failed after stalling

Sergey Konoplev wrote:

On Mon, Dec 30, 2013 at 12:27 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:

Joe Van Dyk wrote:

If I run "COPY (select * from complicate_view) to stdout" on the standby, I've noticed that sometimes
halts replication updates to the slave.

For example, that's happening right now and "now() - pg_last_xact_replay_timestamp()" is 22 minutes.
There's many transactions per second being committed on the master. Once that query is canceled, the
slave catches up immediately.

You have hot_standby_feedback = on, right?

In that case that is expected behaviour.
Some change on the master conflicted with the query on the standby,
perhaps with a tuple cleaned up after a HOT update. Replication will
stall until the query is done.

IIRC, the applying process is paused but the receiving one is going on
in this case, isn't it?

Frankly, I don't know.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general