Wal files - Question | Postgres 9.2

Started by Patrick Bover 9 years ago13 messagesgeneral

patrickbakerbr@gmail.com

over 9 years ago

Hi guys,

I currently have a slave02 server that is replicating from another slave01
via Cascading replication. The master01 server is shipping wal_files (via
ssh) to both slaves.

I'm doing some tests on slave02 to test the recovery via wal_files... The
goal here is to stop postgres, wait few minutes, start postgres again,
watch it recovering from wal_files, once it's done see the streaming
replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the
wal_files are being successfully recovered

However, when doing step3 I get these messages:

cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or
directory
LOG: consistent recovery state reached at 1AF8/AB629F90
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary

still on slave01: *Sometimes the log_delay time is bigger.. sometimes is
lower*

SELECT CASE WHEN pg_last_xlog_receive_location() =
pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() -
pg_last_xact_replay_timestamp()) END AS log_delay;

log_delay

-----------

0.386863

On master01:

select * from pg_current_xlog_location();

pg_current_xlog_location

--------------------------

1AF8/D3F47A80

*QUESTION:*

So.. I just wanna understand what's the risk of those errors... what's
happening?
*cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
director*y - Means it didn't find the file. However, the file exists on the
Master, but it didn't start shipping yet. What are the consequences of that?

Cheers
Patrick

Venkata B Nagothi

nag1010@gmail.com

over 9 years ago

In reply to: Patrick B (#1)

Re: Wal files - Question | Postgres 9.2

On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com> wrote:

Hi guys,

I currently have a slave02 server that is replicating from another slave01
via Cascading replication. The master01 server is shipping wal_files (via
ssh) to both slaves.

I'm doing some tests on slave02 to test the recovery via wal_files... The
goal here is to stop postgres, wait few minutes, start postgres again,
watch it recovering from wal_files, once it's done see the streaming
replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the
wal_files are being successfully recovered

However, when doing step3 I get these messages:

cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or
directory
LOG: consistent recovery state reached at 1AF8/AB629F90
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary

still on slave01: *Sometimes the log_delay time is bigger.. sometimes is
lower*

SELECT CASE WHEN pg_last_xlog_receive_location() =
pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() -
pg_last_xact_replay_timestamp()) END AS log_delay;

log_delay

-----------

0.386863

On master01:

select * from pg_current_xlog_location();

pg_current_xlog_location

--------------------------

1AF8/D3F47A80

*QUESTION:*

So.. I just wanna understand what's the risk of those errors... what's
happening?
*cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
director*y - Means it didn't find the file. However, the file exists on
the Master, but it didn't start shipping yet. What are the consequences of
that?

That is just saying that the slave cannot find the WAL file. That should
not be of big importance. Eventually, that will vanish when the log file
gets shipped from the master. Also "cp: cannot stat." errors have been been
fixed in 9.3 i believe.

Regards,

Venkata B N
Database Consultant

Fujitsu Australia

Patrick B

patrickbakerbr@gmail.com

over 9 years ago

In reply to: Venkata B Nagothi (#2)

Re: Wal files - Question | Postgres 9.2

2016-11-23 15:55 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:

On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com>
wrote:

Hi guys,

I currently have a slave02 server that is replicating from another
slave01 via Cascading replication. The master01 server is shipping
wal_files (via ssh) to both slaves.

I'm doing some tests on slave02 to test the recovery via wal_files... The
goal here is to stop postgres, wait few minutes, start postgres again,
watch it recovering from wal_files, once it's done see the streaming
replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the
wal_files are being successfully recovered

However, when doing step3 I get these messages:

cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or
directory
LOG: consistent recovery state reached at 1AF8/AB629F90
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary

still on slave01: *Sometimes the log_delay time is bigger.. sometimes is
lower*

SELECT CASE WHEN pg_last_xlog_receive_location() =
pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() -
pg_last_xact_replay_timestamp()) END AS log_delay;

log_delay

-----------

0.386863

On master01:

select * from pg_current_xlog_location();

pg_current_xlog_location

--------------------------

1AF8/D3F47A80

*QUESTION:*

So.. I just wanna understand what's the risk of those errors... what's
happening?
*cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
director*y - Means it didn't find the file. However, the file exists on
the Master, but it didn't start shipping yet. What are the consequences of
that?

That is just saying that the slave cannot find the WAL file. That should
not be of big importance. Eventually, that will vanish when the log file
gets shipped from the master. Also "cp: cannot stat." errors have been been
fixed in 9.3 i believe.

Hi Venkata !

Yeah that's fine.. the streaming replication is already working fine.

But, as it didn't find/recover some of the wal_files, doesn't that mean
that the DB isn't up-to-date?
Otherwise what's the purpose of the wal_files if not be responsible to
contain the essential data to the DB be up-to-date?

Thanks!

Venkata B Nagothi

nag1010@gmail.com

over 9 years ago

In reply to: Patrick B (#3)

Re: Wal files - Question | Postgres 9.2

On Wed, Nov 23, 2016 at 1:59 PM, Patrick B <patrickbakerbr@gmail.com> wrote:

2016-11-23 15:55 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:

On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com>
wrote:

Hi guys,

I currently have a slave02 server that is replicating from another
slave01 via Cascading replication. The master01 server is shipping
wal_files (via ssh) to both slaves.

I'm doing some tests on slave02 to test the recovery via wal_files...
The goal here is to stop postgres, wait few minutes, start postgres again,
watch it recovering from wal_files, once it's done see the streaming
replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the
wal_files are being successfully recovered

However, when doing step3 I get these messages:

cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or
directory
LOG: consistent recovery state reached at 1AF8/AB629F90
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary

still on slave01: *Sometimes the log_delay time is bigger.. sometimes
is lower*

SELECT CASE WHEN pg_last_xlog_receive_location() =
pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() -
pg_last_xact_replay_timestamp()) END AS log_delay;

log_delay

-----------

0.386863

On master01:

select * from pg_current_xlog_location();

pg_current_xlog_location

--------------------------

1AF8/D3F47A80

*QUESTION:*

So.. I just wanna understand what's the risk of those errors... what's
happening?
*cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
director*y - Means it didn't find the file. However, the file exists on
the Master, but it didn't start shipping yet. What are the consequences of
that?

That is just saying that the slave cannot find the WAL file. That should
not be of big importance. Eventually, that will vanish when the log file
gets shipped from the master. Also "cp: cannot stat." errors have been been
fixed in 9.3 i believe.

Hi Venkata !

Yeah that's fine.. the streaming replication is already working fine.

But, as it didn't find/recover some of the wal_files, doesn't that mean
that the DB isn't up-to-date?

Not necessarily. Standby periodically checks if the WAL file it is looking
for is available at restore_command location and generates that message if
the file is not available. These messages are not of any harm.

Below link might help you :

/messages/by-id/4DDC9515.2000003@enterprisedb.com

Regards,
Venkata B N
Database Consultant

Fujitsu Australia

Patrick B

patrickbakerbr@gmail.com

over 9 years ago

In reply to: Venkata B Nagothi (#4)

Re: Wal files - Question | Postgres 9.2

2016-11-23 16:18 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:

On Wed, Nov 23, 2016 at 1:59 PM, Patrick B <patrickbakerbr@gmail.com>
wrote:

2016-11-23 15:55 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:

On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com>
wrote:

Hi guys,

I currently have a slave02 server that is replicating from another
slave01 via Cascading replication. The master01 server is shipping
wal_files (via ssh) to both slaves.

I'm doing some tests on slave02 to test the recovery via wal_files...
The goal here is to stop postgres, wait few minutes, start postgres again,
watch it recovering from wal_files, once it's done see the streaming
replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the
wal_files are being successfully recovered

However, when doing step3 I get these messages:

cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
directory

cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or
directory
LOG: consistent recovery state reached at 1AF8/AB629F90
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary

still on slave01: *Sometimes the log_delay time is bigger.. sometimes
is lower*

SELECT CASE WHEN pg_last_xlog_receive_location() =
pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() -
pg_last_xact_replay_timestamp()) END AS log_delay;

log_delay

-----------

0.386863

On master01:

select * from pg_current_xlog_location();

pg_current_xlog_location

--------------------------

1AF8/D3F47A80

*QUESTION:*

So.. I just wanna understand what's the risk of those errors... what's
happening?
*cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
director*y - Means it didn't find the file. However, the file exists
on the Master, but it didn't start shipping yet. What are the consequences
of that?

That is just saying that the slave cannot find the WAL file. That should
not be of big importance. Eventually, that will vanish when the log file
gets shipped from the master. Also "cp: cannot stat." errors have been been
fixed in 9.3 i believe.

Hi Venkata !

Yeah that's fine.. the streaming replication is already working fine.

But, as it didn't find/recover some of the wal_files, doesn't that mean
that the DB isn't up-to-date?

Not necessarily. Standby periodically checks if the WAL file it is looking
for is available at restore_command location and generates that message if
the file is not available. These messages are not of any harm.

Below link might help you :

/messages/by-id/4DDC9515.2000003@enterprisedb.com

hmm.. I really don't get it.

If I get messages like:

*cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
director*y

In my head, it's saying that it was unable to recover that file and,
because of that, there is missing data.
Even if the server is able to connect to tbe master via streaming
replication, there might be missing data. There might be data that is into
master but not into slave.

Am I wrong? If so, I don't understand why are the wal_files for, then.

Patrick

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: Patrick B (#5)

Re: Wal files - Question | Postgres 9.2

On Fri, Nov 25, 2016 at 10:05 AM, Patrick B <patrickbakerbr@gmail.com> wrote:

If I get messages like:

cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
directory

In my head, it's saying that it was unable to recover that file and, because
of that, there is missing data.
Even if the server is able to connect to tbe master via streaming
replication, there might be missing data. There might be data that is into
master but not into slave.

Am I wrong? If so, I don't understand why are the wal_files for, then.

In order, a startup process will look for a WAL segments in its local
pg_xlog, then using restore_command, and finally via streaming
(XLogPageRead in xlog.c). If streaming fails, because for example the
segment requested has already been recycled, then startup process
switches back to a lookup in archive/pg_xlog. If you happen to have
repetitive failures with restore_command, you are missing a segment
and the standby would not be able to replay further. If the segment is
indeed missing, a new base backup is the way to go. A standby without
the WAL segments to replay is useless except if you plan to use PITR
and restore to a position before the last segment available in its
timeline history.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David Steele

david@pgmasters.net

over 9 years ago

In reply to: Patrick B (#5)

Re: Wal files - Question | Postgres 9.2

On 11/24/16 8:05 PM, Patrick B wrote:

hmm.. I really don't get it.

If I get messages like:

*cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
director*y

In my head, it's saying that it was unable to recover that file and,
because of that, there is missing data.
Even if the server is able to connect to tbe master via streaming
replication, there might be missing data. There might be data that is
into master but not into slave.

Am I wrong? If so, I don't understand why are the wal_files for, then.

This is normal as far as it goes. Postgres keeps checking for the WAL
file that it needs to become consistent and cp keeps throwing errors
when it can't find the file (yet).

This simply demonstrates how unsuitable cp is as a recovery command. I
recommend you look at doing backup/archive with a professional tool such
as pgBackRest (http://www.pgbackrest.org) or Barman
(http://www.pgbarman.org/).

--
-David
david@pgmasters.net

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Patrick B

patrickbakerbr@gmail.com

over 9 years ago

In reply to: David Steele (#7)

Re: Wal files - Question | Postgres 9.2

2016-11-29 15:21 GMT+13:00 David Steele <david@pgmasters.net>:

On 11/24/16 8:05 PM, Patrick B wrote:

hmm.. I really don't get it.

If I get messages like:

*cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
director*y

In my head, it's saying that it was unable to recover that file and,
because of that, there is missing data.
Even if the server is able to connect to tbe master via streaming
replication, there might be missing data. There might be data that is
into master but not into slave.

Am I wrong? If so, I don't understand why are the wal_files for, then.

This is normal as far as it goes. Postgres keeps checking for the WAL
file that it needs to become consistent and cp keeps throwing errors
when it can't find the file (yet).

This simply demonstrates how unsuitable cp is as a recovery command. I
recommend you look at doing backup/archive with a professional tool such
as pgBackRest (http://www.pgbackrest.org) or Barman
(http://www.pgbarman.org/).

Thanks for all the replies guys.

Related to the "cp" - That's fine. I'm just doing this in a test
environment. On my prod servers I use wal-e(aws s3) and bash
script(locally) to do this work.

I've done some tests, that only gave me more questions. I'll put everything
I done here and hopefully you guys will be able to help me to understand it.

1. Set up a pg-test-01 machine on Ubuntu 16 Server
2. installed Postgres 9.2 + contribs + libs + etc
3. Configured postgres to be a streaming replication slave (postgresql.conf
and recovery.conf)
4. Configured my masterdb01 server to ship wal_files to the pg-test-01
server
5. Made a pg_basebackup on my master DB server (DB is 1TB big)
6. Recovered the basebackup on the pg-test-01 machine
7. Postgres started to recover the wal_files and then streaming replication
worked:

LOG: streaming replication successfully connected to primary

8. Stopped postgres on the pg-test-01 server for 10 minutes
9. Started postgres on the pg-test-01 to see if the wal_files would be
recovered

LOG: restored log file "0000000200001B2F0000004F" from archive

LOG: restored log file "0000000200001B2F00000050" from archive

LOG: restored log file "0000000200001B2F00000051" from archive

LOG: restored log file "0000000200001B2F00000052" from archive

10. Removed all the wal_files from pg-test-01 (No wal_files are being
shipped to this server anymore)
11. Stopped postgres for 2 hours
12. Started postgres as streaming replication

cp: cannot stat 'archive/0000000200002B36000000BB': No such file or

directory
cp: cannot stat 'archive/0000000200002B36000000BC': No such file or
directory
cp: cannot stat 'archive/0000000200002B36000000BD': No such file or
directory
LOG: streaming replication successfully connected to primary

Postgres was stopped for 2h. How can postgres connect using streaming
replication, if no wal_files were in there???

Hos is that even possible?? I don't understand!

Thanks
Patrick

David G. Johnston

david.g.johnston@gmail.com

over 9 years ago

In reply to: Patrick B (#8)

Re: Wal files - Question | Postgres 9.2

On Mon, Nov 28, 2016 at 8:22 PM, Patrick B <patrickbakerbr@gmail.com> wrote:

Ho
[w]
is that even possible?? I don't understand!

https://www.postgresql.org/docs/9.2/static/warm-standby.html
"""

If you use streaming replication without file-based continuous archiving,
you have to set wal_keep_segments in the master to a value high enough to
ensure that old WAL segments are not recycled too early, while the standby
might still need them to catch up. If the standby falls behind too much, it
needs to be reinitialized from a new base backup. If you set up a WAL
archive that's accessible from the standby, wal_keep_segments is not
required as the standby can always use the archive to catch up.
"""

Basically you did just that when you destroyed the archive. Apparently the
master doesn't churn through WAL quickly enough to have had to discard the
segments from the prior two hours.

David J.

#10

Patrick B

patrickbakerbr@gmail.com

over 9 years ago

In reply to: David G. Johnston (#9)

Re: Wal files - Question | Postgres 9.2

2016-11-29 16:36 GMT+13:00 David G. Johnston <david.g.johnston@gmail.com>:

On Mon, Nov 28, 2016 at 8:22 PM, Patrick B <patrickbakerbr@gmail.com>
wrote:

Ho
[w]
is that even possible?? I don't understand!

https://www.postgresql.org/docs/9.2/static/warm-standby.html
"""

If you use streaming replication without file-based continuous archiving,
you have to set wal_keep_segments in the master to a value high enough to
ensure that old WAL segments are not recycled too early, while the standby
might still need them to catch up. If the standby falls behind too much, it
needs to be reinitialized from a new base backup. If you set up a WAL
archive that's accessible from the standby, wal_keep_segments is not
required as the standby can always use the archive to catch up.
"""

Basically you did just that when you destroyed the archive. Apparently
the master doesn't churn through WAL quickly enough to have had to discard
the segments from the prior two hours.

David J.

That was really helpful! Thanks David!

Patrick

#11

Patrick B

patrickbakerbr@gmail.com

over 9 years ago

In reply to: Patrick B (#10)

Re: Wal files - Question | Postgres 9.2

2016-11-29 23:59 GMT+13:00 Patrick B <patrickbakerbr@gmail.com>:

2016-11-29 16:36 GMT+13:00 David G. Johnston <david.g.johnston@gmail.com>:

On Mon, Nov 28, 2016 at 8:22 PM, Patrick B <patrickbakerbr@gmail.com>
wrote:

Ho
[w]
is that even possible?? I don't understand!

https://www.postgresql.org/docs/9.2/static/warm-standby.html
"""

If you use streaming replication without file-based continuous archiving,
you have to set wal_keep_segments in the master to a value high enough to
ensure that old WAL segments are not recycled too early, while the standby
might still need them to catch up. If the standby falls behind too much, it
needs to be reinitialized from a new base backup. If you set up a WAL
archive that's accessible from the standby, wal_keep_segments is not
required as the standby can always use the archive to catch up.
"""

Basically you did just that when you destroyed the archive. Apparently
the master doesn't churn through WAL quickly enough to have had to discard
the segments from the prior two hours.

David J.

That was really helpful! Thanks David!

Patrick

Hey guys,

https://www.postgresql.org/docs/9.2/static/runtime-config-replication.html

wal_keep_segments is the parameter responsible for streaming replication be
able to recover itself without using wal_files, is that right?

#12

David G. Johnston

david.g.johnston@gmail.com

over 9 years ago

In reply to: Patrick B (#11)

Re: Wal files - Question | Postgres 9.2

On Wed, Nov 30, 2016 at 6:05 PM, Patrick B <patrickbakerbr@gmail.com> wrote:

https://www.postgresql.org/docs/9.2/static/runtime-config-replication.html

wal_keep_segments is the parameter responsible for streaming replication
be able to recover itself without using wal_files, is that right?

[...] without using wal_files[sic] [from an external archive location]

David J.

#13

Patrick B

patrickbakerbr@gmail.com

over 9 years ago

In reply to: David G. Johnston (#12)

Re: Wal files - Question | Postgres 9.2

2016-12-01 14:15 GMT+13:00 David G. Johnston <david.g.johnston@gmail.com>:

On Wed, Nov 30, 2016 at 6:05 PM, Patrick B <patrickbakerbr@gmail.com>
wrote:

https://www.postgresql.org/docs/9.2/static/runtime-config-
replication.html

wal_keep_segments is the parameter responsible for streaming replication
be able to recover itself without using wal_files, is that right?

[...] without using wal_files[sic] [from an external archive location]

David J.

Yep David! Thanks for correcting me.

Got it now. Thanks everyone.

one more question... I'm working with wal_keep_segments = 1000 - Would
that in your opinion be a safe number?

Yes. I got plenty of free space and I'm using pg_xlogs in a different
partition on the master.

Patrick