requested timeline doesn't contain minimum recovery point

Started by Tom DalPozzoover 9 years ago6 messagesgeneral
Jump to latest
#1Tom DalPozzo
t.dalpozzo@gmail.com

Hi,
there is something happening in my replication that is not clear to me. I
think I'm missing something.
I've two server, red and blue.
red is primary blue is standby, async repl.
Now:
1 cleanly stop red
2 promote blue
3 insert tuples in blue
4 from red site, pg_rewind from blue to red dir.
5 start red as standby-> OK
6 wait a long time and then cleanly stop blue
7 promote red
8 insert tuples in red
9 from blue site, pg_rewind from red to blue dir
10 start blue as standby -> I get "requested timeline 3 doesn't contain
minimum recovery point 1/... on timeline 1

Sometimes this "switching game" works up to timeline 4 or 5, not always 3

Regards
Pupillo

#2Michael Paquier
michael@paquier.xyz
In reply to: Tom DalPozzo (#1)
Re: requested timeline doesn't contain minimum recovery point

On Fri, Jan 6, 2017 at 1:01 AM, Tom DalPozzo <t.dalpozzo@gmail.com> wrote:

Hi,
there is something happening in my replication that is not clear to me. I
think I'm missing something.
I've two server, red and blue.
red is primary blue is standby, async repl.
Now:
1 cleanly stop red
2 promote blue
3 insert tuples in blue
4 from red site, pg_rewind from blue to red dir.
5 start red as standby-> OK
6 wait a long time and then cleanly stop blue
7 promote red
8 insert tuples in red
9 from blue site, pg_rewind from red to blue dir
10 start blue as standby -> I get "requested timeline 3 doesn't contain
minimum recovery point 1/... on timeline 1

Sometimes this "switching game" works up to timeline 4 or 5, not always 3

Could you give more details? What does pg_rewind tell you at each
phase? Is that on Postgres 9.5 or 9.6? I use pg_rewind quite
extensively on 9.5 but I have no problems of this time with multiple
timeline jumps when juggling between two nodes. Another thing that is
coming to my mind: you are using pg_rewing with a source node that is
running. You should issue a checkpoint manually after promoting the
node to be sure that its control file gets the new timeline number.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Tom DalPozzo
t.dalpozzo@gmail.com
In reply to: Michael Paquier (#2)
Re: requested timeline doesn't contain minimum recovery point

2017-01-06 13:09 GMT+01:00 Michael Paquier <michael.paquier@gmail.com>:

On Fri, Jan 6, 2017 at 1:01 AM, Tom DalPozzo <t.dalpozzo@gmail.com> wrote:

Hi,
there is something happening in my replication that is not clear to me. I
think I'm missing something.
I've two server, red and blue.
red is primary blue is standby, async repl.
Now:
1 cleanly stop red
2 promote blue
3 insert tuples in blue
4 from red site, pg_rewind from blue to red dir.
5 start red as standby-> OK
6 wait a long time and then cleanly stop blue
7 promote red
8 insert tuples in red
9 from blue site, pg_rewind from red to blue dir
10 start blue as standby -> I get "requested timeline 3 doesn't contain
minimum recovery point 1/... on timeline 1

Sometimes this "switching game" works up to timeline 4 or 5, not always

3

Could you give more details? What does pg_rewind tell you at each
phase? Is that on Postgres 9.5 or 9.6? I use pg_rewind quite
extensively on 9.5 but I have no problems of this time with multiple
timeline jumps when juggling between two nodes. Another thing that is
coming to my mind: you are using pg_rewing with a source node that is
running. You should issue a checkpoint manually after promoting the
node to be sure that its control file gets the new timeline number.
--
Michael

Hi,

sometimes pg_rewind says that nothing needs to be done, sometimes it says
it's rewinding and done at the end.
I'm using 9.6. I moved there from 9.5 as I'm also using replication slots
and in 9.6 there is a second parameter added. But I seem to remember that
it did the same in 9.5 too but I'm not really sure.
I checked that the server, at promotion said the message about the new
timeline.
I will make some more tests.
Regards
Pupillo

#4Tom DalPozzo
t.dalpozzo@gmail.com
In reply to: Tom DalPozzo (#3)
Re: requested timeline doesn't contain minimum recovery point

Could you give more details? What does pg_rewind tell you at each

phase? Is that on Postgres 9.5 or 9.6? I use pg_rewind quite
extensively on 9.5 but I have no problems of this time with multiple
timeline jumps when juggling between two nodes. Another thing that is
coming to my mind: you are using pg_rewing with a source node that is
running. You should issue a checkpoint manually after promoting the
node to be sure that its control file gets the new timeline number.
--
Michael

Hi,

sometimes pg_rewind says that nothing needs to be done, sometimes it says
it's rewinding and done at the end.
I'm using 9.6. I moved there from 9.5 as I'm also using replication slots
and in 9.6 there is a second parameter added. But I seem to remember that
it did the same in 9.5 too but I'm not really sure.
I checked that the server, at promotion said the message about the new
timeline.
I will make some more tests.
Regards
Pupillo

Hi!
I redid the tests following your suggestion to issue a checkpoint manually.
IT WORKS!
Just a question: when the standby server starts, I see the log error
messages (ex.: "invalid record length...") when WAL end is reached. I know
that it's normal.
But I'm wondering if the system, in order to detect the end of the WAL,
controls only the validity of the records in the WAL.
I mean, could random bytes appear as a valid record (very unlikely, but
possible)?
Thanks
Pupillo

#5Michael Paquier
michael@paquier.xyz
In reply to: Tom DalPozzo (#4)
Re: requested timeline doesn't contain minimum recovery point

On Tue, Jan 10, 2017 at 10:35 PM, Tom DalPozzo <t.dalpozzo@gmail.com> wrote:

I redid the tests following your suggestion to issue a checkpoint manually.
IT WORKS!
Just a question: when the standby server starts, I see the log error
messages (ex.: "invalid record length...") when WAL end is reached. I know
that it's normal.
But I'm wondering if the system, in order to detect the end of the WAL,
controls only the validity of the records in the WAL.

You may want to look at xlogreader.c and track report_invalid_record()
to see what are the error checks being done. No full checks are done
depending on the record types, but there are some checks for the
backup blocks, the size record, etc.

I mean, could random bytes appear as a valid record (very unlikely, but
possible)?

Yes, that could be possible if some memory or disk is broken. That's
why, while it is important to take backups, it is more important to
make sure that they are able to restore correctly before deploying
them.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Tom DalPozzo
t.dalpozzo@gmail.com
In reply to: Michael Paquier (#5)
Re: requested timeline doesn't contain minimum recovery point

I mean, could random bytes appear as a valid record (very unlikely, but
possible)?

Yes, that could be possible if some memory or disk is broken. That's
why, while it is important to take backups, it is more important to
make sure that they are able to restore correctly before deploying
them.
--
Michael

Hi,
of course against memory or disk corruption, nothing 100% safe can be done.
But, excluding these cases, can there be situations in which the WAL reader
gets confused?
I'm thinking at WAL segments recycling: when a WAL is recycled it is not
filled with anything (zeroes...) right?
If I'm right, then there are still old records in the WAL. If they're
aligned with the new offsets, I guess that the system can understand that
they're older (looking at some ID) and not valid but if not aligned, there
could be an unlucky and unlikely issue.

In other word, excluding HW problems and possible unwanted bugs, I'd like
to know if the logic underneath WAL reading at startup is 100%safe.

Regards
Pupillo