WAL replay issue from 9.6.8 to 9.6.10
Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
during replay with messages like so:
WARNING: page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC: WAL contains references to invalid pages
This is after restoring from a snapshot that completed successfully as far
as I can tell.
I downgraded the test server to 9.6.8 and I was able to replay WAL past the
point where 9.6.10 would PANIC and crash. Curiously, 9.6.8 PANICed the
first time, but then when I restarted it was able to replay through.
Should I expect this to just not work until the primary database is
upgraded to 9.6.10? I'm a little wary to upgrade the primary at this point
until I understand what might be going on.
thanks,
dave
On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com> wrote:
Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably
Can you tell us the wal-level?
Regards, Andreas
--
2ndQuadrant - The PostgreSQL Support Company
On Tue, Aug 28, 2018 at 9:21 PM Andreas Kretschmer <andreas@a-kretschmer.de>
wrote:
On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com> wrote:
Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliablyCan you tell us the wal-level?
Yes, it is "hot_standby".
On 29 August 2018 06:26:06 CEST, Dave Peticolas <dave@krondo.com> wrote:
On Tue, Aug 28, 2018 at 9:21 PM Andreas Kretschmer
<andreas@a-kretschmer.de>
wrote:On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com>
wrote:
Hello, I'm seeing some issues with WAL replay on a test server
running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably
Can you tell us the wal-level?
Yes, it is "hot_standby".
ok, this was to just rule out a problem that we discovered with logical decoding, but this isn't the case here.
Regards, Andreas
--
2ndQuadrant - The PostgreSQL Support Company
Hi,
2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:
Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
during replay with messages like so:WARNING: page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC: WAL contains references to invalid pages
it looks like you are hitting pretty much the same problem as I:
/messages/by-id/153492341830.1368.3936905691758473953@wrigleys.postgresql.org
The only major difference, you are restoring from the backup, while in
my case the host running replica has crashed.
Also in my case, the primary was already running 9.6.10.
In my case, it also panics during "Btree/DELETE: XYZ items" and page
number of relation is insanely huge.
On Wed, Aug 29, 2018 at 08:31:50AM +0200, Alexander Kukushkin wrote:
2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:
Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
during replay with messages like so:WARNING: page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC: WAL contains references to invalid pagesit looks like you are hitting pretty much the same problem as I:
/messages/by-id/153492341830.1368.3936905691758473953@wrigleys.postgresql.org
The only major difference, you are restoring from the backup, while in
my case the host running replica has crashed.
Also in my case, the primary was already running 9.6.10.In my case, it also panics during "Btree/DELETE: XYZ items" and page
number of relation is insanely huge.
That would be the same problem. Dave, do you have a background worker
running in parallel or some read-only workload with backends doing
read-only operations on a standby once it has reached a consistent
point?
--
Michael
On Wed, Aug 29, 2018 at 4:54 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Aug 29, 2018 at 08:31:50AM +0200, Alexander Kukushkin wrote:
2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:
Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliablyPANICs
during replay with messages like so:
WARNING: page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC: WAL contains references to invalid pagesit looks like you are hitting pretty much the same problem as I:
/messages/by-id/153492341830.1368.3936905691758473953@wrigleys.postgresql.org
The only major difference, you are restoring from the backup, while in
my case the host running replica has crashed.
Also in my case, the primary was already running 9.6.10.In my case, it also panics during "Btree/DELETE: XYZ items" and page
number of relation is insanely huge.That would be the same problem. Dave, do you have a background worker
running in parallel or some read-only workload with backends doing
read-only operations on a standby once it has reached a consistent
point?
Oh, perhaps I do, depending on what you mean by worker. There are a couple
of periodic processes that connect to the server to obtain metrics. Is that
what is triggering this issue? In my case I could probably suspend them
until the replay has reached the desired point.
I have noticed this behavior in the past but prior to 9.6.10 restarting the
server would fix the issue. And the replay always seemed to reach a point
past which the problem would not re-occur.
dave
On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote:
Oh, perhaps I do, depending on what you mean by worker. There are a couple
of periodic processes that connect to the server to obtain metrics. Is that
what is triggering this issue? In my case I could probably suspend them
until the replay has reached the desired point.
That would be it. How do you decide when those begin to run and connect
to Postgres. Do you use pg_isready or similar in a loop for sanity
checks?
I have noticed this behavior in the past but prior to 9.6.10 restarting the
server would fix the issue. And the replay always seemed to reach a point
past which the problem would not re-occur.
You are picking my interest here. Did you actually see the same
problem? In 9.6.10 what happens is that I have tightened the consistent
point checks and logic so as inconsistent page issues would actually
show up when they should, and that those become reproducible so as we
can track down any rogue WAL record or inconsistent behavior.
--
Michael
On Wed, Aug 29, 2018 at 1:50 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote:
Oh, perhaps I do, depending on what you mean by worker. There are a
couple
of periodic processes that connect to the server to obtain metrics. Is
that
what is triggering this issue? In my case I could probably suspend them
until the replay has reached the desired point.That would be it. How do you decide when those begin to run and connect
to Postgres. Do you use pg_isready or similar in a loop for sanity
checks?
I do not, they just try to connect and bail if they cannot.
I have noticed this behavior in the past but prior to 9.6.10 restarting
the
server would fix the issue. And the replay always seemed to reach a point
past which the problem would not re-occur.You are picking my interest here. Did you actually see the same
problem? In 9.6.10 what happens is that I have tightened the consistent
point checks and logic so as inconsistent page issues would actually
show up when they should, and that those become reproducible so as we
can track down any rogue WAL record or inconsistent behavior.
Yes, I've seen this problem occasionally in the past. I think only in the
9.6 series. But before 9.6.10, if I restarted the server it would start
replaying WAL again and typically when it reached the point where it
PANICed before, instead it would report a consistent state and allow
read-only connections. Sometimes it would then PANIC again after more WAL
was replayed. But eventually it would reach a point where it seemed to be
able to replay WAL indefinitely without the issue happening.
dave