WAL replay issue from 9.6.8 to 9.6.10

Started by Dave Peticolasover 7 years ago9 messagesgeneral
Jump to latest
#1Dave Peticolas
dave@krondo.com

Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
during replay with messages like so:

WARNING: page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC: WAL contains references to invalid pages

This is after restoring from a snapshot that completed successfully as far
as I can tell.

I downgraded the test server to 9.6.8 and I was able to replay WAL past the
point where 9.6.10 would PANIC and crash. Curiously, 9.6.8 PANICed the
first time, but then when I restarted it was able to replay through.

Should I expect this to just not work until the primary database is
upgraded to 9.6.10? I'm a little wary to upgrade the primary at this point
until I understand what might be going on.

thanks,
dave

#2Andreas Kretschmer
andreas@a-kretschmer.de
In reply to: Dave Peticolas (#1)
Re: WAL replay issue from 9.6.8 to 9.6.10

On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com> wrote:

Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably

Can you tell us the wal-level?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company

#3Dave Peticolas
dave@krondo.com
In reply to: Andreas Kretschmer (#2)
Re: WAL replay issue from 9.6.8 to 9.6.10

On Tue, Aug 28, 2018 at 9:21 PM Andreas Kretschmer <andreas@a-kretschmer.de>
wrote:

On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com> wrote:

Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably

Can you tell us the wal-level?

Yes, it is "hot_standby".

#4Andreas Kretschmer
andreas@a-kretschmer.de
In reply to: Dave Peticolas (#3)
Re: WAL replay issue from 9.6.8 to 9.6.10

On 29 August 2018 06:26:06 CEST, Dave Peticolas <dave@krondo.com> wrote:

On Tue, Aug 28, 2018 at 9:21 PM Andreas Kretschmer
<andreas@a-kretschmer.de>
wrote:

On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com>

wrote:

Hello, I'm seeing some issues with WAL replay on a test server

running

9.6.10 using WAL archived from a 9.6.8 primary server. It reliably

Can you tell us the wal-level?

Yes, it is "hot_standby".

ok, this was to just rule out a problem that we discovered with logical decoding, but this isn't the case here.

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company

#5Alexander Kukushkin
cyberdemn@gmail.com
In reply to: Dave Peticolas (#1)
Re: WAL replay issue from 9.6.8 to 9.6.10

Hi,

2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:

Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
during replay with messages like so:

WARNING: page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC: WAL contains references to invalid pages

it looks like you are hitting pretty much the same problem as I:
/messages/by-id/153492341830.1368.3936905691758473953@wrigleys.postgresql.org
The only major difference, you are restoring from the backup, while in
my case the host running replica has crashed.
Also in my case, the primary was already running 9.6.10.

In my case, it also panics during "Btree/DELETE: XYZ items" and page
number of relation is insanely huge.

#6Michael Paquier
michael@paquier.xyz
In reply to: Alexander Kukushkin (#5)
Re: WAL replay issue from 9.6.8 to 9.6.10

On Wed, Aug 29, 2018 at 08:31:50AM +0200, Alexander Kukushkin wrote:

2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:

Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
during replay with messages like so:

WARNING: page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC: WAL contains references to invalid pages

it looks like you are hitting pretty much the same problem as I:
/messages/by-id/153492341830.1368.3936905691758473953@wrigleys.postgresql.org
The only major difference, you are restoring from the backup, while in
my case the host running replica has crashed.
Also in my case, the primary was already running 9.6.10.

In my case, it also panics during "Btree/DELETE: XYZ items" and page
number of relation is insanely huge.

That would be the same problem. Dave, do you have a background worker
running in parallel or some read-only workload with backends doing
read-only operations on a standby once it has reached a consistent
point?
--
Michael

#7Dave Peticolas
dave@krondo.com
In reply to: Michael Paquier (#6)
Re: WAL replay issue from 9.6.8 to 9.6.10

On Wed, Aug 29, 2018 at 4:54 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Aug 29, 2018 at 08:31:50AM +0200, Alexander Kukushkin wrote:

2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:

Hello, I'm seeing some issues with WAL replay on a test server running
9.6.10 using WAL archived from a 9.6.8 primary server. It reliably

PANICs

during replay with messages like so:

WARNING: page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC: WAL contains references to invalid pages

it looks like you are hitting pretty much the same problem as I:

/messages/by-id/153492341830.1368.3936905691758473953@wrigleys.postgresql.org

The only major difference, you are restoring from the backup, while in
my case the host running replica has crashed.
Also in my case, the primary was already running 9.6.10.

In my case, it also panics during "Btree/DELETE: XYZ items" and page
number of relation is insanely huge.

That would be the same problem. Dave, do you have a background worker
running in parallel or some read-only workload with backends doing
read-only operations on a standby once it has reached a consistent
point?

Oh, perhaps I do, depending on what you mean by worker. There are a couple
of periodic processes that connect to the server to obtain metrics. Is that
what is triggering this issue? In my case I could probably suspend them
until the replay has reached the desired point.

I have noticed this behavior in the past but prior to 9.6.10 restarting the
server would fix the issue. And the replay always seemed to reach a point
past which the problem would not re-occur.

dave

#8Michael Paquier
michael@paquier.xyz
In reply to: Dave Peticolas (#7)
Re: WAL replay issue from 9.6.8 to 9.6.10

On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote:

Oh, perhaps I do, depending on what you mean by worker. There are a couple
of periodic processes that connect to the server to obtain metrics. Is that
what is triggering this issue? In my case I could probably suspend them
until the replay has reached the desired point.

That would be it. How do you decide when those begin to run and connect
to Postgres. Do you use pg_isready or similar in a loop for sanity
checks?

I have noticed this behavior in the past but prior to 9.6.10 restarting the
server would fix the issue. And the replay always seemed to reach a point
past which the problem would not re-occur.

You are picking my interest here. Did you actually see the same
problem? In 9.6.10 what happens is that I have tightened the consistent
point checks and logic so as inconsistent page issues would actually
show up when they should, and that those become reproducible so as we
can track down any rogue WAL record or inconsistent behavior.
--
Michael

#9Dave Peticolas
dave@krondo.com
In reply to: Michael Paquier (#8)
Re: WAL replay issue from 9.6.8 to 9.6.10

On Wed, Aug 29, 2018 at 1:50 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote:

Oh, perhaps I do, depending on what you mean by worker. There are a

couple

of periodic processes that connect to the server to obtain metrics. Is

that

what is triggering this issue? In my case I could probably suspend them
until the replay has reached the desired point.

That would be it. How do you decide when those begin to run and connect
to Postgres. Do you use pg_isready or similar in a loop for sanity
checks?

I do not, they just try to connect and bail if they cannot.

I have noticed this behavior in the past but prior to 9.6.10 restarting

the

server would fix the issue. And the replay always seemed to reach a point
past which the problem would not re-occur.

You are picking my interest here. Did you actually see the same
problem? In 9.6.10 what happens is that I have tightened the consistent
point checks and logic so as inconsistent page issues would actually
show up when they should, and that those become reproducible so as we
can track down any rogue WAL record or inconsistent behavior.

Yes, I've seen this problem occasionally in the past. I think only in the
9.6 series. But before 9.6.10, if I restarted the server it would start
replaying WAL again and typically when it reached the point where it
PANICed before, instead it would report a consistent state and allow
read-only connections. Sometimes it would then PANIC again after more WAL
was replayed. But eventually it would reach a point where it seemed to be
able to replay WAL indefinitely without the issue happening.

dave