BUG #13876: pg_xlogdump give an error on timeline switch
The following bug has been logged on the website:
Bug reference: 13876
Logged by: Greg Clough
Email address: postgresql.org@gclough.com
PostgreSQL version: 9.3.10
Operating system: CentOS v7.2
Description:
When pg_xlogdump encounters a timeline switch in WAL, it marks it as an
"error in WAL record". I'm just guessing that there is an end-of-timeline
marker in the WAL that pg_xlogdump hasn't been programmed to recognize, so
it marks it as an error.
Here is a dump of WAL file that's had a timeline switch as the first thing
after forcing an pg_swtich_xlog():
-bash-4.2$ pwd
/opt/postgres/9.3/db2/pg_xlog
-bash-4.2$ ls -al
total 114704
drwx------. 3 postgres postgres 4096 Jan 19 09:15 .
drwx------. 16 postgres postgres 4096 Jan 19 09:33 ..
-rw-------. 1 postgres postgres 16777216 Jan 19 08:50
000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Jan 19 08:53
000000010000000000000003
-rw-------. 1 postgres postgres 16777216 Jan 19 09:07
000000020000000000000003
-rw-------. 1 postgres postgres 16777216 Jan 19 09:07
000000020000000000000004
-rw-------. 1 postgres postgres 16777216 Jan 19 09:10
000000020000000000000005
-rw-------. 1 postgres postgres 16777216 Jan 19 09:15
000000020000000000000006
-rw-------. 1 postgres postgres 41 Jan 19 08:55 00000002.history
-rw-------. 1 postgres postgres 16777216 Jan 19 09:33
000000030000000000000006
-rw-------. 1 postgres postgres 83 Jan 19 09:15 00000003.history
drwx------. 2 postgres postgres 78 Jan 19 09:15 archive_status
-bash-4.2$ pg_xlogdump 000000020000000000000006
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn:
0/06000028, prev 0/056370D0, bkp: 0000, desc: checkpoint: redo 0/6000028;
tli 2; prev tli 2; fpw true; xid 0/1894; oid 24576; multi 1; offset 0;
oldest xid 1879 in DB 1; oldest multi 1 in DB 1; oldest running xid 0;
shutdown
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn:
0/06000090, prev 0/06000028, bkp: 0000, desc: checkpoint: redo 0/6000090;
tli 2; prev tli 2; fpw true; xid 0/1894; oid 24576; multi 1; offset 0;
oldest xid 1879 in DB 1; oldest multi 1 in DB 1; oldest running xid 0;
shutdown
pg_xlogdump: FATAL: error in WAL record at 0/6000090: record with zero
length at 0/60000F8
That final message was mis-interpreted as corruption in the WAL, but without
knowing the internals of either the WAL or pg_xlogdump I can't say for sure
that it's not corruption. It's repeatable, so I doubt it's a problem in
PostgreSQL core.
Could someone confirm my understanding, and if so, then could we please get
pg_xlogdump updated to recognize an end-of-timeline?
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On 2016-01-19 15:34:48 +0000, postgresql.org@gclough.com wrote:
That final message was mis-interpreted as corruption in the WAL, but without
knowing the internals of either the WAL or pg_xlogdump I can't say for sure
that it's not corruption. It's repeatable, so I doubt it's a problem in
PostgreSQL core.Could someone confirm my understanding, and if so, then could we please get
pg_xlogdump updated to recognize an end-of-timeline?
Yes, that's pretty much correct. The "old" timeline doesn't even really
know about it, so it's not generally clear how we go forward. There
could be WAL from several parallel timelines present.
You can manually choose the timeline you want to be displayed. Either by
specifying -t or by specifying files from the other timeline.
I don't think we really need to update pg_xlogdump to do any of this
automatically - for which use case is that required?
Regards,
Andres
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Wed, Jan 20, 2016 at 12:44 AM, Andres Freund <andres@anarazel.de> wrote:
You can manually choose the timeline you want to be displayed. Either by
specifying -t or by specifying files from the other timeline.I don't think we really need to update pg_xlogdump to do any of this
automatically - for which use case is that required?
Yes, I am not sure it is worth adding a dependency with the history
file just to debug the WAL contents.
--
Michael
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
(please don't top post on postgres lists)
Top posting... sorry, I blame GMail defaults... and mobile phones. :-/
If pg_xlogdump were a tool aimed at normal endusers, maybe. But it's
clearly not, so ...
What are you using pg_xlogdump for?
The issue came about as one of our other DBA's did a role swap
Primary/Standby, but didn't have the recovery_target_timeline = 'latest'
set. He couldn't figure out why the old Primary (now a Standby) wouldn't
follow the new Primary onto the new timeline, and deduced it was a corrupt
WAL due to the output of pg_xlogdump.
I can understand his confusion, as the error is quite alarming with that
big "FATAL:"... but ultimately, yes, it was our own mistake that led us
down this avenue of investigation
The timeline switch will often happen on a different machine, so you
can't just do that.
It's not a huge issue that I doubt we'll ever have again, but if it was an
oversight and easy to fix then I thought it would be worthwhile doing, so
as to stop others having the same problem.
Cheers.
Greg
Import Notes
Reply to msg id not found: CADjwvTPAWsD_G+VwHU_+WeZMSYSZiOS2UBkVvo88QmwxiOKYJA@mail.gmail.com