ERROR: MultiXactId xxxx has not been created yet -- apparent wraparound
With PostgreSQL 9.3.5 on Ubuntu 12.04, I'm getting the error:
ERROR: MultiXactId 1934308693 has not been created yet -- apparent
wraparound
on doing various queries on our database. I don't think it is a
wraparound - I think the tuple has mistakenly decided it has a
MultiXactId related to it.
Looking back through the logs, it looks as if this suddenly started
happening at 2:00 (UTC+0100) on 23 May. (Nothing was happening at that
time as far as we can tell - the server didn't crash/restart or anything).
The logs were all really small, until the ones from then onwards, which
are now full of things like:
2015-05-23 10:44:54 BST ERROR: MultiXactId 1129406840 has not been
created yet -- apparent wraparound
2015-05-23 10:44:54 BST CONTEXT: automatic analyze of table
"xxxxx.messages.msgdata"
2015-05-23 10:45:16 BST ERROR: MultiXactId 1575170 has not been created
yet -- apparent wraparound
2015-05-23 10:45:16 BST STATEMENT: select sum(size) from messages.msgdata
2015-05-23 10:45:54 BST ERROR: MultiXactId 1129406840 has not been
created yet -- apparent wraparound
2015-05-23 10:45:54 BST CONTEXT: automatic analyze of table
"xxxxx.messages.msgdata"
2015-05-23 10:46:54 BST ERROR: MultiXactId 1129406840 has not been
created yet -- apparent wraparound
(There are several incorrect MultiXactIds in there).
This was NOT an upgrade. It has been running with 9.3 for ages.
The pg_controldata output is:
pg_control version number: 937
Catalog version number: 201306121
Database system identifier: 5990773948116871611
Database cluster state: in production
pg_control last modified: Tue 26 May 2015 09:50:25 BST
Latest checkpoint location: 7C6/8F863440
Prior checkpoint location: 7C6/8E8576D8
Latest checkpoint's REDO location: 7C6/8E8745F8
Latest checkpoint's REDO WAL file: 00000001000007C60000008E
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/24005839
Latest checkpoint's NextOID: 1802564
Latest checkpoint's NextMultiXactId: 216
Latest checkpoint's NextMultiOffset: 439
Latest checkpoint's oldestXID: 710
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: Tue 26 May 2015 09:45:57 BST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: minimal
Current max_connections setting: 200
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
The pg_multixact directory contains two files
members/0000
offsets/0000
There are no locked up transactions is pg_stat_activity;
Any ideas? I can't even delete the records with the bad multixactid in
them (which would be acceptable, as we can recover those individual
records from backup). We don't want to restore fully from backup if
possible as that will lose everything since the last good backup.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
It was fixed in 9.3.7.
Unfortunately 9.3.7 has new bug which is irrelevant to this.
/messages/by-id/20150525142657.4686.35151@wrigleys.postgresql.org
I'm not sure if Ubuntu 12.04 is affected by the bug or not though.
As far as I know developers plan to release 9.3.8 etc. soon.
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
With PostgreSQL 9.3.5 on Ubuntu 12.04, I'm getting the error:
ERROR: MultiXactId 1934308693 has not been created yet -- apparent
wraparoundon doing various queries on our database. I don't think it is a
wraparound - I think the tuple has mistakenly decided it has a
MultiXactId related to it.Looking back through the logs, it looks as if this suddenly started
happening at 2:00 (UTC+0100) on 23 May. (Nothing was happening at that
time as far as we can tell - the server didn't crash/restart or
anything).The logs were all really small, until the ones from then onwards,
which are now full of things like:2015-05-23 10:44:54 BST ERROR: MultiXactId 1129406840 has not been
created yet -- apparent wraparound
2015-05-23 10:44:54 BST CONTEXT: automatic analyze of table
"xxxxx.messages.msgdata"
2015-05-23 10:45:16 BST ERROR: MultiXactId 1575170 has not been
created yet -- apparent wraparound
2015-05-23 10:45:16 BST STATEMENT: select sum(size) from
messages.msgdata
2015-05-23 10:45:54 BST ERROR: MultiXactId 1129406840 has not been
created yet -- apparent wraparound
2015-05-23 10:45:54 BST CONTEXT: automatic analyze of table
"xxxxx.messages.msgdata"
2015-05-23 10:46:54 BST ERROR: MultiXactId 1129406840 has not been
created yet -- apparent wraparound(There are several incorrect MultiXactIds in there).
This was NOT an upgrade. It has been running with 9.3 for ages.
The pg_controldata output is:
pg_control version number: 937
Catalog version number: 201306121
Database system identifier: 5990773948116871611
Database cluster state: in production
pg_control last modified: Tue 26 May 2015 09:50:25 BST
Latest checkpoint location: 7C6/8F863440
Prior checkpoint location: 7C6/8E8576D8
Latest checkpoint's REDO location: 7C6/8E8745F8
Latest checkpoint's REDO WAL file: 00000001000007C60000008E
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/24005839
Latest checkpoint's NextOID: 1802564
Latest checkpoint's NextMultiXactId: 216
Latest checkpoint's NextMultiOffset: 439
Latest checkpoint's oldestXID: 710
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: Tue 26 May 2015 09:45:57 BST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: minimal
Current max_connections setting: 200
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0The pg_multixact directory contains two files
members/0000
offsets/0000There are no locked up transactions is pg_stat_activity;
Any ideas? I can't even delete the records with the bad multixactid in
them (which would be acceptable, as we can recover those individual
records from backup). We don't want to restore fully from backup if
possible as that will lose everything since the last good backup.--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 26/05/2015 10:23, Tatsuo Ishii wrote:
It was fixed in 9.3.7.
OK.
Do you know what the bug number was or what it was called (I've tried
searching for it before posting, but found nothing which seemed to be
the same problem as ours).
Do you know whether, if we upgrade to 9.3.7/8, it will fix the problem
in the current database, or is the data damage 'permanent', so we'll
need to restore from backup and somehow merge in the latest data?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Paul Smith wrote:
With PostgreSQL 9.3.5 on Ubuntu 12.04, I'm getting the error:
ERROR: MultiXactId 1934308693 has not been created yet -- apparent
wraparoundon doing various queries on our database. I don't think it is a wraparound -
I think the tuple has mistakenly decided it has a MultiXactId related to it.
Yeah, that looks like the case. According to your pg_controldata
output, you haven't used many multixacts at all:
Latest checkpoint's NextMultiXactId: 216
Latest checkpoint's NextMultiOffset: 439
and the fact that you only have these files:
The pg_multixact directory contains two files
members/0000
offsets/0000
confirms this. You XID counter is also rather low, only 24 million:
Latest checkpoint's NextXID: 0/24005839
so it doesn't seem plausible that the single bit HEAP_XMAX_IS_MULTI was
turned on accidentally (something which I've never seen happen).
It doesn't look like a randomly corrupt page either; normally you would
see errors about mismatching page headers before you get to the point
where Xmax is read. I wonder if the data page came from elsewhere.
Maybe you copied a data file from another database?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 26/05/2015 16:01, Alvaro Herrera wrote:
Paul Smith wrote:
With PostgreSQL 9.3.5 on Ubuntu 12.04, I'm getting the error:
ERROR: MultiXactId 1934308693 has not been created yet -- apparent
wraparoundon doing various queries on our database. I don't think it is a wraparound -
I think the tuple has mistakenly decided it has a MultiXactId related to it.Yeah, that looks like the case. According to your pg_controldata
output, you haven't used many multixacts at all:
Yep, that's what I thought as well.
so it doesn't seem plausible that the single bit HEAP_XMAX_IS_MULTI
was turned on accidentally (something which I've never seen happen).
It doesn't look like a randomly corrupt page either; normally you
would see errors about mismatching page headers before you get to the
point where Xmax is read. I wonder if the data page came from
elsewhere. Maybe you copied a data file from another database?
No, nothing like that. It was just running fine, and then suddenly (at
2am on 23 May) it started throwing up loads of these errors. The DB
server wasn't even restarted at that point. It was just working fine,
then suddenly wasn't. (The first error was at 02:00:32 BST, then every
few minutes after that there's another one).
It's running in a Hyper-V guest. We had taken a backup of the VM at
00:34 on 23 May and that looks to be absolutely fine. What I have done
now is restore that backup and import the new data which arrived since
that backup was made, and it seems OK now. I still have the 'broken'
installation in case more information is needed from it. I'd try to get
a raw dump of the damaged tuple data if I knew how to find where it is
in the relation file...
I suppose it's possible that it was disk or memory corruption, but I've
seen that before, and it hasn't looked like this. (There are several
PostgreSQL guests running on the same Hyper-V server, and none of the
others seem to have this problem which may suggest that it's less likely
to be a hardware issue).
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Paul Smith wrote:
No, nothing like that. It was just running fine, and then suddenly (at 2am
on 23 May) it started throwing up loads of these errors. The DB server
wasn't even restarted at that point. It was just working fine, then suddenly
wasn't. (The first error was at 02:00:32 BST, then every few minutes after
that there's another one).
Another crazy thought is that the host system messed up and overwrote
part of the table with pages from some other guest system. I have never
seen a report of such a thing happening.
It's running in a Hyper-V guest. We had taken a backup of the VM at 00:34 on
23 May and that looks to be absolutely fine. What I have done now is restore
that backup and import the new data which arrived since that backup was
made, and it seems OK now. I still have the 'broken' installation in case
more information is needed from it. I'd try to get a raw dump of the damaged
tuple data if I knew how to find where it is in the relation file...
Well, you could try a binary search to locate the (or a) borked page
within the table. Or you could write a plpgsql function with an
EXCEPTION block that reads each tuple, and in case of error reports the
CTID of the tuple being read. (select * from tab where
ctid=generated_ctid; EXCEPT WHEN OTHERS THEN raise notice ' .. ',
generated_ctid). Once you have located the problem tuple, a pageinspect
dump would be the starting point.
Another idea is to attach a debugger to the backend process, and set a
breakpoint on function errfinish. Once the error is hit the breakpoint
will stop the execution before the error is reported; at that point you
can request for a backtrace from the debugger. (Make sure to examine
errordata to see whether the error being hit is the multixact one.)
I suppose it's possible that it was disk or memory corruption, but I've seen
that before, and it hasn't looked like this.
Yeah, same here.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 05/26/2015 11:47 AM, Alvaro Herrera wrote:
Paul Smith wrote:
No, nothing like that. It was just running fine, and then suddenly (at 2am
on 23 May) it started throwing up loads of these errors. The DB server
wasn't even restarted at that point. It was just working fine, then suddenly
wasn't. (The first error was at 02:00:32 BST, then every few minutes after
that there's another one).Another crazy thought is that the host system messed up and overwrote
part of the table with pages from some other guest system. I have never
seen a report of such a thing happening.
I have. Linux kernel version of the host?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM361c2f5dc06ff25484189cb579404e80e8ff1ac4043e1fd446bc8842b0c88e4e6499f84c14bf2f82ccc4406d5e453588@asav-1.01.com
On 26/05/2015 19:47, Alvaro Herrera wrote:
Paul Smith wrote:
No, nothing like that. It was just running fine, and then suddenly (at 2am
on 23 May) it started throwing up loads of these errors. The DB server
wasn't even restarted at that point. It was just working fine, then suddenly
wasn't. (The first error was at 02:00:32 BST, then every few minutes after
that there's another one).Another crazy thought is that the host system messed up and overwrote
part of the table with pages from some other guest system. I have never
seen a report of such a thing happening.
Yes, that seems unlikely to me, but you never know...
While I was trying to work out how to fix it (after my original email),
I did see a pgsql server crash:
2015-05-26 12:26:30 BST LOG: server process (PID 35493) was terminated
by signal 11: Segmentation fault
It's running in a Hyper-V guest. We had taken a backup of the VM at 00:34 on
23 May and that looks to be absolutely fine. What I have done now is restore
that backup and import the new data which arrived since that backup was
made, and it seems OK now. I still have the 'broken' installation in case
more information is needed from it. I'd try to get a raw dump of the damaged
tuple data if I knew how to find where it is in the relation file...Well, you could try a binary search to locate the (or a) borked page
within the table. Or you could write a plpgsql function with an
EXCEPTION block that reads each tuple, and in case of error reports the
CTID of the tuple being read. (select * from tab where
ctid=generated_ctid; EXCEPT WHEN OTHERS THEN raise notice ' .. ',
generated_ctid).
In the backup, for the row which will become broken, I get:
ctid | xmin | xmax
------------+----------+----------
(18722,29) | 23862661 | 23862661
Assuming it hadn't changed in the broken version (from application
knowledge, it's unlikely to change), then on the broken version,
select heap_page_items(get_raw_page('messages.msgdata', 18722)); gives:
(1,2,2,0,,,,,,,,,)
(2,7816,1,376,21857691,21857691,11,"(18722,2)",32784,10646,24,,)
(3,4,2,0,,,,,,,,,)
(4,7568,1,248,21858126,21858126,11,"(18722,4)",32784,10646,24,,)
(5,6,2,0,,,,,,,,,)
(6,7216,1,352,22823803,23871405,41,"(18722,30)",49168,9478,24,,)
(7,8,2,0,,,,,,,,,)
(8,6968,1,248,22825206,22825206,11,"(18722,8)",32784,10646,24,,)
(9,10,2,0,,,,,,,,,)
(10,6720,1,248,22826427,22826427,11,"(18722,10)",32784,10646,24,,)
(11,12,2,0,,,,,,,,,)
(12,6336,1,380,22826899,23935627,1,"(18722,31)",49168,9478,24,,)
(13,0,3,0,,,,,,,,,)
(14,15,2,0,,,,,,,,,)
(15,5400,1,932,23862314,23862314,11,"(18722,15)",32784,10642,24,,)
(16,17,2,0,,,,,,,,,)
(17,5168,1,228,23862333,23862333,11,"(18722,17)",32784,10646,24,,)
(18,19,2,0,,,,,,,,,)
(19,4800,1,368,23862365,23862365,11,"(18722,19)",32784,10646,24,,)
(20,24,2,0,,,,,,,,,)
(21,3112,1,812,1332175474,894779760,223823413,"(1443526721,25142)",12141,18516,57,,)
(22,23,2,0,,,,,,,,,)
(23,4448,1,352,728903474,1466642284,1869042033,"(1262568257,16743)",16751,26455,65,111101101000001011010010100000101010001011110110100000101101001010000010100000000000000000000000000000001010000001110110110000001000000001001000101000010110111000000000000000001100000111100010000000000000000010110010101100101101100000000000110010110010110011000000000000000000000000000000000000000000000000000000000000000000000010100001,)
(24,4184,1,264,619804,228,8000,"(956969542,26694)",22857,29251,98,110001101100001001100110011001100001110010000010001000100000111011110100110100101001011000101100100101100110001001000010101011100101111001010110111100101100001011000110101000101101011000011100010101100100110010011100110100101001101001100110110001101000101000110110100101100100001010011110000010100011011001000010101011001111001010100110010010100111011000101110010010101111010000011010110101101000001000101100101100101000101000001110000110101110001000001100110001100001001001010010000010100101001000011110010010100110110010000110101100000101000011100110110010100011011010110110100110101111010000100110,)
(25,26,2,0,,,,,,,,,)
(26,3928,1,252,1212837441,1232350037,1668772674,"(728910135,11093)",31284,16687,70,1111011010100010100010100100110011001100100000101101001001110110000101101110011001100110101101101001110010011110010010101000110011110100000100100010111011010100011010100001001010110110101100100011001000101010111011000000111011110100110011001011000001010000101101101111010010110110100001101000001010100010011011100100101010110110001100100101011001001110001101101100011001100110,1717791858)
(27,21,2,0,,,,,,,,,)
(28,29,2,0,,,,,,,,,)
(29,2296,1,812,1229271346,1934308693,1733443957,"(1848337707,21358)",31337,21592,88,,1178686785)
(30,1944,1,352,947155032,1380218998,792031842,"(1211650927,22579)",30585,20532,80,,)
(31,1560,1,380,23935627,23935627,1,"(18722,31)",40976,9510,24,,)
(31 rows)
The data for index 29 has 1934308693 in the header (which is the
multixactid reported in the error message when trying to retrieve that row).
I'm struggling to find the definitions for the data returned by
heap_page_items, but the output from this page (for indexes between 21
and 30) looks quite different from other neighbouring pages.
If I look at the raw page data, it contains data relating to this
installation, so it doesn't look like a page from a different Hyper-V
guest. It also doesn't look like just random data from a memory error.
It actually looks like legitimate data which should possibly have been
written elsewhere. (I don't want to post it here because it contains
possibly confidential data from one of our customers).
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Paul Smith wrote:
In the backup, for the row which will become broken, I get:
ctid | xmin | xmax
------------+----------+----------
(18722,29) | 23862661 | 23862661
Okay.
Assuming it hadn't changed in the broken version (from application
knowledge, it's unlikely to change), then on the broken version,
select heap_page_items(get_raw_page('messages.msgdata', 18722)); gives:(21,3112,1,812,1332175474,894779760,223823413,"(1443526721,25142)",12141,18516,57,,)
(23,4448,1,352,728903474,1466642284,1869042033,"(1262568257,16743)",16751,26455,65,111101101000001011010010100000101010001011110110100000101101001010000010100000000000000000000000000000001010000001110110110000001000000001001000101000010110111000000000000000001100000111100010000000000000000010110010101100101101100000000000110010110010110011000000000000000000000000000000000000000000000000000000000000000000000010100001,)
(24,4184,1,264,619804,228,8000,"(956969542,26694)",22857,29251,98,110001101100001001100110011001100001110010000010001000100000111011110100110100101001011000101100100101100110001001000010101011100101111001010110111100101100001011000110101000101101011000011100010101100100110010011100110100101001101001100110110001101000101000110110100101100100001010011110000010100011011001000010101011001111001010100110010010100111011000101110010010101111010000011010110101101000001000101100101100101000101000001110000110101110001000001100110001100001001001010010000010100101001000011110010010100110110010000110101100000101000011100110110010100011011010110110100110101111010000100110,)
(26,3928,1,252,1212837441,1232350037,1668772674,"(728910135,11093)",31284,16687,70,1111011010100010100010100100110011001100100000101101001001110110000101101110011001100110101101101001110010011110010010101000110011110100000100100010111011010100011010100001001010110110101100100011001000101010111011000000111011110100110011001011000001010000101101101111010010110110100001101000001010100010011011100100101010110110001100100101011001001110001101101100011001100110,1717791858)
(29,2296,1,812,1229271346,1934308693,1733443957,"(1848337707,21358)",31337,21592,88,,1178686785)
(30,1944,1,352,947155032,1380218998,792031842,"(1211650927,22579)",30585,20532,80,,)
(31,1560,1,380,23935627,23935627,1,"(18722,31)",40976,9510,24,,)
(31 rows)The data for index 29 has 1934308693 in the header (which is the multixactid
reported in the error message when trying to retrieve that row).
Right. The ctids on items 21, 23, 24, 26,29 and 30 are pretty obviously
corrupted (page numbers are too large, and offset numbers are way above
normal offset numbers), as is probaly everything else about them. Note
xmin should be a valid xid, but for example on item 21 it's 133 million
which obviously is not a valid xid on your system. Items 26 and 29 even
have an OID, which the others do not (indicating wrong HEAP_HASOID
flag).
Note the lp_off fields. It seems that all valid items have offsets
above 4096, and corrupted items have offsets below that. The theory
that somehow the system wrote a bad filesystem page (4096 bytes) on the
lower half of the Postgres page (8192 bytes) sounds rather more probable
now.
I'm struggling to find the definitions for the data returned by
heap_page_items,
Try \df+ heap_page_items; or more easily, use this query instead:
select * from heap_page_items(get_raw_page('messages.msgdata', 18722)); gives:
If I look at the raw page data, it contains data relating to this
installation, so it doesn't look like a page from a different Hyper-V guest.
It also doesn't look like just random data from a memory error. It actually
looks like legitimate data which should possibly have been written
elsewhere. (I don't want to post it here because it contains possibly
confidential data from one of our customers).
Sure.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers