Pg11 -- MultiXactId xxxx has not been created yet -- apparent wraparound

Started by Moreno Andreoover 6 years ago9 messagesgeneral
Jump to latest
#1Moreno Andreo
moreno.andreo@evolu-s.it

<html>
<head>

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-15">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Hi all,<br>
ᅵᅵᅵ I'm encountering this issue in a Windows 10/Pg11.5<br>
<br>
<br>
I followed the thread @
<a class="moz-txt-link-freetext" href="https://postgrespro.com/list/thread-id/2380690&quot;&gt;https://postgrespro.com/list/thread-id/2380690&lt;/a&gt;&lt;br&gt;
<br>
but examining heap pages is far beyond my knowledge, so if any of
the gurus would spend some time on it, I would be very grateful.<br>
<br>
So, here comes the facts<br>
<br>
database0=# select datminmxid from pg_database where datname =
current_database();<br>
ᅵdatminmxid<br>
------------<br>
ᅵᅵᅵᅵᅵᅵᅵ 365<br>
(1 row)<br>
<br>
<br>
The CTID of the tuple BEFORE the failing one is 3159,51:<br>
<br>
database0=# select ctid from tablename offset 368 limit 1;<br>
ERROR:ᅵ MultiXactId 12800 has not been created yet -- apparent
wraparound<br>
database0=# select ctid from tablename offset 367 limit 1;<br>
ᅵᅵ ctid<br>
-----------<br>
ᅵ(3159,51)<br>
(1 row)<br>
<br>
Now, I started issuing the queries<br>
<pre>select * from heap_page_items(get_raw_page('tablename',3159));
select * from heap_page_items(get_raw_page('tablename',3160));

and so on for about 5 or 6 pages.
What do I need to search for?
Or is it better to \copy them to and send 'em as an attachment?
How many pages do I need to search for?
Until I see the error again?

Thanks in advance
Moreno.-
</pre>
</body>
</html>

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Moreno Andreo (#1)
Re: Pg11 -- MultiXactId xxxx has not been created yet -- apparent wraparound

On 2019-Oct-04, Moreno Andreo wrote:

select * from heap_page_items(get_raw_page('tablename',3159));
select * from heap_page_items(get_raw_page('tablename',3160));

and so on for about 5 or 6 pages.

Please paste the output of that for pages 3159 and 3160, as well as the
output of pg_controldata.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Moreno Andreo
moreno.andreo@evolu-s.it
In reply to: Alvaro Herrera (#2)
Re: Pg11 -- MultiXactId xxxx has not been created yet -- apparent wraparound

Il 04/10/19 17:30, Alvaro Herrera ha scritto:

On 2019-Oct-04, Moreno Andreo wrote:

select * from heap_page_items(get_raw_page('tablename',3159));
select * from heap_page_items(get_raw_page('tablename',3160));

and so on for about 5 or 6 pages.

Please paste the output of that for pages 3159 and 3160, as well as the
output of pg_controldata.

Thanks Alvaro,
ᅵᅵᅵ you can find attached the data you requested

Attachments:

3159.csvapplication/vnd.ms-excel; name=3159.csvDownload
3160.csvapplication/vnd.ms-excel; name=3160.csvDownload
pgcontroldata.logtext/plain; charset=UTF-8; name=pgcontroldata.logDownload
#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Moreno Andreo (#3)
Re: Pg11 -- MultiXactId xxxx has not been created yet -- apparent wraparound

On 2019-Oct-04, Moreno Andreo wrote:

Il 04/10/19 17:30, Alvaro Herrera ha scritto:

On 2019-Oct-04, Moreno Andreo wrote:

select * from heap_page_items(get_raw_page('tablename',3159));
select * from heap_page_items(get_raw_page('tablename',3160));

and so on for about 5 or 6 pages.

Please paste the output of that for pages 3159 and 3160, as well as the
output of pg_controldata.

Thanks Alvaro,
��� you can find attached the data you requested

Hmm, so it is tuple (3160,31) that's giving you grief -- it has xmax=12800
t_infomask=0x1103 (HEAP_XMAX_IS_MULTI | HEAP_XMIN_COMMITTED | others)

Which is weird, since it has none of the locking bits.

... and also the valid range of multixacts as of the last checkpoint was:

NextMultiXactId dell'ultimo checkpoint: 366
oldestMultiXID dell'ultimo checkpoint: 365

so the value 12800 is certainly not in range there.

I wonder if it would work to just clear that multixact with
SELECT ... WHERE ctid='(3160,31)' FOR UPDATE

If this was in my hands, I would scan the WAL looking for the place that
last touched this page (and the latest FPI for this page, also). It
might have an explanation of what went on. Maybe use the page's LSN
(from pageinspect's page_header()) as starting point for the WAL
location that modified the page. I hope you have a WAL archive that
goes back to well before the previous checkpoint.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Moreno Andreo
moreno.andreo@evolu-s.it
In reply to: Alvaro Herrera (#4)
Re: Pg11 -- MultiXactId xxxx has not been created yet -- apparent wraparound

Il 04/10/19 18:28, Alvaro Herrera ha scritto:

I wonder if it would work to just clear that multixact with
SELECT ... WHERE ctid='(3160,31)' FOR UPDATE

select ...what? :-) Sorry but it's totally beyond my knowledge and my
control.... after resolving the issue i'll surely go and search docs to
understand what we've done....

If this was in my hands, I would scan the WAL looking for the place that
last touched this page (and the latest FPI for this page, also). It
might have an explanation of what went on. Maybe use the page's LSN
(from pageinspect's page_header()) as starting point for the WAL
location that modified the page. I hope you have a WAL archive that
goes back to well before the previous checkpoint.

One thing I forgot to report is that this cluster is just upgraded from
a 9.1 that was crashing at least once a day (in many cases the upgrade
itself resolved the issue)
here's the log line
2019-10-03 15:11:52 CEST LOG:ᅵ server process (PID 18668) was terminated
by exception 0xC0000005
In this case probably the access violation was due to a data corruption.
These are customer machines that are really badly kept and NTFS issues
are not that rare, so I won't bother investigating what's happened but
just make the customer up & running again.

Thanks for your time
Moreno

Show quoted text
#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Moreno Andreo (#5)
Re: Pg11 -- MultiXactId xxxx has not been created yet -- apparent wraparound

On 2019-Oct-04, Moreno Andreo wrote:

Il 04/10/19 18:28, Alvaro Herrera ha scritto:

I wonder if it would work to just clear that multixact with
SELECT ... WHERE ctid='(3160,31)' FOR UPDATE

select ...what? :-) Sorry but it's totally beyond my knowledge and my
control.... after resolving the issue i'll surely go and search docs to
understand what we've done....

This should do it:

SELECT * FROM the_broken_table WHERE <what I said above>

But of course I make no promise of it working or even having any effect
at all ...

One thing I forgot to report is that this cluster is just upgraded from a
9.1 that was crashing at least once a day (in many cases the upgrade itself
resolved the issue)
here's the log line
2019-10-03 15:11:52 CEST LOG:� server process (PID 18668) was terminated by
exception 0xC0000005
In this case probably the access violation was due to a data corruption.
These are customer machines that are really badly kept and NTFS issues are
not that rare, so I won't bother investigating what's happened but just make
the customer up & running again.

Hmm, well, it does sound like corrupted data, and if we suspect that
that's the case then there's not much we can do other than clearing the
page and moving on.

That exception code is STATUS_ACCESS_VIOLATION. Old Postgres bugs caused
that, many are fixed in current versions I think.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Moreno Andreo
moreno.andreo@evolu-s.it
In reply to: Alvaro Herrera (#6)
Re: Pg11 -- MultiXactId xxxx has not been created yet -- apparent wraparound

Il 04/10/19 21:14, Alvaro Herrera ha scritto:

On 2019-Oct-04, Moreno Andreo wrote:

Il 04/10/19 18:28, Alvaro Herrera ha scritto:

I wonder if it would work to just clear that multixact with
SELECT ... WHERE ctid='(3160,31)' FOR UPDATE

select ...what? :-) Sorry but it's totally beyond my knowledge and my
control.... after resolving the issue i'll surely go and search docs to
understand what we've done....

This should do it:

SELECT * FROM the_broken_table WHERE <what I said above>

But of course I make no promise of it working or even having any effect
at all ...

Unfortunately, it didn't work :(

db0=# select * from failing_table where ctid='(3160,31)' for update;
ERROR:ᅵ MultiXactId 12800 has not been created yet -- apparent wraparound

Since the probability we are into corruption is very high, what if I
\copy all the table but the failing row(s) to an external file, drop and
recreate the table, and then \copy clean data back inside?

Thanks
Moreno.-

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Moreno Andreo (#7)
Re: Pg11 -- MultiXactId xxxx has not been created yet -- apparent wraparound

On 2019-Oct-07, Moreno Andreo wrote:

Unfortunately, it didn't work :(

db0=# select * from failing_table where ctid='(3160,31)' for update;
ERROR:� MultiXactId 12800 has not been created yet -- apparent wraparound

Oh well. It was a long shot anyway ...

Since the probability we are into corruption is very high, what if I \copy
all the table but the failing row(s) to an external file, drop and recreate
the table, and then \copy clean data back inside?

Yes, that should work.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Moreno Andreo
moreno.andreo@evolu-s.it
In reply to: Alvaro Herrera (#8)
Re: Pg11 -- MultiXactId xxxx has not been created yet -- apparent wraparound

Hi Alvaro,
ᅵᅵᅵ sorry for late reply, I've been out of office.

Il 09/10/19 19:51, Alvaro Herrera ha scritto:

On 2019-Oct-07, Moreno Andreo wrote:

Unfortunately, it didn't work :(

db0=# select * from failing_table where ctid='(3160,31)' for update;
ERROR:ᅵ MultiXactId 12800 has not been created yet -- apparent wraparound

Oh well. It was a long shot anyway ...

It was a long shot, but it was worth trying

Since the probability we are into corruption is very high, what if I \copy
all the table but the failing row(s) to an external file, drop and recreate
the table, and then \copy clean data back inside?

Yes, that should work.

It did not work... I think there was some big deal with the cluster itself.
To extract these small parts of data I had to SELECT using OFFSET and LIMIT.
Well, the same query (i.e. select * from table offset 35 limit 145) run
as it is worked well, but from the moment I put it into a COPY
statement, it was messing again with multixact, even if I tried back the
only query.
It ended recovering data from backups (2 days old, and that's good news)

Thanks for your time
Moreno.-

Show quoted text