ERROR: found multixact from before relminmxid

Started by Alexandre Arrudaabout 8 years ago33 messagesgeneral

adaldeia@gmail.com

about 8 years ago

Hi,

Some time ago, I had this errors frequently showed in logs after some
autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this tables
show the same and not complete the tasks (showed by some table bloat
select).
Then, I did a full dump/restore into a new version (10.2) and everything is
ok for a couple of months. Now, I have errors like this again:

db1=# cluster pc44t;

ERROR: found multixact 134100944 from before relminmxid 192042633

Like before, the only way to make the errors to desapear is by dump/reload
the whole table.

Thanks for any help.

Best regards,

Alexandre

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Alexandre Arruda (#1)

Re: ERROR: found multixact from before relminmxid

On 04/06/2018 02:09 AM, Alexandre Arruda wrote:

Hi,

Some time ago, I had this errors frequently showed in logs after some
autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this tables
show the same and not complete the tasks (showed by some table bloat
select).
Then, I did a full dump/restore into a new version (10.2) and everything
is ok for a couple of months. Now, I have errors like this again:

db1=# cluster pc44t;

ERROR: found multixact 134100944 from before relminmxid 192042633

Like before, the only way to make the errors to desapear is by
dump/reload the whole table.

Thanks for any help.

That's going to be hard, unless you still have the cluster around.

This surely seems like some sort of data corruption issue, but without
being able to inspect the data pages it's nearly impossible to determine
what went wrong.

We'd also need more information about what happened to the hardware and
cluster before the issues started to appear - crashes, hardware issues.
And so on.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alexandre Arruda

adaldeia@gmail.com

about 8 years ago

In reply to: Tomas Vondra (#2)

Re: ERROR: found multixact from before relminmxid

2018-04-06 9:39 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:

On 04/06/2018 02:09 AM, Alexandre Arruda wrote:

Hi,

Some time ago, I had this errors frequently showed in logs after some
autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this tables
show the same and not complete the tasks (showed by some table bloat
select).
Then, I did a full dump/restore into a new version (10.2) and everything
is ok for a couple of months. Now, I have errors like this again:

db1=# cluster pc44t;

ERROR: found multixact 134100944 from before relminmxid 192042633

Like before, the only way to make the errors to desapear is by
dump/reload the whole table.

Thanks for any help.

That's going to be hard, unless you still have the cluster around.

This surely seems like some sort of data corruption issue, but without
being able to inspect the data pages it's nearly impossible to determine
what went wrong.

We'd also need more information about what happened to the hardware and
cluster before the issues started to appear - crashes, hardware issues.
And so on.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi Tomas,
The old cluster are gone, unfortunatly.

This server is a 128GB ECC RAM with a dual redundant hba fiber channel
connect to a sotorage with Raid 6 and I don't have (apparently) any errors
reported.
Yesterday I did a test with one table: some sum aggragates, count(*), etc,
then dump/reload and repeat the tests the results (of querys) are the same,
regarding the vacuum problem
thats disapeared.

best regards

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Alexandre Arruda (#3)

Re: ERROR: found multixact from before relminmxid

On 04/06/2018 04:29 PM, Alexandre Arruda wrote:

2018-04-06 9:39 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>>:

On 04/06/2018 02:09 AM, Alexandre Arruda wrote:

Hi,

Some time ago, I had this errors frequently showed in logs after some
autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this tables
show the same and not complete the tasks (showed by some table bloat
select).
Then, I did a full dump/restore into a new version (10.2) and everything
is ok for a couple of months. Now, I have errors like this again:

db1=# cluster pc44t;

ERROR: found multixact 134100944 from before relminmxid 192042633

Like before, the only way to make the errors to desapear is by
dump/reload the whole table.

Thanks for any help.

That's going to be hard, unless you still have the cluster around.

This surely seems like some sort of data corruption issue, but without
being able to inspect the data pages it's nearly impossible to determine
what went wrong.

We'd also need more information about what happened to the hardware and
cluster before the issues started to appear - crashes, hardware issues.
And so on.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi Tomas,
The old cluster are gone, unfortunatly.

This server is a 128GB ECC RAM with a dual redundant hba fiber channel
connect to a sotorage with Raid 6 and I don't have (apparently) any
errors reported.
Yesterday I did a test with one table: some sum aggragates, count(*),
etc, then dump/reload and repeat the tests the results (of querys) are
the same, regarding the vacuum problem
thats disapeared.

I'm not sure I understand correctly. So you can reproduce the issue? If
yes, how can you share the scripts/data you use (and how large is it)?
If we could reproduce it locally, it would make the investigation much
easier.

BTW you mentioned you're using PostgreSQL 9.6 - which minor version,
specifically?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alexandre Arruda

adaldeia@gmail.com

about 8 years ago

In reply to: Tomas Vondra (#4)

Re: ERROR: found multixact from before relminmxid

2018-04-06 13:11 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:

On 04/06/2018 04:29 PM, Alexandre Arruda wrote:

2018-04-06 9:39 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>>:

On 04/06/2018 02:09 AM, Alexandre Arruda wrote:

Hi,

Some time ago, I had this errors frequently showed in logs after

some

autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this

tables

show the same and not complete the tasks (showed by some table

bloat

select).
Then, I did a full dump/restore into a new version (10.2) and

everything

is ok for a couple of months. Now, I have errors like this again:

db1=# cluster pc44t;

ERROR: found multixact 134100944 from before relminmxid 192042633

Like before, the only way to make the errors to desapear is by
dump/reload the whole table.

Thanks for any help.

That's going to be hard, unless you still have the cluster around.

This surely seems like some sort of data corruption issue, but

without

being able to inspect the data pages it's nearly impossible to

determine

what went wrong.

We'd also need more information about what happened to the hardware

and

cluster before the issues started to appear - crashes, hardware

issues.

And so on.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi Tomas,
The old cluster are gone, unfortunatly.

This server is a 128GB ECC RAM with a dual redundant hba fiber channel
connect to a sotorage with Raid 6 and I don't have (apparently) any
errors reported.
Yesterday I did a test with one table: some sum aggragates, count(*),
etc, then dump/reload and repeat the tests the results (of querys) are
the same, regarding the vacuum problem
thats disapeared.

I'm not sure I understand correctly. So you can reproduce the issue? If
yes, how can you share the scripts/data you use (and how large is it)?
If we could reproduce it locally, it would make the investigation much
easier.

BTW you mentioned you're using PostgreSQL 9.6 - which minor version,
specifically?

regards

Hi Tomas,

No, I can't reproduce. What I did is a simple way to "validate" the current
table data to see if a dump/reload
preserve them. Old postgresql was 9.6.5. The problem returns now in new
10.3 installation.

There is a way to correct this tables without a dump/reload ?

I'm thinking to reinstall cluster doing a initdb --data-checksums, but I'm
affraid about a severe performance impact.

Best regards,

Alexandre

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Alexandre Arruda (#5)

Re: ERROR: found multixact from before relminmxid

On 04/09/2018 01:49 PM, Alexandre Arruda wrote:

2018-04-06 13:11 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>>:

On 04/06/2018 04:29 PM, Alexandre Arruda wrote:

2018-04-06 9:39 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>
<mailto:tomas.vondra@2ndquadrant.com

<mailto:tomas.vondra@2ndquadrant.com>>>:

On 04/06/2018 02:09 AM, Alexandre Arruda wrote:
> Hi,
>
> Some time ago, I had this errors frequently showed in logs

after some

> autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in

this tables

> show the same and not complete the tasks (showed by some

table bloat

> select).
> Then, I did a full dump/restore into a new version (10.2)

and everything

> is ok for a couple of months. Now, I have errors like this

again:

>
> db1=# cluster pc44t;
>
> ERROR: found multixact 134100944 from before relminmxid

192042633

>
>
>
> Like before, the only way to make the errors to desapear is by
> dump/reload the whole table.
>
>
> Thanks for any help.
>

That's going to be hard, unless you still have the cluster around.

This surely seems like some sort of data corruption issue, but

without

being able to inspect the data pages it's nearly impossible to

determine

what went wrong.

We'd also need more information about what happened to the

hardware and

cluster before the issues started to appear - crashes,

hardware issues.

And so on.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training &

Services

Hi Tomas,
The old cluster are gone, unfortunatly.

This server is a 128GB ECC RAM with a dual redundant hba fiber channel
connect to a sotorage with Raid 6 and I don't have (apparently) any
errors reported.
Yesterday I did a test with one table: some sum aggragates, count(*),
etc, then dump/reload and repeat the tests the results (of querys) are
the same, regarding the vacuum problem
thats disapeared.

I'm not sure I understand correctly. So you can reproduce the issue? If
yes, how can you share the scripts/data you use (and how large is it)?
If we could reproduce it locally, it would make the investigation much
easier.

BTW you mentioned you're using PostgreSQL 9.6 - which minor version,
specifically?

regards

Hi Tomas,

No, I can't reproduce. What I did is a simple way to "validate" the
current table data to see if a dump/reload
preserve them. Old postgresql was 9.6.5. The problem returns now in new
10.3 installation.

I'm confused. First you say you can't reproduce the issue, then you say
it got back on a new installation. So which is it?

There is a way to correct this tables without a dump/reload ?

I'm thinking to reinstall cluster doing a initdb --data-checksums,
but I'm affraid about a severe performance impact.

The performance impact should be non-measurable for most workloads. It
might be a couple of percent on certain workloads/hardware, but that's
probably not your case.

The bigger question is whether this can actually detect the issue. If
it's due to an storage issue, then perhaps yes. But if you only see
multixact issues consistently and nothing else, it might easily be a
PostgreSQL bug (in which case the checksum will be correct).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Tomas Vondra (#6)

Re: ERROR: found multixact from before relminmxid

On Mon, Apr 9, 2018 at 7:01 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The bigger question is whether this can actually detect the issue. If
it's due to an storage issue, then perhaps yes. But if you only see
multixact issues consistently and nothing else, it might easily be a
PostgreSQL bug (in which case the checksum will be correct).

You can also run amcheck. Get the version targeting earlier Postgres
releases off Github (there are packages for most Linux systems). This
can verify that the heap is consistent with indexes.

--
Peter Geoghegan

Alexandre Arruda

adaldeia@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#7)

Re: ERROR: found multixact from before relminmxid

2018-04-09 18:23 GMT-03:00 Peter Geoghegan <pg@bowt.ie>:

On Mon, Apr 9, 2018 at 7:01 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The bigger question is whether this can actually detect the issue. If
it's due to an storage issue, then perhaps yes. But if you only see
multixact issues consistently and nothing else, it might easily be a
PostgreSQL bug (in which case the checksum will be correct).

You can also run amcheck. Get the version targeting earlier Postgres
releases off Github (there are packages for most Linux systems). This
can verify that the heap is consistent with indexes.

--
Peter Geoghegan

Hi Peter,

I ran amcheck in all index of a table and I only get empty returns.

I did some tests:

production=# reindex table fn06t;
REINDEX
production=# vacuum verbose fn06t;
INFO: vacuuming "public.fn06t"
ERROR: found multixact 76440919 from before relminmxid 122128619

production=# select count(*),sum(fn06vrtitu) from fn06t;
count | sum
---------+-----------------
2592363 | 4833603148.1172
(1 row)

#createdb freshdb
#pg_dump -t fn06t production | psql freshdb

freshdb=# select count(*),sum(fn06vrtitu) from fn06t;
count | sum
---------+-----------------
2592363 | 4833603148.1172
(1 row)

psql -c "select * from fn06t order by <PKEY>" production >
multixact_error_fn06t.txt
psql -c "select * from fn06t order by <PKEY>" freshdb > freshdb_fn06t.txt

diff points no differences in files and md5sum produces the same hash.

The question is: if a dump can read the table "correctly", why a
CLUSTER/vacuum full stops while reading them?

Best regards,

Alexandre

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Alexandre Arruda (#8)

Re: ERROR: found multixact from before relminmxid

On Mon, Apr 9, 2018 at 5:55 PM, Alexandre Arruda <adaldeia@gmail.com> wrote:

I ran amcheck in all index of a table and I only get empty returns.

Did you try doing so with the "heapallindexed" option? That's what's
really interesting here.

--
Peter Geoghegan

#10

Alexandre Arruda

adaldeia@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#9)

Re: ERROR: found multixact from before relminmxid

2018-04-09 22:16 GMT-03:00 Peter Geoghegan <pg@bowt.ie>:

On Mon, Apr 9, 2018 at 5:55 PM, Alexandre Arruda <adaldeia@gmail.com> wrote:

I ran amcheck in all index of a table and I only get empty returns.

Did you try doing so with the "heapallindexed" option? That's what's
really interesting here.

--
Peter Geoghegan

production=# select bt_index_parent_check('fn06t_pkey',true);

bt_index_parent_check

-----------------------

(1 row)

production=# select bt_index_parent_check('ifn06t1',true);

bt_index_parent_check

-----------------------

(1 row)

production=# select bt_index_parent_check('ifn06t4',true);

bt_index_parent_check

-----------------------

(1 row)

(... and all other indexes returns null too)

I tried with bt_index_check too. Same results.

regards,

Alexandre

#11

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Alexandre Arruda (#10)

Re: ERROR: found multixact from before relminmxid

On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> wrote:

(... and all other indexes returns null too)

I tried with bt_index_check too. Same results.

That's interesting, because it tells me that you have a table that
appears to not be corrupt, despite the CLUSTER error. Also, the error
itself comes from sanity checking added to MultiXact freezing fairly
recently, in commit 699bf7d0.

You didn't say anything about regular VACUUM being broken. Do you find
that it works without any apparent issue?

I have a suspicion that this could be a subtle bug in
CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by
CLUSTER, so it's not that hard to imagine a MultiXact freezing bug
that is peculiar to CLUSTER. Though I haven't thought about it in much
detail.

--
Peter Geoghegan

#12

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Peter Geoghegan (#11)

Re: ERROR: found multixact from before relminmxid

On April 9, 2018 7:51:19 PM PDT, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com>
wrote:

(... and all other indexes returns null too)

I tried with bt_index_check too. Same results.

That's interesting, because it tells me that you have a table that
appears to not be corrupt, despite the CLUSTER error. Also, the error
itself comes from sanity checking added to MultiXact freezing fairly
recently, in commit 699bf7d0.

You didn't say anything about regular VACUUM being broken. Do you find
that it works without any apparent issue?

I have a suspicion that this could be a subtle bug in
CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by
CLUSTER, so it's not that hard to imagine a MultiXact freezing bug
that is peculiar to CLUSTER. Though I haven't thought about it in much
detail.

I've not followed this thread. Possible it's the overeager check for pg upgraded tuples from before 9.3 that Alvaro fixed recently?

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#13

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andres Freund (#12)

Re: ERROR: found multixact from before relminmxid

On Mon, Apr 9, 2018 at 7:53 PM, Andres Freund <andres@anarazel.de> wrote:

I've not followed this thread. Possible it's the overeager check for pg upgraded tuples from before 9.3 that Alvaro fixed recently?

I was aware of commit 477ad05e, which must be what you're referring
to. I don't think that that's what this is, since this error occurs
within heap_freeze_tuple() -- it's not the over-enforced
HEAP_XMAX_IS_LOCKED_ONLY() error within heap_prepare_freeze_tuple().
And, because this database wasn't pg_upgraded.

I should wait until tomorrow before doing any further analysis, though.

--
Peter Geoghegan

#14

Alexandre Arruda

adaldeia@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#11)

Re: ERROR: found multixact from before relminmxid

2018-04-09 23:51 GMT-03:00 Peter Geoghegan <pg@bowt.ie>:

On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> wrote:

(... and all other indexes returns null too)

I tried with bt_index_check too. Same results.

That's interesting, because it tells me that you have a table that
appears to not be corrupt, despite the CLUSTER error. Also, the error
itself comes from sanity checking added to MultiXact freezing fairly
recently, in commit 699bf7d0.

You didn't say anything about regular VACUUM being broken. Do you find
that it works without any apparent issue?

I have a suspicion that this could be a subtle bug in
CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by
CLUSTER, so it's not that hard to imagine a MultiXact freezing bug
that is peculiar to CLUSTER. Though I haven't thought about it in much
detail.

--
Peter Geoghegan

Hi Peter,

Actualy, I first notice the problem in logs by autovacuum:

2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of
table "production.public.fn06t"
2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765
from before relminmxid 73262006

production=# vacuum analyze verbose fn06t;
INFO: vacuuming "public.fn06t"
ERROR: found multixact 76440919 from before relminmxid 122128619

#15

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Alexandre Arruda (#14)

Re: ERROR: found multixact from before relminmxid

On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda <adaldeia@gmail.com> wrote:

Actualy, I first notice the problem in logs by autovacuum:

2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of
table "production.public.fn06t"
2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765
from before relminmxid 73262006

production=# vacuum analyze verbose fn06t;
INFO: vacuuming "public.fn06t"
ERROR: found multixact 76440919 from before relminmxid 122128619

Do you think that CLUSTER was run before regular VACUUM/autovacuum
showed this error, though?

Have you noticed any data loss? Things look okay when you do your dump
+ restore, right? The problem, as far as you know, is strictly that
CLUSTER + VACUUM refuse to finish/raise these multixactid errors?

--
Peter Geoghegan

#16

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Alexandre Arruda (#14)

Re: ERROR: found multixact from before relminmxid

On 2018-04-10 08:31:20 -0300, Alexandre Arruda wrote:

2018-04-09 23:51 GMT-03:00 Peter Geoghegan <pg@bowt.ie>:

On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> wrote:

(... and all other indexes returns null too)

I tried with bt_index_check too. Same results.

That's interesting, because it tells me that you have a table that
appears to not be corrupt, despite the CLUSTER error. Also, the error
itself comes from sanity checking added to MultiXact freezing fairly
recently, in commit 699bf7d0.

You didn't say anything about regular VACUUM being broken. Do you find
that it works without any apparent issue?

I have a suspicion that this could be a subtle bug in
CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by
CLUSTER, so it's not that hard to imagine a MultiXact freezing bug
that is peculiar to CLUSTER. Though I haven't thought about it in much
detail.

--
Peter Geoghegan

Hi Peter,

Actualy, I first notice the problem in logs by autovacuum:

2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of
table "production.public.fn06t"
2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765
from before relminmxid 73262006

production=# vacuum analyze verbose fn06t;
INFO: vacuuming "public.fn06t"
ERROR: found multixact 76440919 from before relminmxid 122128619

What does the function in
/messages/by-id/20180319181723.ugaf7hfkluqyos5d@alap3.anarazel.de
say about your table?

Could you post pg_controldata output and
SELECT * FROM pg_class WHERE oid = 'public.fn06t'::regclass;
?

Greetings,

Andres Freund

#17

Alexandre Arruda

adaldeia@gmail.com

almost 8 years ago

In reply to: Andres Freund (#16)

Re: ERROR: found multixact from before relminmxid

2018-04-10 19:53 GMT-03:00 Andres Freund <andres@anarazel.de>:

On 2018-04-10 08:31:20 -0300, Alexandre Arruda wrote:

2018-04-09 23:51 GMT-03:00 Peter Geoghegan <pg@bowt.ie>:

On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> wrote:

(... and all other indexes returns null too)

I tried with bt_index_check too. Same results.

That's interesting, because it tells me that you have a table that
appears to not be corrupt, despite the CLUSTER error. Also, the error
itself comes from sanity checking added to MultiXact freezing fairly
recently, in commit 699bf7d0.

You didn't say anything about regular VACUUM being broken. Do you find
that it works without any apparent issue?

I have a suspicion that this could be a subtle bug in
CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by
CLUSTER, so it's not that hard to imagine a MultiXact freezing bug
that is peculiar to CLUSTER. Though I haven't thought about it in much
detail.

--
Peter Geoghegan

Hi Peter,

Actualy, I first notice the problem in logs by autovacuum:

2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of
table "production.public.fn06t"
2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765
from before relminmxid 73262006

production=# vacuum analyze verbose fn06t;
INFO: vacuuming "public.fn06t"
ERROR: found multixact 76440919 from before relminmxid 122128619

What does the function in
/messages/by-id/20180319181723.ugaf7hfkluqyos5d@alap3.anarazel.de
say about your table?

Could you post pg_controldata output and
SELECT * FROM pg_class WHERE oid = 'public.fn06t'::regclass;
?

Greetings,

Andres Freund

pg_control version number: 1002
Catalog version number: 201707211
Database system identifier: 6513284223567708119
Database cluster state: in production
pg_control last modified: Ter 10 Abr 2018 23:51:56 BRT
Latest checkpoint location: 183/E4849618
Prior checkpoint location: 183/E19F8A80
Latest checkpoint's REDO location: 183/E202C430
Latest checkpoint's REDO WAL file: 0000000100000183000000E2
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:494263452
Latest checkpoint's NextOID: 1816390854
Latest checkpoint's NextMultiXactId: 256298856
Latest checkpoint's NextMultiOffset: 529468697
Latest checkpoint's oldestXID: 103399998
Latest checkpoint's oldestXID's DB: 866864162
Latest checkpoint's oldestActiveXID: 494263450
Latest checkpoint's oldestMultiXid: 73262006
Latest checkpoint's oldestMulti's DB: 866864162
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Ter 10 Abr 2018 23:50:14 BRT
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: off
max_connections setting: 600
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce:
a8eb07ea85c4bbeaf202020d036277b276bda47ef55c1456723ec2b3c40386b1

SELECT * FROM pg_class WHERE oid = 'public.fn06t'::regclass;
-[ RECORD 1 ]-------+------------------------------------------------------
relname | fn06t
relnamespace | 2200
reltype | 866874136
reloftype | 0
relowner | 10
relam | 0
relfilenode | 1092835324
reltablespace | 0
relpages | 218797
reltuples | 2.60181e+06
relallvisible | 192741
reltoastrelid | 866874160
relhasindex | t
relisshared | f
relpersistence | p
relkind | r
relnatts | 154
relchecks | 0
relhasoids | t
relhaspkey | t
relhasrules | f
relhastriggers | t
relhassubclass | f
relrowsecurity | f
relforcerowsecurity | f
relispopulated | t
relreplident | d
relispartition | f
relfrozenxid | 211279862
relminmxid | 122128619
relacl | {postgres=arwdDxt/postgres,usuarios=arwdDxt/postgres}
reloptions |
relpartbound |

#18

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Alexandre Arruda (#17)

Re: ERROR: found multixact from before relminmxid

On Tue, Apr 10, 2018 at 7:54 PM, Alexandre Arruda <adaldeia@gmail.com> wrote:

pg_control version number: 1002

Andres was also asking about his check_rel() function, from
/messages/by-id/20180319181723.ugaf7hfkluqyos5d@alap3.anarazel.de.
Can you check that out as well, please? You'll need to be able to
install the pageinspect contrib module.

--
Peter Geoghegan

#19

Alexandre Arruda

adaldeia@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#15)

Re: ERROR: found multixact from before relminmxid

2018-04-10 19:09 GMT-03:00 Peter Geoghegan <pg@bowt.ie>:

On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda <adaldeia@gmail.com> wrote:

Actualy, I first notice the problem in logs by autovacuum:

2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of
table "production.public.fn06t"
2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765
from before relminmxid 73262006

production=# vacuum analyze verbose fn06t;
INFO: vacuuming "public.fn06t"
ERROR: found multixact 76440919 from before relminmxid 122128619

Do you think that CLUSTER was run before regular VACUUM/autovacuum
showed this error, though?

Yes, because the table is clustered in the old database and
reclustered when it was reloaded in the version 10.
But the table was not clustered again.

Have you noticed any data loss? Things look okay when you do your dump
+ restore, right? The problem, as far as you know, is strictly that
CLUSTER + VACUUM refuse to finish/raise these multixactid errors?

I did not see, apparently, any data loss. A dump/reload in fresh db
shows the same data os production. I verify this by md5sum in
genereted files created by a select statment (ordered by pkey) in both
databases (prod and test).
The problem is with regular vaccum that not complete. Cluster or
vacuum FULL is a try to correct the problem.

But I'm affraid if any data can be corrupted along time.

#20

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Alexandre Arruda (#19)

Re: ERROR: found multixact from before relminmxid

On Wed, Apr 11, 2018 at 8:31 AM, Alexandre Arruda <adaldeia@gmail.com>
wrote:

2018-04-10 19:09 GMT-03:00 Peter Geoghegan <pg@bowt.ie>:

On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda <adaldeia@gmail.com>

wrote:

Actualy, I first notice the problem in logs by autovacuum:

2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of
table "production.public.fn06t"
2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765
from before relminmxid 73262006

production=# vacuum analyze verbose fn06t;
INFO: vacuuming "public.fn06t"
ERROR: found multixact 76440919 from before relminmxid 122128619

Do you think that CLUSTER was run before regular VACUUM/autovacuum
showed this error, though?

Yes, because the table is clustered in the old database and
reclustered when it was reloaded in the version 10.
But the table was not clustered again.

I wonder if we're staring at some race condition in visibility map where a
previous vacuum inadvertently skips a page because the visibility map bit
is set, thus leaving behind an unfrozen multixid. The page then again
becomes !all_visible and the next vacuum then complains about the unfrozen
multixid.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Pavan Deolasee (#20)

#22

Alexandre Arruda

adaldeia@gmail.com

almost 8 years ago

In reply to: Andres Freund (#16)