found xmin from before relfrozenxid on pg_catalog.pg_authid
Getting some concerning errors in one of our databases that is on 9.5.11,
on autovacuum from template0 database pg_authid and pg_auth_members. I
only saw some notes on the list about this error related to materialized
views. FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
9.5. Here is an example:
2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19 12:08:33
CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before relfrozenxid
740087784",,,,,"automatic vacuum of table
""template0.pg_catalog.pg_authid""",,,,""
2018-03-19 12:08:33.957 CDT,,,14892,,5aafee91.3a2c,2,,2018-03-19 12:08:33
CDT,59/340953,0,ERROR,XX001,"found xmin 2906288383 from before relfrozenxid
740087784",,,,,"automatic vacuum of table
""template0.pg_catalog.pg_auth_members""",,,,""
Any insight would be much appreciated.
Thanks,
Jeremy
Jeremy Finzel wrote:
Getting some concerning errors in one of our databases that is on 9.5.11,
on autovacuum from template0 database pg_authid and pg_auth_members. I
only saw some notes on the list about this error related to materialized
views. FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
9.5. Here is an example:2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19 12:08:33
CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before relfrozenxid
740087784",,,,,"automatic vacuum of table
""template0.pg_catalog.pg_authid""",,,,""
Can you please supply output of pg_controldata?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pg_control version number: 942
Catalog version number: 201510051
Database system identifier: 6351536019599012028
Database cluster state: in production
pg_control last modified: Mon 19 Mar 2018 12:56:10 PM CDT
Latest checkpoint location: 262BE/FE96240
Prior checkpoint location: 262BA/623D5E40
Latest checkpoint's REDO location: 262BA/F5499E98
Latest checkpoint's REDO WAL file: 00000001000262BA000000F5
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 16/3132524419
Latest checkpoint's NextOID: 1090653331
Latest checkpoint's NextMultiXactId: 2142
Latest checkpoint's NextMultiOffset: 5235
Latest checkpoint's oldestXID: 1829964553
Latest checkpoint's oldestXID's DB: 12376
Latest checkpoint's oldestActiveXID: 3131774441
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:1829964553
Latest checkpoint's newestCommitTsXid:3132524418
Time of latest checkpoint: Mon 19 Mar 2018 12:54:08 PM CDT
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: logical
wal_log_hints setting: off
max_connections setting: 2000
max_worker_processes setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: on
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
On Mon, Mar 19, 2018 at 12:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:
Show quoted text
Jeremy Finzel wrote:
Getting some concerning errors in one of our databases that is on 9.5.11,
on autovacuum from template0 database pg_authid and pg_auth_members. I
only saw some notes on the list about this error related to materialized
views. FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
9.5. Here is an example:2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19
12:08:33
CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before
relfrozenxid
740087784",,,,,"automatic vacuum of table
""template0.pg_catalog.pg_authid""",,,,""Can you please supply output of pg_controldata?
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Mar 19, 2018 at 12:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:
Jeremy Finzel wrote:
Getting some concerning errors in one of our databases that is on 9.5.11,
on autovacuum from template0 database pg_authid and pg_auth_members. I
only saw some notes on the list about this error related to materialized
views. FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
9.5. Here is an example:2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19
12:08:33
CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before
relfrozenxid
740087784",,,,,"automatic vacuum of table
""template0.pg_catalog.pg_authid""",,,,""Can you please supply output of pg_controldata?
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Please forgive my accidental top-post. Here:
pg_control version number: 942
Catalog version number: 201510051
Database system identifier: 6351536019599012028
Database cluster state: in production
pg_control last modified: Mon 19 Mar 2018 12:56:10 PM CDT
Latest checkpoint location: 262BE/FE96240
Prior checkpoint location: 262BA/623D5E40
Latest checkpoint's REDO location: 262BA/F5499E98
Latest checkpoint's REDO WAL file: 00000001000262BA000000F5
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 16/3132524419
Latest checkpoint's NextOID: 1090653331
Latest checkpoint's NextMultiXactId: 2142
Latest checkpoint's NextMultiOffset: 5235
Latest checkpoint's oldestXID: 1829964553
Latest checkpoint's oldestXID's DB: 12376
Latest checkpoint's oldestActiveXID: 3131774441
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:1829964553
Latest checkpoint's newestCommitTsXid:3132524418
Time of latest checkpoint: Mon 19 Mar 2018 12:54:08 PM CDT
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: logical
wal_log_hints setting: off
max_connections setting: 2000
max_worker_processes setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: on
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Thanks,
Jeremy
Hi Jeremy, Alvaro,
On 2018-03-19 13:00:13 -0500, Jeremy Finzel wrote:
On Mon, Mar 19, 2018 at 12:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:Jeremy Finzel wrote:
Getting some concerning errors in one of our databases that is on 9.5.11,
on autovacuum from template0 database pg_authid and pg_auth_members. I
only saw some notes on the list about this error related to materialized
views. FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
9.5. Here is an example:2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19
12:08:33
CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before
relfrozenxid
740087784",,,,,"automatic vacuum of table
""template0.pg_catalog.pg_authid""",,,,""Can you please supply output of pg_controldata?
Latest checkpoint's NextXID: 16/3132524419
Latest checkpoint's NextMultiXactId: 2142
Latest checkpoint's NextMultiOffset: 5235
Latest checkpoint's oldestXID: 1829964553
Latest checkpoint's oldestXID's DB: 12376
Latest checkpoint's oldestActiveXID: 3131774441
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16400
Hm, based on these it doesn't look like multixacts were involved (based
on oldestMultiXid it's highly unlikley there've multi wraparound, and
there's not much multixact usage on system tables anyway). Which
suggests that there might have been actual corrpution here.
Jeremy:
- which version of 9.4 and 9.5 ran on this? Do you know?
- Can you install the pageinspect extension? If so, it might be a
CREATE EXTENSION pageinspect;
CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8, OUT lp int2, OUT xmin xid)
RETURNS SETOF RECORD
LANGUAGE SQL
AS $$
SELECT blockno, lp, t_xmin
FROM
generate_series(0, pg_relation_size($1::text) / 8192 - 1) blockno, -- every block in the relation
heap_page_items(get_raw_page($1::text, blockno::int4)) -- every item on the page
WHERE
t_xmin IS NOT NULL -- filter out empty items
AND t_xmin != 1 -- filter out bootstrap
AND t_xmin != 2 -- filter out frozen transaction id
AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' | x'0200')::int) -- filter out frozen rows with xid present
AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHERE oid = $1)) -- xid cutoff filter
$$;
SELECT * FROM check_rel('pg_authid') LIMIT 100;
and then display all items for one of the affected pages like
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));
Alvaro:
- Hm, we talked about code adding context for these kind of errors,
right? Is that just skipped for csvlog?
- Alvaro, does the above check_rel() function make sense?
Greetings,
Andres Freund
On Mon, Mar 19, 2018 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
Hi Jeremy, Alvaro,
On 2018-03-19 13:00:13 -0500, Jeremy Finzel wrote:
On Mon, Mar 19, 2018 at 12:46 PM, Alvaro Herrera <
alvherre@alvh.no-ip.org>
wrote:
Jeremy Finzel wrote:
Getting some concerning errors in one of our databases that is on
9.5.11,
on autovacuum from template0 database pg_authid and
pg_auth_members. I
only saw some notes on the list about this error related to
materialized
views. FWIW, we did use pg_upgrade to upgrade this database from
9.4 to
9.5. Here is an example:
2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19
12:08:33
CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before
relfrozenxid
740087784",,,,,"automatic vacuum of table
""template0.pg_catalog.pg_authid""",,,,""Can you please supply output of pg_controldata?
Latest checkpoint's NextXID: 16/3132524419
Latest checkpoint's NextMultiXactId: 2142
Latest checkpoint's NextMultiOffset: 5235
Latest checkpoint's oldestXID: 1829964553
Latest checkpoint's oldestXID's DB: 12376
Latest checkpoint's oldestActiveXID: 3131774441
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16400Hm, based on these it doesn't look like multixacts were involved (based
on oldestMultiXid it's highly unlikley there've multi wraparound, and
there's not much multixact usage on system tables anyway). Which
suggests that there might have been actual corrpution here.Jeremy:
- which version of 9.4 and 9.5 ran on this? Do you know?
We upgraded to 9.5.5, and today we are running 9.5.11. And actually we
upgraded from 9.3, not 9.4. We are still trying to figure out which point
release we were on at 9.3.
- Can you install the pageinspect extension? If so, it might be a
CREATE EXTENSION pageinspect;
CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8, OUT
lp int2, OUT xmin xid)
RETURNS SETOF RECORD
LANGUAGE SQL
AS $$
SELECT blockno, lp, t_xmin
FROM
generate_series(0, pg_relation_size($1::text) / 8192 - 1) blockno,
-- every block in the relation
heap_page_items(get_raw_page($1::text, blockno::int4)) -- every
item on the page
WHERE
t_xmin IS NOT NULL -- filter out empty items
AND t_xmin != 1 -- filter out bootstrap
AND t_xmin != 2 -- filter out frozen transaction id
AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' |
x'0200')::int) -- filter out frozen rows with xid present
AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHERE oid
= $1)) -- xid cutoff filter
$$;
SELECT * FROM check_rel('pg_authid') LIMIT 100;
Small note - Needs to be this because != is not supported for xid:
AND NOT t_xmin = 1 -- filter out bootstrap
AND NOT t_xmin = 2 -- filter out frozen transaction id
and then display all items for one of the affected pages like
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));Alvaro:
- Hm, we talked about code adding context for these kind of errors,
right? Is that just skipped for csvlog?
- Alvaro, does the above check_rel() function make sense?Greetings,
Andres Freund
The function does NOT show any issue with either of those tables.
One very interesting thing that is puzzling us - we have taken several san
snapshots of the system real time that are running on the exact same
version 9.5.11, and they do NOT show the same error when we vacuum these
tables. It makes us wonder if simply a db restart would solve the issue.
We will continue to investigate but interested in your feedback about what
we have seen thus far.
Thanks,
Jeremy
On 2018-03-19 14:37:24 -0500, Jeremy Finzel wrote:
We upgraded to 9.5.5, and today we are running 9.5.11. And actually we
upgraded from 9.3, not 9.4. We are still trying to figure out which point
release we were on at 9.3.
Ok. IIRC there used to be a bug a few years back that sometimes lead to
highly contended pages being skipped during vacuum, and we'd still
update relfrozenxid. IIRC it required the table to be extended at the
same time or something?
- Can you install the pageinspect extension? If so, it might be a
CREATE EXTENSION pageinspect;
CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8, OUT
lp int2, OUT xmin xid)
RETURNS SETOF RECORD
LANGUAGE SQL
AS $$
SELECT blockno, lp, t_xmin
FROM
generate_series(0, pg_relation_size($1::text) / 8192 - 1) blockno,
-- every block in the relation
heap_page_items(get_raw_page($1::text, blockno::int4)) -- every
item on the page
WHERE
t_xmin IS NOT NULL -- filter out empty items
AND t_xmin != 1 -- filter out bootstrap
AND t_xmin != 2 -- filter out frozen transaction id
AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' |
x'0200')::int) -- filter out frozen rows with xid present
AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHERE oid
= $1)) -- xid cutoff filter
$$;
SELECT * FROM check_rel('pg_authid') LIMIT 100;Small note - Needs to be this because != is not supported for xid:
AND NOT t_xmin = 1 -- filter out bootstrap
AND NOT t_xmin = 2 -- filter out frozen transaction id
Only on older releases ;). But yea, that looks right.
and then display all items for one of the affected pages like
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));Alvaro:
- Hm, we talked about code adding context for these kind of errors,
right? Is that just skipped for csvlog?
- Alvaro, does the above check_rel() function make sense?Greetings,
Andres Freund
The function does NOT show any issue with either of those tables.
Uh, huh? Alvaro, do you see a bug in my query?
Greetings,
Andres Freund
On Mon, Mar 19, 2018 at 2:41 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-03-19 14:37:24 -0500, Jeremy Finzel wrote:
We upgraded to 9.5.5, and today we are running 9.5.11. And actually we
upgraded from 9.3, not 9.4. We are still trying to figure out whichpoint
release we were on at 9.3.
Ok. IIRC there used to be a bug a few years back that sometimes lead to
highly contended pages being skipped during vacuum, and we'd still
update relfrozenxid. IIRC it required the table to be extended at the
same time or something?- Can you install the pageinspect extension? If so, it might be a
CREATE EXTENSION pageinspect;
CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8,OUT
lp int2, OUT xmin xid)
RETURNS SETOF RECORD
LANGUAGE SQL
AS $$
SELECT blockno, lp, t_xmin
FROM
generate_series(0, pg_relation_size($1::text) / 8192 - 1)blockno,
-- every block in the relation
heap_page_items(get_raw_page($1::text, blockno::int4)) --every
item on the page
WHERE
t_xmin IS NOT NULL -- filter out empty items
AND t_xmin != 1 -- filter out bootstrap
AND t_xmin != 2 -- filter out frozen transaction id
AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' |
x'0200')::int) -- filter out frozen rows with xid present
AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHEREoid
= $1)) -- xid cutoff filter
$$;
SELECT * FROM check_rel('pg_authid') LIMIT 100;Small note - Needs to be this because != is not supported for xid:
AND NOT t_xmin = 1 -- filter out bootstrap
AND NOT t_xmin = 2 -- filter out frozen transaction idOnly on older releases ;). But yea, that looks right.
and then display all items for one of the affected pages like
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));Alvaro:
- Hm, we talked about code adding context for these kind of errors,
right? Is that just skipped for csvlog?
- Alvaro, does the above check_rel() function make sense?Greetings,
Andres Freund
The function does NOT show any issue with either of those tables.
Uh, huh? Alvaro, do you see a bug in my query?
Greetings,
Andres Freund
FWIW, if I remove the last filter, I get these rows and I believe row 7/57/
2906288382 is the one generating error:
SELECT * FROM check_rel('pg_authid') LIMIT 100;
blockno | lp | xmin
---------+----+------------
7 | 4 | 2040863716
7 | 5 | 2040863716
7 | 8 | 2041172882
7 | 9 | 2041172882
7 | 12 | 2041201779
7 | 13 | 2041201779
7 | 16 | 2089742733
7 | 17 | 2090021318
7 | 18 | 2090021318
7 | 47 | 2090021898
7 | 48 | 2090021898
7 | 49 | 2102749003
7 | 50 | 2103210571
7 | 51 | 2103210571
7 | 54 | 2154640913
7 | 55 | 2163849781
7 | 56 | 2295315714
7 | 57 | 2906288382
7 | 58 | 2906329443
7 | 60 | 3131766386
8 | 1 | 2089844462
8 | 2 | 2089844462
8 | 3 | 2089844463
8 | 6 | 2089844463
8 | 9 | 2295318868
(25 rows)
Hi,
On 2018-03-19 14:53:58 -0500, Jeremy Finzel wrote:
FWIW, if I remove the last filter, I get these rows and I believe row 7/57/
2906288382 is the one generating error:
Oh, yea, that makes sense. It's wrapped around and looks like it's from
the future.
SELECT * FROM check_rel('pg_authid') LIMIT 100;
blockno | lp | xmin
---------+----+------------
7 | 4 | 2040863716
7 | 5 | 2040863716
7 | 8 | 2041172882
7 | 9 | 2041172882
7 | 12 | 2041201779
7 | 13 | 2041201779
7 | 16 | 2089742733
7 | 17 | 2090021318
7 | 18 | 2090021318
7 | 47 | 2090021898
7 | 48 | 2090021898
7 | 49 | 2102749003
7 | 50 | 2103210571
7 | 51 | 2103210571
7 | 54 | 2154640913
7 | 55 | 2163849781
7 | 56 | 2295315714
7 | 57 | 2906288382
7 | 58 | 2906329443
7 | 60 | 3131766386
8 | 1 | 2089844462
8 | 2 | 2089844462
8 | 3 | 2089844463
8 | 6 | 2089844463
8 | 9 | 2295318868
(25 rows)
Could you show the contents of those two pages with a query like I had
in an earlier email?
Greetings,
Andres Freund
On Mon, Mar 19, 2018 at 2:56 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-03-19 14:53:58 -0500, Jeremy Finzel wrote:
FWIW, if I remove the last filter, I get these rows and I believe row
7/57/
2906288382 is the one generating error:
Oh, yea, that makes sense. It's wrapped around and looks like it's from
the future.SELECT * FROM check_rel('pg_authid') LIMIT 100;
blockno | lp | xmin
---------+----+------------
7 | 4 | 2040863716
7 | 5 | 2040863716
7 | 8 | 2041172882
7 | 9 | 2041172882
7 | 12 | 2041201779
7 | 13 | 2041201779
7 | 16 | 2089742733
7 | 17 | 2090021318
7 | 18 | 2090021318
7 | 47 | 2090021898
7 | 48 | 2090021898
7 | 49 | 2102749003
7 | 50 | 2103210571
7 | 51 | 2103210571
7 | 54 | 2154640913
7 | 55 | 2163849781
7 | 56 | 2295315714
7 | 57 | 2906288382
7 | 58 | 2906329443
7 | 60 | 3131766386
8 | 1 | 2089844462
8 | 2 | 2089844462
8 | 3 | 2089844463
8 | 6 | 2089844463
8 | 9 | 2295318868
(25 rows)Could you show the contents of those two pages with a query like I had
in an earlier email?Greetings,
Andres Freund
SELECT heap_page_items(get_raw_page('pg_authid', 7));
heap_page_items
--------------------------------------------------------------------------------------------------------------------------------------------
(1,4720,1,108,1897434979,0,0,"(7,1)",11,2825,32,111111111000000000000000000000000000000001011000010011111100001001111000,507769370)
(2,4608,1,108,1897442758,0,18,"(7,2)",11,2825,32,111111111000000000000000000000000000000011000011101100000010001001111000,507776451)
(3,4496,1,108,1897442758,0,20,"(7,3)",11,2825,32,111111111000000000000000000000000000000000100011101100000010001001111000,507776452)
(4,4384,1,108,2040863716,0,37,"(7,4)",11,2313,32,111111111000000000000000000000000000000000110110111011100011001011111000,525105004)
(5,4272,1,108,2040863716,0,39,"(7,5)",11,2313,32,111111111000000000000000000000000000000010110110111011100011001011111000,525105005)
(6,0,3,0,,,,,,,,,)
(7,0,3,0,,,,,,,,,)
(8,4160,1,108,2041172882,0,49,"(7,8)",11,2313,32,111111111000000000000000000000000000000001110100101011000111001011111000,525219118)
(9,4048,1,108,2041172882,0,51,"(7,9)",11,2313,32,111111111000000000000000000000000000000011110100101011000111001011111000,525219119)
(10,0,3,0,,,,,,,,,)
(11,0,3,0,,,,,,,,,)
(12,3936,1,108,2041201779,0,181,"(7,12)",11,2313,32,111111111000000000000000000000000000000011010100010111100111001011111000,525236779)
(13,3824,1,108,2041201779,0,183,"(7,13)",11,2313,32,111111111000000000000000000000000000000000110100010111100111001011111000,525236780)
(14,0,3,0,,,,,,,,,)
(15,0,3,0,,,,,,,,,)
(16,3712,1,108,2089742733,0,0,"(7,16)",11,2313,32,111111111000000000000000000000000000000001000101110011100000001111111000,532706210)
(17,3600,1,108,2090021318,0,1,"(7,17)",11,2313,32,111111111000000000000000000000000000000001001100001101001000001111111000,532753458)
(18,3488,1,108,2090021318,0,3,"(7,18)",11,2313,32,111111111000000000000000000000000000000011001100001101001000001111111000,532753459)
(19,8080,1,108,753125991,0,2,"(7,19)",11,2825,32,111111111000000000000000000000000000000000110001010111001011100001110000,236796556)
(20,7968,1,108,753125991,0,4,"(7,20)",11,2825,32,111111111000000000000000000000000000000010110001010111001011100001110000,236796557)
(21,7856,1,108,753125992,0,19,"(7,21)",11,2825,32,111111111000000000000000000000000000000011001001010111001011100001110000,236796563)
(22,7744,1,108,753125992,0,21,"(7,22)",11,2825,32,111111111000000000000000000000000000000000101001010111001011100001110000,236796564)
(23,7632,1,108,753125993,0,36,"(7,23)",11,2825,32,111111111000000000000000000000000000000001011001010111001011100001110000,236796570)
(24,7520,1,108,753125993,0,38,"(7,24)",11,2825,32,111111111000000000000000000000000000000011011001010111001011100001110000,236796571)
(25,7408,1,108,753125994,0,53,"(7,25)",11,2825,32,111111111000000000000000000000000000000010000101010111001011100001110000,236796577)
(26,7296,1,108,753125994,0,55,"(7,26)",11,2825,32,111111111000000000000000000000000000000001000101010111001011100001110000,236796578)
(27,7184,1,108,753125995,0,70,"(7,27)",11,2825,32,111111111000000000000000000000000000000000010101010111001011100001110000,236796584)
(28,7072,1,108,753125995,0,72,"(7,28)",11,2825,32,111111111000000000000000000000000000000010010101010111001011100001110000,236796585)
(29,6960,1,108,753125997,0,87,"(7,29)",11,2825,32,111111111000000000000000000000000000000011001101010111001011100001110000,236796595)
(30,6848,1,108,753125997,0,89,"(7,30)",11,2825,32,111111111000000000000000000000000000000000101101010111001011100001110000,236796596)
(31,6736,1,108,753125998,0,104,"(7,31)",11,2825,32,111111111000000000000000000000000000000001011101010111001011100001110000,236796602)
(32,6624,1,108,753125998,0,106,"(7,32)",11,2825,32,111111111000000000000000000000000000000011011101010111001011100001110000,236796603)
(33,6512,1,108,753125999,0,121,"(7,33)",11,2825,32,111111111000000000000000000000000000000010000011010111001011100001110000,236796609)
(34,6400,1,108,753125999,0,123,"(7,34)",11,2825,32,111111111000000000000000000000000000000001000011010111001011100001110000,236796610)
(35,6288,1,108,753126000,0,138,"(7,35)",11,2825,32,111111111000000000000000000000000000000000010011010111001011100001110000,236796616)
(36,6176,1,108,753126000,0,140,"(7,36)",11,2825,32,111111111000000000000000000000000000000010010011010111001011100001110000,236796617)
(37,6064,1,108,753126001,0,155,"(7,37)",11,2825,32,111111111000000000000000000000000000000011110011010111001011100001110000,236796623)
(38,5952,1,108,753126001,0,157,"(7,38)",11,2825,32,111111111000000000000000000000000000000000001011010111001011100001110000,236796624)
(39,5840,1,108,753126002,0,172,"(7,39)",11,2825,32,111111111000000000000000000000000000000001101011010111001011100001110000,236796630)
(40,5728,1,108,753126002,0,174,"(7,40)",11,2825,32,111111111000000000000000000000000000000011101011010111001011100001110000,236796631)
(41,5616,1,108,753126003,0,189,"(7,41)",11,2825,32,111111111000000000000000000000000000000010111011010111001011100001110000,236796637)
(42,5504,1,108,753126003,0,191,"(7,42)",11,2825,32,111111111000000000000000000000000000000001111011010111001011100001110000,236796638)
(43,5392,1,108,753126004,0,206,"(7,43)",11,2825,32,111111111000000000000000000000000000000000100111010111001011100001110000,236796644)
(44,5280,1,108,753126004,0,208,"(7,44)",11,2825,32,111111111000000000000000000000000000000010100111010111001011100001110000,236796645)
(45,5168,1,108,753126005,0,223,"(7,45)",11,2825,32,111111111000000000000000000000000000000011010111010111001011100001110000,236796651)
(46,5056,1,108,753126005,0,225,"(7,46)",11,2825,32,111111111000000000000000000000000000000000110111010111001011100001110000,236796652)
(47,3376,1,108,2090021898,0,18,"(7,47)",11,2313,32,111111111000000000000000000000000000000001000111001101001000001111111000,532753634)
(48,3264,1,108,2090021898,0,20,"(7,48)",11,2313,32,111111111000000000000000000000000000000000100111001101001000001111111000,532753636)
(49,3152,1,108,2102749003,0,0,"(7,49)",11,2313,32,111111111000000000000000000000000000000001111011101110010000110100000100,548445662)
(50,3040,1,108,2103210571,0,107,"(7,50)",11,2313,32,111111111000000000000000000000000000000000010110101100100100110100000100,548556136)
(51,2928,1,108,2103210571,0,109,"(7,51)",11,2313,32,111111111000000000000000000000000000000010010110101100100100110100000100,548556137)
(52,0,3,0,,,,,,,,,)
(53,0,3,0,,,,,,,,,)
(54,2816,1,108,2154640913,0,0,"(7,54)",11,2313,32,111111111000000000000000000000000000000010111111010010000000100010000100,554701565)
(55,2704,1,108,2163849781,0,0,"(7,55)",11,2313,32,111111111000000000000000000000000000000000110000001011101001100010000100,555316236)
(56,2592,1,108,2295315714,0,0,"(7,56)",11,2313,32,111111111000000000000000000000000000000000100101010100001000010001000100,572590756)
(57,2480,1,108,2906288382,0,0,"(7,57)",11,2313,32,111111111000000000000000000000000000000000001000101101000100111001111100,1047670032)
(58,2368,1,108,2906329443,0,0,"(7,58)",11,2313,32,111111111000000000000000000000000000000001111001101101000100111001111100,1047670174)
(59,60,2,0,,,,,,,,,)
(60,2224,1,144,3131766386,0,0,"(7,60)",32779,10507,32,111111111100000000000000000000000000000000001101010001010011111100000010,1090298544)
(61,0,0,0,,,,,,,,,)
(62,0,0,0,,,,,,,,,)
(63,0,0,0,,,,,,,,,)
(64,0,0,0,,,,,,,,,)
(65,0,0,0,,,,,,,,,)
(66,0,0,0,,,,,,,,,)
(67,0,0,0,,,,,,,,,)
(68,0,0,0,,,,,,,,,)
(69,4944,1,108,1034607755,0,1,"(7,69)",11,2825,32,111111111000000000000000000000000000000000010111011000010010010100001000,279217896)
(70,4832,1,108,1034607755,0,3,"(7,70)",11,2825,32,111111111000000000000000000000000000000010010111011000010010010100001000,279217897)
(70 rows)
On Mon, Mar 19, 2018 at 3:01 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
On Mon, Mar 19, 2018 at 2:56 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-03-19 14:53:58 -0500, Jeremy Finzel wrote:
FWIW, if I remove the last filter, I get these rows and I believe row
7/57/
2906288382 is the one generating error:
Oh, yea, that makes sense. It's wrapped around and looks like it's from
the future.SELECT * FROM check_rel('pg_authid') LIMIT 100;
blockno | lp | xmin
---------+----+------------
7 | 4 | 2040863716
7 | 5 | 2040863716
7 | 8 | 2041172882
7 | 9 | 2041172882
7 | 12 | 2041201779
7 | 13 | 2041201779
7 | 16 | 2089742733 <(208)%20974-2733>
7 | 17 | 2090021318
7 | 18 | 2090021318
7 | 47 | 2090021898
7 | 48 | 2090021898
7 | 49 | 2102749003 <(210)%20274-9003>
7 | 50 | 2103210571 <(210)%20321-0571>
7 | 51 | 2103210571 <(210)%20321-0571>
7 | 54 | 2154640913 <(215)%20464-0913>
7 | 55 | 2163849781 <(216)%20384-9781>
7 | 56 | 2295315714 <(229)%20531-5714>
7 | 57 | 2906288382
7 | 58 | 2906329443
7 | 60 | 3131766386
8 | 1 | 2089844462 <(208)%20984-4462>
8 | 2 | 2089844462 <(208)%20984-4462>
8 | 3 | 2089844463 <(208)%20984-4463>
8 | 6 | 2089844463 <(208)%20984-4463>
8 | 9 | 2295318868 <(229)%20531-8868>
(25 rows)Could you show the contents of those two pages with a query like I had
in an earlier email?Greetings,
Andres Freund
SELECT heap_page_items(get_raw_page('pg_authid', 7));
heap_page_items
------------------------------------------------------------
------------------------------------------------------------
--------------------
(1,4720,1,108,1897434979,0,0,"(7,1)",11,2825,32,
111111111000000000000000000000000000000001011000010011111100
001001111000,507769370)
(2,4608,1,108,1897442758,0,18,"(7,2)",11,2825,32,
111111111000000000000000000000000000000011000011101100000010
001001111000,507776451)
(3,4496,1,108,1897442758,0,20,"(7,3)",11,2825,32,
111111111000000000000000000000000000000000100011101100000010
001001111000,507776452)
(4,4384,1,108,2040863716,0,37,"(7,4)",11,2313,32,
111111111000000000000000000000000000000000110110111011100011
001011111000,525105004)
(5,4272,1,108,2040863716,0,39,"(7,5)",11,2313,32,
111111111000000000000000000000000000000010110110111011100011
001011111000,525105005)
(6,0,3,0,,,,,,,,,)
(7,0,3,0,,,,,,,,,)
(8,4160,1,108,2041172882,0,49,"(7,8)",11,2313,32,
111111111000000000000000000000000000000001110100101011000111
001011111000,525219118)
(9,4048,1,108,2041172882,0,51,"(7,9)",11,2313,32,
111111111000000000000000000000000000000011110100101011000111
001011111000,525219119)
(10,0,3,0,,,,,,,,,)
(11,0,3,0,,,,,,,,,)
(12,3936,1,108,2041201779,0,181,"(7,12)",11,2313,32,
111111111000000000000000000000000000000011010100010111100111
001011111000,525236779)
(13,3824,1,108,2041201779,0,183,"(7,13)",11,2313,32,
111111111000000000000000000000000000000000110100010111100111
001011111000,525236780)
(14,0,3,0,,,,,,,,,)
(15,0,3,0,,,,,,,,,)
(16,3712,1,108,2089742733 <(208)%20974-2733>,0,0,"(7,16)",11,2313,32,
111111111000000000000000000000000000000001000101110011100000
001111111000,532706210)
(17,3600,1,108,2090021318,0,1,"(7,17)",11,2313,32,
111111111000000000000000000000000000000001001100001101001000
001111111000,532753458)
(18,3488,1,108,2090021318,0,3,"(7,18)",11,2313,32,
111111111000000000000000000000000000000011001100001101001000
001111111000,532753459)
(19,8080,1,108,753125991,0,2,"(7,19)",11,2825,32,
111111111000000000000000000000000000000000110001010111001011
100001110000,236796556)
(20,7968,1,108,753125991,0,4,"(7,20)",11,2825,32,
111111111000000000000000000000000000000010110001010111001011
100001110000,236796557)
(21,7856,1,108,753125992,0,19,"(7,21)",11,2825,32,
111111111000000000000000000000000000000011001001010111001011
100001110000,236796563)
(22,7744,1,108,753125992,0,21,"(7,22)",11,2825,32,
111111111000000000000000000000000000000000101001010111001011
100001110000,236796564)
(23,7632,1,108,753125993,0,36,"(7,23)",11,2825,32,
111111111000000000000000000000000000000001011001010111001011
100001110000,236796570)
(24,7520,1,108,753125993,0,38,"(7,24)",11,2825,32,
111111111000000000000000000000000000000011011001010111001011
100001110000,236796571)
(25,7408,1,108,753125994,0,53,"(7,25)",11,2825,32,
111111111000000000000000000000000000000010000101010111001011
100001110000,236796577)
(26,7296,1,108,753125994,0,55,"(7,26)",11,2825,32,
111111111000000000000000000000000000000001000101010111001011
100001110000,236796578)
(27,7184,1,108,753125995,0,70,"(7,27)",11,2825,32,
111111111000000000000000000000000000000000010101010111001011
100001110000,236796584)
(28,7072,1,108,753125995,0,72,"(7,28)",11,2825,32,
111111111000000000000000000000000000000010010101010111001011
100001110000,236796585)
(29,6960,1,108,753125997,0,87,"(7,29)",11,2825,32,
111111111000000000000000000000000000000011001101010111001011
100001110000,236796595)
(30,6848,1,108,753125997,0,89,"(7,30)",11,2825,32,
111111111000000000000000000000000000000000101101010111001011
100001110000,236796596)
(31,6736,1,108,753125998,0,104,"(7,31)",11,2825,32,
111111111000000000000000000000000000000001011101010111001011
100001110000,236796602)
(32,6624,1,108,753125998,0,106,"(7,32)",11,2825,32,
111111111000000000000000000000000000000011011101010111001011
100001110000,236796603)
(33,6512,1,108,753125999,0,121,"(7,33)",11,2825,32,
111111111000000000000000000000000000000010000011010111001011
100001110000,236796609)
(34,6400,1,108,753125999,0,123,"(7,34)",11,2825,32,
111111111000000000000000000000000000000001000011010111001011
100001110000,236796610)
(35,6288,1,108,753126000,0,138,"(7,35)",11,2825,32,
111111111000000000000000000000000000000000010011010111001011
100001110000,236796616)
(36,6176,1,108,753126000,0,140,"(7,36)",11,2825,32,
111111111000000000000000000000000000000010010011010111001011
100001110000,236796617)
(37,6064,1,108,753126001,0,155,"(7,37)",11,2825,32,
111111111000000000000000000000000000000011110011010111001011
100001110000,236796623)
(38,5952,1,108,753126001,0,157,"(7,38)",11,2825,32,
111111111000000000000000000000000000000000001011010111001011
100001110000,236796624)
(39,5840,1,108,753126002,0,172,"(7,39)",11,2825,32,
111111111000000000000000000000000000000001101011010111001011
100001110000,236796630)
(40,5728,1,108,753126002,0,174,"(7,40)",11,2825,32,
111111111000000000000000000000000000000011101011010111001011
100001110000,236796631)
(41,5616,1,108,753126003,0,189,"(7,41)",11,2825,32,
111111111000000000000000000000000000000010111011010111001011
100001110000,236796637)
(42,5504,1,108,753126003,0,191,"(7,42)",11,2825,32,
111111111000000000000000000000000000000001111011010111001011
100001110000,236796638)
(43,5392,1,108,753126004,0,206,"(7,43)",11,2825,32,
111111111000000000000000000000000000000000100111010111001011
100001110000,236796644)
(44,5280,1,108,753126004,0,208,"(7,44)",11,2825,32,
111111111000000000000000000000000000000010100111010111001011
100001110000,236796645)
(45,5168,1,108,753126005,0,223,"(7,45)",11,2825,32,
111111111000000000000000000000000000000011010111010111001011
100001110000,236796651)
(46,5056,1,108,753126005,0,225,"(7,46)",11,2825,32,
111111111000000000000000000000000000000000110111010111001011
100001110000,236796652)
(47,3376,1,108,2090021898,0,18,"(7,47)",11,2313,32,
111111111000000000000000000000000000000001000111001101001000
001111111000,532753634)
(48,3264,1,108,2090021898,0,20,"(7,48)",11,2313,32,
111111111000000000000000000000000000000000100111001101001000
001111111000,532753636)
(49,3152,1,108,2102749003 <(210)%20274-9003>,0,0,"(7,49)",11,2313,32,
111111111000000000000000000000000000000001111011101110010000
110100000100,548445662)
(50,3040,1,108,2103210571 <(210)%20321-0571>,0,107,"(7,50)",11,2313,32,
111111111000000000000000000000000000000000010110101100100100
110100000100,548556136)
(51,2928,1,108,2103210571 <(210)%20321-0571>,0,109,"(7,51)",11,2313,32,
111111111000000000000000000000000000000010010110101100100100
110100000100,548556137)
(52,0,3,0,,,,,,,,,)
(53,0,3,0,,,,,,,,,)
(54,2816,1,108,2154640913 <(215)%20464-0913>,0,0,"(7,54)",11,2313,32,
111111111000000000000000000000000000000010111111010010000000
100010000100,554701565)
(55,2704,1,108,2163849781 <(216)%20384-9781>,0,0,"(7,55)",11,2313,32,
111111111000000000000000000000000000000000110000001011101001
100010000100,555316236)
(56,2592,1,108,2295315714 <(229)%20531-5714>,0,0,"(7,56)",11,2313,32,
111111111000000000000000000000000000000000100101010100001000
010001000100,572590756)
(57,2480,1,108,2906288382,0,0,"(7,57)",11,2313,32,
111111111000000000000000000000000000000000001000101101000100
111001111100,1047670032)
(58,2368,1,108,2906329443,0,0,"(7,58)",11,2313,32,
111111111000000000000000000000000000000001111001101101000100
111001111100,1047670174)
(59,60,2,0,,,,,,,,,)
(60,2224,1,144,3131766386,0,0,"(7,60)",32779,10507,32,
111111111100000000000000000000000000000000001101010001010011
111100000010,1090298544)
(61,0,0,0,,,,,,,,,)
(62,0,0,0,,,,,,,,,)
(63,0,0,0,,,,,,,,,)
(64,0,0,0,,,,,,,,,)
(65,0,0,0,,,,,,,,,)
(66,0,0,0,,,,,,,,,)
(67,0,0,0,,,,,,,,,)
(68,0,0,0,,,,,,,,,)
(69,4944,1,108,1034607755,0,1,"(7,69)",11,2825,32,
111111111000000000000000000000000000000000010111011000010010
010100001000,279217896)
(70,4832,1,108,1034607755,0,3,"(7,70)",11,2825,32,
111111111000000000000000000000000000000010010111011000010010
010100001000,279217897)
(70 rows)
Does the fact that a snapshot does not have this issue suggest it could be
memory-related corruption and a db restart could clear it up?
Thanks,
Jeremy
Hi,
On 2018-03-19 15:37:51 -0500, Jeremy Finzel wrote:
Does the fact that a snapshot does not have this issue suggest it could be
memory-related corruption and a db restart could clear it up?
Could you show the page from the snapshot? I suspect it might just be a
problem that's temporarily not visible as corrupted.
Greetings,
Andres Freund
On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
SELECT heap_page_items(get_raw_page('pg_authid', 7));
Can you post this?
SELECT * FROM page_header(get_raw_page('pg_authid', 7));
--
Peter Geoghegan
On Mon, Mar 19, 2018 at 3:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
SELECT heap_page_items(get_raw_page('pg_authid', 7));
Can you post this?
SELECT * FROM page_header(get_raw_page('pg_authid', 7));
--
Peter Geoghegan
@Peter :
staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
lsn | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
----------------+----------+-------+-------+-------+---------+----------+---------+-----------
262B4/10FDC478 | 0 | 1 | 304 | 2224 | 8192 | 8192 |
4 | 0
(1 row)
@Andres :
This is from snapshot (on 9.5.12, but we didn't have the error either on a
9.5.11 snap):
heap_page_items
-----------------------------------------------------------------------------------------------------------------------------------------
(1,0,0,0,,,,,,,,,)
(2,0,0,0,,,,,,,,,)
(3,0,0,0,,,,,,,,,)
(4,0,0,0,,,,,,,,,)
(5,0,0,0,,,,,,,,,)
(6,8080,1,108,3137434815,0,0,"(7,6)",11,10505,32,111111111000000000000000000000000000000001111110111100001000010010000001,2166427518)
(7,7936,1,144,3137434816,0,0,"(7,7)",11,10507,32,111111111100000000000000000000000000000001110111000001110101101100001011,3504005358)
(8,0,0,0,,,,,,,,,)
(9,0,0,0,,,,,,,,,)
(10,7792,1,144,3137434817,0,0,"(7,10)",11,10507,32,111111111100000000000000000000000000000001010110001101000011011111101000,401353834)
(11,7680,1,108,3137434818,0,0,"(7,11)",11,10505,32,111111111000000000000000000000000000000001100011010010010001000001100001,2248708806)
(12,0,0,0,,,,,,,,,)
(13,0,0,0,,,,,,,,,)
(14,7568,1,108,3137434819,0,0,"(7,14)",11,10505,32,111111111000000000000000000000000000000001110011110011011011100010100101,2770187214)
(15,7456,1,108,3137434820,0,0,"(7,15)",11,10505,32,111111111000000000000000000000000000000011110001010001010011110010100001,2235343503)
(16,0,0,0,,,,,,,,,)
(17,0,0,0,,,,,,,,,)
(18,0,0,0,,,,,,,,,)
(19,0,0,0,,,,,,,,,)
(20,0,0,0,,,,,,,,,)
(21,0,0,0,,,,,,,,,)
(22,0,0,0,,,,,,,,,)
(23,0,0,0,,,,,,,,,)
(24,0,0,0,,,,,,,,,)
(25,0,0,0,,,,,,,,,)
(26,0,0,0,,,,,,,,,)
(27,0,0,0,,,,,,,,,)
(28,0,0,0,,,,,,,,,)
(29,0,0,0,,,,,,,,,)
(30,0,0,0,,,,,,,,,)
(31,0,0,0,,,,,,,,,)
(32,0,0,0,,,,,,,,,)
(33,0,0,0,,,,,,,,,)
(34,0,0,0,,,,,,,,,)
(35,0,0,0,,,,,,,,,)
(36,0,0,0,,,,,,,,,)
(37,0,0,0,,,,,,,,,)
(38,0,0,0,,,,,,,,,)
(39,0,0,0,,,,,,,,,)
(40,0,0,0,,,,,,,,,)
(41,0,0,0,,,,,,,,,)
(42,0,0,0,,,,,,,,,)
(43,0,0,0,,,,,,,,,)
(44,0,0,0,,,,,,,,,)
(45,0,0,0,,,,,,,,,)
(46,0,0,0,,,,,,,,,)
(47,0,0,0,,,,,,,,,)
(48,0,0,0,,,,,,,,,)
(49,0,0,0,,,,,,,,,)
(50,0,0,0,,,,,,,,,)
(51,0,0,0,,,,,,,,,)
(52,7344,1,108,3137434821,0,0,"(7,52)",11,10505,32,111111111000000000000000000000000000000011011011111110001010010101000001,2191859675)
(53,7232,1,108,3137434822,0,0,"(7,53)",11,10505,32,111111111000000000000000000000000000000001101011010111100110011011100100,661027542)
(54,0,0,0,,,,,,,,,)
(55,0,0,0,,,,,,,,,)
(56,0,0,0,,,,,,,,,)
(57,0,0,0,,,,,,,,,)
(58,0,0,0,,,,,,,,,)
(59,0,0,0,,,,,,,,,)
(60,0,0,0,,,,,,,,,)
(61,7120,1,108,3137434823,0,0,"(7,61)",11,10505,32,111111111000000000000000000000000000000000010111010110000101010111010100,732568296)
(62,6976,1,144,3137434824,0,0,"(7,62)",11,10507,32,111111111100000000000000000000000000000010100100001001001010110000010100,674571301)
(63,6864,1,108,3137434825,0,0,"(7,63)",11,10505,32,111111111000000000000000000000000000000010011011011111011100001001010001,2319695577)
(64,6720,1,144,3137434826,0,0,"(7,64)",11,10507,32,111111111100000000000000000000000000000001000010011001111011100100101000,345892418)
(65,6608,1,108,3137434827,0,0,"(7,65)",11,10505,32,111111111000000000000000000000000000000001000001010100010010101011001010,1398049410)
(66,6496,1,108,3137434828,0,0,"(7,66)",11,10505,32,111111111000000000000000000000000000000000101000111100110101001000101010,1414188820)
(67,6384,1,108,3137434829,0,0,"(7,67)",11,10505,32,111111111000000000000000000000000000000000110101011101111011001110101001,2513301164)
(68,0,0,0,,,,,,,,,)
(69,0,0,0,,,,,,,,,)
(70,0,0,0,,,,,,,,,)
(70 rows)
Thanks,
Jeremy
On Mon, Mar 19, 2018 at 1:55 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
@Peter :
staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
lsn | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
----------------+----------+-------+-------+-------+---------+----------+---------+-----------
262B4/10FDC478 | 0 | 1 | 304 | 2224 | 8192 | 8192 |
4 | 0
(1 row)
Thanks.
That looks normal. I wonder if the contents of that page looks
consistent with the rest of the table following manual inspection,
though. I recently saw system catalog corruption on a 9.5 instance
where an entirely different relation's page ended up in pg_attribute
and pg_depend. They were actually pristine index pages from an
application index. I still have no idea why this happened.
This is very much a guess, but it can't hurt to check if the contents
of the tuples themselves are actually sane by inspecting them with
"SELECT * FROM pg_authid". heap_page_items() doesn't actually care
about the shape of the tuples in the page, so this might have been
missed.
--
Peter Geoghegan
On Mon, Mar 19, 2018 at 4:12 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 19, 2018 at 1:55 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
@Peter :
staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
lsn | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
----------------+----------+-------+-------+-------+---------+----------+---------+-----------
262B4/10FDC478 | 0 | 1 | 304 | 2224 | 8192 | 8192 |
4 | 0
(1 row)Thanks.
That looks normal. I wonder if the contents of that page looks
consistent with the rest of the table following manual inspection,
though. I recently saw system catalog corruption on a 9.5 instance
where an entirely different relation's page ended up in pg_attribute
and pg_depend. They were actually pristine index pages from an
application index. I still have no idea why this happened.This is very much a guess, but it can't hurt to check if the contents
of the tuples themselves are actually sane by inspecting them with
"SELECT * FROM pg_authid". heap_page_items() doesn't actually care
about the shape of the tuples in the page, so this might have been
missed.--
Peter Geoghegan
The data all looks fine. I even forced the index scan on both indexes
which also looks fine.
Thanks,
Jeremy
On Mon, Mar 19, 2018 at 3:55 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
On Mon, Mar 19, 2018 at 3:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
SELECT heap_page_items(get_raw_page('pg_authid', 7));
Can you post this?
SELECT * FROM page_header(get_raw_page('pg_authid', 7));
--
Peter Geoghegan@Peter :
staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
lsn | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
----------------+----------+-------+-------+-------+--------
-+----------+---------+-----------
262B4/10FDC478 | 0 | 1 | 304 | 2224 | 8192 | 8192 |
4 | 0
(1 row)@Andres :
This is from snapshot (on 9.5.12, but we didn't have the error either on a
9.5.11 snap):heap_page_items
------------------------------------------------------------
------------------------------------------------------------
-----------------
(1,0,0,0,,,,,,,,,)
(2,0,0,0,,,,,,,,,)
(3,0,0,0,,,,,,,,,)
(4,0,0,0,,,,,,,,,)
(5,0,0,0,,,,,,,,,)
(6,8080,1,108,3137434815 <(313)%20743-4815>,0,0,"(7,6)",11,10505,32,
111111111000000000000000000000000000000001111110111100001000
010010000001,2166427518)
(7,7936,1,144,3137434816 <(313)%20743-4816>,0,0,"(7,7)",11,10507,32,
111111111100000000000000000000000000000001110111000001110101
101100001011,3504005358)
(8,0,0,0,,,,,,,,,)
(9,0,0,0,,,,,,,,,)
(10,7792,1,144,3137434817 <(313)%20743-4817>,0,0,"(7,10)",11,10507,32,
111111111100000000000000000000000000000001010110001101000011
011111101000,401353834)
(11,7680,1,108,3137434818 <(313)%20743-4818>,0,0,"(7,11)",11,10505,32,
111111111000000000000000000000000000000001100011010010010001
000001100001,2248708806)
(12,0,0,0,,,,,,,,,)
(13,0,0,0,,,,,,,,,)
(14,7568,1,108,3137434819 <(313)%20743-4819>,0,0,"(7,14)",11,10505,32,
111111111000000000000000000000000000000001110011110011011011
100010100101,2770187214)
(15,7456,1,108,3137434820 <(313)%20743-4820>,0,0,"(7,15)",11,10505,32,
111111111000000000000000000000000000000011110001010001010011
110010100001,2235343503)
(16,0,0,0,,,,,,,,,)
(17,0,0,0,,,,,,,,,)
(18,0,0,0,,,,,,,,,)
(19,0,0,0,,,,,,,,,)
(20,0,0,0,,,,,,,,,)
(21,0,0,0,,,,,,,,,)
(22,0,0,0,,,,,,,,,)
(23,0,0,0,,,,,,,,,)
(24,0,0,0,,,,,,,,,)
(25,0,0,0,,,,,,,,,)
(26,0,0,0,,,,,,,,,)
(27,0,0,0,,,,,,,,,)
(28,0,0,0,,,,,,,,,)
(29,0,0,0,,,,,,,,,)
(30,0,0,0,,,,,,,,,)
(31,0,0,0,,,,,,,,,)
(32,0,0,0,,,,,,,,,)
(33,0,0,0,,,,,,,,,)
(34,0,0,0,,,,,,,,,)
(35,0,0,0,,,,,,,,,)
(36,0,0,0,,,,,,,,,)
(37,0,0,0,,,,,,,,,)
(38,0,0,0,,,,,,,,,)
(39,0,0,0,,,,,,,,,)
(40,0,0,0,,,,,,,,,)
(41,0,0,0,,,,,,,,,)
(42,0,0,0,,,,,,,,,)
(43,0,0,0,,,,,,,,,)
(44,0,0,0,,,,,,,,,)
(45,0,0,0,,,,,,,,,)
(46,0,0,0,,,,,,,,,)
(47,0,0,0,,,,,,,,,)
(48,0,0,0,,,,,,,,,)
(49,0,0,0,,,,,,,,,)
(50,0,0,0,,,,,,,,,)
(51,0,0,0,,,,,,,,,)
(52,7344,1,108,3137434821 <(313)%20743-4821>,0,0,"(7,52)",11,10505,32,
111111111000000000000000000000000000000011011011111110001010
010101000001,2191859675)
(53,7232,1,108,3137434822 <(313)%20743-4822>,0,0,"(7,53)",11,10505,32,
111111111000000000000000000000000000000001101011010111100110
011011100100,661027542)
(54,0,0,0,,,,,,,,,)
(55,0,0,0,,,,,,,,,)
(56,0,0,0,,,,,,,,,)
(57,0,0,0,,,,,,,,,)
(58,0,0,0,,,,,,,,,)
(59,0,0,0,,,,,,,,,)
(60,0,0,0,,,,,,,,,)
(61,7120,1,108,3137434823 <(313)%20743-4823>,0,0,"(7,61)",11,10505,32,
111111111000000000000000000000000000000000010111010110000101
010111010100,732568296)
(62,6976,1,144,3137434824 <(313)%20743-4824>,0,0,"(7,62)",11,10507,32,
111111111100000000000000000000000000000010100100001001001010
110000010100,674571301)
(63,6864,1,108,3137434825 <(313)%20743-4825>,0,0,"(7,63)",11,10505,32,
111111111000000000000000000000000000000010011011011111011100
001001010001,2319695577)
(64,6720,1,144,3137434826 <(313)%20743-4826>,0,0,"(7,64)",11,10507,32,
111111111100000000000000000000000000000001000010011001111011
100100101000,345892418)
(65,6608,1,108,3137434827 <(313)%20743-4827>,0,0,"(7,65)",11,10505,32,
111111111000000000000000000000000000000001000001010100010010
101011001010,1398049410)
(66,6496,1,108,3137434828 <(313)%20743-4828>,0,0,"(7,66)",11,10505,32,
111111111000000000000000000000000000000000101000111100110101
001000101010,1414188820)
(67,6384,1,108,3137434829 <(313)%20743-4829>,0,0,"(7,67)",11,10505,32,
111111111000000000000000000000000000000000110101011101111011
001110101001,2513301164)
(68,0,0,0,,,,,,,,,)
(69,0,0,0,,,,,,,,,)
(70,0,0,0,,,,,,,,,)
(70 rows)Thanks,
Jeremy
Any suggestions as to what I can do from this point? Is it feasible that a
server restart would fix this, or are you saying it would just mask the
problem? I can't reproduce it on a snapshot which is quite odd.
FWIW, maybe this is obvious, but the pages look identical on the streamers
we have to the master.
Thanks,
Jeremy
On Tue, Mar 20, 2018 at 11:19 AM, Jeremy Finzel <finzelj@gmail.com> wrote:
On Mon, Mar 19, 2018 at 3:55 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
On Mon, Mar 19, 2018 at 3:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel <finzelj@gmail.com>
wrote:SELECT heap_page_items(get_raw_page('pg_authid', 7));
Can you post this?
SELECT * FROM page_header(get_raw_page('pg_authid', 7));
--
Peter Geoghegan@Peter :
staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
lsn | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
----------------+----------+-------+-------+-------+--------
-+----------+---------+-----------
262B4/10FDC478 | 0 | 1 | 304 | 2224 | 8192 | 8192
| 4 | 0
(1 row)@Andres :
This is from snapshot (on 9.5.12, but we didn't have the error either on
a 9.5.11 snap):heap_page_items
------------------------------------------------------------
------------------------------------------------------------
-----------------
(1,0,0,0,,,,,,,,,)
(2,0,0,0,,,,,,,,,)
(3,0,0,0,,,,,,,,,)
(4,0,0,0,,,,,,,,,)
(5,0,0,0,,,,,,,,,)
(6,8080,1,108,3137434815 <(313)%20743-4815>,0,0,
"(7,6)",11,10505,32,1111111110000000000000000000000000000000
01111110111100001000010010000001,2166427518 <(216)%20642-7518>)
(7,7936,1,144,3137434816 <(313)%20743-4816>,0,0,
"(7,7)",11,10507,32,1111111111000000000000000000000000000000
01110111000001110101101100001011,3504005358)
(8,0,0,0,,,,,,,,,)
(9,0,0,0,,,,,,,,,)
(10,7792,1,144,3137434817 <(313)%20743-4817>,0,0
,"(7,10)",11,10507,32,11111111110000000000000000000000000000
0001010110001101000011011111101000,401353834)
(11,7680,1,108,3137434818 <(313)%20743-4818>,0,0
,"(7,11)",11,10505,32,11111111100000000000000000000000000000
0001100011010010010001000001100001,2248708806 <(224)%20870-8806>)
(12,0,0,0,,,,,,,,,)
(13,0,0,0,,,,,,,,,)
(14,7568,1,108,3137434819 <(313)%20743-4819>,0,0
,"(7,14)",11,10505,32,11111111100000000000000000000000000000
0001110011110011011011100010100101,2770187214)
(15,7456,1,108,3137434820 <(313)%20743-4820>,0,0
,"(7,15)",11,10505,32,11111111100000000000000000000000000000
0011110001010001010011110010100001,2235343503 <(223)%20534-3503>)
(16,0,0,0,,,,,,,,,)
(17,0,0,0,,,,,,,,,)
(18,0,0,0,,,,,,,,,)
(19,0,0,0,,,,,,,,,)
(20,0,0,0,,,,,,,,,)
(21,0,0,0,,,,,,,,,)
(22,0,0,0,,,,,,,,,)
(23,0,0,0,,,,,,,,,)
(24,0,0,0,,,,,,,,,)
(25,0,0,0,,,,,,,,,)
(26,0,0,0,,,,,,,,,)
(27,0,0,0,,,,,,,,,)
(28,0,0,0,,,,,,,,,)
(29,0,0,0,,,,,,,,,)
(30,0,0,0,,,,,,,,,)
(31,0,0,0,,,,,,,,,)
(32,0,0,0,,,,,,,,,)
(33,0,0,0,,,,,,,,,)
(34,0,0,0,,,,,,,,,)
(35,0,0,0,,,,,,,,,)
(36,0,0,0,,,,,,,,,)
(37,0,0,0,,,,,,,,,)
(38,0,0,0,,,,,,,,,)
(39,0,0,0,,,,,,,,,)
(40,0,0,0,,,,,,,,,)
(41,0,0,0,,,,,,,,,)
(42,0,0,0,,,,,,,,,)
(43,0,0,0,,,,,,,,,)
(44,0,0,0,,,,,,,,,)
(45,0,0,0,,,,,,,,,)
(46,0,0,0,,,,,,,,,)
(47,0,0,0,,,,,,,,,)
(48,0,0,0,,,,,,,,,)
(49,0,0,0,,,,,,,,,)
(50,0,0,0,,,,,,,,,)
(51,0,0,0,,,,,,,,,)
(52,7344,1,108,3137434821 <(313)%20743-4821>,0,0
,"(7,52)",11,10505,32,11111111100000000000000000000000000000
0011011011111110001010010101000001,2191859675)
(53,7232,1,108,3137434822 <(313)%20743-4822>,0,0
,"(7,53)",11,10505,32,11111111100000000000000000000000000000
0001101011010111100110011011100100,661027542)
(54,0,0,0,,,,,,,,,)
(55,0,0,0,,,,,,,,,)
(56,0,0,0,,,,,,,,,)
(57,0,0,0,,,,,,,,,)
(58,0,0,0,,,,,,,,,)
(59,0,0,0,,,,,,,,,)
(60,0,0,0,,,,,,,,,)
(61,7120,1,108,3137434823 <(313)%20743-4823>,0,0
,"(7,61)",11,10505,32,11111111100000000000000000000000000000
0000010111010110000101010111010100,732568296)
(62,6976,1,144,3137434824 <(313)%20743-4824>,0,0
,"(7,62)",11,10507,32,11111111110000000000000000000000000000
0010100100001001001010110000010100,674571301)
(63,6864,1,108,3137434825 <(313)%20743-4825>,0,0
,"(7,63)",11,10505,32,11111111100000000000000000000000000000
0010011011011111011100001001010001,2319695577 <(231)%20969-5577>)
(64,6720,1,144,3137434826 <(313)%20743-4826>,0,0
,"(7,64)",11,10507,32,11111111110000000000000000000000000000
0001000010011001111011100100101000,345892418)
(65,6608,1,108,3137434827 <(313)%20743-4827>,0,0
,"(7,65)",11,10505,32,11111111100000000000000000000000000000
0001000001010100010010101011001010,1398049410)
(66,6496,1,108,3137434828 <(313)%20743-4828>,0,0
,"(7,66)",11,10505,32,11111111100000000000000000000000000000
0000101000111100110101001000101010,1414188820)
(67,6384,1,108,3137434829 <(313)%20743-4829>,0,0
,"(7,67)",11,10505,32,11111111100000000000000000000000000000
0000110101011101111011001110101001,2513301164 <(251)%20330-1164>)
(68,0,0,0,,,,,,,,,)
(69,0,0,0,,,,,,,,,)
(70,0,0,0,,,,,,,,,)
(70 rows)Thanks,
JeremyAny suggestions as to what I can do from this point? Is it feasible that
a server restart would fix this, or are you saying it would just mask the
problem? I can't reproduce it on a snapshot which is quite odd.FWIW, maybe this is obvious, but the pages look identical on the streamers
we have to the master.Thanks,
Jeremy
A server restart and upgrade to 9.5.12 (at the same time), as expected,
made the issue go away. Still doesn't give us any answers as to what
happened or if it would happen again! Thanks for the feeback.
Jeremy
On Wed, Mar 21, 2018 at 1:38 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
A server restart and upgrade to 9.5.12 (at the same time), as expected, made
the issue go away. Still doesn't give us any answers as to what happened or
if it would happen again! Thanks for the feeback.
You may still want to use amcheck to look for problems. The version on
Github works with 9.5, and there are Redhat and Debian pgdg packages.
See:
https://github.com/petergeoghegan/amcheck
The "heapallindexed" option will be of particular interest to you -
that option verifies that the table has matching rows for a target
index (in addition to testing the structure of a target B-Tree index
itself). This is probably the best general test for corruption that is
available. There is a fair chance that this will reveal new
information.
--
Peter Geoghegan
On Wed, Mar 21, 2018 at 4:29 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Mar 21, 2018 at 1:38 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
A server restart and upgrade to 9.5.12 (at the same time), as expected,
made
the issue go away. Still doesn't give us any answers as to what
happened or
if it would happen again! Thanks for the feeback.
You may still want to use amcheck to look for problems. The version on
Github works with 9.5, and there are Redhat and Debian pgdg packages.
See:https://github.com/petergeoghegan/amcheck
The "heapallindexed" option will be of particular interest to you -
that option verifies that the table has matching rows for a target
index (in addition to testing the structure of a target B-Tree index
itself). This is probably the best general test for corruption that is
available. There is a fair chance that this will reveal new
information.--
Peter Geoghegan
Thank you for the recommendation. I ran both amcheck functions on all 4
indexes of those 2 tables with heapallindexed = true, but no issues were
found.
Thanks,
Jeremy
I admit I'm pretty surprised by this whole episode. I have no useful
advice to offer here.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Mar 22, 2018 at 12:27 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
Thank you for the recommendation. I ran both amcheck functions on all 4
indexes of those 2 tables with heapallindexed = true, but no issues were
found.
Probably wouldn't hurt to run it against all indexes, if you can make
time for that. If you can generalize from the example query that calls
the bt_index_check() function, but set
"heapallindexed=>i.indisprimary" and remove "n.nspname =
'pg_catalog'", as well as "LIMIT 10". This will test tables and
indexes from all schemas, which might be interesting.
--
Peter Geoghegan
On Thu, Mar 22, 2018 at 3:20 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Mar 22, 2018 at 12:27 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
Thank you for the recommendation. I ran both amcheck functions on all 4
indexes of those 2 tables with heapallindexed = true, but no issues were
found.Probably wouldn't hurt to run it against all indexes, if you can make
time for that. If you can generalize from the example query that calls
the bt_index_check() function, but set
"heapallindexed=>i.indisprimary" and remove "n.nspname =
'pg_catalog'", as well as "LIMIT 10". This will test tables and
indexes from all schemas, which might be interesting.
--
Peter Geoghegan
I am running this on a san snapshot of our production system. I assume
that this will give me a valid check for file-system-level corruption. I
am going to kick it off and see if I find anything interesting.
Thanks,
Jeremy
On Thu, Mar 22, 2018 at 2:24 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
I am running this on a san snapshot of our production system. I assume that
this will give me a valid check for file-system-level corruption. I am
going to kick it off and see if I find anything interesting.
It might. Note that SAN snapshots might have corruption hazards,
though. Postgres expects crash consistency across all filesystems, so
you might run into trouble if you had a separate filesystem for WAL,
for example. I know that LVM snapshots only provide a consistent view
of a single logical volume, even though many LVM + Postgres setups
will involve multiple logical volumes. This makes it possible for a
small inconsistency across logical volumes to corrupt data.
I don't know anything about your SAN snapshotting, but this is at
least something to consider.
--
Peter Geoghegan
Hi everyone,
I just got the same issue on 9.6.8:
2018-05-15 11:52:01 MSK 33558 @ from [vxid:317/92895305 txid:0] [] ERROR:
found xmin 2808837517 from before relfrozenxid 248712603
2018-05-15 11:52:01 MSK 33558 @ from [vxid:317/92895305 txid:0] []
CONTEXT: automatic vacuum of table "template0.pg_catalog.pg_authid"
Additional details:
1) bt_index_check - no errors on both indexes.
2) SELECT pg_truncate_visibility_map('pg_authid'::regclass); + vacuum
freeze - doesn't help.
3)Total 6 rows affected:
template1=# select oid, ctid, xmin, xmax, cmin, cmax from pg_authid where
xmin::text::bigint > (select relfrozenxid::text::bigint from pg_class where
relname='pg_authid');
oid | ctid | xmin | xmax | cmin | cmax
-----------+--------+------------+------+------+------
183671986 | (0,90) | 3039161773 | 0 | 0 | 0
183106699 | (1,48) | 2576823237 | 0 | 0 | 0
183921770 | (1,50) | 3265971811 | 0 | 0 | 0
183921914 | (1,52) | 3266122344 | 0 | 0 | 0
187988173 | (1,58) | 4258893789 | 0 | 0 | 0
182424977 | (1,62) | 2808837517 | 0 | 0 | 0
on total two pages.
4) template1=# select relfrozenxid from pg_catalog.pg_class where relname =
'pg_authid';
relfrozenxid
--------------
2548304492
5)Rows itself looks pretty valid and correspond to the actual database
users.
7)No database/server crash happened last few years, no disk errors/problems.
I feel it could be related with vacuum skip locked pages patch + freeze +
shared catalog combination, but cannot prove it yet.
Looking for possible course of action.
Probably simplest fix - drop and recreate these 6 affected users, but so
far I willing spent some time research into this issue.
--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/ <http://www.postgresql-consulting.com/>
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"
On 2018-05-15 11:06:38 +0200, Maxim Boguk wrote:
Hi everyone,
I just got the same issue on 9.6.8:
2018-05-15 11:52:01 MSK 33558 @ from [vxid:317/92895305 txid:0] [] ERROR:
found xmin 2808837517 from before relfrozenxid 248712603
2018-05-15 11:52:01 MSK 33558 @ from [vxid:317/92895305 txid:0] []
CONTEXT: automatic vacuum of table "template0.pg_catalog.pg_authid"Additional details:
1) bt_index_check - no errors on both indexes.
2) SELECT pg_truncate_visibility_map('pg_authid'::regclass); + vacuum
freeze - doesn't help.
Yea, too late.
3)Total 6 rows affected:
template1=# select oid, ctid, xmin, xmax, cmin, cmax from pg_authid where
xmin::text::bigint > (select relfrozenxid::text::bigint from pg_class where
relname='pg_authid');
oid | ctid | xmin | xmax | cmin | cmax
-----------+--------+------------+------+------+------
183671986 | (0,90) | 3039161773 | 0 | 0 | 0
183106699 | (1,48) | 2576823237 | 0 | 0 | 0
183921770 | (1,50) | 3265971811 | 0 | 0 | 0
183921914 | (1,52) | 3266122344 | 0 | 0 | 0
187988173 | (1,58) | 4258893789 | 0 | 0 | 0
182424977 | (1,62) | 2808837517 | 0 | 0 | 0
on total two pages.4) template1=# select relfrozenxid from pg_catalog.pg_class where relname =
'pg_authid';
relfrozenxid
--------------
25483044925)Rows itself looks pretty valid and correspond to the actual database
users.
7)No database/server crash happened last few years, no disk errors/problems.I feel it could be related with vacuum skip locked pages patch + freeze +
shared catalog combination, but cannot prove it yet.
Looking for possible course of action.
Probably simplest fix - drop and recreate these 6 affected users, but so
far I willing spent some time research into this issue.
Could you use pageinspect to get the infomasks for the affected tuples?
Greetings,
Andres Freund
Hi Andres,
Looking for possible course of action.
Probably simplest fix - drop and recreate these 6 affected users, but so
far I willing spent some time research into this issue.Could you use pageinspect to get the infomasks for the affected tuples?
Greetings,
Andres Freund
For sample:
postgres=# vacuum pg_catalog.pg_authid;
ERROR: found xmin 2894889518 from before relfrozenxid 248712603
select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where
xmin::text::bigint=2894889518;
ctid | xmin | xmax | cmin | cmax
--------+------------+------+------+------
(1,26) | 2894889518 | 0 | 0 | 0
postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1))
where t_ctid::text='(1,26)';
-[ RECORD 1
]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lp | 26
lp_off | 4656
lp_flags | 1
lp_len | 144
t_xmin | 2894889518
t_xmax | 0
t_field3 | 0
t_ctid | (1,26)
t_infomask2 | 32779
t_infomask | 10507
t_hoff | 32
t_bits | 1111111111000000
t_oid | 189787727
Any new role created in DB instantly affected by this issue.
In the same time:
select relfrozenxid from pg_class where relname='pg_authid';
relfrozenxid
--------------
2863429136
So it's interesting where value of " from before relfrozenxid 248712603"
come from.
--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/ <http://www.postgresql-consulting.com/>
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"
Hi,
On 2018-05-22 21:30:43 +0300, Maxim Boguk wrote:
For sample:
postgres=# vacuum pg_catalog.pg_authid;
ERROR: found xmin 2894889518 from before relfrozenxid 248712603select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where
xmin::text::bigint=2894889518;
ctid | xmin | xmax | cmin | cmax
--------+------------+------+------+------
(1,26) | 2894889518 | 0 | 0 | 0postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1))
where t_ctid::text='(1,26)';
-[ RECORD 1
]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lp | 26
lp_off | 4656
lp_flags | 1
lp_len | 144
t_xmin | 2894889518
t_xmax | 0
t_field3 | 0
t_ctid | (1,26)
t_infomask2 | 32779
t_infomask | 10507
t_hoff | 32
t_bits | 1111111111000000
t_oid | 189787727
So this row has, if I didn't screw up decoding the following infomask
bits set::
HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED
So it's not been previously frozen, which I was wondering about.
Which indeed makes it
Any new role created in DB instantly affected by this issue.
What's txid_current()?
In the same time:
select relfrozenxid from pg_class where relname='pg_authid';
relfrozenxid
--------------
2863429136So it's interesting where value of " from before relfrozenxid 248712603"
come from.
Hm. That's indeed odd. Could you get a backtrace of the error with "bt
full" of the error?
Greetings,
Andres Freund
On Tue, May 22, 2018 at 9:47 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-05-22 21:30:43 +0300, Maxim Boguk wrote:
For sample:
postgres=# vacuum pg_catalog.pg_authid;
ERROR: found xmin 2894889518 from before relfrozenxid 248712603select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where
xmin::text::bigint=2894889518;
ctid | xmin | xmax | cmin | cmax
--------+------------+------+------+------
(1,26) | 2894889518 | 0 | 0 | 0postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1))
where t_ctid::text='(1,26)';
-[ RECORD 1
]-----------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------------------lp | 26
lp_off | 4656
lp_flags | 1
lp_len | 144
t_xmin | 2894889518
t_xmax | 0
t_field3 | 0
t_ctid | (1,26)
t_infomask2 | 32779
t_infomask | 10507
t_hoff | 32
t_bits | 1111111111000000
t_oid | 189787727So this row has, if I didn't screw up decoding the following infomask
bits set::
HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATEDSo it's not been previously frozen, which I was wondering about.
Which indeed makes it
Any new role created in DB instantly affected by this issue.What's txid_current()?
In the same time:
select relfrozenxid from pg_class where relname='pg_authid';
relfrozenxid
--------------
2863429136So it's interesting where value of " from before relfrozenxid 248712603"
come from.Hm. That's indeed odd. Could you get a backtrace of the error with "bt
full" of the error?Greetings,
Andres Freund
select txid_current();
txid_current
--------------
41995913769
About gdb bt - it's tricky because it is mission critical master db of
huge project.
I'll will try promote backup replica and check is issue persist there and
if yes - we will have our playground for a while, but it will require
sometime to arrange.
--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/ <http://www.postgresql-consulting.com/>
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"
Hi,
On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
On Tue, May 22, 2018 at 9:47 PM, Andres Freund <andres@anarazel.de> wrote:
select relfrozenxid from pg_class where relname='pg_authid';
relfrozenxid
--------------
2863429136
select txid_current();
txid_current
--------------
41995913769
So that's an xid of 3341208114, if you leave the epoch out. What's
autovacuum_freeze_max_age set to in that cluster?
Can you show pg_controldata output, and relminmxid from that cluster?
I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error: 248712603
About gdb bt - it's tricky because it is mission critical master db of
huge project.
I'll will try promote backup replica and check is issue persist there and
if yes - we will have our playground for a while, but it will require
sometime to arrange.
You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that... I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.
Greetings,
Andres Freund
On Tue, May 22, 2018 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
On Tue, May 22, 2018 at 9:47 PM, Andres Freund <andres@anarazel.de>
wrote:
select relfrozenxid from pg_class where relname='pg_authid';
relfrozenxid
--------------
2863429136select txid_current();
txid_current
--------------
41995913769So that's an xid of 3341208114, if you leave the epoch out. What's
autovacuum_freeze_max_age set to in that cluster?
postgres=# show autovacuum_freeze_max_age;
autovacuum_freeze_max_age
---------------------------
200000000
(default value I think)
Can you show pg_controldata output, and
relminmxid from that cluster?
postgres@db:~$ /usr/lib/postgresql/9.6/bin/pg_controldata -D
/var/lib/postgresql/9.6/main
pg_control version number: 960
Catalog version number: 201608131
Database system identifier: 6469368654711450114
Database cluster state: in production
pg_control last modified: Tue 22 May 2018 10:20:14 PM MSK
Latest checkpoint location: CCB5/F9C37950
Prior checkpoint location: CCB0/43F316B0
Latest checkpoint's REDO location: CCB1/6706BD88
Latest checkpoint's REDO WAL file: 000000010000CCB100000067
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 9:3341161759
Latest checkpoint's NextOID: 190071899
Latest checkpoint's NextMultiXactId: 59416233
Latest checkpoint's NextMultiOffset: 215588532
Latest checkpoint's oldestXID: 2814548646
Latest checkpoint's oldestXID's DB: 16400
Latest checkpoint's oldestActiveXID: 3341161759
Latest checkpoint's oldestMultiXid: 54264778
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:2814548646
Latest checkpoint's newestCommitTsXid:3341161758
Time of latest checkpoint: Tue 22 May 2018 10:05:16 PM MSK
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 2000
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: on
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
postgres=# select datname,datfrozenxid,datminmxid from pg_database order by
datname;
datname | datfrozenxid | datminmxid
-----------+--------------+------------
** | 2815939794 | 54265194
postgres | 2863429136 | 54280819
template0 | 3148297669 | 59342106
template1 | 2816765546 | 59261794
I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error: 248712603About gdb bt - it's tricky because it is mission critical master db of
huge project.
I'll will try promote backup replica and check is issue persist there and
if yes - we will have our playground for a while, but it will require
sometime to arrange.You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that... I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.
My gdb kunf-fu exceptionally rusty now, so I better play in sandbox.
And if error did't exist on fresh promoted replica it will give us useful
info as well.
--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/ <http://www.postgresql-consulting.com/>
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"
On Tue, May 22, 2018 at 2:42 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:
On Tue, May 22, 2018 at 10:30 PM, Andres Freund <andres@anarazel.de>
wrote:Hi,
On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
On Tue, May 22, 2018 at 9:47 PM, Andres Freund <andres@anarazel.de>
wrote:
select relfrozenxid from pg_class where relname='pg_authid';
relfrozenxid
--------------
2863429136select txid_current();
txid_current
--------------
41995913769So that's an xid of 3341208114, if you leave the epoch out. What's
autovacuum_freeze_max_age set to in that cluster?postgres=# show autovacuum_freeze_max_age;
autovacuum_freeze_max_age
---------------------------
200000000
(default value I think)Can you show pg_controldata output, and
relminmxid from that cluster?postgres@db:~$ /usr/lib/postgresql/9.6/bin/pg_controldata -D
/var/lib/postgresql/9.6/main
pg_control version number: 960
Catalog version number: 201608131
Database system identifier: 6469368654711450114
Database cluster state: in production
pg_control last modified: Tue 22 May 2018 10:20:14 PM MSK
Latest checkpoint location: CCB5/F9C37950
Prior checkpoint location: CCB0/43F316B0
Latest checkpoint's REDO location: CCB1/6706BD88
Latest checkpoint's REDO WAL file: 000000010000CCB100000067
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 9:3341161759
Latest checkpoint's NextOID: 190071899
Latest checkpoint's NextMultiXactId: 59416233
Latest checkpoint's NextMultiOffset: 215588532
Latest checkpoint's oldestXID: 2814548646
Latest checkpoint's oldestXID's DB: 16400
Latest checkpoint's oldestActiveXID: 3341161759
Latest checkpoint's oldestMultiXid: 54264778
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:2814548646
Latest checkpoint's newestCommitTsXid:3341161758
Time of latest checkpoint: Tue 22 May 2018 10:05:16 PM MSK
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 2000
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: on
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
postgres=# select datname,datfrozenxid,datminmxid from pg_database order
by datname;
datname | datfrozenxid | datminmxid
-----------+--------------+------------
** | 2815939794 | 54265194
postgres | 2863429136 | 54280819
template0 | 3148297669 | 59342106
template1 | 2816765546 | 59261794I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error: 248712603About gdb bt - it's tricky because it is mission critical master db of
huge project.
I'll will try promote backup replica and check is issue persist thereand
if yes - we will have our playground for a while, but it will require
sometime to arrange.You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that... I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.My gdb kunf-fu exceptionally rusty now, so I better play in sandbox.
And if error did't exist on fresh promoted replica it will give us useful
info as well.--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/ <http://www.postgresql-consulting.com/>Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"
All -
We just noticed that this exact same problem has now returned for us. It
has exactly the same symptoms as my original post (including the fact that
a snapshot does not have the issue).
I am open to all suggestions as to troubleshooting this further!
Thank you,
Jeremy
About gdb bt - it's tricky because it is mission critical master db of
huge project.
I'll will try promote backup replica and check is issue persist there and
if yes - we will have our playground for a while, but it will require
sometime to arrange.You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that... I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.Greetings,
Andres Freund
Ok this issue reproducible on promoted replica.
So now I have my playground.
Will provide gdb info in few hours.
--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/ <http://www.postgresql-consulting.com/>
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"
On Thu, May 24, 2018 at 12:38 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:
About gdb bt - it's tricky because it is mission critical master db of
huge project.
I'll will try promote backup replica and check is issue persist thereand
if yes - we will have our playground for a while, but it will require
sometime to arrange.You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that... I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.Greetings,
Andres Freund
bt full :
#0 errmsg_internal (fmt=0x555b62e6eb70 "found xmin %u from before
relfrozenxid %u") at
/build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/utils/error/elog.c:828
edata = <optimized out>
__func__ = "errmsg_internal"
#1 0x0000555b62ad1cb9 in heap_prepare_freeze_tuple (tuple=<optimized out>,
relfrozenxid=relfrozenxid@entry=248720453,
relminmxid=relminmxid@entry=53644256,
cutoff_xid=3485221679,
cutoff_multi=<optimized out>, frz=frz@entry=0x555b640d1988,
totally_frozen_p=0x7ffca32c0e90 "\001")
at
/build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/access/heap/heapam.c:6685
changed = <optimized out>
freeze_xmax = 0 '\000'
xid = <optimized out>
totally_frozen = <optimized out>
__func__ = "heap_prepare_freeze_tuple"
#2 0x0000555b62bfd2b5 in lazy_scan_heap (aggressive=0 '\000', nindexes=2,
Irel=0x555b64095948, vacrelstats=<optimized out>, options=26,
onerel=0x555b64029498)
at
/build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuumlazy.c:1090
tuple_totally_frozen = 1 '\001'
itemid = 0x7f2b7bc5d4fc
buf = 18138842
page = <optimized out>
offnum = 26
maxoff = 26
hastup = 1 '\001'
nfrozen = 0
freespace = <optimized out>
all_frozen = 1 '\001'
tupgone = 0 '\000'
prev_dead_count = 0
all_visible_according_to_vm = 0 '\000'
all_visible = 1 '\001'
has_dead_tuples = 0 '\000'
visibility_cutoff_xid = 3490221678
relname = 0x555b640296a8 "pg_authid"
relminmxid = 53644256
tups_vacuumed = 0
indstats = 0x555b64095928
relfrozenxid = 248720453
skipping_blocks = 0 '\000'
frozen = 0x555b640d1988
initprog_val = {1, 2, 582}
vacuumed_pages = 0
num_tuples = 80
nkeep = 0
ru0 = {tv = {tv_sec = 1527175061, tv_usec = 739743}, ru = {ru_utime
= {tv_sec = 0, tv_usec = 60000}, ru_stime = {tv_sec = 1, tv_usec = 416000},
{ru_maxrss = 9704,
__ru_maxrss_word = 9704}, {ru_ixrss = 0, __ru_ixrss_word =
0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word =
0}, {ru_minflt = 33982,
__ru_minflt_word = 33982}, {ru_majflt = 0, __ru_majflt_word =
0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word
= 0}, {ru_oublock = 8,
__ru_oublock_word = 8}, {ru_msgsnd = 0, __ru_msgsnd_word =
0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0,
__ru_nsignals_word = 0}, {ru_nvcsw = 38446,
__ru_nvcsw_word = 38446}, {ru_nivcsw = 2, __ru_nivcsw_word =
2}}}
vmbuffer = 8763411
empty_pages = 0
nunused = 0
i = <optimized out>
next_unskippable_block = <optimized out>
buf = {data = 0x1 <error: Cannot access memory at address 0x1>, len
= -1557393520, maxlen = 32764, cursor = -1557393616}
nblocks = 2
blkno = <optimized out>
tuple = {t_len = 144, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 1},
ip_posid = 26}, t_tableOid = 1260, t_data = 0x7f2b7bc5e6b0}
initprog_index = {0, 1, 5}
#3 lazy_vacuum_rel (onerel=onerel@entry=0x555b64029498,
options=options@entry=1, params=params@entry=0x7ffca32c11b0,
bstrategy=<optimized out>)
at
/build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuumlazy.c:253
vacrelstats = <optimized out>
Irel = 0x555b64095948
nindexes = 2
ru0 = {tv = {tv_sec = 93850993708032, tv_usec = 23936}, ru =
{ru_utime = {tv_sec = 128, tv_usec = 93851007694584}, ru_stime = {tv_sec =
0, tv_usec = 0}, {ru_maxrss = 140723046059424,
__ru_maxrss_word = 140723046059424}, {ru_ixrss =
93850989549501, __ru_ixrss_word = 93850989549501}, {ru_idrss = 16,
__ru_idrss_word = 16}, {ru_isrss = 93851007694584,
__ru_isrss_word = 93851007694584}, {ru_minflt =
140723046059472, __ru_minflt_word = 140723046059472}, {ru_majflt =
93850989559601, __ru_majflt_word = 93850989559601}, {
ru_nswap = 93850994010032, __ru_nswap_word = 93850994010032},
{ru_inblock = 139833259949608, __ru_inblock_word = 139833259949608},
{ru_oublock = 93850994010032,
__ru_oublock_word = 93850994010032}, {ru_msgsnd =
93851008239032, __ru_msgsnd_word = 93851008239032}, {ru_msgrcv =
140723046059904, __ru_msgrcv_word = 140723046059904}, {
ru_nsignals = 93850989313526, __ru_nsignals_word =
93850989313526}, {ru_nvcsw = 345176855416, __ru_nvcsw_word = 345176855416},
{ru_nivcsw = 140723046059551,
__ru_nivcsw_word = 140723046059551}}}
starttime = 0
secs = 1527175061
usecs = 144
read_rate = <optimized out>
write_rate = <optimized out>
aggressive = 0 '\000'
scanned_all_unfrozen = <optimized out>
xidFullScanLimit = 3440221679
mxactFullScanLimit = 4204418425
new_rel_pages = <optimized out>
new_rel_tuples = <optimized out>
new_rel_allvisible = 2737573328
new_live_tuples = <optimized out>
new_frozen_xid = <optimized out>
new_min_multi = <optimized out>
__func__ = "lazy_vacuum_rel"
#4 0x0000555b62bfa54d in vacuum_rel (relid=relid@entry=1260,
relation=relation@entry=0x555b64047100, options=options@entry=1,
params=params@entry=0x7ffca32c11b0)
at
/build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuum.c:1391
lmode = 4
onerel = 0x555b64029498
onerelid = {relId = 1260, dbId = 0}
toast_relid = 0
save_userid = 10
save_sec_context = 0
save_nestlevel = 2
__func__ = "vacuum_rel"
#5 0x0000555b62bfb545 in vacuum (options=1, relation=0x555b64047100,
relid=relid@entry=0, params=params@entry=0x7ffca32c11b0, va_cols=0x0,
bstrategy=<optimized out>, bstrategy@entry=0x0,
isTopLevel=1 '\001') at
/build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuum.c:303
relid = 1260
cur = 0x555b640998b8
save_exception_stack = 0x7ffca32c1590
save_context_stack = 0x0
local_sigjmp_buf = {{__jmpbuf = {93851008735448,
-4513934124599105488, 93850994009984, 93851008619616, 93851008619616,
93851008398480, -4513934124655728592, -7776366411743385552},
__mask_was_saved = 0, __saved_mask = {__val = {0, 0, 0, 0, 1,
1, 1, 3473119737632784380, 139833388464563, 140723046060384, 12,
140723046060912, 93850990837571,
4049721212982268568, 1, 140723046060432}}}}
stmttype = <optimized out>
in_outer_xact = 0 '\000'
use_own_xacts = 1 '\001'
relations = 0x555b640998d8
in_vacuum = 1 '\001'
__func__ = "vacuum"
#6 0x0000555b62bfb8ca in ExecVacuum (vacstmt=vacstmt@entry=0x555b64047150,
isTopLevel=isTopLevel@entry=1 '\001')
at
/build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuum.c:121
params = {freeze_min_age = -1, freeze_table_age = -1,
multixact_freeze_min_age = -1, multixact_freeze_table_age = -1,
is_wraparound = 0 '\000', log_min_duration = -1}
#7 0x0000555b62d27787 in standard_ProcessUtility
(parsetree=0x555b64047150, queryString=0x555b64046688 "vacuum
pg_catalog.pg_authid;", context=<optimized out>, params=0x0,
dest=0x555b64047490, completionTag=0x7ffca32c19c0 "") at
/build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/tcop/utility.c:655
stmt = 0x555b64047150
isTopLevel = <optimized out>
__func__ = "standard_ProcessUtility"
--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/ <http://www.postgresql-consulting.com/>Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"
--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/ <http://www.postgresql-consulting.com/>
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"
Hmm .. surely
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5016181fd7..5d7fa1fb45 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
xid = HeapTupleHeaderGetXmin(tuple);
xmin_frozen = ((xid == FrozenTransactionId) ||
HeapTupleHeaderXminFrozen(tuple));
- if (TransactionIdIsNormal(xid))
+ if (!xmin_frozen && TransactionIdIsNormal(xid))
{
if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,
??
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
Hmm .. surely
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 5016181fd7..5d7fa1fb45 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, xid = HeapTupleHeaderGetXmin(tuple); xmin_frozen = ((xid == FrozenTransactionId) || HeapTupleHeaderXminFrozen(tuple)); - if (TransactionIdIsNormal(xid)) + if (!xmin_frozen && TransactionIdIsNormal(xid)) { if (TransactionIdPrecedes(xid, relfrozenxid)) ereport(ERROR,??
I don't think that's necesary - HeapTupleHeaderGetXmin() returns
FrozenTransactionId if the tuple is frozen (note the
HeapTupleHeaderXminFrozen() within).
Greetings,
Andres Freund
On 2018-05-24 13:30:54 -0700, Andres Freund wrote:
On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
Hmm .. surely
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 5016181fd7..5d7fa1fb45 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, xid = HeapTupleHeaderGetXmin(tuple); xmin_frozen = ((xid == FrozenTransactionId) || HeapTupleHeaderXminFrozen(tuple)); - if (TransactionIdIsNormal(xid)) + if (!xmin_frozen && TransactionIdIsNormal(xid)) { if (TransactionIdPrecedes(xid, relfrozenxid)) ereport(ERROR,??
I don't think that's necesary - HeapTupleHeaderGetXmin() returns
FrozenTransactionId if the tuple is frozen (note the
HeapTupleHeaderXminFrozen() within).
FWIW, even if that weren't the case: a) there'd be a lot more wrong with
this routine imo. b) some of the tuples affected clearly weren't
frozen...
Greetings,
Andres Freund
On 2018-May-24, Andres Freund wrote:
On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
Hmm .. surely
xid = HeapTupleHeaderGetXmin(tuple); xmin_frozen = ((xid == FrozenTransactionId) || HeapTupleHeaderXminFrozen(tuple)); - if (TransactionIdIsNormal(xid)) + if (!xmin_frozen && TransactionIdIsNormal(xid))
I don't think that's necesary - HeapTupleHeaderGetXmin() returns
FrozenTransactionId if the tuple is frozen (note the
HeapTupleHeaderXminFrozen() within).
Ah, yeah ... I probably thought about this when writing it and removed
it for that reason.
BTW I think the definition of HeapTupleHeaderXminFrozen is seriously
confusing, by failing to return true if the xmin is numerically
FrozenXid (which it'll be if the database was pg_upgraded). I wonder
about this one in HeapTupleSatisfiesMVCC:
else
{
/* xmin is committed, but maybe not according to our snapshot */
if (!HeapTupleHeaderXminFrozen(tuple) &&
XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false; /* treat as still in progress */
}
I think this is not a bug only because XidInMVCCSnapshot does this
/* Any xid < xmin is not in-progress */
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
which makes it return false for FrozenXid, but seems more of an accident
than explicitly designed.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-May-24, Andres Freund wrote:
FWIW, even if that weren't the case: a) there'd be a lot more wrong with
this routine imo. b) some of the tuples affected clearly weren't
frozen...
Right.
BTW is it just a coincidence or are all the affected tables pg_authid?
Maybe the problem is shared relations ..? Maybe the fact that they have
separate relfrozenxid (!?) in different databases?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-05-24 16:49:40 -0400, Alvaro Herrera wrote:
BTW is it just a coincidence or are all the affected tables pg_authid?
Maybe the problem is shared relations ..? Maybe the fact that they have
separate relfrozenxid (!?) in different databases?
Yes, that appears to be part of the problem. I've looked at a number of
shared relation related codepaths, but so far my theory is that the
relcache is wrong. Note that one of the reports in this thread clearly
had a different relcache relfrozenxid than in the catalog.
Then there's also:
http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com
Greetings,
Andres Freund
BTW is it just a coincidence or are all the affected tables pg_authid?
Maybe the problem is shared relations ..? Maybe the fact that they have
separate relfrozenxid (!?) in different databases?--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
We have had this problem twice and both times on both pg_authid and
pg_auth_members. Thanks,
Jeremy
On 2018-May-24, Andres Freund wrote:
On 2018-05-24 16:49:40 -0400, Alvaro Herrera wrote:
BTW is it just a coincidence or are all the affected tables pg_authid?
Maybe the problem is shared relations ..? Maybe the fact that they have
separate relfrozenxid (!?) in different databases?Yes, that appears to be part of the problem. I've looked at a number of
shared relation related codepaths, but so far my theory is that the
relcache is wrong. Note that one of the reports in this thread clearly
had a different relcache relfrozenxid than in the catalog.
Hmm ... is that because they read the values on different databases?
Are you referring to the reports by Maxim Boguk? I see one value from
template1, another value from template0.
Then there's also:
http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com
ah, so deleting the relcache file makes the problem to go away? That's
definitely pretty strange. I see no reason for the value in relcache to
become out of step with the catalogued value in the same database ... I
don't think we transmit in any way values of one database to another.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-05-24 17:13:11 -0400, Alvaro Herrera wrote:
On 2018-May-24, Andres Freund wrote:
On 2018-05-24 16:49:40 -0400, Alvaro Herrera wrote:
BTW is it just a coincidence or are all the affected tables pg_authid?
Maybe the problem is shared relations ..? Maybe the fact that they have
separate relfrozenxid (!?) in different databases?Yes, that appears to be part of the problem. I've looked at a number of
shared relation related codepaths, but so far my theory is that the
relcache is wrong. Note that one of the reports in this thread clearly
had a different relcache relfrozenxid than in the catalog.Hmm ... is that because they read the values on different databases?
Are you referring to the reports by Maxim Boguk? I see one value from
template1, another value from template0.
I was referring to
/messages/by-id/20180522193007.4bi5oluqb7c72oq2@alap3.anarazel.de
but you're right, it's possible that that's just caused by time passing
or different databases.
Greetings,
Andres Freund
Hi,
Moving discussion to -hackers. Tom, I think you worked most with this
code, your input would be appreciated.
Original discussion is around:
http://archives.postgresql.org/message-id/20180524211311.tnswfnjwnii54htx%40alvherre.pgsql
On 2018-05-24 17:13:11 -0400, Alvaro Herrera wrote:
On 2018-May-24, Andres Freund wrote:
Then there's also:
http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.comah, so deleting the relcache file makes the problem to go away? That's
definitely pretty strange. I see no reason for the value in relcache to
become out of step with the catalogued value in the same database ... I
don't think we transmit in any way values of one database to another.
I can reproduce the issue. As far as I can tell we just don't ever
actually update nailed relcache entries in the normal course, leaving
the "physical address" aside. VACUUM will, via
vac_update_relstats() -> heap_inplace_update() -> CacheInvalidateHeapTuple(),
send out an invalidation. But invalidation, in my case another session,
will essentially ignore most of that due to:
static void
RelationClearRelation(Relation relation, bool rebuild)
...
/*
* Never, never ever blow away a nailed-in system relation, because we'd
* be unable to recover. However, we must redo RelationInitPhysicalAddr
* in case it is a mapped relation whose mapping changed.
*
* If it's a nailed-but-not-mapped index, then we need to re-read the
* pg_class row to see if its relfilenode changed. We do that immediately
* if we're inside a valid transaction and the relation is open (not
* counting the nailed refcount). Otherwise just mark the entry as
* possibly invalid, and it'll be fixed when next opened.
*/
if (relation->rd_isnailed)
{
RelationInitPhysicalAddr(relation);
if (relation->rd_rel->relkind == RELKIND_INDEX ||
relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
{
relation->rd_isvalid = false; /* needs to be revalidated */
if (relation->rd_refcnt > 1 && IsTransactionState())
RelationReloadIndexInfo(relation);
}
return;
}
Which basically means that once running we'll never update the relcache
data for nailed entries. That's unproblematic for most relcache fields,
but not for things like RelationData->rd_rel->relfrozenxid / relminmxid.
This'll e.g. lead to lazy_vacuum_rel() wrongly not using aggressive
vacuums despite being required. And it'll lead, triggering this thread,
to wrong errors being raised during vacuum because relfrozenxid just is
some random value from the past. I suspect this might also be
co-responsible for a bunch of planning issues for queries involving the
catalog, because the planner will use wrong relcache data until the next
time the init file is thrown away?
This looks like a very longstanding bug to me. I'm not yet quite sure
what the best way to deal with this is. I suspect we might get away
with just looking up a new version of the pg_class tuple and copying
rd_rel over?
Greetings,
Andres Freund
On 2018-05-24 16:46:24 -0400, Alvaro Herrera wrote:
On 2018-May-24, Andres Freund wrote:
On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
Hmm .. surely
xid = HeapTupleHeaderGetXmin(tuple); xmin_frozen = ((xid == FrozenTransactionId) || HeapTupleHeaderXminFrozen(tuple)); - if (TransactionIdIsNormal(xid)) + if (!xmin_frozen && TransactionIdIsNormal(xid))I don't think that's necesary - HeapTupleHeaderGetXmin() returns
FrozenTransactionId if the tuple is frozen (note the
HeapTupleHeaderXminFrozen() within).Ah, yeah ... I probably thought about this when writing it and removed
it for that reason.BTW I think the definition of HeapTupleHeaderXminFrozen is seriously
confusing, by failing to return true if the xmin is numerically
FrozenXid (which it'll be if the database was pg_upgraded). I wonder
about this one in HeapTupleSatisfiesMVCC:
I suggest raising this on -hackers. I agree that it's unfortunate.
Greetings,
Andres Freund
On Fri, May 25, 2018 at 1:38 PM, Andres Freund <andres@anarazel.de> wrote:
BTW I think the definition of HeapTupleHeaderXminFrozen is seriously
confusing, by failing to return true if the xmin is numerically
FrozenXid (which it'll be if the database was pg_upgraded). I wonder
about this one in HeapTupleSatisfiesMVCC:I suggest raising this on -hackers. I agree that it's unfortunate.
I wonder if BootstrapTransactionId also needs to be considered here.
--
Peter Geoghegan
Andres Freund <andres@anarazel.de> writes:
Moving discussion to -hackers. Tom, I think you worked most with this
code, your input would be appreciated.
Yeah, the assumption in the relcache is that the only part of a nailed
catalog's relcache entry that really needs to be updated intrasession is
the relfilenode mapping. For nailed indexes, we allow updating of some
additional fields, and I guess what has to happen here is that we teach
the code to update some additional fields for nailed tables too.
regards, tom lane
On 2018-05-25 17:47:37 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Moving discussion to -hackers. Tom, I think you worked most with this
code, your input would be appreciated.Yeah, the assumption in the relcache is that the only part of a nailed
catalog's relcache entry that really needs to be updated intrasession is
the relfilenode mapping.
Paging through the changes to relcache.c and vacuum[lazy].c it looks to
me like that hasn't been true in a long time, right?
For nailed indexes, we allow updating of some additional fields, and I
guess what has to happen here is that we teach the code to update some
additional fields for nailed tables too.
Yea, it seems like we could just get a new version of the pg_class tuple
if in the right state, and memcpy() it into place. Not sure if there's
any other issues...
BTW, and I guess this mostly goes to Alvaro, I don't understand why that
code accepts relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX?
That seems like something we'll hopefully never support.
Greetings,
Andres Freund
On 2018-05-25 15:05:31 -0700, Andres Freund wrote:
On 2018-05-25 17:47:37 -0400, Tom Lane wrote:
For nailed indexes, we allow updating of some additional fields, and I
guess what has to happen here is that we teach the code to update some
additional fields for nailed tables too.Yea, it seems like we could just get a new version of the pg_class tuple
if in the right state, and memcpy() it into place. Not sure if there's
any other issues...
That part isn't too hard. I've a patch that appears to address the
issue, and isn't *too* ugly.
We don't really have a way to force .init file removal / update for
shared relations however. Otherwise we'll just continue to read old data
from .init files at startup. And there'll commonly not be any
outstanding invalidation. Thus it appears to me that we need to extend
RelcacheInitFileInval to also support the shared file. That's WAL
logged, but it looks like we can just add flag like
XACT_COMPLETION_UPDATE_RELCACHE_FILE without breaking the WAL format.
Does anybody see a way to not have to remove the .init file?
Greetings,
Andres Freund
On 2018-05-26 13:45:06 -0700, Andres Freund wrote:
On 2018-05-25 15:05:31 -0700, Andres Freund wrote:
On 2018-05-25 17:47:37 -0400, Tom Lane wrote:
For nailed indexes, we allow updating of some additional fields, and I
guess what has to happen here is that we teach the code to update some
additional fields for nailed tables too.Yea, it seems like we could just get a new version of the pg_class tuple
if in the right state, and memcpy() it into place. Not sure if there's
any other issues...That part isn't too hard. I've a patch that appears to address the
issue, and isn't *too* ugly.We don't really have a way to force .init file removal / update for
shared relations however. Otherwise we'll just continue to read old data
from .init files at startup. And there'll commonly not be any
outstanding invalidation. Thus it appears to me that we need to extend
RelcacheInitFileInval to also support the shared file. That's WAL
logged, but it looks like we can just add flag like
XACT_COMPLETION_UPDATE_RELCACHE_FILE without breaking the WAL format.Does anybody see a way to not have to remove the .init file?
Just to be clear: We already remove the non-shared relcache init file
when a non-shared table in it is changed . Which I presume is the reason
this issue hasn't bitten us in a much bigger way. While the lack of
proper invalidations means that already running sessions will see the
wrong values and make wrong decisions, the fact that the non-shared file
will regularly be removed has reduced the impact quite a bit.
Greetings,
Andres Freund
On May 26, 2018, at 1:45 PM, Andres Freund <andres@anarazel.de> wrote:
Does anybody see a way to not have to remove the .init file?
How about only keeping the critical info for being able to find relations in the .init files, and then fully populate the cache by doing a normal lookup? Since there’s multiple entries needed to bootstrap things I guess there’d need to be a flag in relcache to indicate that an existing entry wasn’t fully formed.
On May 27, 2018 9:39:49 AM PDT, "Nasby, Jim" <nasbyj@amazon.com> wrote:
On May 26, 2018, at 1:45 PM, Andres Freund <andres@anarazel.de> wrote:
Does anybody see a way to not have to remove the .init file?
How about only keeping the critical info for being able to find
relations in the .init files, and then fully populate the cache by
doing a normal lookup? Since there’s multiple entries needed to
bootstrap things I guess there’d need to be a flag in relcache to
indicate that an existing entry wasn’t fully formed.
Then the cache wouldn't have any benefits, no? It's been a while, but last time I checked it does make quite a measurable performance difference in a new backend.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes:
On May 27, 2018 9:39:49 AM PDT, "Nasby, Jim" <nasbyj@amazon.com> wrote:
How about only keeping the critical info for being able to find
relations in the .init files, and then fully populate the cache by
doing a normal lookup?
Then the cache wouldn't have any benefits, no? It's been a while, but last time I checked it does make quite a measurable performance difference in a new backend.
Yeah, we don't want to lose the performance benefit. But I don't think
there's any need for special magic here: we just have to accept the fact
that there's a need to flush that cache sometimes. In normal use it
shouldn't happen often enough to be a performance problem.
FWIW, I'm not on board with "memcpy the whole row". I think the right
thing is more like what we do in RelationReloadIndexInfo, ie copy over
the specific fields that we expect to be mutable.
regards, tom lane
Hi,
On 2018-05-27 13:22:21 -0400, Tom Lane wrote:
But I don't think there's any need for special magic here: we just
have to accept the fact that there's a need to flush that cache
sometimes. In normal use it shouldn't happen often enough to be a
performance problem.
Yea, it's not that problematic. We already remove the local init
file. I started out trying to write a version of invalidation that also
WAL logs shared inval, but that turns out to be hard to do without
breaking compatibilty. So I think we should just always unlink the
shared one - that seems to work well.
FWIW, I'm not on board with "memcpy the whole row". I think the right
thing is more like what we do in RelationReloadIndexInfo, ie copy over
the specific fields that we expect to be mutable.
But that's what RelationReloadIndexInfo() etc do?
relp = (Form_pg_class) GETSTRUCT(pg_class_tuple);
memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE);
I don't think we need to modify anything outside of rd_rel at this
point?
I've a patch that seems to work, that mostly needs some comment
polishing.
Greetings,
Andres Freund
Hi,
We were working on this issue and thinking if we could actually make pg_class(rd_rel) part of recache entry upgradable.
To achieve this we can allocate Form_pg_class structures (for shared relations… a small number) on shared memory.
We do not need global pg_internal_init file as new backend during boot up will be set to point at already stored Form_pg_class structure.
Thanks,
Nishant
On 5/27/18, 1:01 PM, "Andres Freund" <andres@anarazel.de> wrote:
Hi,
On 2018-05-27 13:22:21 -0400, Tom Lane wrote:
But I don't think there's any need for special magic here: we just
have to accept the fact that there's a need to flush that cache
sometimes. In normal use it shouldn't happen often enough to be a
performance problem.
Yea, it's not that problematic. We already remove the local init
file. I started out trying to write a version of invalidation that also
WAL logs shared inval, but that turns out to be hard to do without
breaking compatibilty. So I think we should just always unlink the
shared one - that seems to work well.
FWIW, I'm not on board with "memcpy the whole row". I think the right
thing is more like what we do in RelationReloadIndexInfo, ie copy over
the specific fields that we expect to be mutable.
But that's what RelationReloadIndexInfo() etc do?
relp = (Form_pg_class) GETSTRUCT(pg_class_tuple);
memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE);
I don't think we need to modify anything outside of rd_rel at this
point?
I've a patch that seems to work, that mostly needs some comment
polishing.
Greetings,
Andres Freund
Hi,
On 2018-05-27 13:00:06 -0700, Andres Freund wrote:
I've a patch that seems to work, that mostly needs some comment
polishing.
Attached is what I currently have. Still needs some more work, but I
think it's more than good enough to review the approach. Basically the
approach consists out of two changes:
1) Send init file removals for shared nailed relations as well.
This fixes that the shared init file contains arbitrarily outdated
information for relfrozenxid etc. Leading to e.g. the pg_authid
errors we've seen in some recent threads. Only applies to
new connections.
2) Reread RelationData->rd_rel for nailed relations when invalidated.
This ensures that already built relcache entries for nailed relations
are updated. Currently they never are. This currently doesn't cause
*that* frequently an issue for !shared entries, because for those the
init file gets zapped regularly, and autovacuum workers usually don't
live that long. But it's still a significant correctness issue for
both shared an non shared relations.
FWIW, I wonder if this isn't critical enough to make us consider having
a point release earlier..
Greetings,
Andres Freund
Attachments:
0001-WIP-Ensure-relcache-entries-for-nailed-relations-are.patchtext/x-diff; charset=us-asciiDownload
From c7121eec1e76b6f9b2cfa5d7d7f5f4b7a4f8be96 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 May 2018 12:38:22 -0700
Subject: [PATCH] WIP: Ensure relcache entries for nailed relations are
accurate.
Author: Andres Freund
Reviewed-By:
Discussion:
https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.de
https://postgr.es/m/CAMa1XUhKSJd98JW4o9StWPrfS=11bPgG+_GDMxe25TvUY4Sugg@mail.gmail.com
https://postgr.es/m/CAKMFJucqbuoDRfxPDX39WhA3vJyxweRg_zDVXzncr6+5wOguWA@mail.gmail.com
https://postgr.es/m/CAGewt-ujGpMLQ09gXcUFMZaZsGJC98VXHEFbF-tpPB0fB13K+A@mail.gmail.com
Backpatch:
---
src/backend/utils/cache/inval.c | 31 +++--
src/backend/utils/cache/relcache.c | 198 ++++++++++++++++++++---------
src/include/storage/standbydefs.h | 2 +-
3 files changed, 156 insertions(+), 75 deletions(-)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 955f5c7fdc5..0d6100fb082 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -521,12 +521,11 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
(void) GetCurrentCommandId(true);
/*
- * If the relation being invalidated is one of those cached in the local
- * relcache init file, mark that we need to zap that file at commit. Same
- * is true when we are invalidating whole relcache.
+ * If the relation being invalidated is one of those cached in a relcache
+ * init file, mark that we need to zap that file at commit. Same is true
+ * when we are invalidating whole relcache.
*/
- if (OidIsValid(dbId) &&
- (RelationIdIsInInitFile(relId) || relId == InvalidOid))
+ if (relId == InvalidOid || RelationIdIsInInitFile(relId))
transInvalInfo->RelcacheInitFileInval = true;
}
@@ -881,18 +880,26 @@ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
if (RelcacheInitFileInval)
{
+ elog(trace_recovery(DEBUG4), "removing relcache init files for database %u",
+ dbid);
+
/*
- * RelationCacheInitFilePreInvalidate requires DatabasePath to be set,
- * but we should not use SetDatabasePath during recovery, since it is
+ * RelationCacheInitFilePreInvalidate, when the invalidation message
+ * is for a specific database, requires DatabasePath to be set, but we
+ * should not use SetDatabasePath during recovery, since it is
* intended to be used only once by normal backends. Hence, a quick
* hack: set DatabasePath directly then unset after use.
*/
- DatabasePath = GetDatabasePath(dbid, tsid);
- elog(trace_recovery(DEBUG4), "removing relcache init file in \"%s\"",
- DatabasePath);
+ if (OidIsValid(dbid))
+ DatabasePath = GetDatabasePath(dbid, tsid);
+
RelationCacheInitFilePreInvalidate();
- pfree(DatabasePath);
- DatabasePath = NULL;
+
+ if (OidIsValid(dbid))
+ {
+ pfree(DatabasePath);
+ DatabasePath = NULL;
+ }
}
SendSharedInvalidMessages(msgs, nmsgs);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3dfb1b8fbe2..aa4427724d5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -250,6 +250,7 @@ static void RelationDestroyRelation(Relation relation, bool remember_tupdesc);
static void RelationClearRelation(Relation relation, bool rebuild);
static void RelationReloadIndexInfo(Relation relation);
+static void RelationReloadNailed(Relation relation);
static void RelationFlushRelation(Relation relation);
static void RememberToFreeTupleDescAtEOX(TupleDesc td);
static void AtEOXact_cleanup(Relation relation, bool isCommit);
@@ -286,7 +287,7 @@ static void IndexSupportInitialize(oidvector *indclass,
static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid,
StrategyNumber numSupport);
static void RelationCacheInitFileRemoveInDir(const char *tblspcpath);
-static void unlink_initfile(const char *initfilename);
+static void unlink_initfile(const char *initfilename, int elevel);
static bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
PartitionDesc partdesc2);
@@ -1931,7 +1932,16 @@ RelationIdGetRelation(Oid relationId)
RelationReloadIndexInfo(rd);
else
RelationClearRelation(rd, true);
- Assert(rd->rd_isvalid);
+
+ /*
+ * Normally entries need to be valid here, but before the relcache
+ * has been initialized, not enough infrastructure exists to
+ * perform pg_class lookups. The structure of such entries doesn't
+ * change, but we still want to update the rd_rel entry. So
+ * rd_isvalid = false is left in place for a later lookup.
+ */
+ Assert(rd->rd_isvalid ||
+ (rd->rd_isnailed && !criticalRelcachesBuilt));
}
return rd;
}
@@ -2135,6 +2145,92 @@ RelationReloadIndexInfo(Relation relation)
relation->rd_isvalid = true;
}
+/*
+ * RelationReloadNailed - reload minimal information for nailed relations.
+ *
+ * The structure of a nailed relation can never change (which is good, because
+ * we rely on knowing their structure to be able to read catalog content). But
+ * some parts, e.g. pg_class.relfrozenxid, are still important to have
+ * accurate content for. Therefore those need to be reloaded upon arrival of
+ * invalidations.
+ */
+static void
+RelationReloadNailed(Relation relation)
+{
+ Assert(relation->rd_isnailed);
+
+ /*
+ * Redo RelationInitPhysicalAddr in case it is a mapped relation whose
+ * mapping changed.
+ */
+ RelationInitPhysicalAddr(relation);
+
+ if (relation->rd_rel->relkind == RELKIND_INDEX)
+ {
+ /*
+ * If it's a nailed-but-not-mapped index, then we need to re-read the
+ * pg_class row to see if its relfilenode changed. We do that immediately
+ * if we're inside a valid transaction and the relation is open (not
+ * counting the nailed refcount). Otherwise just mark the entry as
+ * possibly invalid, and it'll be fixed when next opened.
+ */
+ relation->rd_isvalid = false; /* needs to be revalidated */
+
+ if (relation->rd_refcnt > 1 && IsTransactionState())
+ RelationReloadIndexInfo(relation);
+ }
+ else
+ {
+ /*
+ * If the relcache infrastructure has been bootstrapped, and we're in
+ * a transaction, reload the pg_class part of the relcache entry. We
+ * can't easily do so if relcaches aren't yet built, but that's fine
+ * because at that stage the to-be-reloaded attributes (like
+ * relfrozenxid) aren't accessed. To ensure the entry will later be
+ * revalidated, we leave it in invalid state, but allow use
+ * (cf. RelationIdGetRelation()).
+ *
+ * We do that immediately if we're inside a valid transaction and the
+ * relation is open (not counting the nailed refcount). Otherwise just
+ * mark the entry as possibly invalid, and it'll be fixed when next
+ * opened.
+ */
+ if (criticalRelcachesBuilt && relation->rd_refcnt > 1 && IsTransactionState())
+ {
+ HeapTuple pg_class_tuple;
+ Form_pg_class relp;
+
+ /*
+ * NB: Mark the entry as valid before starting to scan, to avoid
+ * self-recursion when looking up pg_class.
+ */
+ relation->rd_isvalid = true;
+
+ pg_class_tuple = ScanPgRelation(RelationGetRelid(relation),
+ true,
+ false);
+ relp = (Form_pg_class) GETSTRUCT(pg_class_tuple);
+ memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE);
+ heap_freetuple(pg_class_tuple);
+
+ /*
+ * Again mark as valid, to protect against concurrently arriving
+ * invalidations.
+ */
+ relation->rd_isvalid = true;
+ }
+ else
+ {
+ /*
+ * If currently the entry can't be updated, leave it as invalid. Note
+ * that we'll allow access to nailed entries marked invalid, as the
+ * structure didn't change.
+ */
+ relation->rd_isvalid = false;
+ }
+ }
+}
+
/*
* RelationDestroyRelation
*
@@ -2250,27 +2346,12 @@ RelationClearRelation(Relation relation, bool rebuild)
RelationCloseSmgr(relation);
/*
- * Never, never ever blow away a nailed-in system relation, because we'd
- * be unable to recover. However, we must redo RelationInitPhysicalAddr
- * in case it is a mapped relation whose mapping changed.
- *
- * If it's a nailed-but-not-mapped index, then we need to re-read the
- * pg_class row to see if its relfilenode changed. We do that immediately
- * if we're inside a valid transaction and the relation is open (not
- * counting the nailed refcount). Otherwise just mark the entry as
- * possibly invalid, and it'll be fixed when next opened.
+ * Treat nailed-in system relations separately, they always need to be
+ * accessible, so we can't blow them away.
*/
if (relation->rd_isnailed)
{
- RelationInitPhysicalAddr(relation);
-
- if (relation->rd_rel->relkind == RELKIND_INDEX ||
- relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
- {
- relation->rd_isvalid = false; /* needs to be revalidated */
- if (relation->rd_refcnt > 1 && IsTransactionState())
- RelationReloadIndexInfo(relation);
- }
+ RelationReloadNailed(relation);
return;
}
@@ -5907,24 +5988,26 @@ write_item(const void *data, Size len, FILE *fp)
/*
* Determine whether a given relation (identified by OID) is one of the ones
- * we should store in the local relcache init file.
+ * we should store in a relcache init file.
*
* We must cache all nailed rels, and for efficiency we should cache every rel
* that supports a syscache. The former set is almost but not quite a subset
- * of the latter. Currently, we must special-case TriggerRelidNameIndexId,
- * which RelationCacheInitializePhase3 chooses to nail for efficiency reasons,
- * but which does not support any syscache.
- *
- * Note: this function is currently never called for shared rels. If it were,
- * we'd probably also need a special case for DatabaseNameIndexId, which is
- * critical but does not support a syscache.
+ * of the latter. The special cases are relations where
+ * RelationCacheInitializePhase2/3 chooses to nail for efficiency reasons, but
+ * which do not support any syscache.
*/
bool
RelationIdIsInInitFile(Oid relationId)
{
- if (relationId == TriggerRelidNameIndexId)
+ if (relationId == SharedSecLabelRelationId ||
+ relationId == TriggerRelidNameIndexId ||
+ relationId == DatabaseNameIndexId ||
+ relationId == SharedSecLabelObjectIndexId)
{
- /* If this Assert fails, we don't need this special case anymore. */
+ /*
+ * If this Assert fails, we don't need the applicable special case
+ * anymore.
+ */
Assert(!RelationSupportsSysCache(relationId));
return true;
}
@@ -5994,38 +6077,30 @@ RelationHasUnloggedIndex(Relation rel)
* We take the lock and do the unlink in RelationCacheInitFilePreInvalidate,
* then release the lock in RelationCacheInitFilePostInvalidate. Caller must
* send any pending SI messages between those calls.
- *
- * Notice this deals only with the local init file, not the shared init file.
- * The reason is that there can never be a "significant" change to the
- * relcache entry of a shared relation; the most that could happen is
- * updates of noncritical fields such as relpages/reltuples. So, while
- * it's worth updating the shared init file from time to time, it can never
- * be invalid enough to make it necessary to remove it.
*/
void
RelationCacheInitFilePreInvalidate(void)
{
- char initfilename[MAXPGPATH];
+ char localinitfname[MAXPGPATH];
+ char sharedinitfname[MAXPGPATH];
- snprintf(initfilename, sizeof(initfilename), "%s/%s",
- DatabasePath, RELCACHE_INIT_FILENAME);
+ if (DatabasePath)
+ snprintf(localinitfname, sizeof(localinitfname), "%s/%s",
+ DatabasePath, RELCACHE_INIT_FILENAME);
+ snprintf(sharedinitfname, sizeof(sharedinitfname), "global/%s",
+ RELCACHE_INIT_FILENAME);
LWLockAcquire(RelCacheInitLock, LW_EXCLUSIVE);
- if (unlink(initfilename) < 0)
- {
- /*
- * The file might not be there if no backend has been started since
- * the last removal. But complain about failures other than ENOENT.
- * Fortunately, it's not too late to abort the transaction if we can't
- * get rid of the would-be-obsolete init file.
- */
- if (errno != ENOENT)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove cache file \"%s\": %m",
- initfilename)));
- }
+ /*
+ * The files might not be there if no backend has been started since the
+ * last removal. But complain about failures other than ENOENT with
+ * ERROR. Fortunately, it's not too late to abort the transaction if we
+ * can't get rid of the would-be-obsolete init file.
+ */
+ if (DatabasePath)
+ unlink_initfile(localinitfname, ERROR);
+ unlink_initfile(sharedinitfname, ERROR);
}
void
@@ -6051,13 +6126,9 @@ RelationCacheInitFileRemove(void)
struct dirent *de;
char path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY)];
- /*
- * We zap the shared cache file too. In theory it can't get out of sync
- * enough to be a problem, but in data-corruption cases, who knows ...
- */
snprintf(path, sizeof(path), "global/%s",
RELCACHE_INIT_FILENAME);
- unlink_initfile(path);
+ unlink_initfile(path, LOG);
/* Scan everything in the default tablespace */
RelationCacheInitFileRemoveInDir("base");
@@ -6097,7 +6168,7 @@ RelationCacheInitFileRemoveInDir(const char *tblspcpath)
/* Try to remove the init file in each database */
snprintf(initfilename, sizeof(initfilename), "%s/%s/%s",
tblspcpath, de->d_name, RELCACHE_INIT_FILENAME);
- unlink_initfile(initfilename);
+ unlink_initfile(initfilename, LOG);
}
}
@@ -6105,12 +6176,15 @@ RelationCacheInitFileRemoveInDir(const char *tblspcpath)
}
static void
-unlink_initfile(const char *initfilename)
+unlink_initfile(const char *initfilename, int elevel)
{
if (unlink(initfilename) < 0)
{
/* It might not be there, but log any error other than ENOENT */
if (errno != ENOENT)
- elog(LOG, "could not remove cache file \"%s\": %m", initfilename);
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove cache file \"%s\": %m",
+ initfilename)));
}
}
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index bb6144805a7..17b74ca3b61 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -64,7 +64,7 @@ typedef struct xl_invalidations
{
Oid dbId; /* MyDatabaseId */
Oid tsId; /* MyDatabaseTableSpace */
- bool relcacheInitFileInval; /* invalidate relcache init file */
+ bool relcacheInitFileInval; /* invalidate relcache init files */
int nmsgs; /* number of shared inval msgs */
SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
} xl_invalidations;
--
2.17.0.rc1.dirty
Hi,
(please don't top post)
On 2018-05-28 15:07:52 +0000, Nishant, Fnu wrote:
We were working on this issue and thinking if we could actually make
pg_class(rd_rel) part of recache entry upgradable.
Right, that's necessary. See the patch I just sent.
To achieve this we can allocate Form_pg_class structures (for shared
relations… a small number) on shared memory.
But why would this be necessary / a good idea? Even if we decided it
were, it seems like it'd end up being quite invasive. But I doubt it's
a good plan, because relcache entries want / need to be updated
differently in the transaction that does the changes (as it needs to see
the effect of catalog changes before commit) than other sessions (which
only should see them after commit).
Greetings,
Andres Freund
Hi,
To achieve this we can allocate Form_pg_class structures (for shared
relations… a small number) on shared memory.
But why would this be necessary / a good idea? Even if we decided it
were, it seems like it'd end up being quite invasive. But I doubt it's
a good plan, because relcache entries want / need to be updated
differently in the transaction that does the changes (as it needs to see
the effect of catalog changes before commit) than other sessions (which
only should see them after commit).
It will be a good idea as, we can avoid maintaining file (creation/deletion/updation) for period of engine running and we do not need to invalidate other backend cache.
For transaction(which change catalog), we can allocate a new private Form_pg_class structure and do operations on it and update shared structure post commit.
Routine roughly will be-
1) A backend having a relcache entry for a shared rel where rd_rel part points to shared memory structure.
Rel->rd_rel is storing a shared memory pointer.
2) Wants to update the entry...allocate a new private structure and memcpy the shared content to this new memory and point rd_rel to private memory
Rel->rd_rel is storing a pointer to process specific structure.
3) Transaction committed and we copy the private memory content to shared memory area and point rd_rel again to shared memory.
4) free private memory.
5) Other backends do not do any invalidation but still get the latest updated values.
Why is this good idea?
Lets take an example (assuming we have 1000 postgres backends running).
With shared memory scheme-
Operation wise, for a transaction, we allocate/free once (private memory allocation) and memcpy data to and fro (from shared to private and back to shared)...
Overall memory footprint 1 shared copy and 1 private only when updating.
No file creation/deletion/updation.
With current file scheme-
Operation wise, for a transaction, we use private cache but we need to invalidate 1000 other caches( which will be atleast 1000 memcpy and allocate/free) and may involve reading back in page of pg_class.
Overall memory footprint 1000 private copies.
We have to create/delete/update init file and synchronize around it.
Having said that we may not worry about transaction for updating all values...(I think relfrozenxid can be updated by a CAS operation ...still thinking on it).
-Nishant
On 2018-05-29 18:06:12 +0000, Nishant, Fnu wrote:
Hi,
To achieve this we can allocate Form_pg_class structures (for shared
relations… a small number) on shared memory.But why would this be necessary / a good idea? Even if we decided it
were, it seems like it'd end up being quite invasive. But I doubt it's
a good plan, because relcache entries want / need to be updated
differently in the transaction that does the changes (as it needs to see
the effect of catalog changes before commit) than other sessions (which
only should see them after commit).It will be a good idea as, we can avoid maintaining file
(creation/deletion/updation) for period of engine running and we do
not need to invalidate other backend cache.
a) That's a major change. Shouldn't be considered for a bugfix.
b) This is going to have locking issues / lock contention.
Why is this good idea?
Lets take an example (assuming we have 1000 postgres backends running).
With shared memory scheme-
Operation wise, for a transaction, we allocate/free once (private memory allocation) and memcpy data to and fro (from shared to private and back to shared)...
Overall memory footprint 1 shared copy and 1 private only when updating.
No file creation/deletion/updation.
I don't buy that nailed relations are a meaningful part of that
problem. They hardly ever change. And a shared cache is a much bigger
issues.
Greetings,
Andres Freund
I added an Assert(DatabasePath != NULL) to
RelationCacheInitFilePreInvalidate() and didn't see a single crash when
running the tests. I thought that adding a "VACUUM FREEZE pg_class"
in the recovery tests (where there is a standby) ought to do it, but it
doesn't. What's the deal there?
Here are some proposed changes. Some of these comment edits are WIP :-)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
fixup.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 0d6100fb08..8a8620943f 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -872,6 +872,8 @@ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
int nmsgs, bool RelcacheInitFileInval,
Oid dbid, Oid tsid)
{
+ Assert(InRecovery);
+
if (nmsgs <= 0)
return;
@@ -884,12 +886,13 @@ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
dbid);
/*
- * RelationCacheInitFilePreInvalidate, when the invalidation message
- * is for a specific database, requires DatabasePath to be set, but we
- * should not use SetDatabasePath during recovery, since it is
- * intended to be used only once by normal backends. Hence, a quick
- * hack: set DatabasePath directly then unset after use.
+ * When the invalidation message is for a specific database,
+ * RelationCacheInitFilePreInvalidate requires DatabasePath to be set,
+ * but we're not allowed to use SetDatabasePath during recovery (since
+ * it is intended to be used by normal backends). Hence, a quick hack:
+ * set DatabasePath directly then unset after use.
*/
+ Assert(!DatabasePath); /* don't clobber an existing value */
if (OidIsValid(dbid))
DatabasePath = GetDatabasePath(dbid, tsid);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index aa4427724d..71b2212cbd 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1934,6 +1934,11 @@ RelationIdGetRelation(Oid relationId)
RelationClearRelation(rd, true);
/*
+ * Normal entries are valid by now -- except nailed ones when
+ * loaded before relcache initialization. There isn't enough
+ * infrastructure yet to do pg_class lookups, but we need their
+ * rd_rel entries to be updated, so we let these through.
+ */
* Normally entries need to be valid here, but before the relcache
* has been initialized, not enough infrastructure exists to
* perform pg_class lookups. The structure of such entries doesn't
@@ -2346,8 +2351,7 @@ RelationClearRelation(Relation relation, bool rebuild)
RelationCloseSmgr(relation);
/*
- * Treat nailed-in system relations separately, they always need to be
- * accessible, so we can't blow them away.
+ * We cannot blow away nailed-in relations, so treat them especially.
*/
if (relation->rd_isnailed)
{
@@ -5942,7 +5946,8 @@ write_relcache_init_file(bool shared)
* wrote out was up-to-date.)
*
* This mustn't run concurrently with the code that unlinks an init file
- * and sends SI messages, so grab a serialization lock for the duration.
+ * and sends SI messages (see RelationCacheInitFilePreInvalidate), so grab
+ * a serialization lock for the duration.
*/
LWLockAcquire(RelCacheInitLock, LW_EXCLUSIVE);
@@ -6061,6 +6066,10 @@ RelationHasUnloggedIndex(Relation rel)
* changed one or more of the relation cache entries that are kept in the
* local init file.
*
+ * When DatabasePath is set, both the init file for that database and the
+ * shared (global) init files are to be removed; otherwise only the latter is.
+ * This is useful during recovery (XXX really?)
+ *
* To be safe against concurrent inspection or rewriting of the init file,
* we must take RelCacheInitLock, then remove the old init file, then send
* the SI messages that include relcache inval for such relations, and then
@@ -6180,9 +6189,9 @@ unlink_initfile(const char *initfilename, int elevel)
{
if (unlink(initfilename) < 0)
{
- /* It might not be there, but log any error other than ENOENT */
+ /* It might not be there, but report any error other than ENOENT */
if (errno != ENOENT)
- ereport(ERROR,
+ ereport(elevel,
(errcode_for_file_access(),
errmsg("could not remove cache file \"%s\": %m",
initfilename)));
On 2018-May-29, Alvaro Herrera wrote:
I added an Assert(DatabasePath != NULL) to
RelationCacheInitFilePreInvalidate() and didn't see a single crash when
running the tests. I thought that adding a "VACUUM FREEZE pg_class"
in the recovery tests (where there is a standby) ought to do it, but it
doesn't. What's the deal there?
Sorry, that was dumb -- the assert obviously is hit if I remove the code
to set DatabasePath beforehand :-)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2018-05-29 19:14:51 -0400, Alvaro Herrera wrote:
I added an Assert(DatabasePath != NULL) to
RelationCacheInitFilePreInvalidate() and didn't see a single crash when
running the tests. I thought that adding a "VACUUM FREEZE pg_class"
in the recovery tests (where there is a standby) ought to do it, but it
doesn't. What's the deal there?Here are some proposed changes. Some of these comment edits are WIP :-)
I'm a bit confused by these changes - there seems to be some that look
like a borked diff? And a number of others look pretty unrelated?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c index 0d6100fb08..8a8620943f 100644 --- a/src/backend/utils/cache/inval.c +++ b/src/backend/utils/cache/inval.c @@ -872,6 +872,8 @@ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs, int nmsgs, bool RelcacheInitFileInval, Oid dbid, Oid tsid) { + Assert(InRecovery); +
Idk, seems unrelated.
if (nmsgs <= 0)
return;@@ -884,12 +886,13 @@ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
dbid);/* - * RelationCacheInitFilePreInvalidate, when the invalidation message - * is for a specific database, requires DatabasePath to be set, but we - * should not use SetDatabasePath during recovery, since it is - * intended to be used only once by normal backends. Hence, a quick - * hack: set DatabasePath directly then unset after use. + * When the invalidation message is for a specific database, + * RelationCacheInitFilePreInvalidate requires DatabasePath to be set, + * but we're not allowed to use SetDatabasePath during recovery (since + * it is intended to be used by normal backends). Hence, a quick hack: + * set DatabasePath directly then unset after use. */ + Assert(!DatabasePath); /* don't clobber an existing value */ if (OidIsValid(dbid)) DatabasePath = GetDatabasePath(dbid, tsid);
Same.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index aa4427724d..71b2212cbd 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -1934,6 +1934,11 @@ RelationIdGetRelation(Oid relationId) RelationClearRelation(rd, true);/* + * Normal entries are valid by now -- except nailed ones when + * loaded before relcache initialization. There isn't enough + * infrastructure yet to do pg_class lookups, but we need their + * rd_rel entries to be updated, so we let these through. + */ * Normally entries need to be valid here, but before the relcache * has been initialized, not enough infrastructure exists to * perform pg_class lookups. The structure of such entries doesn't @@ -2346,8 +2351,7 @@ RelationClearRelation(Relation relation, bool rebuild) RelationCloseSmgr(relation);
This sure looks like it's a syntax error? So I guess you might not have
staged the removal ports of the diff?
/* - * Treat nailed-in system relations separately, they always need to be - * accessible, so we can't blow them away. + * We cannot blow away nailed-in relations, so treat them especially. */
The former seems just as accurate, and is basically just the already
existing comment?
if (relation->rd_isnailed) { @@ -5942,7 +5946,8 @@ write_relcache_init_file(bool shared) * wrote out was up-to-date.) * * This mustn't run concurrently with the code that unlinks an init file - * and sends SI messages, so grab a serialization lock for the duration. + * and sends SI messages (see RelationCacheInitFilePreInvalidate), so grab + * a serialization lock for the duration. */ LWLockAcquire(RelCacheInitLock, LW_EXCLUSIVE);
Unrelated?
@@ -6061,6 +6066,10 @@ RelationHasUnloggedIndex(Relation rel) * changed one or more of the relation cache entries that are kept in the * local init file. * + * When DatabasePath is set, both the init file for that database and the + * shared (global) init files are to be removed; otherwise only the latter is. + * This is useful during recovery (XXX really?) + *
I'm confused?
* To be safe against concurrent inspection or rewriting of the init file, * we must take RelCacheInitLock, then remove the old init file, then send * the SI messages that include relcache inval for such relations, and then @@ -6180,9 +6189,9 @@ unlink_initfile(const char *initfilename, int elevel) { if (unlink(initfilename) < 0) { - /* It might not be there, but log any error other than ENOENT */ + /* It might not be there, but report any error other than ENOENT */ if (errno != ENOENT) - ereport(ERROR, + ereport(elevel, (errcode_for_file_access(), errmsg("could not remove cache file \"%s\": %m", initfilename)));
Included the elevel inclusion. Can't parse the difference between log
and report here?
Greetings,
Andres Freund
Hi,
On 2018-05-28 12:52:06 -0700, Andres Freund wrote:
Hi,
On 2018-05-27 13:00:06 -0700, Andres Freund wrote:
I've a patch that seems to work, that mostly needs some comment
polishing.Attached is what I currently have. Still needs some more work, but I
think it's more than good enough to review the approach. Basically the
approach consists out of two changes:1) Send init file removals for shared nailed relations as well.
This fixes that the shared init file contains arbitrarily outdated
information for relfrozenxid etc. Leading to e.g. the pg_authid
errors we've seen in some recent threads. Only applies to
new connections.2) Reread RelationData->rd_rel for nailed relations when invalidated.
This ensures that already built relcache entries for nailed relations
are updated. Currently they never are. This currently doesn't cause
*that* frequently an issue for !shared entries, because for those the
init file gets zapped regularly, and autovacuum workers usually don't
live that long. But it's still a significant correctness issue for
both shared an non shared relations.
Here's a more polished v2 version of the patch. I, locally, did the
work to backpatch the change. Besides trivialities there's two
nontrivial changes:
- In earlier versions there's no global invalidations. I've inquired and
complained about what exactly they're for in
http://archives.postgresql.org/message-id/20180611231634.w2rgtlzxaw4loefk%40alap3.anarazel.de
In earlier branches we just don't do the global thing. That seems
unproblematic to me.
- The bigger issue is that in 9.3, and just there
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8de3e410faa06ab20ec1aa6d0abb0a2c040261ba
does not yet exist.
That means back then we performed reloads even outside a
transaction. I don't feel confident about invoking additional catalog
reloads in the new situations, so I kept the IsTransactionState()
checks in RelationReloadNailed(). That seems less risky, but possibly
somebody wants to argue the other way round?
There's some minor other conflicts, but they're all pretty obvious.
I plan to go over the change again tomorrow, and then push. Unless
somebody has comments before then, obviously.
FWIW, I wonder if this isn't critical enough to make us consider having
a point release earlier..
Still think this is something we should seriously consider.
- Andres
Attachments:
v2-0001-Fix-bugs-in-vacuum-of-shared-rels-by-keeping-thei.patchtext/x-diff; charset=us-asciiDownload
From d4fc41f841951e74e298811ad5ad9b872d14d607 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 May 2018 12:38:22 -0700
Subject: [PATCH v2] Fix bugs in vacuum of shared rels, by keeping their
relcache entries current.
When vacuum processes a relation it uses the corresponding relcache
entry's relfrozenxid / relminmxid as a cutoff for when to remove
tuples etc. Unfortunately for nailed relations (i.e. critical system
catalogs) bugs could frequently lead to the corresponding relcache
entry being stale.
This set of bugs could cause actual data corruption as vacuum would
potentially not remove the correct row versions, potentially reviving
them at a later point. After 699bf7d05c some corruptions in this vein
were prevented, but the additional error checks could also trigger
spuriously. Examples of such errors are:
ERROR: found xmin ... from before relfrozenxid ...
and
ERROR: found multixact ... from before relminmxid ...
To be caused by this bug the errors have to occur on system catalog
tables.
The two bugs are:
1) Invalidations for nailed relations were ignored, based on the
theory that the relcache entry for such tables doesn't
change. Which is largely true, except for fields like relfrozenxid
etc. This means that changes to relations vacuumed in other
sessions weren't picked up by already existing sessions. Luckily
autovacuum doesn't have particularly longrunning sessions.
2) For shared *and* nailed relations, the shared relcache init file
was never invalidated while running. That means that for such
tables (e.g. pg_authid, pg_database) it's not just already existing
sessions that are affected, but even new connections are as well.
That explains why the reports usually were about pg_authid et. al.
To fix 1), revalidate the rd_rel portion of a relcache entry when
invalid. This implies a bit of extra complexity to deal with
bootstrapping, but it's not too bad. The fix for 2) is simpler,
simply always remove both the shared and local init files.
Author: Andres Freund
Reviewed-By: Alvaro Herrera
Discussion:
https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.de
https://postgr.es/m/CAMa1XUhKSJd98JW4o9StWPrfS=11bPgG+_GDMxe25TvUY4Sugg@mail.gmail.com
https://postgr.es/m/CAKMFJucqbuoDRfxPDX39WhA3vJyxweRg_zDVXzncr6+5wOguWA@mail.gmail.com
https://postgr.es/m/CAGewt-ujGpMLQ09gXcUFMZaZsGJC98VXHEFbF-tpPB0fB13K+A@mail.gmail.com
Backpatch: 9.3-
---
src/backend/utils/cache/inval.c | 32 +++--
src/backend/utils/cache/relcache.c | 187 +++++++++++++++++++----------
src/include/storage/standbydefs.h | 2 +-
3 files changed, 146 insertions(+), 75 deletions(-)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 955f5c7fdc5..16d86a29390 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -521,12 +521,12 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
(void) GetCurrentCommandId(true);
/*
- * If the relation being invalidated is one of those cached in the local
- * relcache init file, mark that we need to zap that file at commit. Same
- * is true when we are invalidating whole relcache.
+ * If the relation being invalidated is one of those cached in a relcache
+ * init file, mark that we need to zap that file at commit. For simplicity
+ * invalidations for a specific database always invalidate the shared file
+ * as well. Also zap when we are invalidating whole relcache.
*/
- if (OidIsValid(dbId) &&
- (RelationIdIsInInitFile(relId) || relId == InvalidOid))
+ if (relId == InvalidOid || RelationIdIsInInitFile(relId))
transInvalInfo->RelcacheInitFileInval = true;
}
@@ -881,18 +881,26 @@ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
if (RelcacheInitFileInval)
{
+ elog(trace_recovery(DEBUG4), "removing relcache init files for database %u",
+ dbid);
+
/*
- * RelationCacheInitFilePreInvalidate requires DatabasePath to be set,
- * but we should not use SetDatabasePath during recovery, since it is
+ * RelationCacheInitFilePreInvalidate, when the invalidation message
+ * is for a specific database, requires DatabasePath to be set, but we
+ * should not use SetDatabasePath during recovery, since it is
* intended to be used only once by normal backends. Hence, a quick
* hack: set DatabasePath directly then unset after use.
*/
- DatabasePath = GetDatabasePath(dbid, tsid);
- elog(trace_recovery(DEBUG4), "removing relcache init file in \"%s\"",
- DatabasePath);
+ if (OidIsValid(dbid))
+ DatabasePath = GetDatabasePath(dbid, tsid);
+
RelationCacheInitFilePreInvalidate();
- pfree(DatabasePath);
- DatabasePath = NULL;
+
+ if (OidIsValid(dbid))
+ {
+ pfree(DatabasePath);
+ DatabasePath = NULL;
+ }
}
SendSharedInvalidMessages(msgs, nmsgs);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3dfb1b8fbe2..7fed394f552 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -250,6 +250,7 @@ static void RelationDestroyRelation(Relation relation, bool remember_tupdesc);
static void RelationClearRelation(Relation relation, bool rebuild);
static void RelationReloadIndexInfo(Relation relation);
+static void RelationReloadNailed(Relation relation);
static void RelationFlushRelation(Relation relation);
static void RememberToFreeTupleDescAtEOX(TupleDesc td);
static void AtEOXact_cleanup(Relation relation, bool isCommit);
@@ -286,7 +287,7 @@ static void IndexSupportInitialize(oidvector *indclass,
static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid,
StrategyNumber numSupport);
static void RelationCacheInitFileRemoveInDir(const char *tblspcpath);
-static void unlink_initfile(const char *initfilename);
+static void unlink_initfile(const char *initfilename, int elevel);
static bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
PartitionDesc partdesc2);
@@ -1931,7 +1932,16 @@ RelationIdGetRelation(Oid relationId)
RelationReloadIndexInfo(rd);
else
RelationClearRelation(rd, true);
- Assert(rd->rd_isvalid);
+
+ /*
+ * Normally entries need to be valid here, but before the relcache
+ * has been initialized, not enough infrastructure exists to
+ * perform pg_class lookups. The structure of such entries doesn't
+ * change, but we still want to update the rd_rel entry. So
+ * rd_isvalid = false is left in place for a later lookup.
+ */
+ Assert(rd->rd_isvalid ||
+ (rd->rd_isnailed && !criticalRelcachesBuilt));
}
return rd;
}
@@ -2135,6 +2145,81 @@ RelationReloadIndexInfo(Relation relation)
relation->rd_isvalid = true;
}
+/*
+ * RelationReloadNailed - reload minimal information for nailed relations.
+ *
+ * The structure of a nailed relation can never change (which is good, because
+ * we rely on knowing their structure to be able to read catalog content). But
+ * some parts, e.g. pg_class.relfrozenxid, are still important to have
+ * accurate content for. Therefore those need to be reloaded after the arrival
+ * of invalidations.
+ */
+static void
+RelationReloadNailed(Relation relation)
+{
+ Assert(relation->rd_isnailed);
+
+ /*
+ * Redo RelationInitPhysicalAddr in case it is a mapped relation whose
+ * mapping changed.
+ */
+ RelationInitPhysicalAddr(relation);
+
+ /* flag as needing to be revalidated */
+ relation->rd_isvalid = false;
+
+ /*
+ * Can only reread catalog contents if in a transaction. If the relation
+ * is currently open (not counting the nailed refcount), do so
+ * immediately. Otherwise we've already marked the entry as possibly
+ * invalid, and it'll be fixed when next opened.
+ */
+ if (!IsTransactionState() || relation->rd_refcnt <= 1)
+ return;
+
+ if (relation->rd_rel->relkind == RELKIND_INDEX)
+ {
+ /*
+ * If it's a nailed-but-not-mapped index, then we need to re-read the
+ * pg_class row to see if its relfilenode changed.
+ */
+ RelationReloadIndexInfo(relation);
+ }
+ else
+ {
+ /*
+ * Reload a non-index entry. We can't easily do so if relcaches
+ * aren't yet built, but that's fine because at that stage the
+ * attributes that need to be current (like relfrozenxid) aren't yet
+ * accessed. To ensure the entry will later be revalidated, we leave
+ * it in invalid state, but allow use (cf. RelationIdGetRelation()).
+ */
+ if (criticalRelcachesBuilt)
+ {
+ HeapTuple pg_class_tuple;
+ Form_pg_class relp;
+
+ /*
+ * NB: Mark the entry as valid before starting to scan, to avoid
+ * self-recursion when re-building pg_class.
+ */
+ relation->rd_isvalid = true;
+
+ pg_class_tuple = ScanPgRelation(RelationGetRelid(relation),
+ true, false);
+ relp = (Form_pg_class) GETSTRUCT(pg_class_tuple);
+ memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE);
+ heap_freetuple(pg_class_tuple);
+
+ /*
+ * Again mark as valid, to protect against concurrently arriving
+ * invalidations.
+ */
+ relation->rd_isvalid = true;
+ }
+ }
+}
+
/*
* RelationDestroyRelation
*
@@ -2250,27 +2335,12 @@ RelationClearRelation(Relation relation, bool rebuild)
RelationCloseSmgr(relation);
/*
- * Never, never ever blow away a nailed-in system relation, because we'd
- * be unable to recover. However, we must redo RelationInitPhysicalAddr
- * in case it is a mapped relation whose mapping changed.
- *
- * If it's a nailed-but-not-mapped index, then we need to re-read the
- * pg_class row to see if its relfilenode changed. We do that immediately
- * if we're inside a valid transaction and the relation is open (not
- * counting the nailed refcount). Otherwise just mark the entry as
- * possibly invalid, and it'll be fixed when next opened.
+ * Treat nailed-in system relations separately, they always need to be
+ * accessible, so we can't blow them away.
*/
if (relation->rd_isnailed)
{
- RelationInitPhysicalAddr(relation);
-
- if (relation->rd_rel->relkind == RELKIND_INDEX ||
- relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
- {
- relation->rd_isvalid = false; /* needs to be revalidated */
- if (relation->rd_refcnt > 1 && IsTransactionState())
- RelationReloadIndexInfo(relation);
- }
+ RelationReloadNailed(relation);
return;
}
@@ -5907,24 +5977,26 @@ write_item(const void *data, Size len, FILE *fp)
/*
* Determine whether a given relation (identified by OID) is one of the ones
- * we should store in the local relcache init file.
+ * we should store in a relcache init file.
*
* We must cache all nailed rels, and for efficiency we should cache every rel
* that supports a syscache. The former set is almost but not quite a subset
- * of the latter. Currently, we must special-case TriggerRelidNameIndexId,
- * which RelationCacheInitializePhase3 chooses to nail for efficiency reasons,
- * but which does not support any syscache.
- *
- * Note: this function is currently never called for shared rels. If it were,
- * we'd probably also need a special case for DatabaseNameIndexId, which is
- * critical but does not support a syscache.
+ * of the latter. The special cases are relations where
+ * RelationCacheInitializePhase2/3 chooses to nail for efficiency reasons, but
+ * which do not support any syscache.
*/
bool
RelationIdIsInInitFile(Oid relationId)
{
- if (relationId == TriggerRelidNameIndexId)
+ if (relationId == SharedSecLabelRelationId ||
+ relationId == TriggerRelidNameIndexId ||
+ relationId == DatabaseNameIndexId ||
+ relationId == SharedSecLabelObjectIndexId)
{
- /* If this Assert fails, we don't need this special case anymore. */
+ /*
+ * If this Assert fails, we don't need the applicable special case
+ * anymore.
+ */
Assert(!RelationSupportsSysCache(relationId));
return true;
}
@@ -5994,38 +6066,30 @@ RelationHasUnloggedIndex(Relation rel)
* We take the lock and do the unlink in RelationCacheInitFilePreInvalidate,
* then release the lock in RelationCacheInitFilePostInvalidate. Caller must
* send any pending SI messages between those calls.
- *
- * Notice this deals only with the local init file, not the shared init file.
- * The reason is that there can never be a "significant" change to the
- * relcache entry of a shared relation; the most that could happen is
- * updates of noncritical fields such as relpages/reltuples. So, while
- * it's worth updating the shared init file from time to time, it can never
- * be invalid enough to make it necessary to remove it.
*/
void
RelationCacheInitFilePreInvalidate(void)
{
- char initfilename[MAXPGPATH];
+ char localinitfname[MAXPGPATH];
+ char sharedinitfname[MAXPGPATH];
- snprintf(initfilename, sizeof(initfilename), "%s/%s",
- DatabasePath, RELCACHE_INIT_FILENAME);
+ if (DatabasePath)
+ snprintf(localinitfname, sizeof(localinitfname), "%s/%s",
+ DatabasePath, RELCACHE_INIT_FILENAME);
+ snprintf(sharedinitfname, sizeof(sharedinitfname), "global/%s",
+ RELCACHE_INIT_FILENAME);
LWLockAcquire(RelCacheInitLock, LW_EXCLUSIVE);
- if (unlink(initfilename) < 0)
- {
- /*
- * The file might not be there if no backend has been started since
- * the last removal. But complain about failures other than ENOENT.
- * Fortunately, it's not too late to abort the transaction if we can't
- * get rid of the would-be-obsolete init file.
- */
- if (errno != ENOENT)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove cache file \"%s\": %m",
- initfilename)));
- }
+ /*
+ * The files might not be there if no backend has been started since the
+ * last removal. But complain about failures other than ENOENT with
+ * ERROR. Fortunately, it's not too late to abort the transaction if we
+ * can't get rid of the would-be-obsolete init file.
+ */
+ if (DatabasePath)
+ unlink_initfile(localinitfname, ERROR);
+ unlink_initfile(sharedinitfname, ERROR);
}
void
@@ -6051,13 +6115,9 @@ RelationCacheInitFileRemove(void)
struct dirent *de;
char path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY)];
- /*
- * We zap the shared cache file too. In theory it can't get out of sync
- * enough to be a problem, but in data-corruption cases, who knows ...
- */
snprintf(path, sizeof(path), "global/%s",
RELCACHE_INIT_FILENAME);
- unlink_initfile(path);
+ unlink_initfile(path, LOG);
/* Scan everything in the default tablespace */
RelationCacheInitFileRemoveInDir("base");
@@ -6097,7 +6157,7 @@ RelationCacheInitFileRemoveInDir(const char *tblspcpath)
/* Try to remove the init file in each database */
snprintf(initfilename, sizeof(initfilename), "%s/%s/%s",
tblspcpath, de->d_name, RELCACHE_INIT_FILENAME);
- unlink_initfile(initfilename);
+ unlink_initfile(initfilename, LOG);
}
}
@@ -6105,12 +6165,15 @@ RelationCacheInitFileRemoveInDir(const char *tblspcpath)
}
static void
-unlink_initfile(const char *initfilename)
+unlink_initfile(const char *initfilename, int elevel)
{
if (unlink(initfilename) < 0)
{
/* It might not be there, but log any error other than ENOENT */
if (errno != ENOENT)
- elog(LOG, "could not remove cache file \"%s\": %m", initfilename);
+ ereport(elevel,
+ (errcode_for_file_access(),
+ errmsg("could not remove cache file \"%s\": %m",
+ initfilename)));
}
}
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index bb6144805a7..17b74ca3b61 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -64,7 +64,7 @@ typedef struct xl_invalidations
{
Oid dbId; /* MyDatabaseId */
Oid tsId; /* MyDatabaseTableSpace */
- bool relcacheInitFileInval; /* invalidate relcache init file */
+ bool relcacheInitFileInval; /* invalidate relcache init files */
int nmsgs; /* number of shared inval msgs */
SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
} xl_invalidations;
--
2.17.0.rc1.dirty
Hi,
On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
I plan to go over the change again tomorrow, and then push. Unless
somebody has comments before then, obviously.
Done.
- Andres
On Fri, May 25, 2018 at 3:37 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
Moving discussion to -hackers. Tom, I think you worked most with this
code, your input would be appreciated.Original discussion is around:
http://archives.postgresql.org/message-id/20180524211311.
tnswfnjwnii54htx%40alvherre.pgsqlOn 2018-05-24 17:13:11 -0400, Alvaro Herrera wrote:
On 2018-May-24, Andres Freund wrote:
Then there's also:
http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com
ah, so deleting the relcache file makes the problem to go away? That's
definitely pretty strange. I see no reason for the value in relcache to
become out of step with the catalogued value in the same database ... I
don't think we transmit in any way values of one database to another.I can reproduce the issue. As far as I can tell we just don't ever
actually update nailed relcache entries in the normal course, leaving
the "physical address" aside. VACUUM will, via
vac_update_relstats() -> heap_inplace_update() ->
CacheInvalidateHeapTuple(),
send out an invalidation. But invalidation, in my case another session,
will essentially ignore most of that due to:static void
RelationClearRelation(Relation relation, bool rebuild)
...
/*
* Never, never ever blow away a nailed-in system relation,
because we'd
* be unable to recover. However, we must redo
RelationInitPhysicalAddr
* in case it is a mapped relation whose mapping changed.
*
* If it's a nailed-but-not-mapped index, then we need to re-read
the
* pg_class row to see if its relfilenode changed. We do that
immediately
* if we're inside a valid transaction and the relation is open
(not
* counting the nailed refcount). Otherwise just mark the entry as
* possibly invalid, and it'll be fixed when next opened.
*/
if (relation->rd_isnailed)
{
RelationInitPhysicalAddr(relation);if (relation->rd_rel->relkind == RELKIND_INDEX ||
relation->rd_rel->relkind ==
RELKIND_PARTITIONED_INDEX)
{
relation->rd_isvalid = false; /* needs to be
revalidated */
if (relation->rd_refcnt > 1 &&
IsTransactionState())
RelationReloadIndexInfo(relation);
}
return;
}Which basically means that once running we'll never update the relcache
data for nailed entries. That's unproblematic for most relcache fields,
but not for things like RelationData->rd_rel->relfrozenxid / relminmxid.This'll e.g. lead to lazy_vacuum_rel() wrongly not using aggressive
vacuums despite being required. And it'll lead, triggering this thread,
to wrong errors being raised during vacuum because relfrozenxid just is
some random value from the past. I suspect this might also be
co-responsible for a bunch of planning issues for queries involving the
catalog, because the planner will use wrong relcache data until the next
time the init file is thrown away?This looks like a very longstanding bug to me. I'm not yet quite sure
what the best way to deal with this is. I suspect we might get away
with just looking up a new version of the pg_class tuple and copying
rd_rel over?Greetings,
Andres Freund
I have a question related to this - and specifically, preventing the error
until we have a patch :). We are encountering this error every few weeks
on one very high transaction db, and have to restart to fix it.
If I read you correctly, the cache may never be invalidated for these
catalogs even if I manually VACUUM them? I was thinking if I routinely run
VACUUM FREEZE on these tables in every database I might avoid the issue.
But given the cause of the issue, would that just make no difference and I
will still hit the error eventually?
Thanks,
Jeremy
Hello friends.
On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
I plan to go over the change again tomorrow, and then push. Unless
somebody has comments before then, obviously.Done.
Sorry to bother about this, but do you have any plan to do the minor
release before planned due to this bug?
There seem to have too many users affected by this. And worst is that many
users may not have even noticed they have the problem, which can cause
`age(datfrozenxid)` to keep increasing until reachs 2.1 billions and the
system goes down.
In my case, I have a server that its `age(datfrozenxid)` is already at 1.9
billions, and I expect it to reach 2.1 billions in about 14 days.
Fortunately, I have monitoring system over `age(datfrozenxid)`, that is why
I found the issue in one of my servers.
I'm pondering what is the best option to avoid a forced shutdown of this
server:
- should I just wait for a release (if it is soon, I would be fine)?
- build PG from the git version by myself?
- or is there a safer workaround to the problem? (not clear to me if
deleting the `global/pg_internal.init` file is really the way to go, and
the details, is it safe? Should I stop the server, delete, start?)
Best regards,
--
Matheus de Oliveira
On Tue, Jun 19, 2018 at 8:26 AM Matheus de Oliveira <
matioli.matheus@gmail.com> wrote:
Hello friends.
On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
I plan to go over the change again tomorrow, and then push. Unless
somebody has comments before then, obviously.Done.
Sorry to bother about this, but do you have any plan to do the minor
release before planned due to this bug?There seem to have too many users affected by this. And worst is that many
users may not have even noticed they have the problem, which can cause
`age(datfrozenxid)` to keep increasing until reachs 2.1 billions and the
system goes down.In my case, I have a server that its `age(datfrozenxid)` is already at 1.9
billions, and I expect it to reach 2.1 billions in about 14 days.
Fortunately, I have monitoring system over `age(datfrozenxid)`, that is why
I found the issue in one of my servers.I'm pondering what is the best option to avoid a forced shutdown of this
server:
- should I just wait for a release (if it is soon, I would be fine)?
- build PG from the git version by myself?
- or is there a safer workaround to the problem? (not clear to me if
deleting the `global/pg_internal.init` file is really the way to go, and
the details, is it safe? Should I stop the server, delete, start?)Best regards,
--
Matheus de OliveiraRestarting the database has fixed the error on these pg_catalog tables,
allowing us to vacuum them and avoid wraparound.
We first noticed a restart fixed the issue because SAN snapshots did not
have the error. The only difference really being shared memory and nothing
disk-level.
Jeremy
Hi,
On 2018-06-19 10:26:26 -0300, Matheus de Oliveira wrote:
Hello friends.
On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
I plan to go over the change again tomorrow, and then push. Unless
somebody has comments before then, obviously.Done.
Sorry to bother about this, but do you have any plan to do the minor
release before planned due to this bug?
Unclear at this point. There's been discussion about it, without coming
to a conclusion.
I'm pondering what is the best option to avoid a forced shutdown of this
server:
- should I just wait for a release (if it is soon, I would be fine)?
- build PG from the git version by myself?
- or is there a safer workaround to the problem? (not clear to me if
deleting the `global/pg_internal.init` file is really the way to go, and
the details, is it safe? Should I stop the server, delete, start?)
You should first make sure it's actually this problem - which tables are
holding back the xmin horizon? After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think of a
case where it's problematic, even without stopping the server.
Greetings,
Andres Freund
Hi,
On 2018-06-19 08:25:33 -0500, Jeremy Finzel wrote:
I have a question related to this - and specifically, preventing the error
until we have a patch :).
The patch has been committed to postgres, although no release has been
made including it yet.
We are encountering this error every few weeks on one very high
transaction db, and have to restart to fix it.
If I read you correctly, the cache may never be invalidated for these
catalogs even if I manually VACUUM them? I was thinking if I routinely run
VACUUM FREEZE on these tables in every database I might avoid the
issue.
Correct, that won't help. In fact, it'll quite possibly make it worse.
But given the cause of the issue, would that just make no difference and I
will still hit the error eventually?
Yes. You might be able to get away with just removing the
pg_internal.init files before and after an explicit VACUUM on shared
system tables.
Greetings,
Andres Freund
On Tue, Jun 19, 2018 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-06-19 10:26:26 -0300, Matheus de Oliveira wrote:
Hello friends.
On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund <andres@anarazel.de>
wrote:
On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
I plan to go over the change again tomorrow, and then push. Unless
somebody has comments before then, obviously.Done.
Sorry to bother about this, but do you have any plan to do the minor
release before planned due to this bug?Unclear at this point. There's been discussion about it, without coming
to a conclusion.
Ok. Thank you for the information.
I really hope you decide to release it soon, I'm very afraid about the
users that have hit the bug but haven't noticed the issue.
I'm pondering what is the best option to avoid a forced shutdown of this
server:
- should I just wait for a release (if it is soon, I would be fine)?
- build PG from the git version by myself?
- or is there a safer workaround to the problem? (not clear to me if
deleting the `global/pg_internal.init` file is really the way to go, and
the details, is it safe? Should I stop the server, delete, start?)You should first make sure it's actually this problem - which tables are
holding back the xmin horizon?
How can I be sure? The tables are `pg_authid` and `pg_auth_members`, the
following message is logged every minute (which I belive is due to
`autovacuum_naptime`, which is using the default of 1 minute):
ERROR: found xmin 3460221635 from before relfrozenxid 1245633870
CONTEXT: automatic vacuum of table "template0.pg_catalog.pg_authid"
ERROR: found xmin 3460221635 from before relfrozenxid 1245633870
CONTEXT: automatic vacuum of table
"template0.pg_catalog.pg_auth_members"
Do you need controldata or more info to validate it?
After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think of a
case where it's problematic, even without stopping the server.
I'll try that and get back to you if it worked or not. Thank you for the
confirmation.
And thank you for all clarifications.
Best regards,
--
Matheus de Oliveira
Hi,
On 2018-06-19 17:05:48 -0300, Matheus de Oliveira wrote:
You should first make sure it's actually this problem - which tables are
holding back the xmin horizon?How can I be sure? The tables are `pg_authid` and `pg_auth_members`, the
following message is logged every minute (which I belive is due to
`autovacuum_naptime`, which is using the default of 1 minute):
Yes, that sounds like the issue. Basically just wanted the table names:
ERROR: found xmin 3460221635 from before relfrozenxid 1245633870
CONTEXT: automatic vacuum of table "template0.pg_catalog.pg_authid"
which indeed are shared relations.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
You should first make sure it's actually this problem - which tables are
holding back the xmin horizon? After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think of a
case where it's problematic, even without stopping the server.
Thanks for clarification! I also have this problem, BTW, autovacuum does
not worked at all:
# select max(last_autovacuum) from pg_stat_user_tables;
max
-------------------------------
2018-06-06 00:48:47.813841+03
all it workers stoped with this messages:
ERROR: found xmin 982973690 from before relfrozenxid 2702858737
CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_authid"
ERROR: found xmin 982973690 from before relfrozenxid 2702858761
CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_auth_members"
and it does not try to vacuum other tables.
--
Sergey Burladyan
Hi,
On 2018-06-20 15:05:59 +0300, Sergey Burladyan wrote:
Andres Freund <andres@anarazel.de> writes:
You should first make sure it's actually this problem - which tables are
holding back the xmin horizon? After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think of a
case where it's problematic, even without stopping the server.Thanks for clarification! I also have this problem, BTW, autovacuum does
not worked at all:
# select max(last_autovacuum) from pg_stat_user_tables;
max
-------------------------------
2018-06-06 00:48:47.813841+03all it workers stoped with this messages:
ERROR: found xmin 982973690 from before relfrozenxid 2702858737
CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_authid"
ERROR: found xmin 982973690 from before relfrozenxid 2702858761
CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_auth_members"and it does not try to vacuum other tables.
Yea, that's this bug. I'd remove global/pg_internal.init, reconnect,
and manually vacuum.
Greetings,
Andres Freund
Hello again...
On Tue, Jun 19, 2018 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote:
...
After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think of a
case where it's problematic, even without stopping the server.
Just to let you know. I applied the work around in the affected server and
seemed to work way, so far no error.
Thank you a lot for all the information and help.
Best regards,
--
Matheus de Oliveira
Hackers, felt like reporting this relevant tidbit in case of interest...
My site is among the few who've hit this bug.
We observed recently a case of pg_database having joined pg_authid and
pg_auth_members in the bad xmin, unable to vacuum shcatalog group,
however...
pg_database then started working again without any intervention on our
part so apparently due to some relevant system activity.
We did later do a restart to correct the problem for those other 2
catalogs but , will try the global init file hack mentioned below later
if/when we face this anew before running a fixed Pg version.
Many thanks!
Matheus de Oliveira <matioli.matheus@gmail.com> writes:
Hello again...
On Tue, Jun 19, 2018 at 12:53 PM, Andres Freund <andres@anarazel.de>
wrote:...
After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think
of a
case where it's problematic, even without stopping the server.
Just to let you know. I applied the work around in the affected
server and seemed to work way, so far no error.Thank you a lot for all the information and help.
Best regards,
--
Matheus de Oliveira
--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800
Hello,
We found the exact same issue on a production database at TripAdvisor. We have no new information to add for debugging. Bumping this to up the priority on a patch version release getting out.
Thanks,
-Kyle Samson
On Wed, Aug 8, 2018 at 10:23 AM, Kyle Samson <kysamson@tripadvisor.com> wrote:
We found the exact same issue on a production database at TripAdvisor. We have no new information to add for debugging. Bumping this to up the priority on a patch version release getting out.
There is a batch of point releases that were just stamped + tagged.
They should be released shortly.
--
Peter Geoghegan
On Wed, Aug 8, 2018 at 10:27:59AM -0700, Peter Geoghegan wrote:
On Wed, Aug 8, 2018 at 10:23 AM, Kyle Samson <kysamson@tripadvisor.com> wrote:
We found the exact same issue on a production database at TripAdvisor. We have no new information to add for debugging. Bumping this to up the priority on a patch version release getting out.
There is a batch of point releases that were just stamped + tagged.
They should be released shortly.
They have been released:
https://www.postgresql.org/about/news/1878/
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +