does this look more like a possible bug or more like a possible hardware problem...? (long)

Started by Frank van Vugtabout 22 years ago9 messagesgeneral
Jump to latest
#1Frank van Vugt
ftm.van.vugt@foxi.nl

Hi,

I'm using a 9500 line script to create an initial database. After applying a
number of changes, I was repeatedly executing this script while hunting for
typos. The script itself takes care of dropping the various objects and every
now and then (when delta t << autovacuum t) a manual vacuum full was
performed to clean the system tables.

At one point, I arrived at the following situation:

db=# select version();
version
------------------------------------------------------------------------
PostgreSQL 7.4.1 on i586-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.3

Message in psql terminal:

psql:/home/data/megadump.sql:5169: NOTICE: CREATE TABLE / PRIMARY KEY will
create implicit index "remainder_type_pkey" for table "remainder_type"
psql:/home/data/megadump.sql:5169: NOTICE: CREATE TABLE / UNIQUE will create
implicit index "remainder_type_abbreviation_key" for table "remainder_type"
psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too
large
psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page for
"pg_attribute_relid_attnum_index"
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql:/home/data/megadump.sql:5169: connection to server was lost
vugtf@katrien:~>

(so psql did indeed quit)

The specific part in the script contains nothing strange:

CREATE TABLE "remainder_type" (
"id" integer primary key check ("id" > 0),
"abbreviation" varchar(20) not null unique,
"descr" varchar(30) not null
);
GRANT SELECT on remainder_type to PUBLIC;
GRANT ALL on remainder_type to "postgres";

The server logs are mentioning this:

2004-03-09 23:11:34 [6516] LOG: statement: CREATE TABLE "remainder_type" (
"id" integer primary key check ("id" > 0),
"abbreviation" varchar(20) not null unique,
"descr" varchar(30) not null
);
2004-03-09 23:11:34 [6516] NOTICE: CREATE TABLE / PRIMARY KEY will create
implicit index "remainder_type_pkey" for table "remainder_type"
2004-03-09 23:11:34 [6516] NOTICE: CREATE TABLE / UNIQUE will create implicit
index "remainder_type_abbreviation_key" for table "remainder_type"
2004-03-09 23:11:34 [6516] WARNING: specified item offset is too large
2004-03-09 23:11:34 [6516] PANIC: failed to add item to the page for
"pg_attribute_relid_attnum_index"
STATEMENT: CREATE TABLE "remainder_type" (
"id" integer primary key check ("id" > 0),
"abbreviation" varchar(20) not null unique,
"descr" varchar(30) not null
);
2004-03-09 23:11:34 [833] LOG: server process (PID 6516) was terminated by
signal 6
2004-03-09 23:11:34 [833] LOG: terminating any other active server processes
2004-03-09 23:11:34 [8074] WARNING: terminating connection because of crash
of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat
your command.
2004-03-09 23:11:34 [833] LOG: all server processes terminated;
reinitializing
2004-03-09 23:11:34 [8075] LOG: database system was interrupted at 2004-03-09
23:11:31 CET
2004-03-09 23:11:35 [8075] LOG: checkpoint record is at 1/9FAAB5C
2004-03-09 23:11:35 [8075] LOG: redo record is at 1/9E803DC; undo record is
at 0/0; shutdown FALSE
2004-03-09 23:11:35 [8075] LOG: next transaction ID: 354187; next OID:
8355508
2004-03-09 23:11:35 [8075] LOG: database system was not properly shut down;
automatic recovery in progress
2004-03-09 23:11:35 [8075] LOG: redo starts at 1/9E803DC
2004-03-09 23:11:46 [8075] LOG: record with zero length at 1/B27335C
2004-03-09 23:11:46 [8075] LOG: redo done at 1/B273338
2004-03-09 23:11:55 [8075] LOG: database system is ready

After this, a vacuum full verbose analyse seems to end prematurely with:

INFO: vacuuming "public.stock_delivery"
INFO: "stock_delivery": found 0 removable, 0 nonremovable row versions in 0
pages
DETAIL: 0 dead row versions cannot be removed yet.
Nonremovable row versions range from 0 to 0 bytes long.
There were 0 unused item pointers.
Total free space (including removable row versions) is 0 bytes.
0 pages are or will become empty, including 0 at the end of the table.
0 pages containing 0 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "stock_delivery_pkey" now contains 0 row versions in 1 pages
DETAIL: 0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.02 sec.
INFO: index "stock_delivery_delivery_id_key" now contains 0 row versions in 1
pages
DETAIL: 0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "stock_delivery_idx1" now contains 0 row versions in 1 pages
DETAIL: 0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: analyzing "public.stock_delivery"
INFO: "stock_delivery": 0 pages, 0 rows sampled, 0 estimated total rows
ERROR: catalog is missing 6 attribute(s) for relid 8349771

Looking at the pg_autovacuum log (see timestamps) it seems this could have
clashed with the script :

[2004-03-09 11:11:08 PM] Performing: VACUUM ANALYZE "pg_catalog"."pg_depend"
[2004-03-09 11:11:29 PM] table name:
megafox_trial."pg_catalog"."pg_depend"
[2004-03-09 11:11:29 PM] relid: 16599; relisshared: 0
[2004-03-09 11:11:29 PM] reltuples: 6124; relpages: 153
[2004-03-09 11:11:29 PM] curr_analyze_count: 668446; cur_delete_count:
332899
[2004-03-09 11:11:29 PM] ins_at_last_analyze: 668446; del_at_last_vacuum:
332899
[2004-03-09 11:11:29 PM] insert_threshold: 6624; delete_threshold
13248
[2004-03-09 11:11:29 PM] Performing: VACUUM ANALYZE
"pg_catalog"."pg_attribute"
[2004-03-09 11:11:35 PM] Can not refresh statistics information from the
database megafox_trial.
[2004-03-09 11:11:35 PM] The error is [server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
]
[2004-03-09 11:11:35 PM] Fatal error occured while sending query (select
oid,reltuples,relpages from pg_class where oid=1249) to database
megafox_trial
[2004-03-09 11:11:35 PM] The error is []
[2004-03-09 11:11:35 PM] table name:
megafox_trial."pg_catalog"."pg_attribute"
[2004-03-09 11:11:35 PM] relid: 1249; relisshared: 0
[2004-03-09 11:11:35 PM] reltuples: 3595; relpages: 214
[2004-03-09 11:11:35 PM] curr_analyze_count: 721978; cur_delete_count:
361920
[2004-03-09 11:11:35 PM] ins_at_last_analyze: 707819; del_at_last_vacuum:
346312
[2004-03-09 11:11:35 PM] insert_threshold: 4095; delete_threshold
11400
[2004-03-09 11:11:35 PM] Performing: VACUUM ANALYZE "pg_catalog"."pg_class"
[2004-03-09 11:11:35 PM] Fatal error occured while sending query (VACUUM
ANALYZE "pg_catalog"."pg_class") to database megafox_trial
[2004-03-09 11:11:35 PM] The error is []
[2004-03-09 11:11:35 PM] Fatal error occured while sending query (select
oid,reltuples,relpages from pg_class where oid=1259) to database
megafox_trial
[2004-03-09 11:11:35 PM] The error is []
[2004-03-09 11:11:35 PM] table name:
megafox_trial."pg_catalog"."pg_class"
[2004-03-09 11:11:35 PM] relid: 1259; relisshared: 0
[2004-03-09 11:11:35 PM] reltuples: 885; relpages: 117
[2004-03-09 11:11:35 PM] curr_analyze_count: 439206; cur_delete_count:
373969
[2004-03-09 11:11:35 PM] ins_at_last_analyze: 430423; del_at_last_vacuum:
366316
[2004-03-09 11:11:35 PM] insert_threshold: 1385; delete_threshold
2770
[2004-03-09 11:11:35 PM] Performing: VACUUM ANALYZE "pg_catalog"."pg_attrdef"
[2004-03-09 11:11:35 PM] Fatal error occured while sending query (VACUUM
ANALYZE "pg_catalog"."pg_attrdef") to database megafox_trial
[2004-03-09 11:11:35 PM] The error is []
[2004-03-09 11:11:35 PM] Fatal error occured while sending query (select
oid,reltuples,relpages from pg_class where oid=16384) to database
megafox_trial
[2004-03-09 11:11:35 PM] The error is []

There's nothing critical here and I expect a drop/create database to solve it,
but I'll wait for a reply just in case.

--
Best,

Frank.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Frank van Vugt (#1)
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)

Frank van Vugt <ftm.van.vugt@foxi.nl> writes:

At one point, I arrived at the following situation:

psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too
large
psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page for
"pg_attribute_relid_attnum_index"

Looks like a bug to me. Can you create a reproducible test case?

regards, tom lane

#3Frank van Vugt
ftm.van.vugt@foxi.nl
In reply to: Tom Lane (#2)
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)

Hi Tom,

At one point, I arrived at the following situation:
psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too
large
psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page
for "pg_attribute_relid_attnum_index"

Looks like a bug to me. Can you create a reproducible test case?

Nope.....

Trying the same script on a newly created database doesn't show the problem.
However, I do still have the database that came up with this message, which
is now failing a vacuum full verbose analyse with:

(psql)
INFO: vacuuming "public.user_pref"
INFO: "user_pref": found 0 removable, 0 nonremovable row versions in 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
Nonremovable row versions range from 0 to 0 bytes long.
There were 0 unused item pointers.
Total free space (including removable row versions) is 0 bytes.
0 pages are or will become empty, including 0 at the end of the table.
0 pages containing 0 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "user_pref_pkey" now contains 0 row versions in 1 pages
DETAIL: 0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.03 sec.
INFO: index "user_pref_idx1" now contains 0 row versions in 1 pages
DETAIL: 0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO: analyzing "public.user_pref"
INFO: "user_pref": 0 pages, 0 rows sampled, 0 estimated total rows
ERROR: catalog is missing 6 attribute(s) for relid 8349771

(log)
2004-03-10 10:10:00 [8207] LOG: statement: vacuum full verbose analyse;
2004-03-10 10:11:43 [8207] ERROR: catalog is missing 6 attribute(s) for relid
8349771

The server itself has no particular messages in its log that hint on what went
on, except for the earlier mentioned 'possible clash' of a running
pg_autovacuum. By the way, I have stopped pg_autovacuum by now to avoid any
more changes on this particular database.

--
Best,

Frank.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Frank van Vugt (#3)
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)

Frank van Vugt <ftm.van.vugt@foxi.nl> writes:

At one point, I arrived at the following situation:
psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too
large
psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page
for "pg_attribute_relid_attnum_index"

Trying the same script on a newly created database doesn't show the problem.
However, I do still have the database that came up with this message, which
is now failing a vacuum full verbose analyse with:

ERROR: catalog is missing 6 attribute(s) for relid 8349771

This is consistent with the idea that pg_attribute_relid_attnum_index is
corrupted. I would suggest saving a copy of that file for postmortem
analysis and then trying to REINDEX pg_attribute. (Depending on which
PG version you are running, that may require running a standalone
backend. See the REINDEX man page.)

If REINDEX makes the problem go away, would you send me the corrupted
index file off-list?

regards, tom lane

#5Frank van Vugt
ftm.van.vugt@foxi.nl
In reply to: Tom Lane (#4)
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)

Hi,

ERROR: catalog is missing 6 attribute(s) for relid 8349771

This is consistent with the idea that pg_attribute_relid_attnum_index is
corrupted. I would suggest saving a copy of that file for postmortem
analysis and then trying to REINDEX pg_attribute.

Uhm...... looking for a file named 8349771 located in the directory base/<nr>
where nr corresponds to the proper db, using oid2name.....

We seem to have a small problem : there is no such file.....

I'll hold reindexing in case you first want to look elsewhere.

If REINDEX makes the problem go away, would you send me the corrupted
index file off-list?

I happily would, but you already know its content ;-)

Here's a log of what I did:

base/7847604# ls 82*
8240054 8240056 8240058 8240059 8240600 8240602 8261384 8261386
8261388 8261398

base/7847604# ls 83*
8351687 8352194 8352623 8352754 8353080 8353478 8353581 8353741
<cut>

The last few existing entries from the 82* range seem to be regular
user-tables and corresponding indexes, likewise for the first few in the 83*
range

But when I try the particular oid:

# /usr/local/pgsql/bin/oid2name -d megafox_trial -U ... -P ... -o 8349771
Tablename of oid 8349771 from database "megafox_trial":
---------------------------------
8349771 = article_property_tree

and its revers:

# /usr/local/pgsql/bin/oid2name -d megafox_trial -U ... -P ...
-t article_property_tree
Oid of table article_property_tree from database "megafox_trial":
---------------------------------
VERY scary: more than one table with that name found!!

--
Best,

Frank.

#6Frank van Vugt
ftm.van.vugt@foxi.nl
In reply to: Tom Lane (#4)
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)

Some additional info:

# select * from pg_class where relname = 'article_property_tree';
-[ RECORD 1 ]--
+-------------------------------------------------------------------
relname        | article_property_tree
relnamespace   | 2200
reltype        | 8349772
relowner       | 1000
relam          | 0
relfilenode    | 8349771
relpages       | 4
reltuples      | 299
reltoastrelid  | 8349775
reltoastidxid  | 0
relhasindex    | t
relisshared    | f
relkind        | r
relnatts       | 6
relchecks      | 1
reltriggers    | 3
relukeys       | 0
relfkeys       | 0
relrefs        | 0
relhasoids     | t
relhaspkey     | t
relhasrules    | f
relhassubclass | f
relacl         | {vugtf=a*r*w*d*R*x*t*/vugtf,=arwdRxt/vugtf,postgres=arwdRxt/
vugtf}
-[ RECORD 2 ]--
+-------------------------------------------------------------------
relname        | article_property_tree
relnamespace   | 2200
reltype        | 8354495
relowner       | 1000
relam          | 0
relfilenode    | 8354494
relpages       | 4
reltuples      | 299
reltoastrelid  | 8354498
reltoastidxid  | 0
relhasindex    | t
relisshared    | f
relkind        | r
relnatts       | 6
relchecks      | 1
reltriggers    | 5
relukeys       | 0
relfkeys       | 0
relrefs        | 0
relhasoids     | t
relhaspkey     | t
relhasrules    | f
relhassubclass | f
relacl         | {vugtf=a*r*w*d*R*x*t*/vugtf,=arwdRxt/vugtf,postgres=arwdRxt/
vugtf}

--
Best,

Frank.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Frank van Vugt (#5)
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)

Frank van Vugt <ftm.van.vugt@foxi.nl> writes:

This is consistent with the idea that pg_attribute_relid_attnum_index is
corrupted. I would suggest saving a copy of that file for postmortem
analysis and then trying to REINDEX pg_attribute.

Uhm...... looking for a file named 8349771 located in the directory base/<nr>
where nr corresponds to the proper db, using oid2name.....
We seem to have a small problem : there is no such file.....

I'm sorry, I meant to say save a copy of
pg_attribute_relid_attnum_index. The correct filename for it can be
found via
select relfilenode from pg_class where relname = 'pg_attribute_relid_attnum_index';

Oid of table article_property_tree from database "megafox_trial":
---------------------------------
VERY scary: more than one table with that name found!!

Do you have more than one such table (presumably in different schemas)?
If so this isn't scary at all. I don't believe oid2name has been fixed
to be schema-aware :-(

regards, tom lane

#8Frank van Vugt
ftm.van.vugt@foxi.nl
In reply to: Tom Lane (#7)
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)

I'm sorry, I meant to say save a copy of
pg_attribute_relid_attnum_index. The correct filename for it can be
found via
select relfilenode from pg_class where relname =
'pg_attribute_relid_attnum_index';

Got it, made a backup of the entire database as well. Since the db wasn't
filled yet, both files are fairly small when bzipped (<0.5MB and < 5MB) and
both can be made available if you want.

Issuing 'reindex table pg_attribute' did not solve the problem, though....

Oid of table article_property_tree from database "megafox_trial":
---------------------------------
VERY scary: more than one table with that name found!!

Do you have more than one such table (presumably in different schemas)?
If so this isn't scary at all. I don't believe oid2name has been fixed
to be schema-aware :-(

Nope, all user-tables are in public at the moment.

Also:

megafox_trial=# select * from article_property_tree;
ERROR: catalog is missing 6 attribute(s) for relid 8349771

--
Best,

Frank.

#9Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#2)
Re: does this look more like a possible bug or more like

Tom Lane wrote:

Frank van Vugt <ftm.van.vugt@foxi.nl> writes:

At one point, I arrived at the following situation:

psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too
large
psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page for
"pg_attribute_relid_attnum_index"

Looks like a bug to me. Can you create a reproducible test case?

I have seen one occurence of that during a Slony test run on the log
table. I think it can be reproduced (not reliable though) with a high
frequency of insert, select and delete of growing keys with a very high
frequency of vacuums at the same time.

Concurrency seems to be an issue here.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #