Serious Crash last Friday

Started by Henrik Steffenalmost 24 years ago21 messagesgeneral
Jump to latest
#1Henrik Steffen
steffen@city-map.de

Hello all,

on Friday we experienced a very very worrying crash of our postgresql
server.

Our system runs on a Intel Pentium IV, 1.6 GHz, 1 GB RAM with latest
Postgresql-Server
7.2.1 (redhat rpm) - on rather heavy load

There are 109 user tables in one database, the largest tables contain 60
columns and
approx. 400.000 rows. it's a dedicated database-only machine.

Well, the crash was indicated as follows: One of my employees complained
that she couldn't
work anymore (via webinterface). The error-message was due to an error in
the
employee-table. This particular table has a unique row for employee-numbers.
Suddenly
there were 11 entries for the same employee. Even my name was included
twice, and
another employee still working on friday afternoon was also included 3
times. Note:
This was a table with a UNIQUE KEY - this shouldn't be possible IMHO.

Taking a closer look, I found additional tables, with non-unique values in
UNIQUE columns.

When trying to delete unique values by using the OIDs, I found out, that
even the OIDs
were the same!!!! Taking a yet closer look, I found out by querying
pg_tables that
there were duplicates of some tables. Then there was the message: "Backend
message type
0x44 arrived while idle"

I was running VACUUM and VACUUM FULL a hundred times - but it failed to
repair these
errors. It didn't even succeed in running VACUUM on all tables: VACUUM
complained something
about "UNIQUE" (I didn't write down the exact error message though).

Then I tried to DUMP as much as I could, then I stopped the database, moved
the db-folder to
a different location, did a new initdb and restored the whole system.
Unfortunately
there was one table I couldn't dump at all and I had to use the 15 hours old
backup copy.

But, please correct me if I am wrong, this should never actually happen,
shouldn't it?

Anyone had any of these problems before? I will see if this happens again -
and if it
does I will have to think about using a different backend-server. I'll don't
have to
explain to you, that a database server that corrupts data, is completely
useless.

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Henrik Steffen (#1)
Re: Serious Crash last Friday

On Mon, Jun 17, 2002 at 08:43:37AM +0200, Henrik Steffen wrote:

Hello all,

on Friday we experienced a very very worrying crash of our postgresql
server.

Sound like the CTIDs are out of whack or something. If you're really
desperate you can try the program here, it may be able to dump something.
http://svana.org/kleptog/pgsql/pgfsck.html

Well, the crash was indicated as follows: One of my employees complained
that she couldn't
work anymore (via webinterface). The error-message was due to an error in
the
employee-table. This particular table has a unique row for employee-numbers.
Suddenly
there were 11 entries for the same employee. Even my name was included
twice, and
another employee still working on friday afternoon was also included 3
times. Note:
This was a table with a UNIQUE KEY - this shouldn't be possible IMHO.

What DB version is this. Could it be XID wraparound?

Taking a closer look, I found additional tables, with non-unique values in
UNIQUE columns.

When trying to delete unique values by using the OIDs, I found out, that
even the OIDs
were the same!!!! Taking a yet closer look, I found out by querying
pg_tables that
there were duplicates of some tables. Then there was the message: "Backend
message type
0x44 arrived while idle"

Try the CTIDs, they will be unique.

I was running VACUUM and VACUUM FULL a hundred times - but it failed to
repair these
errors. It didn't even succeed in running VACUUM on all tables: VACUUM
complained something
about "UNIQUE" (I didn't write down the exact error message though).

Please post the message exactly as printed out.

Then I tried to DUMP as much as I could, then I stopped the database, moved
the db-folder to
a different location, did a new initdb and restored the whole system.
Unfortunately
there was one table I couldn't dump at all and I had to use the 15 hours old
backup copy.

But, please correct me if I am wrong, this should never actually happen,
shouldn't it?

Never, that's why it would be helpful to know what went wrong.

Anyone had any of these problems before? I will see if this happens again -
and if it
does I will have to think about using a different backend-server. I'll don't
have to
explain to you, that a database server that corrupts data, is completely
useless.

HTH,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

#3Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#1)
Re: Serious Crash last Friday

Hi Martijn,

cute little program you pointed me to, thank you. So I am not the only
one expiriencing problems on certain SELECTs sometimes. That's another
very annoying thing about postgresql. Had it several times by now and
always tried to find the corrupted tuples by hand...

ok, but back to the crash of last friday:

What DB version is this. Could it be XID wraparound?

it's postgres 7.2.1

what actually is XID wraparound, and how can I find out if I have it?

Try the CTIDs, they will be unique.

ah, this was also new for me - have been working with oids sometimes,
but never heard of ctids before, thanks again.

Please post the message exactly as printed out.

This is what I can see from /var/log/messages:
(these messages were often repeated:)

XLogFlush: request D/39CC9F8 is not satisfied - flushed only to D/39A4354

(some messages are in German, I'll try to translate them:)

Can't create "Unique"-Index, because table contains duplicated values

This happened while vacuuming:
Duplicated value cannot be inserted in "Unique"-Index pg_class_relname_index
Duplicated value cannot be inserted in "Unique"-Index
pg_statistic_relid_att_index
Duplicated value cannot be inserted in "Unique"-Index pg_class_oid_index

Looks like these system-tables have been corrupted, too

As i mentioned before, I copied the complete data-directory to a different
location, so
someone could have a look at the complete corrupted data.

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, June 17, 2002 9:43 AM
Subject: Re: [GENERAL] Serious Crash last Friday

On Mon, Jun 17, 2002 at 08:43:37AM +0200, Henrik Steffen wrote:

Hello all,

on Friday we experienced a very very worrying crash of our postgresql
server.

Sound like the CTIDs are out of whack or something. If you're really
desperate you can try the program here, it may be able to dump something.
http://svana.org/kleptog/pgsql/pgfsck.html

Well, the crash was indicated as follows: One of my employees complained
that she couldn't
work anymore (via webinterface). The error-message was due to an error

in

the
employee-table. This particular table has a unique row for

employee-numbers.

Suddenly
there were 11 entries for the same employee. Even my name was included
twice, and
another employee still working on friday afternoon was also included 3
times. Note:
This was a table with a UNIQUE KEY - this shouldn't be possible IMHO.

What DB version is this. Could it be XID wraparound?

Taking a closer look, I found additional tables, with non-unique values

in

UNIQUE columns.

When trying to delete unique values by using the OIDs, I found out, that
even the OIDs
were the same!!!! Taking a yet closer look, I found out by querying
pg_tables that
there were duplicates of some tables. Then there was the message:

"Backend

message type
0x44 arrived while idle"

Try the CTIDs, they will be unique.

I was running VACUUM and VACUUM FULL a hundred times - but it failed to
repair these
errors. It didn't even succeed in running VACUUM on all tables: VACUUM
complained something
about "UNIQUE" (I didn't write down the exact error message though).

Please post the message exactly as printed out.

Then I tried to DUMP as much as I could, then I stopped the database,

moved

the db-folder to
a different location, did a new initdb and restored the whole system.
Unfortunately
there was one table I couldn't dump at all and I had to use the 15 hours

old

backup copy.

But, please correct me if I am wrong, this should never actually happen,
shouldn't it?

Never, that's why it would be helpful to know what went wrong.

Anyone had any of these problems before? I will see if this happens

again -

and if it
does I will have to think about using a different backend-server. I'll

don't

Show quoted text

have to
explain to you, that a database server that corrupts data, is completely
useless.

HTH,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#4Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#1)
Re: Serious Crash last Friday

Hello,

trying pgfsck on my corrupted employee table from friday it gave me about 85
lines complaining
about "Tuple incorrect length (parsed data=xxxxxx, length=xxx)"

the table had 184 rows, out of which 85 were corrupt ??

trying pgfsck on the current employee table of today (after new initdb etc.)
with 184 rows,
I get 814 (!!) rows complaining about "Tuple incorrect length ..." - how can
this be???

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, June 17, 2002 9:43 AM
Subject: Re: [GENERAL] Serious Crash last Friday

On Mon, Jun 17, 2002 at 08:43:37AM +0200, Henrik Steffen wrote:

Hello all,

on Friday we experienced a very very worrying crash of our postgresql
server.

Sound like the CTIDs are out of whack or something. If you're really
desperate you can try the program here, it may be able to dump something.
http://svana.org/kleptog/pgsql/pgfsck.html

Well, the crash was indicated as follows: One of my employees complained
that she couldn't
work anymore (via webinterface). The error-message was due to an error

in

the
employee-table. This particular table has a unique row for employee-numb

ers.

Suddenly
there were 11 entries for the same employee. Even my name was included
twice, and
another employee still working on friday afternoon was also included 3
times. Note:
This was a table with a UNIQUE KEY - this shouldn't be possible IMHO.

What DB version is this. Could it be XID wraparound?

Taking a closer look, I found additional tables, with non-unique values

in

UNIQUE columns.

When trying to delete unique values by using the OIDs, I found out, that
even the OIDs
were the same!!!! Taking a yet closer look, I found out by querying
pg_tables that
there were duplicates of some tables. Then there was the message:

"Backend

message type
0x44 arrived while idle"

Try the CTIDs, they will be unique.

I was running VACUUM and VACUUM FULL a hundred times - but it failed to
repair these
errors. It didn't even succeed in running VACUUM on all tables: VACUUM
complained something
about "UNIQUE" (I didn't write down the exact error message though).

Please post the message exactly as printed out.

Then I tried to DUMP as much as I could, then I stopped the database,

moved

the db-folder to
a different location, did a new initdb and restored the whole system.
Unfortunately
there was one table I couldn't dump at all and I had to use the 15 hours

old

backup copy.

But, please correct me if I am wrong, this should never actually happen,
shouldn't it?

Never, that's why it would be helpful to know what went wrong.

Anyone had any of these problems before? I will see if this happens

again -

and if it
does I will have to think about using a different backend-server. I'll

don't

Show quoted text

have to
explain to you, that a database server that corrupts data, is completely
useless.

HTH,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#5Martijn van Oosterhout
kleptog@svana.org
In reply to: Henrik Steffen (#4)
Re: Serious Crash last Friday

On Mon, Jun 17, 2002 at 10:39:21AM +0200, Henrik Steffen wrote:

Hello,

trying pgfsck on my corrupted employee table from friday it gave me about 85
lines complaining
about "Tuple incorrect length (parsed data=xxxxxx, length=xxx)"

the table had 184 rows, out of which 85 were corrupt ??

Hmm, it depends mostly on whether the parsed was greater or less than the
length. If it's less than it is something to do with additional padding
being added at the end, which I havn't worked out yet. I'm thinking of
supressing the warning where the remaining looks like padding.

If it's greater, there is a problem.

trying pgfsck on the current employee table of today (after new initdb etc.)
with 184 rows,
I get 814 (!!) rows complaining about "Tuple incorrect length ..." - how can
this be???

Same or different tables?
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

#6Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#1)
Re: Serious Crash last Friday

same employee table (but after initdb, and restored from fridays dump)

how can i find out the correct length of the tuples?

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, June 17, 2002 11:27 AM
Subject: Re: [GENERAL] Serious Crash last Friday

On Mon, Jun 17, 2002 at 10:39:21AM +0200, Henrik Steffen wrote:

Hello,

trying pgfsck on my corrupted employee table from friday it gave me

about 85

lines complaining
about "Tuple incorrect length (parsed data=xxxxxx, length=xxx)"

the table had 184 rows, out of which 85 were corrupt ??

Hmm, it depends mostly on whether the parsed was greater or less than the
length. If it's less than it is something to do with additional padding
being added at the end, which I havn't worked out yet. I'm thinking of
supressing the warning where the remaining looks like padding.

If it's greater, there is a problem.

trying pgfsck on the current employee table of today (after new initdb

etc.)

with 184 rows,
I get 814 (!!) rows complaining about "Tuple incorrect length ..." - how

can

Show quoted text

this be???

Same or different tables?
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

#7Martijn van Oosterhout
kleptog@svana.org
In reply to: Henrik Steffen (#6)
Re: Serious Crash last Friday

On Mon, Jun 17, 2002 at 11:38:00AM +0200, Henrik Steffen wrote:

same employee table (but after initdb, and restored from fridays dump)

how can i find out the correct length of the tuples?

Tuple incorrect length (parsed data=xxxxxx, length=xxx)

"parsed data" is what was processed
"length" is the amount allocated in the tuple header

so, "parsed data" <= "length" is good, though hopefully it's not too much
less.

HTH,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

#8Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#1)
Re: Serious Crash last Friday

so, "parsed data" <= "length" is good, though hopefully it's not too much
less.

well,

parsed data == 825242056
length == 84

I fear, it's toooooo much less.

BTW: The count of rows printed by the pgfsck increases permanently.
In my last mail, there were only 814 rows, 2 hours ago, there were 1200
rows, now there are 1600 rows !!! What does this mean? In the table, there
are only 185 rows like before.

There is always about 800000000 parsed data, and about 100-150 of length.

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, June 17, 2002 12:02 PM
Subject: Re: [GENERAL] Serious Crash last Friday

Show quoted text

On Mon, Jun 17, 2002 at 11:38:00AM +0200, Henrik Steffen wrote:

same employee table (but after initdb, and restored from fridays dump)

how can i find out the correct length of the tuples?

Tuple incorrect length (parsed data=xxxxxx, length=xxx)

"parsed data" is what was processed
"length" is the amount allocated in the tuple header

so, "parsed data" <= "length" is good, though hopefully it's not too much
less.

HTH,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#9Martijn van Oosterhout
kleptog@svana.org
In reply to: Henrik Steffen (#8)
Re: Serious Crash last Friday

On Mon, Jun 17, 2002 at 01:44:22PM +0200, Henrik Steffen wrote:

so, "parsed data" <= "length" is good, though hopefully it's not too much
less.

well,

parsed data == 825242056
length == 84

I fear, it's toooooo much less.

That's wierd. Something is really wierd here, as that number is 0x313031C8,
which has '101' in text format. What datatypes are you using. Seems more
like a bug in my program.

BTW: The count of rows printed by the pgfsck increases permanently.
In my last mail, there were only 814 rows, 2 hours ago, there were 1200
rows, now there are 1600 rows !!! What does this mean? In the table, there
are only 185 rows like before.

There is always about 800000000 parsed data, and about 100-150 of length.

Are you using -a? It will print all rows, including deleted ones. That will
obviously continue to increase during usage. Are you doing this on a live
database?
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

#10Henrik Steffen
steffen@city-map.de
In reply to: Martijn van Oosterhout (#9)
Re: Serious Crash last Friday

yes, I was doing it on a living database... not good?

I was not using -a

this is structure dump of the employee table:

/* --------------------------------------------------------
phpPgAdmin 2.4-1 DB Dump
http://www.sourceforge.net/projects/phppgadmin/
Host: db.city-map.de:5432
Database : "kunden"
Table : "mitarbeiter"
2002-06-17 16:06:15
-------------------------------------------------------- */

CREATE TABLE "mitarbeiter" (
"miano" char(6),
"name" text,
"email" text,
"titel" text,
"telefon" text,
"pos" int4,
"benutzername" text,
"lastlogindate" date,
"lastlogintime" time,
"sprache" int4,
"prov1" int2,
"prov2" int2,
"region" char(4)
);
CREATE UNIQUE INDEX "mitarbeiter_miano_idx" ON "mitarbeiter" ("miano");

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

#11Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#1)
Re: Serious Crash last Friday

well, I don't work with transactions at all...

and I can't see no dupes...

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Tuesday, June 18, 2002 1:25 AM
Subject: Re: [GENERAL] Serious Crash last Friday

On Mon, Jun 17, 2002 at 05:28:23PM +0200, Henrik Steffen wrote:

yes, I was doing it on a living database... not good?

I was not using -a

It doesn't really matter if you're doing it on a live database, it just
means that it won't see stuff that is in WAL that has not been committed

to

the main storage.

If you're not using -a then something else is going on. Do you have many
uncomitted transactions? The program is not particularly intelligent about
which rows are currently active, so you will probably see some duplicates.
Look at the oids shown on the right of each tuple. Also, I've just

uploaded

Show quoted text

a new version which deals with dates and times much better.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

#12Henrik Steffen
steffen@city-map.de
In reply to: Martijn van Oosterhout (#2)
Re: Serious Crash last Friday

ok, but I don't use any explicit transactions using begin, end, commit or
rollback

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Andrew Sullivan" <andrew@libertyrms.info>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Tuesday, June 18, 2002 4:34 PM
Subject: Re: [GENERAL] Serious Crash last Friday

Show quoted text

On Tue, Jun 18, 2002 at 11:15:04AM +0200, Henrik Steffen wrote:

well, I don't work with transactions at all...

Yes you do. Everything in Postres is in a transaction. So if
someone hasn't finished something, you have an uncommitted
transaction.

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
+1 416 646 3304 x110
#13Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#1)
Re: Serious Crash last Friday

Hello all,

unfortunately I didn't get a really helping answer from the list yet.

Additionally yesterday night there was again a problem with some SELECTs:

NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Command was:
SELECT name
FROM regionen
WHERE region='0119';
at /web/pm/CityMap/Abfragen.pm line 135

This is really annoying.

When I noticed it this morning, I dropped all indexes and recreated them.
Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query worked
properly again.

I have now created a cronjob that will drop and recreate all indexes on a
daily basis.
But shouldn't this be unnecessary ?

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Henrik Steffen" <steffen@city-map.de>
To: "Martijn van Oosterhout" <kleptog@svana.org>
Cc: "pg" <pgsql-general@postgresql.org>
Sent: Tuesday, June 18, 2002 11:15 AM
Subject: Re: [GENERAL] Serious Crash last Friday

well, I don't work with transactions at all...

and I can't see no dupes...

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Tuesday, June 18, 2002 1:25 AM
Subject: Re: [GENERAL] Serious Crash last Friday

On Mon, Jun 17, 2002 at 05:28:23PM +0200, Henrik Steffen wrote:

yes, I was doing it on a living database... not good?

I was not using -a

It doesn't really matter if you're doing it on a live database, it just
means that it won't see stuff that is in WAL that has not been committed

to

the main storage.

If you're not using -a then something else is going on. Do you have many
uncomitted transactions? The program is not particularly intelligent

about

which rows are currently active, so you will probably see some

duplicates.

Show quoted text

Look at the oids shown on the right of each tuple. Also, I've just

uploaded

a new version which deals with dates and times much better.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#14Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#13)
Re: Serious Crash last Friday

Hello all,

unfortunately I didn't get a really helping answer from the list yet.

Additionally yesterday night there was again a problem with some SELECTs:

NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Command was:
SELECT name
FROM regionen
WHERE region='0119';
at /web/pm/CityMap/Abfragen.pm line 135

This is really annoying.

When I noticed it this morning, I dropped all indexes and recreated them.
Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query worked
properly again.

I have now created a cronjob that will drop and recreate all indexes on a
daily basis.

But shouldn't this be unnecessary ?

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

#15Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#14)
Re: Serious Crash last Friday

Dear Philippe,

I have always been using the VACUUM ANALYZE on a daily basis.

I recently changed this cronjob into "VACUUM FULL ANALYZE" - which
didn't help either.

What seems to be helpful is, to drop all user indexes and recreate them
on a daily basis (before, I did this only on a weekly basis once every
sunday)

What exactly is XID wraparound? How can I make sure I am using it? Is it
anything
I shouldn't use? Or should I change anything to make my system run more
stable?

Any help highly appreciated

thanks

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Bertin, Philippe" <philippe.bertin@barco.com>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Thursday, June 20, 2002 9:06 AM
Subject: RE: [GENERAL] Serious Crash last Friday

Henrik,

I think it was Martijn who asked at a certain point if you made sure there
was no ... (XID ?) wraparound. As it's a heavily used database, couldn't
this be the cause ? i.e., did you try since then to regularly perform the
vacuum analyze (as I see this, on a daily basis) ?

Regards,

Philippe Bertin
Software Development Engineer Avionics
---------------------------------------------------

Show quoted text

-----Original Message-----
From: Henrik Steffen [SMTP:steffen@city-map.de]
Sent: donderdag 20 juni 2002 8:54
To: pg
Subject: Re: [GENERAL] Serious Crash last Friday

Hello all,

unfortunately I didn't get a really helping answer from the list yet.

Additionally yesterday night there was again a problem with some SELECTs:

NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Command was:
SELECT name
FROM regionen
WHERE region='0119';
at /web/pm/CityMap/Abfragen.pm line 135

This is really annoying.

When I noticed it this morning, I dropped all indexes and recreated them.
Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query
worked
properly again.

I have now created a cronjob that will drop and recreate all indexes on a
daily basis.

But shouldn't this be unnecessary ?

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#16Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#15)
Re: Serious Crash last Friday

Hi,

thanks for this piece of information. Do you know, where and how I can see
which XID is the current? From this I could tell if there more than 2M
transactions a day.

I don't use any triggers. But on our heavily loaded website, there are about
60.000 pageviews on an average day, and each page view triggers maybe 10 -
15 database
lookups.

so we might get rather close to 2 M a day on certain days with more
pageviews.

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

#17Henrik Steffen
steffen@city-map.de
In reply to: Henrik Steffen (#16)
Re: Serious Crash last Friday

not only lookups, but also inserts, updates and deletions, of course...

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Bertin, Philippe" <philippe.bertin@barco.com>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Thursday, June 20, 2002 11:33 AM
Subject: RE: [GENERAL] Serious Crash last Friday

Hello,

If you're only having lookups, I still doubt that this kind of problem would
be able to occur (however, never be surprised of anything :) How you can
see the current ID, I can't tell, but I think that was also in the same
text. I suggest you have a look into the official PostgreSQL- documentation,
probably the Admin's guide, or the User's guide, because I think it was
there I saw this explanation.

Kind regards,

Philippe.

Show quoted text

-----Original Message-----
From: Henrik Steffen [SMTP:steffen@city-map.de]
Sent: donderdag 20 juni 2002 11:29
To: Bertin, Philippe
Subject: Re: [GENERAL] Serious Crash last Friday

Hi,

thanks for this piece of information. Do you know, where and how I can see
which XID is the current? From this I could tell if there more than 2M
transactions
a day.

I don't use any triggers. But on our heavily loaded website, there are
about
60.000 pageviews a day, and each page view triggers maybe 10 - 15 database
lookups.
so we might get rather close to 2 M a day on certain days.

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Bertin, Philippe" <philippe.bertin@barco.com>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Thursday, June 20, 2002 10:52 AM
Subject: RE: [GENERAL] Serious Crash last Friday

Henrik,

Not that I know so much about it (I don't even remember if it's really
called 'XID'- raparound). But it comes to something like : every action on
the database gets an ID. This ID is contained in 32 bits, so 4M; only half
of the range can be used, so 2M. Every 2M actions, the ID wraps around. If
that happens, your data may get corrupt. A vacuum analyze resets these
ID's
...

I don't remember where I read this. I'll look around as from when I have
time, and mail you back the reference for it. I think it'll be around this
evening (Belgian time) when I'll find the time for doing so ... Maybe
Martijn could give you some hints ?

However, if you indicate that you're doing a vacuum analyze on a daily
basis, I doubt that that can be the reason (unless you would have a lot of
triggers on your tables that increment the number of actions exponentially
by recursively calling each other ?)

Kind regards,

Philippe Bertin.

-----Original Message-----
From: Henrik Steffen [SMTP:steffen@city-map.de]
Sent: donderdag 20 juni 2002 9:34
To: Bertin, Philippe
Cc: pg
Subject: Re: [GENERAL] Serious Crash last Friday

Dear Philippe,

I have always been using the VACUUM ANALYZE on a daily basis.

I recently changed this cronjob into "VACUUM FULL ANALYZE" - which
didn't help either.

What seems to be helpful is, to drop all user indexes and recreate them
on a daily basis (before, I did this only on a weekly basis once every
sunday)

What exactly is XID wraparound? How can I make sure I am using it? Is it
anything
I shouldn't use? Or should I change anything to make my system run more
stable?

Any help highly appreciated

thanks

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Bertin, Philippe" <philippe.bertin@barco.com>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Thursday, June 20, 2002 9:06 AM
Subject: RE: [GENERAL] Serious Crash last Friday

Henrik,

I think it was Martijn who asked at a certain point if you made sure

there

was no ... (XID ?) wraparound. As it's a heavily used database, couldn't
this be the cause ? i.e., did you try since then to regularly perform

the

vacuum analyze (as I see this, on a daily basis) ?

Regards,

Philippe Bertin
Software Development Engineer Avionics
---------------------------------------------------

-----Original Message-----
From: Henrik Steffen [SMTP:steffen@city-map.de]
Sent: donderdag 20 juni 2002 8:54
To: pg
Subject: Re: [GENERAL] Serious Crash last Friday

Hello all,

unfortunately I didn't get a really helping answer from the list yet.

Additionally yesterday night there was again a problem with some

SELECTs:

NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code

7:

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Command was:
SELECT name
FROM regionen
WHERE region='0119';
at /web/pm/CityMap/Abfragen.pm line 135

This is really annoying.

When I noticed it this morning, I dropped all indexes and recreated

them.

Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query
worked
properly again.

I have now created a cronjob that will drop and recreate all indexes

on

a

daily basis.

But shouldn't this be unnecessary ?

Mit freundlichem Gru�

Henrik Steffen
Gesch�ftsf�hrer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com Tel. +49 4141 991230
mail: steffen@topconcepts.com Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

---------------------------(end of

broadcast)---------------------------

TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#18Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Henrik Steffen (#14)
Re: Serious Crash last Friday

On Thu, 20 Jun 2002, Henrik Steffen wrote:

Hello all,

unfortunately I didn't get a really helping answer from the list yet.

Additionally yesterday night there was again a problem with some SELECTs:

NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

As a question, what does the log say and did it leave a core in the
database directory that you can get a backtrace from?

#19scott.marlowe
scott.marlowe@ihs.com
In reply to: Henrik Steffen (#14)
Re: Serious Crash last Friday

On Thu, 20 Jun 2002, Henrik Steffen wrote:

Additionally yesterday night there was again a problem with some SELECTs:

NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Look at that error message again. It says SOME OTHER backend died
abnormally, and your query was terminated because of it. I.e. the
following query was NOT the problem it simply couldn't run because some
other query caused a backend to abort.

Command was:
SELECT name
FROM regionen
WHERE region='0119';
at /web/pm/CityMap/Abfragen.pm line 135

This is really annoying.

Yes it is.

When I noticed it this morning, I dropped all indexes and recreated them.
Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query worked
properly again.

It would likely have worked fine without all that, since it wasn't the
cause of the backend crash.

I have now created a cronjob that will drop and recreate all indexes on a
daily basis.

But shouldn't this be unnecessary ?

Correct. Someday, someone will step up to the plate and fix the problem
with btrees growing and growing and not reusing dead space.

Til then the solution is to reindex heavily updated indexes during nightly
maintenance.

A few questions. Have you done any really heavy testing on your server to
make sure it has no problems with its hardware or anything? I've seen
machines with memory errors or bad blocks on the hard drive slip into
production and wreak havoc due to slow corruption of a database.

Try compiling the linux kernel with a -j 10 switch (i.e. 10 seperate
threads, eats up tons of memory) and see if you get sig 11 errors. Also,
check your hard drives for bad blocks as well (badblock is the command,
and it can run in a "save a block before write testing it then put the
data back in it" mode that lets you find all the bad blocks on your hard
drives.

Bad blocks are the primary reason I always try to run my database on RAID1
or RAID5 software raid as a minimum on Linux, since a bad block will cause
the affected drive to be marked offline, and not affect your data
integrity.

--
"Force has no place where there is need of skill.", "Haste in every
business brings failures.", "This is the bitterest pain among men, to have
much knowledge but no power." -- Herodotus

#20Alvaro Herrera
alvherre@atentus.com
In reply to: scott.marlowe (#19)
Re: Serious Crash last Friday

Scott Marlowe dijo:

On Thu, 20 Jun 2002, Henrik Steffen wrote:

I have now created a cronjob that will drop and recreate all indexes on a
daily basis.

But shouldn't this be unnecessary ?

[...]

A few questions. Have you done any really heavy testing on your server to
make sure it has no problems with its hardware or anything? I've seen
machines with memory errors or bad blocks on the hard drive slip into
production and wreak havoc due to slow corruption of a database.

Also, if you are on ix86 hardware, try running memtest86 for a few hours
on it (www.teresaudio.com/memtest86).

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"La realidad se compone de muchos sue�os, todos ellos diferentes,
pero en cierto aspecto, parecidos..." (Yo, hablando de sue�os er�ticos)

#21Andrew Sullivan
andrew@libertyrms.info
In reply to: Henrik Steffen (#13)